pd1_graftsheet_41limit keep crashing

Author	Message
Ananas Send message Joined: 1 Jan 06 Posts: 232 Credit: 752,471 RAC: 0	Message 76965 - Posted: 7 Jul 2014, 6:24:45 UTC Last modified: 7 Jul 2014, 6:26:07 UTC Error code 0xc0000005 (protection fault / access violation) Not only for me, my wingmen seem not to have more luck with those. p.s.: already reported here, sorry, I had not seen this before I posted. ID: 76965 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2122 Credit: 12,390,943 RAC: 147	Message 76966 - Posted: 7 Jul 2014, 7:44:16 UTC - in response to Message 76965. Error code 0xc0000005 (protection fault / access violation) Not only for me, my wingmen seem not to have more luck with those. p.s.: already reported here, sorry, I had not seen this before I posted. +1. All this kind of wus crashes after few seconds. Please, stop this batch ID: 76966 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2122 Credit: 12,390,943 RAC: 147	Message 76971 - Posted: 8 Jul 2014, 5:47:55 UTC - in response to Message 76966. Please, stop this batch Again, a lot of pd2_grafsheet errors (and i kill the download of these wus). Plese, stop this batch, don't waste our time ID: 76971 · Rating: 0 · rate: / Reply Quote

Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0	Message 76972 - Posted: 8 Jul 2014, 8:32:35 UTC - in response to Message 76971. Please, stop this batch Again, a lot of pd2_grafsheet errors (and i kill the download of these wus). Plese, stop this batch, don't waste our time You are assuming that the scientists can't get any useful data from these failures? From other reports on the forums, some of the pd1 tasks are failing while others are succeeding. The results may have useful lessons on why some fail and others don't. Also, for most participants the errors are occuring after a matter of seconds so there is little "wasted" time (and the time you do use is granted credit by the overnight script). It would be useful though if one of the scientists gave us a reply to say whether the results are useful or just a case of someone missing a decimal point when setting up the batch... ID: 76972 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2122 Credit: 12,390,943 RAC: 147	Message 76974 - Posted: 8 Jul 2014, 9:41:23 UTC - in response to Message 76972. It would be useful though if one of the scientists gave us a reply to say whether the results are useful or just a case of someone missing a decimal point when setting up the batch... If the scientist wants only to debug this particular batch, they can use ralph@home.... :-) ID: 76974 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2122 Credit: 12,390,943 RAC: 147	Message 76976 - Posted: 8 Jul 2014, 9:51:22 UTC - in response to Message 76972. Also, for most participants the errors are occuring after a matter of seconds so there is little "wasted" time (and the time you do use is granted credit by the overnight script). Not only wasted time, but also wasted adsl (every pd1_graftsheet is 80mb) :-( ID: 76976 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 76977 - Posted: 8 Jul 2014, 12:07:13 UTC Last modified: 8 Jul 2014, 12:09:11 UTC hi all, i'm running a linux host, i do get some errors too (see Minirosetta 3.52 thread). fortunately for me even those tasks which errored out ran to completion. i did not get credits at first but later i found that credits is allocated to the task itself which probably means that the job ran completely. it apparently has to do with some null pointer errors and seem to affect this particular job however, i do see cases on windows platform for the same task reallocated to me where the job terminates, some almost when it started. for windows users, have u tried to reset the project so that the rosetta apps and database is downloaded again? perhaps we could provide some of such feedback in this thread. e.g. if resetting solve the issue it might just be the solution ID: 76977 · Rating: 0 · rate: / Reply Quote

Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0	Message 76978 - Posted: 8 Jul 2014, 15:08:27 UTC - in response to Message 76977. i did not get credits at first but later i found that credits is allocated to the task itself which probably means that the job ran completely. There is an automatic script that runs each night to award credit to tasks that ended in an error. Because the script is a modification to the normal BOINC process the granted credit only shows up on the task page. for windows users, have u tried to reset the project so that the rosetta apps and database is downloaded again? perhaps we could provide some of such feedback in this thread. e.g. if resetting solve the issue it might just be the solution It is just a problem with this batch. Other tasks are running fine, so it is unlikely to be a problem with the app or database files. Also the fact that so many Windows users are affected suggests that something is wrong with the task design - these things usually turn out to be that one of the scientists missed a decimal place or left a stray reference in one of the task calculations. If the scientist wants only to debug this particular batch, they can use ralph@home.... :-) Sorry, I was a little unclear in my earlier comment. I don't think the scientists deliberately released a bad batch. I was trying to point out that the limited results from this batch could still be useful despite the problems. ID: 76978 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2122 Credit: 12,390,943 RAC: 147	Message 76979 - Posted: 8 Jul 2014, 15:58:28 UTC - in response to Message 76978. Sorry, I was a little unclear in my earlier comment. I don't think the scientists deliberately released a bad batch. I was trying to point out that the limited results from this batch could still be useful despite the problems. I have over 300 messages and 400k points on Ralph so i know what is a beta test. I think, like you, that they don't released a bad batch deliberately. But i also think that a "stop" is the best solution. After that, pass the code on Ralph and test it largely. ID: 76979 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 76980 - Posted: 8 Jul 2014, 16:41:20 UTC Last modified: 8 Jul 2014, 17:41:17 UTC i'm making some guesses if things might have improved i've an instance of pd1_graftsheet_41limit that runs without errors https://boinc.bakerlab.org/rosetta/result.php?resultid=673049941 apparently it seemed the same task errored out when someone else runs the same(?) job https://boinc.bakerlab.org/rosetta/workunit.php?wuid=610410753 guess it is hit and miss for now agree with [VENETO] boboviz to at least reduce the number of pd1_graftsheet_41limit tasks that's being pushed out and perhaps fix the issues, and perhaps beta test it it's pushing out 80 megs per task when it fails it gets reassigned, if there are 100s of jobs that may add up to (10s of) gigabytes of bandwidth wasted ID: 76980 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2122 Credit: 12,390,943 RAC: 147	Message 76983 - Posted: 9 Jul 2014, 6:28:33 UTC - in response to Message 76980. agree with [VENETO] boboviz to at least reduce the number of pd1_graftsheet_41limit tasks that's being pushed out and perhaps fix the issues, and perhaps beta test it I continue to receive a lot of pd1_graftsheet wus (all errors). Admins read the forum?? ID: 76983 · Rating: 0 · rate: / Reply Quote

krypton Volunteer moderator Project developer Project scientist Send message Joined: 16 Nov 11 Posts: 108 Credit: 2,164,309 RAC: 0	Message 76987 - Posted: 9 Jul 2014, 17:45:08 UTC - in response to Message 76983. agree with [VENETO] boboviz to at least reduce the number of pd1_graftsheet_41limit tasks that's being pushed out and perhaps fix the issues, and perhaps beta test it I continue to receive a lot of pd1_graftsheet wus (all errors). Admins read the forum?? I contacted scientist responsible for this particular batch of jobs. He is looking at it now. THank you for reporting the errors! ID: 76987 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2122 Credit: 12,390,943 RAC: 147	Message 76990 - Posted: 9 Jul 2014, 20:06:57 UTC - in response to Message 76987. I contacted scientist responsible for this particular batch of jobs. He is looking at it now. THank you for reporting the errors! Thank you!! ID: 76990 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2122 Credit: 12,390,943 RAC: 147	Message 76998 - Posted: 12 Jul 2014, 19:11:27 UTC - in response to Message 76987. I contacted scientist responsible for this particular batch of jobs. He is looking at it now. THank you for reporting the errors! I continue to receive this kind of wu (with errors). Please, stop it ID: 76998 · Rating: 0 · rate: / Reply Quote

krypton Volunteer moderator Project developer Project scientist Send message Joined: 16 Nov 11 Posts: 108 Credit: 2,164,309 RAC: 0	Message 77002 - Posted: 12 Jul 2014, 21:24:40 UTC Hi VENETO, I talked researcher starting these jobs... Apparently the failure is "normal" in that its trying different parameters and if it doesnt pass the filters it doesnt return the result. The issue is that boinc reports an error when no result is seen after the run, even though the fact that it didn't pass the filter is a result! Future jobs will behave more nicely and not appear to crash. This batch is almost complete. ID: 77002 · Rating: 0 · rate: / Reply Quote

Indigo Send message Joined: 5 Dec 07 Posts: 1 Credit: 133,409 RAC: 0	Message 77013 - Posted: 14 Jul 2014, 18:56:32 UTC Hi all. This is my code and it is doing exactly what it is supposed to do. It turns out that BOINC reports an "error" when no protein structure is returned. The reason many jobs are not returning structures is that I'm using strategy called "dead-end elimination." The reason we need massively parallel computing/simulations is that it's impossible to know ahead of time if a single simulation will return the results we need (i.e. "stochastic sampling"). However, after a simulation has run for a while, we can sometimes tell that it's not going anywhere, and it's a waste of everyone's resources to continue it and save the output, such that a new job is spawned with a different starting point. I've been a Rosetta developer and scientist for years, but this is my first time using R@H instead of our own supercomputing clusters. I'm gonna reconfigure my job submission strategy to play nicer with BOINC's point system. Thanks everybody! Chris Indigo King Bakerlab ID: 77013 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 77022 - Posted: 16 Jul 2014, 15:11:54 UTC hi Indigo, thanks for coming into the open and sharing these info, i'd guess many participants (including me) appreciate all these feedbacks very much :) yup i'd think there may be ways to improve the credit system or even minirosetta (esp for the 'windows' platform volunteers to get some credits in these 'special case' failures. After all those participants may have downloaded multiples of the 80 megs start files but 'crashed' (some almost on starting up) :) ID: 77022 · Rating: 0 · rate: / Reply Quote

Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 302,521,161 RAC: 1	Message 77023 - Posted: 16 Jul 2014, 20:11:37 UTC Yes, thanks you for posting explanations Indigo. It is really appreciatted. ID: 77023 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 77034 - Posted: 19 Jul 2014, 0:33:19 UTC Last modified: 19 Jul 2014, 0:41:31 UTC hi Indigo, Take a look at this workunit https://boinc.bakerlab.org/rosetta/workunit.php?wuid=612393502 when a different (Windows) PC runs it, apparently it 'crashed' https://boinc.bakerlab.org/rosetta/result.php?resultid=675364476 the task gets reassigned to my (Linux) PC https://boinc.bakerlab.org/rosetta/result.php?resultid=675380858 apparently it generates 108 decoys/models - no errors just like to bring up some details, just in case it might be interesting or perhaps suggest some problems more than simply not finding structures ID: 77034 · Rating: 0 · rate: / Reply Quote

LumenDan Send message Joined: 26 Apr 07 Posts: 3 Credit: 5,698,321 RAC: 0	Message 77045 - Posted: 20 Jul 2014, 9:15:55 UTC Last modified: 20 Jul 2014, 9:17:59 UTC It turns out that BOINC reports an "error" when no protein structure is returned. It is a relief to know that the failed units were not required to continue computation and have in fact contributed to the scientific process in a meaningful way. From an application programmer's point of view reporting an error is one thing, generating an access violation (windows) is another. "Reason: Access Violation (0xc0000005) at address 0x00757DEB write attempt to address 0x00000000" The Rosetta application (or boinc core) has definitely crashed when this error occurs and left the operating system to clean up the mess. I don't think a write to a null pointer should ever be considered as normal behaviour and I hope that the fault can be avoided in new batches or future releases of minirosetta. Please add a null pointer check to the application or create a place holder structure to return when a dead-end calculation terminates to avoid fatal exceptions. My personal reaction when I see batches with ongoing failures is to question weather my computer configuration is at fault and is there something I need to change to avoid returning bad results. Your responses have certainly put me at ease in that respect :). Thanks for considering credit allocation for dead-end units. All of mine seem to have failed in the first 30 seconds so I didn't expect any but in the case where more substantial computing time has been invested lack of due credit could skew people away from certain batches. Best Regards, LumenDan ID: 77045 · Rating: 0 · rate: / Reply Quote