Message boards : Number crunching : pd1_graftsheet_41limit keep crashing
Author | Message |
---|---|
Ananas Send message Joined: 1 Jan 06 Posts: 232 Credit: 752,471 RAC: 0 |
Error code 0xc0000005 (protection fault / access violation) Not only for me, my wingmen seem not to have more luck with those. p.s.: already reported here, sorry, I had not seen this before I posted. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,524,889 RAC: 7,500 |
Error code 0xc0000005 (protection fault / access violation) +1. All this kind of wus crashes after few seconds. Please, stop this batch |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,524,889 RAC: 7,500 |
Please, stop this batch Again, a lot of pd2_grafsheet errors (and i kill the download of these wus). Plese, stop this batch, don't waste our time |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
Please, stop this batch You are assuming that the scientists can't get any useful data from these failures? From other reports on the forums, some of the pd1 tasks are failing while others are succeeding. The results may have useful lessons on why some fail and others don't. Also, for most participants the errors are occuring after a matter of seconds so there is little "wasted" time (and the time you do use is granted credit by the overnight script). It would be useful though if one of the scientists gave us a reply to say whether the results are useful or just a case of someone missing a decimal point when setting up the batch... |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,524,889 RAC: 7,500 |
It would be useful though if one of the scientists gave us a reply to say whether the results are useful or just a case of someone missing a decimal point when setting up the batch... If the scientist wants only to debug this particular batch, they can use ralph@home.... :-) |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,524,889 RAC: 7,500 |
Also, for most participants the errors are occuring after a matter of seconds so there is little "wasted" time (and the time you do use is granted credit by the overnight script). Not only wasted time, but also wasted adsl (every pd1_graftsheet is 80mb) :-( |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
hi all, i'm running a linux host, i do get some errors too (see Minirosetta 3.52 thread). fortunately for me even those tasks which errored out ran to completion. i did not get credits at first but later i found that credits is allocated to the task itself which probably means that the job ran completely. it apparently has to do with some null pointer errors and seem to affect this particular job however, i do see cases on windows platform for the same task reallocated to me where the job terminates, some almost when it started. for windows users, have u tried to reset the project so that the rosetta apps and database is downloaded again? perhaps we could provide some of such feedback in this thread. e.g. if resetting solve the issue it might just be the solution |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
i did not get credits at first but later i found that credits is allocated to the task itself which probably means that the job ran completely. There is an automatic script that runs each night to award credit to tasks that ended in an error. Because the script is a modification to the normal BOINC process the granted credit only shows up on the task page. for windows users, have u tried to reset the project so that the rosetta apps and database is downloaded again? perhaps we could provide some of such feedback in this thread. e.g. if resetting solve the issue it might just be the solution It is just a problem with this batch. Other tasks are running fine, so it is unlikely to be a problem with the app or database files. Also the fact that so many Windows users are affected suggests that something is wrong with the task design - these things usually turn out to be that one of the scientists missed a decimal place or left a stray reference in one of the task calculations. If the scientist wants only to debug this particular batch, they can use ralph@home.... :-) Sorry, I was a little unclear in my earlier comment. I don't think the scientists deliberately released a bad batch. I was trying to point out that the limited results from this batch could still be useful despite the problems. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,524,889 RAC: 7,500 |
Sorry, I was a little unclear in my earlier comment. I don't think the scientists deliberately released a bad batch. I was trying to point out that the limited results from this batch could still be useful despite the problems. I have over 300 messages and 400k points on Ralph so i know what is a beta test. I think, like you, that they don't released a bad batch deliberately. But i also think that a "stop" is the best solution. After that, pass the code on Ralph and test it largely. |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
i'm making some guesses if things might have improved i've an instance of pd1_graftsheet_41limit that runs without errors https://boinc.bakerlab.org/rosetta/result.php?resultid=673049941 apparently it seemed the same task errored out when someone else runs the same(?) job https://boinc.bakerlab.org/rosetta/workunit.php?wuid=610410753 guess it is hit and miss for now agree with [VENETO] boboviz to at least reduce the number of pd1_graftsheet_41limit tasks that's being pushed out and perhaps fix the issues, and perhaps beta test it it's pushing out 80 megs per task when it fails it gets reassigned, if there are 100s of jobs that may add up to (10s of) gigabytes of bandwidth wasted |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,524,889 RAC: 7,500 |
agree with [VENETO] boboviz to at least reduce the number of pd1_graftsheet_41limit tasks that's being pushed out and perhaps fix the issues, and perhaps beta test it I continue to receive a lot of pd1_graftsheet wus (all errors). Admins read the forum?? |
krypton Volunteer moderator Project developer Project scientist Send message Joined: 16 Nov 11 Posts: 108 Credit: 2,164,309 RAC: 0 |
agree with [VENETO] boboviz to at least reduce the number of pd1_graftsheet_41limit tasks that's being pushed out and perhaps fix the issues, and perhaps beta test it I contacted scientist responsible for this particular batch of jobs. He is looking at it now. THank you for reporting the errors! |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,524,889 RAC: 7,500 |
I contacted scientist responsible for this particular batch of jobs. He is looking at it now. THank you for reporting the errors! Thank you!! |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,524,889 RAC: 7,500 |
I contacted scientist responsible for this particular batch of jobs. He is looking at it now. THank you for reporting the errors! I continue to receive this kind of wu (with errors). Please, stop it |
krypton Volunteer moderator Project developer Project scientist Send message Joined: 16 Nov 11 Posts: 108 Credit: 2,164,309 RAC: 0 |
Hi VENETO, I talked researcher starting these jobs... Apparently the failure is "normal" in that its trying different parameters and if it doesnt pass the filters it doesnt return the result. The issue is that boinc reports an error when no result is seen after the run, even though the fact that it didn't pass the filter is a result! Future jobs will behave more nicely and not appear to crash. This batch is almost complete. |
Indigo Send message Joined: 5 Dec 07 Posts: 1 Credit: 133,409 RAC: 0 |
Hi all. This is my code and it is doing exactly what it is supposed to do. It turns out that BOINC reports an "error" when no protein structure is returned. The reason many jobs are not returning structures is that I'm using strategy called "dead-end elimination." The reason we need massively parallel computing/simulations is that it's impossible to know ahead of time if a single simulation will return the results we need (i.e. "stochastic sampling"). However, after a simulation has run for a while, we can sometimes tell that it's not going anywhere, and it's a waste of everyone's resources to continue it and save the output, such that a new job is spawned with a different starting point. I've been a Rosetta developer and scientist for years, but this is my first time using R@H instead of our own supercomputing clusters. I'm gonna reconfigure my job submission strategy to play nicer with BOINC's point system. Thanks everybody! Chris Indigo King Bakerlab |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
hi Indigo, thanks for coming into the open and sharing these info, i'd guess many participants (including me) appreciate all these feedbacks very much :) yup i'd think there may be ways to improve the credit system or even minirosetta (esp for the 'windows' platform volunteers to get some credits in these 'special case' failures. After all those participants may have downloaded multiples of the 80 megs start files but 'crashed' (some almost on starting up) :) |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 2 |
Yes, thanks you for posting explanations Indigo. It is really appreciatted. |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
hi Indigo, Take a look at this workunit https://boinc.bakerlab.org/rosetta/workunit.php?wuid=612393502 when a different (Windows) PC runs it, apparently it 'crashed' https://boinc.bakerlab.org/rosetta/result.php?resultid=675364476 the task gets reassigned to my (Linux) PC https://boinc.bakerlab.org/rosetta/result.php?resultid=675380858 apparently it generates 108 decoys/models - no errors just like to bring up some details, just in case it might be interesting or perhaps suggest some problems more than simply not finding structures |
LumenDan Send message Joined: 26 Apr 07 Posts: 3 Credit: 5,549,929 RAC: 1,065 |
It turns out that BOINC reports an "error" when no protein structure is returned. It is a relief to know that the failed units were not required to continue computation and have in fact contributed to the scientific process in a meaningful way. From an application programmer's point of view reporting an error is one thing, generating an access violation (windows) is another. "Reason: Access Violation (0xc0000005) at address 0x00757DEB write attempt to address 0x00000000" The Rosetta application (or boinc core) has definitely crashed when this error occurs and left the operating system to clean up the mess. I don't think a write to a null pointer should ever be considered as normal behaviour and I hope that the fault can be avoided in new batches or future releases of minirosetta. Please add a null pointer check to the application or create a place holder structure to return when a dead-end calculation terminates to avoid fatal exceptions. My personal reaction when I see batches with ongoing failures is to question weather my computer configuration is at fault and is there something I need to change to avoid returning bad results. Your responses have certainly put me at ease in that respect :). Thanks for considering credit allocation for dead-end units. All of mine seem to have failed in the first 30 seconds so I didn't expect any but in the case where more substantial computing time has been invested lack of due credit could skew people away from certain batches. Best Regards, LumenDan |
Message boards :
Number crunching :
pd1_graftsheet_41limit keep crashing
©2024 University of Washington
https://www.bakerlab.org