Message boards : Number crunching : Problem with task "exited with zero status but no 'finished' file" error
Author | Message |
---|---|
sirlampsalot Send message Joined: 18 Feb 11 Posts: 1 Credit: 336,823 RAC: 0 |
Hi - I am a very occasional Rosetta cruncher. This may have happened in the past, but today was the first time I noticed this error associated with every work unit "3/8/2015 8:40:28 AM | rosetta@home | Task rb_03_07_54085_99588_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_246104_1456_0 exited with zero status but no 'finished' file" I have two computers that generate this error (different hardware and OS but both running BOINC 7.4.36). The log suggested resetting the project, which I did on both pc's. It promptly returned after resetting. Is this an error message I can ignore? Or how can I fix it? I found this tread https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6586 which did not show up when I did a search for "zero status but no 'finished' file". My CPU time is set to default. Many thanks |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
Do you have a link to the workunit? I checked the workunits on your machines and most of them seem to be completing without issue - which ones threw these errors? |
Erik Send message Joined: 25 Jun 09 Posts: 11 Credit: 2,904,454 RAC: 0 |
Allowing BOINC to use 100% CPU time got rid of that error for me, possibly in conjunction with a 12-hour run time. In practice, the CPU usage tends to run between 75 to 90%. I also found that rebooting the machine will cause an in-process work unit to fail. |
Ananas Send message Joined: 1 Jan 06 Posts: 232 Credit: 752,471 RAC: 0 |
I have the same problem on one computer, from what I can see it is caused by the huge HDD activity when unpacking and starting each Rosetta WU, combined with the outdated API version that still has the first and much too low heartbeat tolerance setting. If you have a not too fast HDD, this slows down the BOINC core client and the same Rosetta workunit that caused the slowdown doesn't receive the heartbeat in time and restarts itself, before it managed to unzip the result. Plus it kills other Rosetta WUs in that process, they run into the same heartbeat error too. This goes on for quite a while until the core client gives up on restarting the workunit. The only other project that seems to use such an old API is Leiden Classical, their WUs are also victim of the heartbeat bug now and then. Other projects are more robust when it comes to that problem. Afaik. the heartbeat timeout has been increased quite much in later API versions. On the box where I have that problem I can start one Rosetta WU, it usually comes to start, sometimes after one heartbeat crash. Trying to starting a second one kills both and none of the two will ever recover from that restart/crash loop. p.s.: another way to fix it might be to split the compressed database into smaller parts and hoping that the core client can use the break between the unzips to refresh the heartbeat in the shared memory - or to exclude those .gz files from the .zip file and deliver them (only the needed ones!) separately, as packing it like that is very stupid anyway. @Erik : I don't think that it is entirely fixed on your system, the WUs still show "unpacking ..." way too often. If it was a clean run, it would unpack the database and unpack the workunit file - two unpack commands per workunit and that's it. |
Message boards :
Number crunching :
Problem with task "exited with zero status but no 'finished' file" error
©2024 University of Washington
https://www.bakerlab.org