Message boards : Number crunching : Bad WU Batch - "TEMP_" all fail
Author | Message |
---|---|
pterosoft Send message Joined: 27 Nov 09 Posts: 1 Credit: 1,447,216 RAC: 0 |
I ran into a bad WU batch yesterday. Each ran for almost 3 hours and then failed at the very end, with the same error. I had 9 WU across 3 different systems fail this way. Wingmen on each WU are also failing with same error. I finally aborted the 3 I had left from the batch. Other WU from other batches are completing just fine. Lot of wasted CPU with this batch, since they run for 3 hours before failing. All start with name "TEMP_" Rosetta Mini 2.16 Error is: Compute error Too many error results <error_code>-161</error_code> Here are the bad WU I know about: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343254772 TEMP_5_control_1ctf__SAVE_ALL_OUT_22400_1 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343274715 TEMP_0.05_control_1ew4A_SAVE_ALL_OUT_22400_38 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343297411 TEMP_50_control_1b3aA_SAVE_ALL_OUT_22400_34 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343307185 TEMP_0.01_control_1cg5B_SAVE_ALL_OUT_22400_97 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343320953 TEMP_0.05_control_1tul__SAVE_ALL_OUT_22400_65 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343325313 TEMP_10_control_1urnA_SAVE_ALL_OUT_22400_73 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343335483 TEMP_5_control_1enh__SAVE_ALL_OUT_22400_92 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343322361 TEMP_1_control_1dhn__SAVE_ALL_OUT_22400_68 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343254959 TEMP_0.1_control_1c9oA_SAVE_ALL_OUT_22400_2 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343269869 TEMP_0.05_control_1vie__SAVE_ALL_OUT_22400_28 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343275297 TEMP_0.5_control_1c9oA_SAVE_ALL_OUT_22400_38 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343311485 TEMP_1_control_1enh__SAVE_ALL_OUT_22400_49 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I see what you mean, and it is definitely not just you either. It would appear there must be a problem with the templates used to build these. I've asked the Project Team to look in to these TEMP_ tasks. This probably explains why folks are having trouble getting new tasks today. Rosetta Moderator: Mod.Sense |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 137 |
<message> https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343331790 There are others, look at my results. >>> since they run for 3 hours before failing. 3? You were lucky! Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Does anyone now if the decoys generated by these tasks contain any valid data which can be salvaged? If this is just a "shutdown and reporting" problem but good data is being generated, I'll let them run. I don't care about the loss of credits. However, if the output is totally FUBAR and unusable, then I might as well set up a simple script to abort them before they burn five or six hours of CPU time. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
By the name of the file, the fact that they ran to completion, and the fact that the task fails, it seems to me as though the output file produced is not reaching the server... or not being produced. My TEMP_ tasks are suspended, awaiting further word. But I am inclined to believe the are unusable. That output file that is not being received on the server, is where your results are sent. Without it, I see no good beyond correcting the problem and getting it on a list of problems to avoid in the future. Rosetta Moderator: Mod.Sense |
Robert Send message Joined: 10 Nov 05 Posts: 5 Credit: 9,609 RAC: 0 |
I'm looking into this now. The protocol appears to be producing useful structures, I think there's an issue with validation. Definitely feel free to cancel these work units if you'd like. We may still be able to validate these results, but there's definitely no harm if you run other work units instead. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Robert, correct me if I'm mistaken here, but you aren't getting a file to validate... are you? At least that's what the message would imply. Rosetta Moderator: Mod.Sense |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
OK - I cut the middle ground - those "TEMP" tasks which are running will be allowed to complete as best they can - however, I just filtered out the "TEMP" tasks in a "ready to start" state. Just a little side note - during the short time I process SETI before discovering Rosetta, I noted when you would display a system's task list, you could further select by status - ready to start, awaiting validation, and ERROR. It would be so handy at times to be able to display just the tasks which had an error instead of scrolling through page after page of tasks for a dozen computers. |
Bernd Schnitker Send message Joined: 2 Jan 09 Posts: 10 Credit: 62,009 RAC: 0 |
I have one of these that errored out. Hope folks will be able to fix them. Both I and my wingman had an error at the end. |
Tom Send message Joined: 8 Oct 06 Posts: 8 Credit: 1,533,336 RAC: 0 |
Is it alright to process TEMP_ WU's? I have one in my queue and I'm wondering if I should throw it out. |
TeAm Enterprise Send message Joined: 28 Sep 05 Posts: 18 Credit: 27,907,583 RAC: 67 |
All the WUs I ran yesterday that started with TEMP also failed as did the wingman for all the ones I checked. I ran into a bad WU batch yesterday. Each ran for almost 3 hours and then failed at the very end, with the same error. I had 9 WU across 3 different systems fail this way. Wingmen on each WU are also failing with same error. I finally aborted the 3 I had left from the batch. Other WU from other batches are completing just fine. Lot of wasted CPU with this batch, since they run for 3 hours before failing. Crunch with friends - TeAm Anandtech |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
...that was what I figured would happen, but wasn't positive. They cut another application version to fix the validator, but it can't correct the problem with the existing work units. So, Tom, it is "alright", as in they don't do any harm to your machine or anything. But I don't believe there is any chance they will get credit. So I would suggest aborting the TEMP_ tasks from Rosetta version 2.16. Rosetta Moderator: Mod.Sense |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
I would suggest aborting the TEMP_ tasks from Rosetta version 2.16. Rosetta v2.17 does not seem to make them better, here the stderr_txt from task 376369621:
. |
[AF>france>pas-de-calais]symaski62 Send message Joined: 19 Sep 05 Posts: 47 Credit: 33,871 RAC: 0 |
http://www.boinc-wiki.info/Error_Code ERR_NOT_FOUND || -161 || not found || inconsistent client state :) |
[AF>france>pas-de-calais]symaski62 Send message Joined: 19 Sep 05 Posts: 47 Credit: 33,871 RAC: 0 |
http://www.boinc-wiki.info/Error_Code http://boincfaq.mundayweb.com/index.php?view=77&sessionID=a4d6f938094d567da63ac51e1a31dfb4 ERR_NOT_FOUND -161 This happens when you have an inconsistent client_state.xml file. Files aren't written to it. Task not found would be the error message. |
Message boards :
Number crunching :
Bad WU Batch - "TEMP_" all fail
©2024 University of Washington
https://www.bakerlab.org