minirosetta 2.05

Author	Message
Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0	Message 65159 - Posted: 31 Jan 2010, 20:32:50 UTC - in response to Message 65157. Hello, yes! The critstubs jobs are of the same sort of unevenly crediting trajectories as the gbnnotyr. You can have a look at Eva-Maria Strauch's message on the Protein-interface design thread for details. Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a lot of steps up to 200000 - 400000 for 1 model. Is this normal? And on another type of job (critStubs_profiled_1dnA_...) one of them was abnormally small credits (1.97 cr and 9 decoys for 7k CPU seconds): https://boinc.bakerlab.org/rosetta/result.php?resultid=314040179 While other WUs of the same type considered normal (about ~20 cr and ~80 decoys for the same CPU time). This is a bug with partial loss of the results of calculations? Or in this type of tasks just is uneven crediting? (As in the tasks type gbnnotyr, where the combined "small" and "huge" models in the same type of tasks) ID: 65159 · Rating: 0 · rate: / Reply Quote

Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0	Message 65160 - Posted: 31 Jan 2010, 21:43:31 UTC - in response to Message 65159. Rosetta @ Home has produced many very high-quality designs for our Protein-interface design team! So we're likely to submit many more jobs to Rosetta @ Home. To help you recognize these jobs, we'll add a _Protein_Interface_Design_ note to every job name that is related to these jobs from now on. This way you'll be able to follow these jobs. I also hope that this will help you see where the variable-credit issue is coming from more easily. Hello, yes! The critstubs jobs are of the same sort of unevenly crediting trajectories as the gbnnotyr. You can have a look at Eva-Maria Strauch's message on the Protein-interface design thread for details. Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a lot of steps up to 200000 - 400000 for 1 model. Is this normal? And on another type of job (critStubs_profiled_1dnA_...) one of them was abnormally small credits (1.97 cr and 9 decoys for 7k CPU seconds): https://boinc.bakerlab.org/rosetta/result.php?resultid=314040179 While other WUs of the same type considered normal (about ~20 cr and ~80 decoys for the same CPU time). This is a bug with partial loss of the results of calculations? Or in this type of tasks just is uneven crediting? (As in the tasks type gbnnotyr, where the combined "small" and "huge" models in the same type of tasks) ID: 65160 · Rating: 0 · rate: / Reply Quote

fredmeyer2470 Send message Joined: 6 Jun 09 Posts: 1 Credit: 1,741,466 RAC: 0	Message 65161 - Posted: 31 Jan 2010, 23:20:50 UTC The Rosetta application is spinning its wheels. It is continually running a task even though the task is 100% complete. There is another task to run, but Rosetta won't switch to it. ID: 65161 · Rating: 0 · rate: / Reply Quote

Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 29,766,470 RAC: 6,958	Message 65162 - Posted: 1 Feb 2010, 3:14:03 UTC 2 Sarel Thanks for the explanation. And what about this?: > Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a very lot of steps up to 200000 - 400000 for 1 model. Is this normal? And at the same time, another note: it seems the job of this type: resa_sel_core_1.5_low200_beta_low200_nostart_texcst_05_hb_t328__IGNORE_THE_REST_17378_267_0 ignore the target CPU time. For example, this WU calculate 1 model somewhere for 2.5 hours (already longer than the target time ), but after the 1-st model, instead of sending the result starts calculating 2-nd model. Total 18850 seconds vs cpu_run_time_pref = 7200 seconds. In this example, all ended well, but in other circumstances it can lead to excess cpu_run_time_pref more than 3 times and triggering watchdog and results loss. In addition, some members may think that the task stuck and abort it... ID: 65162 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 65165 - Posted: 1 Feb 2010, 16:27:56 UTC Max, the watchdog should be expected to kick in if a model exceeds target runtime by more then 4 hours. And so with a short 2 hour preference, you will tend to see much more variation in runtimes then if you had a longer preference. However, I would expect that once the target runtime is exceeded, and a model is completed, that work on the WU should be ended and it would be reported back. So if you have a 2hr preference, and the first model took more then 2hrs to completed, I should have expected the WU to end once the first model was completed. In fact, if the first model took more then an hour, the predicted completion time of the second model would have exceeded your preference, and so I would expect that scenario to end after the first model as well. Have you changed your runtime preference? Perhaps this machine hadn't received the updated preference yet? (looking at the WU you linked, it appears to have had the 7200sec preference during the entire run). So, it sounds like your expectations are correct. More observations would be helpful. Reference to the specific WU (as you've done) is very helpful. Rosetta Moderator: Mod.Sense ID: 65165 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 65169 - Posted: 1 Feb 2010, 21:39:59 UTC A couple of t287__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901 WUs on two different Linux machines failed after a few seconds claiming "process got signal 11". https://boinc.bakerlab.org/rosetta/result.php?resultid=314826769 https://boinc.bakerlab.org/rosetta/result.php?resultid=314751622 ID: 65169 · Rating: 0 · rate: / Reply Quote

Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 29,766,470 RAC: 6,958	Message 65170 - Posted: 2 Feb 2010, 1:09:41 UTC 2 Mod.Sense Thanks for the clarification on the watchdog. Previously I had seen how it hit after exceeding 6 hours of calculations and thought that he was fired after exceeding CPU TT x 3 (2h * 3 = 6h for my case). So in fact correct formula is CPU TT + 4h, right? (just in my case it gives the same 2h +4 h = 6h) fact, if the first model took more then an hour, the predicted completion time of the second model would have exceeded your preference, and so I would expect that scenario to end after the first model as well. Yes, usually does so. Here's an example of such a task: https://boinc.bakerlab.org/rosetta/result.php?resultid=313861637 Calculation of 1-st model took 5145 sec and the program has ended the processing, because second model would exceed the CPU TT (5145 * 2 = 10290> 7200). Or another example: https://boinc.bakerlab.org/rosetta/result.php?resultid=314455813 Calculation of the two models has taken 4995 sec and the program has ended the processing, because third model would exceed the CPU TT ((4995 / 2) * 3 = 7492> 7200). In these (and most others) the logic of the program is working correct. But in the example above, this algorithm seems to give a failure. Have you changed your runtime preference? Perhaps this machine hadn't received the updated preference yet? (looking at the WU you linked, it appears to have had the 7200sec preference during the entire run). So, it sounds like your expectations are correct. More observations would be helpful. Reference to the specific WU (as you've done) is very helpful. No, the last 2 weeks I have not changed runtime preference. Yet I have no more recent examples, but before I had 2 other tasks that too, seems to ignore the runtime preference. (although I'm not 100% sure about it, because I have not followed their performance - perhaps just a 1st model was designed quickly, and the last took much longer than expected...) Here they are: cst2.loopbuild_threading_hb_i1496_IGNORE_THE_REST_17154_387_0 t364__boinc_filtered_loopbuild_threading_cst_all_tex_IGNORE_THE_REST_16902_4455_0 ID: 65170 · Rating: 0 · rate: / Reply Quote

KnopperHarley Send message Joined: 1 Nov 06 Posts: 2 Credit: 788,560 RAC: 0	Message 65175 - Posted: 2 Feb 2010, 11:18:57 UTC Hey there! I got a problem with two tasks at the moment. Yesterday i wondered why remaining time is set to 30,5h per WU when i saw it, but i didn't care about it ... perhaps a test with more work per WU ... who knows. ;-) But now one task is 'stuck' at 58.285% (+0.002% in now more than 12h) and the other one at 82.419% work done. Runtime for these WUs are at around 28h und 11,75h counting on and on up high (elapsed and remaining -_- ). So i asked the task-manager for help and is says the following: these two WUs are using 218mb and 300mb memory ... not using ANY cpu-resources any more ... 0% both (cpu-time is still counting on 1sec/sec). Did something went wrong on my pc while crunching? Or what's the matter of this? Tasks https://boinc.bakerlab.org/rosetta/workunit.php?wuid=286264240 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=287080918 greetings PS: both paused for now ID: 65175 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 65177 - Posted: 2 Feb 2010, 15:21:50 UTC Last modified: 2 Feb 2010, 15:24:05 UTC Max: perhaps just a 1st model was designed quickly, and the last took much longer than expected Right and that is exactly what Sarel's new tasks do. Run 5 models in 5 minutes, then hit one that looks interesting and run for (for example) 80 minutes. Now 6 models have been completed in 85 minutes and with a 2hr runtime preference, we guess we can complete more models in the 2 hours. If that next one happens to be interesting as well, you run long. Some of the improvements Sarel is making and working on will help the longer models run faster. So this should avoid some of those that were taking several hours for a single model, and make completion times closer to your preference. Yes, Max. The watchdog USED to be based on 4 times the runtime preference. This was fine for short runtime preferences, but those with preference set to over 12 hours wanted to kill the task sooner and get on with others. Now it is runtime pref. plus 4 hrs, with the thought that all properly running models will complete in less then 4 hours. The rest of what you say sounds right to me. So keep watching, and jotting notes. This information will be key to resolving this issue. KnopperHarley This is one of the few remaining problems that some people are seeing in version 2.05. It seems to be rather rare, and perhaps only to occur on Windows. I see you are running Win XP (I highlight that just to make it easy for the Project Team to see it, not because it should be a problem). I believe suspending and resuming the tasks seems to get them going again. Could I ask you how your machine is configured? Specifically, do you leave tasks in memory while preempted? Do you run other BOINC projects? Do you allow BOINC to run 100% of CPU? Do you power your machine off each day? Rosetta Moderator: Mod.Sense ID: 65177 · Rating: 0 · rate: / Reply Quote

KnopperHarley Send message Joined: 1 Nov 06 Posts: 2 Credit: 788,560 RAC: 0	Message 65178 - Posted: 2 Feb 2010, 15:47:13 UTC Uhm, well ... I tried around a bit (restarted BOINC) and (you might guess): it works. ^^' Cpu-time jumped back to 3h and 6h or something and it's using the cores again. Seems like something really screwed up the Rosetta-apps while working. So nevermind ... ignore my posting above. ;-) I lost a bit of time, but the WUs are obviously (hopefully?!) undamaged and one has been completed in the meantime, so happy crunching again. o/ greetings PS: Would it make sense to send the WUs a second time to another participant to confirm the results ... just to be sure?! Especially the second WU mentioned in my post above (probably more than 7,5h in the end) plus another WU with almost 6,75h (t293__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901_4919) that has been finished last night are, let's say ... (maybe not impossible but) 'unusual' (to me :-) ). PPS: for the protocol g - Leave applications in memory while suspended? no - Rosetta + SETI (50:50) - Use at most 100 percent of CPU time - it's almost every day off for a period of time (except weekend once in a while) ID: 65178 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5765 Credit: 6,139,760 RAC: 45	Message 65182 - Posted: 2 Feb 2010, 23:12:43 UTC compute error t323__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_2006_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=314347348 <core_client_version>6.10.18</core_client_version> <![CDATA[ <message> Maximum elapsed time exceeded </message> ]]> ID: 65182 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5765 Credit: 6,139,760 RAC: 45	Message 65183 - Posted: 2 Feb 2010, 23:14:45 UTC compute error with unhandeled exception dump https://boinc.bakerlab.org/rosetta/result.php?resultid=310017128 homopt_fa_cstmc_1.t370_.t370_.IGNORE_THE_REST.S_00003_0000784_010.pdb.JOB_16898_23_0 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x7C90120E ID: 65183 · Rating: 0 · rate: / Reply Quote

l_mckeon Send message Joined: 5 Jun 07 Posts: 44 Credit: 180,717 RAC: 0	Message 65184 - Posted: 3 Feb 2010, 0:40:23 UTC I aborted lr15clus_opt_.1ptq.1ptq.IGNORE_THE_REST.c.3.21.pdb.pdb.JOB_17471_3_0 after 50 minutes. Stuck on model 1, step 0, with funny looking graphics. I no longer have the patience to see how these turn out. ID: 65184 · Rating: 0 · rate: / Reply Quote

Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0	Message 65187 - Posted: 3 Feb 2010, 11:54:35 UTC - in response to Message 65184. I aborted lr15clus_opt_.1ptq.1ptq.IGNORE_THE_REST.c.3.21.pdb.pdb.JOB_17471_3_0 after 50 minutes. Stuck on model 1, step 0, with funny looking graphics. I no longer have the patience to see how these turn out. Instead of aborting just try closing and restarting Boinc. That often does the trick. ID: 65187 · Rating: 0 · rate: / Reply Quote

John Hunt Send message Joined: 18 Sep 05 Posts: 446 Credit: 200,755 RAC: 0	Message 65189 - Posted: 3 Feb 2010, 15:20:10 UTC Last modified: 3 Feb 2010, 15:21:06 UTC https://boinc.bakerlab.org/rosetta/workunit.php?wuid=287053961 has been running now for 56 hrs and still only 57.019% complete. Core2Quad Q6600 @ 2.4GHz & Windows XP Home. Keep going or abort? ID: 65189 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 65190 - Posted: 3 Feb 2010, 16:50:20 UTC Keep going or abort? As Evan points out, often such conditions get reset if you suspend and resume the task, or end and restart BOINC... But first, I'd like to ask you to go to the advanced view, tasks tab, select the task that's been running so long, and then click the properties button that appears over on the left. There are three time figures there that I would like you to report: CPU time at last checkpoint: CPU time: and Elapsed time: It will take you a minute or so to jot that down, then close the window, and click again on the properties button for the task and see if the CPU time has changed at all. Rosetta Moderator: Mod.Sense ID: 65190 · Rating: 0 · rate: / Reply Quote

John Hunt Send message Joined: 18 Sep 05 Posts: 446 Credit: 200,755 RAC: 0	Message 65191 - Posted: 3 Feb 2010, 17:52:51 UTC Last modified: 3 Feb 2010, 18:12:49 UTC O.K. I've suspended the WU and then re-started. Here are the figures requested (when suspended) - CPU time at last checkpoint: 02:05:26 CPU time: 02:05:27 and Elapsed time: 58:38:24 After re-start - CPU time at last checkpoint: 02:05:26 CPU time: 02:10:22 and Elapsed time: 58:43:35 WU completed shortly afterwards with a computation error. Thank you! ID: 65191 · Rating: 0 · rate: / Reply Quote

Admin Send message Joined: 13 Apr 07 Posts: 42 Credit: 260,782 RAC: 0	Message 65195 - Posted: 4 Feb 2010, 1:09:32 UTC Just took a look at my graphics and saw this, is it normal? Ive been watching it for awhile now and it seems to be stuck on the model 2 step 0. Any ideas on what i should do? ID: 65195 · Rating: 0 · rate: / Reply Quote

Admin Send message Joined: 13 Apr 07 Posts: 42 Credit: 260,782 RAC: 0	Message 65198 - Posted: 4 Feb 2010, 3:08:44 UTC Strange seems to be fine now, you can disregard earlier post. ID: 65198 · Rating: 0 · rate: / Reply Quote

Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 29,766,470 RAC: 6,958	Message 65208 - Posted: 4 Feb 2010, 20:14:30 UTC Last modified: 4 Feb 2010, 20:24:06 UTC I do not think that should be ignored. This type of tasks on my computer, too, is behaving very strangely. Here's an example where the protein is coiled into a ring(Click to enlarge): In this state model is already about 30 minutes. Sometime ring begins to deploy, but then rolled back into the ring. ID: 65208 · Rating: 0 · rate: / Reply Quote