Message boards : Number crunching : Mini Rosetta Version 3.41.
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,845,777 RAC: 12,122 |
No, minirosetta app has checkpointing feature inside one decoy, so checkpoints should work(and usually works) with long models too. But in some cases (including the examples I listed above) it does not work. I think the calculation of these jobs just "hangs" at an early stage(before the first checkpoint). And just waste CPU resources until the watchdog will not stop the work. |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
No, minirosetta app has checkpointing feature inside one decoy, so checkpoints should work(and usually works) with long models too. From my observations: some tasks have "stages", those checkpoint inside one decoy, you can see that in the slot directory of such tasks (files with "stage" in the file name and "Starting work on structure: _00001" entries in std_err). Like this one. All other checkpoint at the end of each decoy, which might take a few hours to compute. I think the calculation of these jobs just "hangs" at an early stage(before the first checkpoint). And just waste CPU resources until the watchdog will not stop the work. As I have pointed out above, in some cases 7 hours might be not enough for 1 decoy. But such jobs should be only send out to hosts, which allow higher runtimes and not those at default settings. . |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
and yet more silence from the team...as usual. Because ralph is run by the same group as rosetta and ralph is a testing platform so whatever goes wrong there goes wrong and they will see it in the results files. I gave up on ralph long ago due to to many errors and no communication. The only Baker labs stuff I am signed on for is Rosie. |
ArcSedna Send message Joined: 23 Oct 11 Posts: 14 Credit: 69,190,403 RAC: 19,011 |
Was there not a bug some time ago that was making work fail at 1201 seconds It might be. That was happening on MiniRosetta version 3.22. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5909&nowrap=true#72471 And I'm still having these validate errors... https://boinc.bakerlab.org/rosetta/result.php?resultid=531638117 Task ID 531638117 Name lr_ab_bench_3solA_SAVE_ALL_OUT_IGNORE_THE_REST_58333_1091_2 Workunit 482120816 Created 13 Sep 2012 21:39:39 UTC Sent 13 Sep 2012 21:40:02 UTC Received 14 Sep 2012 0:09:47 UTC Server state Over Outcome Validate error Client state Done Exit status 0 (0x0) Computer ID 1530475 Report deadline 23 Sep 2012 21:40:02 UTC CPU time 239.3506 stderr out <core_client_version>6.12.43</core_client_version> <![CDATA[ <stderr_txt> [2012- 9-14 8:50:32:] :: BOINC:: Initializing ... ok. [2012- 9-14 8:50:32:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev50262.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/input_lr_ab_bench_3solA_yfsong.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... BOINC:: Worker startup. Starting watchdog... Watchdog active. ====================================================== DONE :: 1 starting structures 1201 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: WS_max 0 BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> ]]> Validate state Invalid Claimed credit 1.04346308677662 Granted credit 0 application version 3.41 |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. I've had a few tasks now that don't seem to want to stop when they should, my selected runtime is 4hrs this last one went for 7hrs,42mins as you can see. Why didn't it stop at the 4hr mark & do say ~100 models these tasks are balls'n up my d.c.f. on my rigs. On a side note the credits are not great for the runtime. ========================================================== Ebolanator3_1LOUA_ProteinInterfaceDesign_2Sep2012_58540_53_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=483693654 Setting database description ... Setting up checkpointing ... Setting up graphics native ... BOINC:: Worker startup. Starting watchdog... Watchdog active. # cpu_run_time_pref: 14400 ====================================================== DONE :: 215 starting structures 27414.5 cpu seconds This process generated 215 decoys from 215 attempts ====================================================== BOINC :: WS_max 0 BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> ]]> Validate state Valid Claimed credit 282.012672541005 Granted credit 116.338517867505 application version 3.41 |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,845,777 RAC: 12,122 |
Hi. It old known issue - this type of tasks (ProteinInterfaceDesign) have 2 types of models actually: 1st simple and fast calculated (preliminary selection) and 2nd detail models with slow calculation if you lucky and calculations found out something promising. So you may see the 2 variants: 1)If you got WU with only "garbage" (1st type models) calculation goes fast, WU produce lot of models(=lot of Credit too, because Cr granted based on how much models in WU) and finish at exactly right(target) time. 2)If you got WU which find in garbage 1 or more interesting models (worth detail calculation) WU can not stop at right moment because 2nd type models take few hours of calculation each, and WU can't stop and report result until it finish model. And model counter is low, so Cr are low too. (if you very "lucky" and hit good model in very beginning and/or few such models in one WU it can be only 5-10 Cr instead of few hundreds). But in average from large amount of WUs Cr/hour(day) are near same compare to other type of tasks because high Cr WUs compensate low Cr WUs. So Cr is not problem. For example WUs from the same series of tasks (my comp) Low Cr WUs(which probably contain "interesting" slow models), runtimes far exceed the target time(=default 3 hours/10800sec): Ebolanator3_2jhqa_ProteinInterfaceDesign_2Sep2012_58540_36_0 24k sec of CPU time = 113 models, Claimed credit 149, Granted credit 33 Ebolanator3_2r48a_ProteinInterfaceDesign_2Sep2012_58541_42_0 25k sec = 153 models, Claimed credit 157, Granted credit 73 High Cr WUs(only preselection models), runtimes = target time: Ebolanator3_3mw8a_ProteinInterfaceDesign_2Sep2012_58540_39_0 ~11k sec = 272 models, Claimed credit 67, Granted credit 119 Ebolanator3_2d1va_ProteinInterfaceDesign_2Sep2012_58541_45_0 ~11k sec = 410 models, Claimed credit 67, Granted credit 101 Ebolanator3_2NRRA_ProteinInterfaceDesign_2Sep2012_58540_50_0 ~11k sec = 332 models, Claimed credit 67, Granted credit 108 |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,269,631 RAC: 2,588 |
I suspect that users would like it more if the detailed ProteinInterfaceDesign models contributed much more to the number of credits than the preliminary ProteinInterfaceDesign models. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Task 532135642 (Ebolanator3_1hsla_ProteinInterfaceDesign_2Sep2012_58541_60_1) failed on W7 ERROR: unknown atom_name: ILE CG ERROR:: Exit from: ......srccorechemicalResidueType.cc line: 1702 called boinc_finish </stderr_txt> ] |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
I got half a dozen of these garbage tasks today all failed with the same problem. lr_ab_bench Watchdog active. # cpu_run_time_pref: 14400 ====================================================== DONE :: 1 starting structures 1201 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: WS_max 0 Validate error |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Had this one error out not long ago. 2w9pA_newfrag_abinitio_SAVE_ALL_OUT_61300_1309_1 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=488023761 <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> BOINC:: Worker startup. Starting watchdog... Watchdog active. Starting work on structure: _00001 ERROR: Assertion failure: runtime_assert( ( begin + size - 1 ) <= pose.total_residue() ); ERROR:: Exit from: src/protocols/simple_moves/FragmentMover.cc line: 260 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> ]]> |
Viking69 Send message Joined: 3 Oct 05 Posts: 20 Credit: 6,780,190 RAC: 2,146 |
Lots of Computing Errors on my WU's too. Whats Up? Hi all you enthusiastic crunchers..... |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Another error. cyst_d17_0000_abinitio_SAVE_ALL_OUT_61396_213_1 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=488307382 <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> ---- Setting database description ... Setting up checkpointing ... Setting up graphics native ... Setting up folding (abrelax) ... ERROR: ERROR: FragmentIO: could not open file aa0000109_05.200_v1_3 ERROR:: Exit from: src/core/fragment/FragmentIO.cc line: 233 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> ]]> |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Another error. rb_10_12_34084_64234__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_61699_37_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=488755685 <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> ======= # cpu_run_time_pref: 14400 ERROR: can't open file: minirosetta_database//sampling/filtered.vall.dat.2006-05-05 ERROR:: Exit from: src/core/fragment/picking_old/vall/vall_io.cc line: 63 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> ]]> |
Bjarke Send message Joined: 14 Feb 06 Posts: 5 Credit: 1,634,479 RAC: 0 |
Numerous errors for both of my computers: Tasks for host 1569084 Tasks for host 1569102 All wu's for the past two days are failing... |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,845,777 RAC: 12,122 |
Numerous errors for both of my computers: Congratulations! You are "lucky" and you got a special bug - now all your WU' will be count as errors. This a old known bug, but its cause and how to fix it is still not known. Details in this thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6050 Think of what could have changed over the 2 days? Installing new software (or drivers), change settings(hardware or software), etc. |
Bjarke Send message Joined: 14 Feb 06 Posts: 5 Credit: 1,634,479 RAC: 0 |
Numerous errors for both of my computers: F***ing crap. I will definately switch to another project then. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,269,631 RAC: 2,588 |
Numerous errors for both of my computers: The last time I had a similar problem (on a different BOINC project) I already had several projects enabled, so I set the one with the problem to no new tasks, let the computer finish all tasks for that project, then told BOINC Manager to reset that project, then allowed more workunits for that project. The new workunits started working correctly. I have no way to tell if this will work for your problem, though. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
rb_10_13_34107_64244__t000__0_C2_SAVE_ALL_OUT_IGNORE_THE_REST_61777_916_0 This task 537475495 failed on Mac OS X with this error message # cpu_run_time_pref: 21600 dof_atom1 atomno= 3 rsd= 1 atom1 atomno= 1 rsd= 1 atom2 atomno= 2 rsd= 1 atom3 atomno= 5 rsd= 1 atom4 atomno= 6 rsd= 1 THETA1 nan THETA3 nan PHI2 0 ERROR: AtomTree::torsion_angle_dof_id: angle range error ERROR:: Exit from: src/core/kinematics/AtomTree.cc line: 771 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> ]]> |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
An error. rb_10_19_34246_64381__t000__0_C2_SAVE_ALL_OUT_IGNORE_THE_REST_62288_83_1 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=490064035 <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> Setting database description ... Setting up checkpointing ... Setting up graphics native ... BOINC:: Worker startup. Starting watchdog... Watchdog active. ERROR: can't open file: minirosetta_database//sampling/filtered.vall.dat.2006-05-05 ERROR:: Exit from: src/core/fragment/picking_old/vall/vall_io.cc line: 63 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> ]]> |
Message boards :
Number crunching :
Mini Rosetta Version 3.41.
©2024 University of Washington
https://www.bakerlab.org