Message boards : Number crunching : Rosetta 4.1+ and 4.2+
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 34 · Next
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1673 Credit: 17,596,129 RAC: 22,414 |
Some problems with "12v1n_" wus.I've processed these ones with no problems so far. 12v1n_al_12mer_design_00003_000575_0001_SAVE_ALL_OUT_913418_39_0 12v1n_al_12mer_design_00062_014552_0001_SAVE_ALL_OUT_913824_41_0 12v1n_al_12mer_design_00166_018161_0001_SAVE_ALL_OUT_914183_55_0 12v1n_al_12mer_design_00178_008639_0001_SAVE_ALL_OUT_914209_113_0 12v1n_al_12mer_design_00329_016075_0001_SAVE_ALL_OUT_914468_22_0 They finish early (done in 3hrs with 8hr Target CPU time), but all are Valid. I have 4hs wus (in my profile), but these are crunching over 10hs and with NO checkpointIt's been sent to another system, so we'll see how it goes. Grant Darwin NT |
Stret Send message Joined: 18 Mar 20 Posts: 7 Credit: 529,664 RAC: 0 |
Please move to relevant forum, I was struggling to find a help section. One of my work units has been running for over a day (unusual in and of itself) and is not geting up at all, it says it is 10 minutes from finishing, but that hasn't changed in over 12 hours. I suspect based on my rudamentory programming knowledge that it has hit an infinite loop. What is the best way forward? There's no point in it hitting its deadline and doing the same on another machine. copy and paste from propeties of WU: Application Rosetta 4.15 Name 12v1n_al_12mer_design_00003_000575_0001_SAVE_ALL_OUT_913418_17 State Running Received 15/04/2020 07:16:00 Report deadline 18/04/2020 07:16:02 Estimated computation size 80,000 GFLOPs CPU time 1d 05:34:42 CPU time since checkpoint 1d 05:34:42 Elapsed time 1d 05:55:59 Estimated time remaining 00:10:07 Fraction done 99.439% Virtual memory size 244.53 MB Working set size 48.34 MB Directory slots/5 Process ID 26968 Progress rate 3.240% per hour Executable rosetta_4.15_windows_x86_64.exe |
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,075,292 RAC: 189 |
Yeah, I had a similar one whose name started the same way as yours. It ran over a day and a half before I aborted it. Apparently others have reported issues with tasks with names like that. Just abort it. -Charlie -Charlie |
James W Send message Joined: 25 Nov 12 Posts: 130 Credit: 1,766,254 RAC: 0 |
Task: 1150978005 Task: 12v1n_al_12mer_design_00026_019077_0001_SAVE_ALL_OUT_913633_58 CPU time: 15:01:25 CPU time since checkpoint: 15:01:25 Elapsed time: 15:28:46 Estimated time remaining: 00:10:18 (which varies between 00:10:17-00:10:20) Fraction done: 98.901% (Which has gradually increased over last 1/2 hr or so from 98.893% or so) Original estimated time was 8 hrs, so now is 7.5 hrs over this. Shouldn't Watchdog have stopped processing, as now over 4 hrs longer than estimated processing time? Fraction done is slowly rising, so reluctant to abort at this point. Concerning that only checkpoint was when task first started. If BOINC manager stops or suspends, afraid task will want to start over from scratch! I'll keep an eye on this for now and if no change in hour or so, will likely need to abort at that time. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
The watchdog kicks in 4 hours after the runtime preference, not the estimated runtime shown in the BOINC Manager. Once the WU reports back it will show the runtime preference it was run with. But you are correct, no checkpoints for over an hour is not a good sign, and your other work units seem to be running with the default 8 hour preference. Is this with your i5 Windows 7 Profession machine? It looks like your i3, also with Win7, has already run a few similar tasks with dozens of models completed in the same period of time, even with less memory per core. Are the BOINC settings the same for both systems? Rosetta Moderator: Mod.Sense |
rsNeutrino Send message Joined: 22 Mar 20 Posts: 10 Credit: 4,883,530 RAC: 5,961 |
Task 1152764941 also drove into the 12 hour timeout, reaching 98% around 10 minutes before that. <core_client_version>7.16.5</core_client_version> <![CDATA[ <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.15_windows_x86_64.exe @rb_04_16_21806_21365_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -jumps:pairing_file t000_.fasta.bbcontacts.jumps -jumps:random_sheets 2 2 2 1 1 1 1 2 1 1 1 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_04_16_21806_21365_ab_t000__robetta.zip -frag3 rb_04_16_21806_21365_ab_t000__robetta.200.3mers.index.gz -fragA rb_04_16_21806_21365_ab_t000__robetta.200.4mers.index.gz -fragB rb_04_16_21806_21365_ab_t000__robetta.200.7mers.index.gz -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 5000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1285868 Starting watchdog... Watchdog active. BOINC:: CPU time: 43301.9s, 14400s + 28800s[2020- 4-18 6:26:34:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 InternalDecoyCount: 0 (GZ) ----- 0 ----- Stream information inconsistent. Writing W_0000001 ====================================================== DONE :: 1 starting structures 43301.9 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== 06:26:34 (10032): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>rb_04_16_21806_21365_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_04_918009_202_0_r1479539452_0</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> </message> ]]> Task 1153161145 is probably going to end up the same, 34% at 4h 10min. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1673 Credit: 17,596,129 RAC: 22,414 |
Task 1152764941 also drove into the 12 hour timeout.Looks like there was a file transfer problem there as well. Grant Darwin NT |
rsNeutrino Send message Joined: 22 Mar 20 Posts: 10 Credit: 4,883,530 RAC: 5,961 |
Looks like there was a file transfer problem there as well. Maybe because it didn't have any checkpoint or result file to upload since it wasn't finished with its first decoy. Thats probably also the reason for the long runtime, it HAS to finish one before shutdown else it keeps going until the watchdog kills it. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1673 Credit: 17,596,129 RAC: 22,414 |
From the couple i've had, it looks like the Watchdog will let it run for up to an extra 4 hours, and if it doesn't finish in that time then the next time it goes to Checkpoint, the Watchdog then ends it and it is considered finished.Looks like there was a file transfer problem there as well.Maybe because it didn't have any checkpoint or result file to upload since it wasn't finished. ====================================================== DONE :: 1 starting structures 43301.9 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== 06:26:34 (10032): called boinc_finish(0)If it had returned the result, it would (or at least should have) Validated. Grant Darwin NT |
James W Send message Joined: 25 Nov 12 Posts: 130 Credit: 1,766,254 RAC: 0 |
The watchdog kicks in 4 hours after the runtime preference, not the estimated runtime shown in the BOINC Manager.Both computers are set with default CPU runtime of 8 hrs (28000 seconds). Is this with your i5 Windows 7 Profession machine?Yes times, etc. I quoted in my previous post were for the i5 Windows 7 PC. As of 07:00 UTC: Task: 1150978005 Task Name: 12v1n_al_12mer_design_00026_019077_0001_SAVE_ALL_OUT_913633_58 CPU time: 19:29:08 CPU time since checkpoint: 19:29:08 Elapsed time: 20:03:42 Estimated time remaining: 00:10:17 Fraction done: 99.152% Fraction done moved up slightly in last 4.5 hrs, though estimated time remaining has remained the same. CPU runtime also over 11 hrs over default/set time. Doubt this task would be valid, even if Watchdog stops processing because of having only the one checkpoint at start of processing. Probably will need to abort if BOINC manager/project doesn't stop processing. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1673 Credit: 17,596,129 RAC: 22,414 |
... and then i get 2 Tasks where the watchdog hasn't kicked in at all even after 1hr plus over the 4hrs and multiple checkpoints in that time.From the couple i've had, it looks like the Watchdog will let it run for up to an extra 4 hours, and if it doesn't finish in that time then the next time it goes to Checkpoint, the Watchdog then ends it and it is considered finished.Looks like there was a file transfer problem there as well.Maybe because it didn't have any checkpoint or result file to upload since it wasn't finished. Grant Darwin NT |
rsNeutrino Send message Joined: 22 Mar 20 Posts: 10 Credit: 4,883,530 RAC: 5,961 |
From the couple i've had, it looks like the Watchdog will let it run for up to an extra 4 hours, and if it doesn't finish in that time then the next time it goes to Checkpoint, the Watchdog then ends it and it is considered finished My target time is 8 hours. The task reached 12 hours, so it did run for 4 extra hours. I had my eye on that task before it ended, and BOINC told me in the task properties that "CPU time since checkpoint" was equal to the "CPU time" of that task. Which means there wasn't even one checkpoint saved in the 12h since the start of that task. The second task shows the same symptoms at the moment, CPU time 04:40:xx, CPU time since checkpoint 04:40:xx, Elapsed time 04:45:xx. My understanding is that the watchdog is there to kill the task at target time + 4h, regardless of wether there are any results: 18.04.2020 06:26:41 | Rosetta@home | Output file rb_04_16_21806_21365_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_04_918009_202_0_r1479539452_0 for task rb_04_16_21806_21365_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_04_918009_202_0 absent Also the watchdog seems to look for "cpu seconds" alias "CPU time", not the bit longer Elapsed time. The point is, it seems to me that there are some models that are either buggy or need much more time to produce even a single result, and the watchdog doesn't like it. In the case that the model can't be changed to fit in an 8h timeslot, to raise the watchdog timeout could be a necessary option, which MAY has already happened in your and James' cases, but not in mine. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,523,781 RAC: 8,309 |
Some problems with "12v1n_" wus. Now seems well for me too... 1153276213 |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,523,781 RAC: 8,309 |
My understanding is that the watchdog is there to kill the task at target time + 4h, regardless of wether there are any results: Problems with cstwt wus are well known |
Admin Project administrator Send message Joined: 1 Jul 05 Posts: 4805 Credit: 0 RAC: 0 |
There very well could be jobs that have long run times per model. I increased the watchdog to 10 hours. This will only take affect on new jobs. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2117 Credit: 41,140,182 RAC: 15,917 |
There very well could be jobs that have long run times per model. I increased the watchdog to 10 hours. This will only take affect on new jobs. Really? That doesn't seem a great move. The first problem seems to me that "CPU time since checkpoint" is equal to the "CPU time" of the task. That is, even after the requested runtime PLUS the existing 4hr watchdog, the task hasn't checkpointed at all. The watchdog is there for tasks that've gone "rogue", not to wait for a single and first checkpoint. Ok, if the tasks has completed several decoys already but the last one is taking an unexpectedly long time, a longer watchdog is maybe appropriate, but is that what's being reported? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I would think that it means they really need to see some of these extreme models completed. That all models might take a long time. Your scenario with early models completed and then one long one doesn't sound like the reason one would make the change described. Rest assured that my experience with the project has always been that model runtimes and the consistency of model runtimes improves with updates to the specific protocols. But, in the meantime, extending the watchdog sounds like the fastest way for them to get some results. {edit} I don't mean to sound like I am refuting any of the desirable attributes of WUs that Sid mentioned. The Project Team is very aware of the desirability of checkpoints, fast consistent model runtimes, and etc. The fact that they chose to extend the watchdog to 10 hours really tells me that we're down to either do this, or don't get the data you need to continue your COVID study. I'm confident it will not be a permanent change. Rosetta Moderator: Mod.Sense |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
Some problems with "12v1n_" wus. Not all of the 12v1n's are having issues, but I've had the issue mentioned above about long run time (seems to stall). (Task linked below) First time I ran it, it went over 24 hours, got to 99.4%, and then for other reasons I rebooted my machine. The same task after reboot reset to 0% and started over. I let it run 12+ hours the second time, and it was exhibiting the same behavior. I quit Boinc and relaunched, and again the same task reset to 0% and started over. I aborted it. Now it's on a Android machine, we'll see if it goes anywhere. Aborted task: https://boinc.bakerlab.org/rosetta/result.php?resultid=1151472990 Where it lives now: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1035999438 I'm intrigued to see if the second computer is able to finish it. /edit. Should add, I have finished several of the 12v1n tasks without issue, so it's not widespread. |
rsNeutrino Send message Joined: 22 Mar 20 Posts: 10 Credit: 4,883,530 RAC: 5,961 |
There very well could be jobs that have long run times per model. I increased the watchdog to 10 hours. This will only take affect on new jobs. After 12h 8min CPU time this one finished successfully with 1 decoy: 1153161145 Did the watchdog end it? BOINC:: CPU time: 43719.1s, 14400s + 28800s[2020- 4-18 16: 7:48:] :: BOINC |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
After 12h 8min CPU time this one finished successfully with 1 decoy: 1153161145 Yes, this looks like a good example why, in future WUs, the watchdog will be set to only kick in 10 hours after the preferred runtime (versus the prior 4 hours past setting). Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
Rosetta 4.1+ and 4.2+
©2024 University of Washington
https://www.bakerlab.org