Strange work unit.

Message boards : Number crunching : Strange work unit.

To post messages, you must log in.

AuthorMessage
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 137
Message 90461 - Posted: 2 Mar 2019, 12:02:20 UTC

I have a work unit on here, this one...

https://boinc.bakerlab.org/rosetta/result.php?resultid=1059492200

... which has behaved in an unusual fashion. I have the run time set to 12 hours, and understand the task runs several times in the time allowed starting with a new random number each time. I saw the above unit had run for over two days, and only a little over 8% complete. I suspended it and then released it, it dropped back to the start and is running again now and has passed the point where it stopped before, indeed, it is showing 13.757% progress.

I infer from that, the task is sensitive to certain random numbers, which is a little odd, indeed, worrying. I have Rosetta running as one of the projects on machines that I do not see everyday.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 90461 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 90489 - Posted: 6 Mar 2019, 15:59:32 UTC

That one random start may expose a bug that is not seen by a million others is certainly a possibility. But, when a task gets reset like you describe, it will actually restart processing on the same random start. So it sounds like a BOINC issue, that I've not heard brought up for some time, where the BOINC Manager shows the tasks is running, and it is recording time, but the task does not actually get CPU time dispatched to it. You can look at the task manager of Windows or the properties of the task in the BOINC Manager to see if it is actually showing CPU time accumulating.

If the task is not accumulating CPU, even when shown as active by the BOINC Manager... I don't recall the work around. Was this one of the symptoms of using the BOINC setting to use less than 100% of CPU? If your settings use less than 100% of CPU of the machine, another approach that doesn't seem to have the problems is to use less than 100% of the number of CPUs instead. In other words, rather than running at 90% CPU, on an 8 core machine, set BOINC to use 87% of the CPUs that the machine has (i.e. 7 cores).
Rosetta Moderator: Mod.Sense
ID: 90489 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 137
Message 90490 - Posted: 6 Mar 2019, 17:16:42 UTC
Last modified: 6 Mar 2019, 17:18:39 UTC

I kept half an eye on it after it restarted, but it appeared to run completely normally, finished and uploaded. I agree, that if it restarts with the same random number, then my theory above is incorrect. I am not sufficiently familiar with the code to comment further really. It ran to a normal completion. Something upset it, cosmic ray, neuitrino interaction, could be anything I suppose. I have not changed anything here, so the project continues to run as it always has. Forget it.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 90490 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 137
Message 90594 - Posted: 30 Mar 2019, 12:39:59 UTC

I've just had another of these, this one:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=959097968.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 90594 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 137
Message 91127 - Posted: 18 Sep 2019, 8:42:20 UTC
Last modified: 18 Sep 2019, 8:45:38 UTC

And another. I presume that there is a safety kill mechanism which will abort a task if it exceeds some threshold time value. I ask because I have Rosetta in the portfolio of a couple of machines I do not see every day.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 91127 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 91129 - Posted: 18 Sep 2019, 13:48:53 UTC
Last modified: 18 Sep 2019, 13:49:04 UTC

I presume that there is a safety kill mechanism


Yes, we call it the watch dog. I ensures that tasks that run more than 4 hours longer then their preferred runtime are ended, and any completed models of that WU are returned.
Rosetta Moderator: Mod.Sense
ID: 91129 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 137
Message 91130 - Posted: 18 Sep 2019, 14:56:02 UTC - in response to Message 91129.  

Good, that is what I expected. Thanks.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 91130 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 137
Message 91131 - Posted: 19 Sep 2019, 8:37:49 UTC
Last modified: 19 Sep 2019, 9:12:12 UTC

It is a little unfortunate! I have a task running now, it is 98.965% complete, VERY slowly increasing 10 minutes odd to complete, but it has run now 16:40:50 so is at the point where it is more than 4 hours over my 12:00:00 run time. A few minutes ago, it showed .963% and after finishing this post, it says .968% so it IS doing something.

<edit>
Okay, I managed to up the run time to 14:00:00 before it got the chop so I hope it will get there. Shows 98.971% right now. Interestingly, the time remaining is not decreasing, it has been 00:10:27 since I started.
</edit>

<edit again>
Yes! It suddenly jumped to 100% after 16:50:47.
</edit>

<edit again>
The task is:

rb_08_27_7614_7823_ab_t000_robetta_cstwt_5.0_FT_IGNORE_THE_REST_08_06_857976_594

Hope it is a good one. I'll leave the time at 14:00:00 in case there are others like this one.
</edit>

<edit AGAIN>

1093987561 985420727 3117659 18 Sep 2019, 5:10:20 UTC 19 Sep 2019, 6:59:05 UTC Completed and validated 44,963.77 43,178.52 569.88 Rosetta Mini v3.78
windows_x86_64
1093972969 984034477 3117659 18 Sep 2019, 3:52:35 UTC 19 Sep 2019, 8:51:40 UTC Completed and validated 60,647.23 57,949.45 398.76 Rosetta v4.07
windows_x86_64

1093968124 985403294 3117659 18 Sep 2019, 2:51:36 UTC 19 Sep 2019, 4:16:12 UTC Completed and validated 44,333.80 43,112.95 625.53 Rosetta Mini v3.78
windows_x86_64
1093967112 985402459 3161065 18 Sep 2019, 2:36:42 UTC 19 Sep 2019, 3:42:42 UTC Completed and validated 43,181.07 43,133.70 512.24 Rosetta Mini v3.78
windows_intelx86
1093967259 985402586 3161065 18 Sep 2019, 2:36:42 UTC 19 Sep 2019, 1:43:24 UTC Completed and validated 43,139.44 43,080.78 590.64 Rosetta Mini v3.78
windows_intelx86

Credit column is interesting. Mini looks to be maxy.
</edit>
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 91131 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Strange work unit.



©2024 University of Washington
https://www.bakerlab.org