Message boards : Number crunching : Should I abort already finished workunits?
Author | Message |
---|---|
TSD Send message Joined: 10 Oct 08 Posts: 7 Credit: 2,189,714 RAC: 0 |
Sometimes I get a workunit tagged with "Timed out - no response". Someone else got such a workunit before me - and has not finished it within 3 days. And sometimes such a workunit is "Completed and validated" - after deadline - while my system is working on the same workunit. Would it be a good idea to abort such a workunit and continue with another one? Or should I let it finish? I don't care about getting credits. I hope this makes sense. And I apologize if this has been answered before. I have tried to make some searches. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 20,118 |
Would it be a good idea to abort such a workunit and continue with another one? Or should I let it finish? I don't care about getting credits.I'd let it finish as it will still be returning a useful result. Grant Darwin NT |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
If you notice that the _0 task for the workunit has completed and validated before your _1 task has finished, you might as well abort it and start something new because your results will be identical to the set already submitted and (unless your machine is substantially faster than the other, or your run time is set higher) won’t add anything. [So that’s two conflicting answers you’ve got, which leaves you in no better place than before you asked…] |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 20,118 |
If you notice that the _0 task for the workunit has completed and validated before your _1 task has finished, you might as well abort it and start something new because your results will be identical to the set already submittedNo, they won't. That's the thing with Rosetta work- when a Task is processed it uses a Random seed value. So you could process the exact same Task 100 times on the same system, and get 100 different results, all of them Valid.. That's why it's worth completing a Task, even if it's original issue has been returned & Validated. Grant Darwin NT |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
It seems exceptionally unlikely that scientific experiments are being conducted by random chance, or that the tasks sent out are anything but entirely deterministic. Far more likely is that the seeds are determined by the server in advance, per workunit, to ensure a known distribution over the range of starting points. Observe that every task is sent with the -constant_seed option set, and a specific -jran value. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 20,118 |
Observe that every task is sent with the -constant_seed option set, and a specific -jran value.And those values (or at least the -jran one) is different for each Task. So for a given work unit, each replication that is sent out starts with a different value. Hence there is no comparison of returned Tasks for validation as occurs on other projects. It seems exceptionally unlikely that scientific experiments are being conducted by random chance, or that the tasks sent out are anything but entirely deterministic.Yet that is partly what is happening- look at some of the std_err outputs for WUs where one Tasks errors out but another doesn't. There are a huge number of possible combinations to try, and you need to try as many as possible to discard those that aren't of use, and find those that are. Using random variables (within a given range of values) as a seed value helps achieve that. Grant Darwin NT |
TSD Send message Joined: 10 Oct 08 Posts: 7 Credit: 2,189,714 RAC: 0 |
I am a little confused. I don't know much about what the options -constant_seed or -jran means or does. It is a little too technical for me. What I have described as a problem does not happen very often. And so far I think I will just do nothing - and let workunits finish. @Grant (SSSF) & Brian Nixon Thanks for your replies. Appreciated. |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
-jran is set per workunit, not per task. (Evidence: _0 · _1) What would be the point of resending a failed workunit to a second machine if it wasn’t going to retry the exact thing that failed? You’re right that there are some nondeterministic-looking outcomes where a workunit will fail on one machine but succeed on another. As you’ve pointed out to people many times, those cases are more likely down to hardware failures (which, over the installed base of users’ computers, genuinely are random) or latent platform bugs (such as where we see workunits fail on Windows but not on Linux, and the randomness is in the assignment of tasks to machines) than to unpredictable progression of tasks themselves. Using random numbers will lead to a random outcome, not a useful one. That is gambling, not science. |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
Your question has sparked a discussion between Grant and me, based on our different understandings of how the project works. The truth is that neither of us really knows, and unless we get a definitive answer from an administrator, we never will. As you say: the situation you’re asking about happens so rarely that it will make no significant difference in the grand scheme of the project whether you abort the workunits or let them continue. |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 1,013 |
I would probably run it especially if it already has a significant amount of CPU time. An alternative to aborting a task is to change the CPU runtime target on the Rosetta@home preferences so that the WU finishes earlier (say the WU has 5 hours of CPU time already, then change the CPU runtime target to 4 hours if you want it to finish ASAP or 6 hours so it finishes soon). After updating the prefs, do a project update on the BOINC Manager. Then I usually either wait a bit or disable LAIM in the BOINC Settings, suspend the task and resume it immediately - it should end soon after. Rarely, I miss the deadline by a couple of hours and a replacement task gets sent to another device and returned before I finish my own task because that replacement host is set with a very low CPU runtime target (1 hour, come on lol). Annoying, but I deliver my task with the 8 hours of CPU runtime since an 8 hour task is probably better than a mere 1 hour task. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 20,118 |
Using random numbers will lead to a random outcome, not a useful one. That is gambling, not science.No so. Many great discoveries have come about by chance. Statistics is a good example where randomness is essential in order to produce valid results. Science is about repeatability- if you get a particular result using a particular set of variables, then you should always get the same result under the same identical conditions. When it comes to proteins there are billions upon billions upon billions etc of possibilities. Most of which aren't viable. But that still leaves a mind boggling number of possibilities that are viable. So many so that the models the researchers release are just a punt- an educated punt based on past experience, but a punt none the less. And using a random seed variable, that is within limits set by the researcher, will produce a range of valid result that will show if their model on on the right track, and if so in which direction they should head. Grant Darwin NT |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,169,305 RAC: 3,400 |
Using random numbers will lead to a random outcome, not a useful one. That is gambling, not science.No so. AGREED it's alot like the projects looking for Prime Numbers.....sure they go one by one as they look for them but the Project Scientists really do have a very good idea if there will be a prime number found in the current batch or the next batch or the batch after that as they are fairly predictable, not exactly but pretty close. YES eventually Rosetta, or some other project, could circle back and pick up all the ranges of things they are currently not doing but predictability is how most Boinc projects work and most seem to work pretty well. |
KWSN_Sir_Frank_of_the_Wood Send message Joined: 17 Mar 21 Posts: 1 Credit: 23,589 RAC: 0 |
New to Rosetta - less than 10 days... Noticed this morning that 12 work units (out of 30 or so in last batch) had been flushed/discarded by Server before my machine started on them... Each had been re-sent to me shortly after 72 hour deadline had passed - then Previous Cruncher had completed the unit a few hours later. Seems to me that this is better than a lot of Wheel Spinning on processing units that have already been completed successfully...and apparently the Server thinks that a re-sent unit is exactly the same as the original unit. frank |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 20,118 |
New to Rosetta - less than 10 days...Having a smaller cache fixes that. Store at least 0.1 days of work Store up to an additional 0.01 days of work Grant Darwin NT |
Message boards :
Number crunching :
Should I abort already finished workunits?
©2024 University of Washington
https://www.bakerlab.org