Minirosetta 3.62-3.65

Message boards : Number crunching : Minirosetta 3.62-3.65

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
martyn_2010

Send message
Joined: 5 Oct 10
Posts: 1
Credit: 390,318
RAC: 0
Message 79118 - Posted: 26 Nov 2015, 18:30:25 UTC

An annoying aspect of this project is the 'Validate Error'.

From the log it appears that the workunit has processed successfully, yet when the results are uploaded and checked there appears to be a Validate Error.

This happened to me 3 times for large workunits in September 2015. Have avoided this project until now.

Today I have had 2 more Validate Errors -
rb_11_24_60863_105242__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_312475_210
rb_11_24_60883_105260__t000__4_C1_SAVE_ALL_OUT_IGNORE_THE_REST_312527_72

Please explain why this error occurs and that I should lose credit for what appears to be work finishing successfully.
ID: 79118 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1831
Credit: 119,603,057
RAC: 11,102
Message 79123 - Posted: 28 Nov 2015, 13:10:52 UTC - in response to Message 79118.  

I don't know why it happens, but sometimes things go wrong when you're running a science project. The tasks are credited though - it doesn't show up in the "Tasks for computer" page but does under the task:

https://boinc.bakerlab.org/rosetta/result.php?resultid=774133344

An annoying aspect of this project is the 'Validate Error'.

From the log it appears that the workunit has processed successfully, yet when the results are uploaded and checked there appears to be a Validate Error.

This happened to me 3 times for large workunits in September 2015. Have avoided this project until now.

Today I have had 2 more Validate Errors -
rb_11_24_60863_105242__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_312475_210
rb_11_24_60883_105260__t000__4_C1_SAVE_ALL_OUT_IGNORE_THE_REST_312527_72

Please explain why this error occurs and that I should lose credit for what appears to be work finishing successfully.


ID: 79123 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Betting Slip

Send message
Joined: 26 Sep 05
Posts: 71
Credit: 5,702,246
RAC: 0
Message 79124 - Posted: 28 Nov 2015, 13:50:09 UTC - in response to Message 79123.  
Last modified: 28 Nov 2015, 14:12:24 UTC

I don't know why it happens, but sometimes things go wrong when you're running a science project. The tasks are credited though - it doesn't show up in the "Tasks for computer" page but does under the task:


Only up to a MAX of 300

If you run tasks for longer periods like myself you will lose.

This was supposed to have been fixed a long, long time ago but as everything else with this project, nothing, nothing to be heard, nothing to be seen and certanly nothing to be done.

EDIT TO ADD

There have been more reported sightings of "Elvis" in London alone over the last 5 years than scientists on this project.
ID: 79124 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 79150 - Posted: 3 Dec 2015, 3:29:33 UTC - in response to Message 79118.  

An annoying aspect of this project is the 'Validate Error'.

From the log it appears that the workunit has processed successfully, yet when the results are uploaded and checked there appears to be a Validate Error.

This happened to me 3 times for large workunits in September 2015. Have avoided this project until now.

Today I have had 2 more Validate Errors -
rb_11_24_60863_105242__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_312475_210
rb_11_24_60883_105260__t000__4_C1_SAVE_ALL_OUT_IGNORE_THE_REST_312527_72

Please explain why this error occurs and that I should lose credit for what appears to be work finishing successfully.


I'm not sure what happened with these two results. The logs show:

[CRITICAL] check_set: init_result([RESULT#773910587 rb_11_24_60863_105242__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_312475_210_0]) failed: -1

but the result files have already been cleaned. The files were possibly corrupt but if this continues to happen, let us know. Credit was granted.
ID: 79150 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 79151 - Posted: 3 Dec 2015, 3:39:54 UTC - in response to Message 79124.  

I don't know why it happens, but sometimes things go wrong when you're running a science project. The tasks are credited though - it doesn't show up in the "Tasks for computer" page but does under the task:


Only up to a MAX of 300

If you run tasks for longer periods like myself you will lose.

This was supposed to have been fixed a long, long time ago but as everything else with this project, nothing, nothing to be heard, nothing to be seen and certanly nothing to be done.

EDIT TO ADD

There have been more reported sightings of "Elvis" in London alone over the last 5 years than scientists on this project.



What was supposed to be fixed? The validator or the credit limit. I can increase the credit limit to something more reasonable for those that run long jobs like yourself. I'm not sure what exactly causes the validation errors, it can be computer specific, but I'll take a closer look.
ID: 79151 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Betting Slip

Send message
Joined: 26 Sep 05
Posts: 71
Credit: 5,702,246
RAC: 0
Message 79154 - Posted: 4 Dec 2015, 1:45:28 UTC - in response to Message 79151.  
Last modified: 4 Dec 2015, 1:45:57 UTC



What was supposed to be fixed? The validator or the credit limit. I can increase the credit limit to something more reasonable for those that run long jobs like yourself. I'm not sure what exactly causes the validation errors, it can be computer specific, but I'll take a closer look.


It's hard to find a post that would prove it conclusively but here is one from October https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6716&nowrap=true#78892
ID: 79154 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 356
Credit: 382,349
RAC: 0
Message 79155 - Posted: 5 Dec 2015, 10:24:43 UTC
Last modified: 5 Dec 2015, 10:26:45 UTC

Validate errors on rb_11_2* tasks:

Task 774759576

<core_client_version>6.10.18</core_client_version>
<![CDATA[
<stderr_txt>
[2015-11-30 17:36:44:] :: BOINC:: Initializing ... ok.
[2015-11-30 17:36:44:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully. 
command: projects/boinc.bakerlab.org_rosetta/minirosetta_3.65_windows_intelx86.exe @rb_11_28_60979_105347_ab_stage0_h003___robetta_FLAGS -psipred_ss2 h003_.psipred_ss2 -in::file::fasta h003_.fasta -kill_hairpins h003_.nobuformat.psipred_ss2 -in:file:boinc_wu_zip rb_11_28_60979_105347_ab_stage0_h003___robetta.zip -frag3 rb_11_28_60979_105347_ab_stage0_h003___robetta_h003_.200.3mers.index.gz -fragA rb_11_28_60979_105347_ab_stage0_h003___robetta_h003_.200.11mers.index.gz -fragB rb_11_28_60979_105347_ab_stage0_h003___robetta_h003_.200.12mers.index.gz -nstruct 10000 -cpu_run_time 10800 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2611634
Registering options.. 
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok 
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize()  End reached
Loaded options.... ok 
Processed options.... ok 
Initializing random generators... ok 
Initialization complete. 
Initializing options.... ok 
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize()  End reached
Loaded options.... ok 
Processed options.... ok 
Initializing random generators... ok 
Initialization complete. 
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_b7c7d78.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/rb_11_28_60979_105347_ab_stage0_h003___robetta.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ... 
BOINC:: Worker startup. 
Starting watchdog...
Watchdog active.
Starting work on structure: _00001
# cpu_run_time_pref: 86400
Starting work on structure: _00002
Starting work on structure: _00003
Starting work on structure: _00004
Starting work on structure: _00005
Starting work on structure: _00006
Starting work on structure: _00007
Starting work on structure: _00008
Starting work on structure: _00009
Starting work on structure: _00010
Starting work on structure: _00011
Starting work on structure: _00012
Starting work on structure: _00013
Starting work on structure: _00014
Starting work on structure: _00015
Starting work on structure: _00016
Starting work on structure: _00017
Starting work on structure: _00018
Starting work on structure: _00019
Starting work on structure: _00020
Starting work on structure: _00021
Starting work on structure: _00022
Starting work on structure: _00023
Starting work on structure: _00024
Starting work on structure: _00025
Starting work on structure: _00026
Starting work on structure: _00027
Starting work on structure: _00028
Starting work on structure: _00029
Starting work on structure: _00030
Starting work on structure: _00031
Starting work on structure: _00032
Starting work on structure: _00033
Starting work on structure: _00034
Starting work on structure: _00035
Starting work on structure: _00036
Starting work on structure: _00037
Starting work on structure: _00038
Starting work on structure: _00039
Starting work on structure: _00040
Starting work on structure: _00041
Starting work on structure: _00042
Starting work on structure: _00043
Starting work on structure: _00044
Starting work on structure: _00045
Starting work on structure: _00046
Starting work on structure: _00047
Starting work on structure: _00048
Starting work on structure: _00049
Starting work on structure: _00050
Starting work on structure: _00051
Starting work on structure: _00052
Starting work on structure: _00053
Starting work on structure: _00054
Starting work on structure: _00055
Starting work on structure: _00056
Starting work on structure: _00057
Starting work on structure: _00058
Starting work on structure: _00059
Starting work on structure: _00060
Starting work on structure: _00061
Starting work on structure: _00062
Starting work on structure: _00063
Starting work on structure: _00064
Starting work on structure: _00065
Starting work on structure: _00066
Starting work on structure: _00067
Starting work on structure: _00068
Starting work on structure: _00069
Starting work on structure: _00070
Starting work on structure: _00071
Starting work on structure: _00072
Starting work on structure: _00073
Starting work on structure: _00074
Starting work on structure: _00075
Starting work on structure: _00076
Starting work on structure: _00077
Starting work on structure: _00078
Starting work on structure: _00079
Starting work on structure: _00080
Starting work on structure: _00081
Starting work on structure: _00082
Starting work on structure: _00083
Starting work on structure: _00084
Starting work on structure: _00085
Starting work on structure: _00086
Starting work on structure: _00087
Starting work on structure: _00088
Starting work on structure: _00089
Starting work on structure: _00090
Starting work on structure: _00091
Starting work on structure: _00092
Starting work on structure: _00093
Starting work on structure: _00094
Starting work on structure: _00095
Starting work on structure: _00096
Starting work on structure: _00097
Starting work on structure: _00098
Starting work on structure: _00099
======================================================
DONE ::     1 starting structures    58190 cpu seconds
This process generated     99 decoys from      99 attempts
======================================================
BOINC :: WS_max 2.28844e+008

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>




Task 775179895

<core_client_version>6.10.18</core_client_version>
<![CDATA[
<stderr_txt>
[2015-12- 2 11:44:41:] :: BOINC:: Initializing ... ok.
[2015-12- 2 11:44:41:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully. 
command: projects/boinc.bakerlab.org_rosetta/minirosetta_3.65_windows_intelx86.exe @rb_11_29_60395_105396_ab_stage0_t000___robetta_FLAGS -psipred_ss2 t000_.psipred_ss2 -in::file::fasta t000_.fasta -kill_hairpins t000_.nobuformat.psipred_ss2 -in:file:boinc_wu_zip rb_11_29_60395_105396_ab_stage0_t000___robetta.zip -frag3 rb_11_29_60395_105396_ab_stage0_t000___robetta_t000_.200.3mers.index.gz -fragA rb_11_29_60395_105396_ab_stage0_t000___robetta_t000_.200.17mers.index.gz -fragB rb_11_29_60395_105396_ab_stage0_t000___robetta_t000_.200.6mers.index.gz -nstruct 10000 -cpu_run_time 10800 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2057108
Registering options.. 
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok 
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize()  End reached
Loaded options.... ok 
Processed options.... ok 
Initializing random generators... ok 
Initialization complete. 
Initializing options.... ok 
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize()  End reached
Loaded options.... ok 
Processed options.... ok 
Initializing random generators... ok 
Initialization complete. 
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_b7c7d78.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/rb_11_29_60395_105396_ab_stage0_t000___robetta.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ... 
BOINC:: Worker startup. 
Starting watchdog...
Watchdog active.
Starting work on structure: _00001
# cpu_run_time_pref: 86400
Starting work on structure: _00002
Starting work on structure: _00003
Starting work on structure: _00004
Starting work on structure: _00005
Starting work on structure: _00006
Starting work on structure: _00007
Starting work on structure: _00008
Starting work on structure: _00009
Starting work on structure: _00010
Starting work on structure: _00011
Starting work on structure: _00012
Starting work on structure: _00013
Starting work on structure: _00014
Starting work on structure: _00015
Starting work on structure: _00016
Starting work on structure: _00017
Starting work on structure: _00018
Starting work on structure: _00019
Starting work on structure: _00020
Starting work on structure: _00021
Starting work on structure: _00022
Starting work on structure: _00023
Starting work on structure: _00024
Starting work on structure: _00025
Starting work on structure: _00026
Starting work on structure: _00027
Starting work on structure: _00028
Starting work on structure: _00029
Starting work on structure: _00030
Starting work on structure: _00031
Starting work on structure: _00032
Starting work on structure: _00033
Starting work on structure: _00034
Starting work on structure: _00035
Starting work on structure: _00036
Starting work on structure: _00037
Starting work on structure: _00038
Starting work on structure: _00039
Starting work on structure: _00040
Starting work on structure: _00041
Starting work on structure: _00042
Starting work on structure: _00043
Starting work on structure: _00044
======================================================
DONE ::     1 starting structures  86100.8 cpu seconds
This process generated     44 decoys from      44 attempts
======================================================
BOINC :: WS_max 2.76714e+008

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>




Both workunits were canceled, how about sending a kill command to the clients in order to avoid wasting ressources?
.
ID: 79155 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Ball

Send message
Joined: 25 Nov 05
Posts: 25
Credit: 1,439,333
RAC: 0
Message 79162 - Posted: 7 Dec 2015, 21:39:56 UTC - in response to Message 79155.  

Validate errors on rb_11_2* tasks


Both workunits were canceled, how about sending a kill command to the clients in order to avoid wasting ressources?


I'm also getting validate errors on some workunits that say they completed OK on the client but get a validate error on the server with the workunit details saying they were cancelled. I've started aborting any WUs that start with "rb_" since yours were in the "rb_11_2*" range and mine were in the "rb_12_06_61173_105565_ab_stage0_t000*" range. I also noticed that some in that range were completing and validating.

workunits:

failed: rb_12_06_61173_105565_ab_stage0_t000___robetta_IGNORE_THE_REST_04_10_313585_76_0
failed: rb_12_06_61173_105562_ab_stage0_t000___robetta_IGNORE_THE_REST_07_12_313580_207_0

passed: rb_12_06_61173_105565_ab_stage0_t000___robetta_IGNORE_THE_REST_05_10_313585_82_0
passed: rb_12_06_61173_105562_ab_stage0_t000___robetta_IGNORE_THE_REST_07_12_313580_47_0

if there's a way to cancel WUs without them running for many hours on the client then I really wish rosetta would use it.

Thanks,

David

Have you read a good Science Fiction book lately?
ID: 79162 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 79164 - Posted: 8 Dec 2015, 14:02:20 UTC - in response to Message 79162.  

I've started aborting any WUs that start with "rb_" since yours were in the "rb_11_2*" range and mine were in the "rb_12_06_61173_105565_ab_stage0_t000*" range. I also noticed that some in that range were completing and validating.


This is a bit of a bazooka-to-kill-a-housefly type of a solution. I'd encourage anyone not to abort WUs based on something as broad as starting with 'rb_' as 'rb_' work units are part of the Robetta prediction server and serve an incredibly wide range of research projects.

Secondly, I'll note that a) the two 'failed' WUs you listed above David actually DID grant you full credit (Click on the WU link you posted, and scroll to the bottom, you'll see credit was granted even though it doesn't show in the summary of your WUs it does count towards your total).

Lastly, looking at some of the WUs you've aborted, most of them (like this one, and this one, and this one, for example) were successfully completed by other users after being aborted on your end :).
ID: 79164 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
PanicMan

Send message
Joined: 31 Jan 10
Posts: 7
Credit: 276,651
RAC: 0
Message 79165 - Posted: 8 Dec 2015, 20:08:27 UTC

i was just looking around a bit and found this issue also... all have been rb-12 or 11 workunits..a total of 6 of them in last few days..only 1 before that as far as my history shows on site and that was an rb-11 task...i did get credit for 5/7 of them but boy it isnt nice to se 14k computation seconds thrown out the window twice in the last week...seems obvious from what i have read in this thread the issue is with the rb tasks...admittedly i have no idea as to what all these numbers mean but i assume someone does and could do something to check these units before sending? not sure how many others this has happened to as the site only goes back to 11/24 for me.
ID: 79165 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Ball

Send message
Joined: 25 Nov 05
Posts: 25
Credit: 1,439,333
RAC: 0
Message 79166 - Posted: 8 Dec 2015, 20:14:52 UTC - in response to Message 79164.  

I've started aborting any WUs that start with "rb_" since yours were in the "rb_11_2*" range and mine were in the "rb_12_06_61173_105565_ab_stage0_t000*" range. I also noticed that some in that range were completing and validating.


This is a bit of a bazooka-to-kill-a-housefly type of a solution. I'd encourage anyone not to abort WUs based on something as broad as starting with 'rb_' as 'rb_' work units are part of the Robetta prediction server and serve an incredibly wide range of research projects.

Secondly, I'll note that a) the two 'failed' WUs you listed above David actually DID grant you full credit (Click on the WU link you posted, and scroll to the bottom, you'll see credit was granted even though it doesn't show in the summary of your WUs it does count towards your total).

Lastly, looking at some of the WUs you've aborted, most of them (like this one, and this one, and this one, for example) were successfully completed by other users after being aborted on your end :).


Thanks for the info. Basically I was waiting for more information and will start letting them run. BTW, I could be remembering wrong but when I checked the failed WUs shortly after they failed, ISTR that the granted credit on the workunit details was something like "-----".

Anyway, I'll let them process from now on.

-- David
Have you read a good Science Fiction book lately?
ID: 79166 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 79167 - Posted: 8 Dec 2015, 21:06:59 UTC

Unfortunately, there is no easy way to efficiently abort the tasks mid process and keep the results/work. I'll likely change the delay bound for these Robetta jobs so they do not get sent to computers that can't finish them in time. This change may prevent some computers from getting work if there are only Robetta jobs in the queue which would be rare. Or maybe there is a better fix?
ID: 79167 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
wiueiwue

Send message
Joined: 7 Sep 15
Posts: 1
Credit: 0
RAC: 0
Message 79183 - Posted: 11 Dec 2015, 6:15:53 UTC

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=704600829

zibochen.helix.151129ZCwh4start2_fold_and_dock_SAVE_ALL_OUT_312981_3717

777217222 2201387 10 Dec 2015 7:08:56 UTC 10 Dec 2015 7:29:33 UTC Over Client error Compute error 848.38 1.96 ---

777220523 1729478 10 Dec 2015 7:30:20 UTC 11 Dec 2015 3:02:33 UTC Over Client error Compute error 4,093.50 41.12 ---
ID: 79183 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 356
Credit: 382,349
RAC: 0
Message 79193 - Posted: 12 Dec 2015, 10:08:55 UTC - in response to Message 79167.  

Unfortunately, there is no easy way to efficiently abort the tasks mid process and keep the results/work.

No, but you could at least abort them on the next scheduler request if they have not started yet.

Or, if you still want the results back, simply do not generate new WUs for that job, but do not cancel the already sent out ones (should be possible I think).



I'll likely change the delay bound for these Robetta jobs so they do not get sent to computers that can't finish them in time. This change may prevent some computers from getting work if there are only Robetta jobs in the queue which would be rare. Or maybe there is a better fix?

Not sure what you mean, the "canceled" tasks were finished before deadline...
.
ID: 79193 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 79198 - Posted: 12 Dec 2015, 20:54:12 UTC - in response to Message 79193.  

Unfortunately, there is no easy way to efficiently abort the tasks mid process and keep the results/work.

No, but you could at least abort them on the next scheduler request if they have not started yet.

Or, if you still want the results back, simply do not generate new WUs for that job, but do not cancel the already sent out ones (should be possible I think).



I'll likely change the delay bound for these Robetta jobs so they do not get sent to computers that can't finish them in time. This change may prevent some computers from getting work if there are only Robetta jobs in the queue which would be rare. Or maybe there is a better fix?

Not sure what you mean, the "canceled" tasks were finished before deadline...


The Robetta jobs often finished before the deadline but I updated the deadline and Robetta no longer cancels jobs so hopefully this will help. I may have to adjust the delay bound to find an optimal value.
ID: 79198 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 79201 - Posted: 13 Dec 2015, 5:36:12 UTC - in response to Message 79167.  
Last modified: 13 Dec 2015, 6:26:41 UTC

Unfortunately, there is no easy way to efficiently abort the tasks mid process and keep the results/work. I'll likely change the delay bound for these Robetta jobs so they do not get sent to computers that can't finish them in time. This change may prevent some computers from getting work if there are only Robetta jobs in the queue which would be rare. Or maybe there is a better fix?


i'd guess users hitting those issues may perhaps need to tune their computing preferences so that the boinc client do not download too many tasks

parameters like the Computing preferences > Maintain enough work for an additional n days should be limited to a manageable number e.g. i used 0.5 days

changing parameters on the server side unfortunately would run into various dilemmas as those normally affects everyone. e.g. having too far an expiry date would lead to the researchers waiting too long for results to be turned around. and if there are orphaned tasks that'd be worse as the boinc server only comes to know that they are orphaned after waiting till expiry and those gets re-assigned say after a 2 weeks expiry period https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6747

having too short an expiry period risks tasks getting cancelled/aborted by the server before they are complete

the other thing is for users to practice a quick turn around rather than to download a large cache of jobs, e.g. i normally get sufficient number of tasks for the current session which i intend to run and set 'no new tasks' once there are adequate tasks, so that those tasks are completed and returned to the server promptly

i'd guess that helps me keep a short Average turnaround time of 0.24 days
ID: 79201 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 79202 - Posted: 13 Dec 2015, 6:51:35 UTC - in response to Message 79198.  
Last modified: 13 Dec 2015, 7:01:13 UTC

I updated the deadline and Robetta no longer cancels jobs so hopefully this will help. I may have to adjust the delay bound to find an optimal value.


.. Wondering if that part of things - Robetta canceling jobs - was my fault (as per this thread which resulted in David adding logic to cancel tasks once a Robetta job completes, to solve a seperate issue of jobs being sent out for already-completed Robetta runs). Really hard to please everyone now isn't it :)
ID: 79202 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 356
Credit: 382,349
RAC: 0
Message 79210 - Posted: 13 Dec 2015, 15:21:14 UTC - in response to Message 79201.  

i'd guess users hitting those issues may perhaps need to tune their computing preferences so that the boinc client do not download too many tasks

The issue here is not that someone has a too large cache, the work units were done long before deadline, the issue is that jobs are canceled on the server and this information is not passed to the client.
.
ID: 79210 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 79214 - Posted: 13 Dec 2015, 18:27:32 UTC - in response to Message 79210.  
Last modified: 13 Dec 2015, 19:04:05 UTC


The issue here is not that someone has a too large cache, the work units were done long before deadline, the issue is that jobs are canceled on the server and this information is not passed to the client.


Thanks link i see your point, there are various/many limitations with boinc i'd guess in part due to the protocol design. For most part it works well, then in the real world we have the extremes which fall our of the 'normal' design ranges of boinc i'd guess. i read that boinc is based on a 'one way polling' design where all network requests are initiated by the client, this limits 'push' notifications from being possible as a solution.

there are probably ways to resolve that e.g. using a 2 phase closure state design (e.g. when the batch/jobs are cancelled on the server, jobs which has been downloaded become 1/2 closed, when the client completes and submits the results, the server can then assign credits and mark the job closed) but that may make boinc codes more complicated & it'd take effort to do so.
i'd guess it may be good to post some of these circumstances/issues in the boinc message boards http://boinc.berkeley.edu/dev/ or log bug reports https://github.com/BOINC/boinc/issues so that the developers could consider them as boinc codes are enhanced.

actually there are many pressing issues such as that servers tend to be overloaded due to design, e.g.heavy 'polling' by clients. those again are complicated protocol design issues which may take significant effort to enhance the codes, test and to install/update servers and distribute updated clients to all participants.
ID: 79214 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 79216 - Posted: 14 Dec 2015, 5:40:46 UTC - in response to Message 79202.  

I updated the deadline and Robetta no longer cancels jobs so hopefully this will help. I may have to adjust the delay bound to find an optimal value.


.. Wondering if that part of things - Robetta canceling jobs - was my fault (as per this thread which resulted in David adding logic to cancel tasks once a Robetta job completes, to solve a seperate issue of jobs being sent out for already-completed Robetta runs). Really hard to please everyone now isn't it :)


Yep, but I think reducing the delay bound was necessary so bringing the issue(s) to light was great. It may need further adjusting but it's better than before IMO. Thanks!
ID: 79216 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

Message boards : Number crunching : Minirosetta 3.62-3.65



©2024 University of Washington
https://www.bakerlab.org