Message boards : Number crunching : Minirosetta 3.62-3.65
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next
Author | Message |
---|---|
martyn_2010 Send message Joined: 5 Oct 10 Posts: 1 Credit: 390,318 RAC: 0 |
An annoying aspect of this project is the 'Validate Error'. From the log it appears that the workunit has processed successfully, yet when the results are uploaded and checked there appears to be a Validate Error. This happened to me 3 times for large workunits in September 2015. Have avoided this project until now. Today I have had 2 more Validate Errors - rb_11_24_60863_105242__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_312475_210 rb_11_24_60883_105260__t000__4_C1_SAVE_ALL_OUT_IGNORE_THE_REST_312527_72 Please explain why this error occurs and that I should lose credit for what appears to be work finishing successfully. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,598,331 RAC: 10,707 |
I don't know why it happens, but sometimes things go wrong when you're running a science project. The tasks are credited though - it doesn't show up in the "Tasks for computer" page but does under the task: https://boinc.bakerlab.org/rosetta/result.php?resultid=774133344 An annoying aspect of this project is the 'Validate Error'. |
Betting Slip Send message Joined: 26 Sep 05 Posts: 71 Credit: 5,702,246 RAC: 0 |
I don't know why it happens, but sometimes things go wrong when you're running a science project. The tasks are credited though - it doesn't show up in the "Tasks for computer" page but does under the task: Only up to a MAX of 300 If you run tasks for longer periods like myself you will lose. This was supposed to have been fixed a long, long time ago but as everything else with this project, nothing, nothing to be heard, nothing to be seen and certanly nothing to be done. EDIT TO ADD There have been more reported sightings of "Elvis" in London alone over the last 5 years than scientists on this project. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
An annoying aspect of this project is the 'Validate Error'. I'm not sure what happened with these two results. The logs show: [CRITICAL] check_set: init_result([RESULT#773910587 rb_11_24_60863_105242__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_312475_210_0]) failed: -1 but the result files have already been cleaned. The files were possibly corrupt but if this continues to happen, let us know. Credit was granted. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I don't know why it happens, but sometimes things go wrong when you're running a science project. The tasks are credited though - it doesn't show up in the "Tasks for computer" page but does under the task: What was supposed to be fixed? The validator or the credit limit. I can increase the credit limit to something more reasonable for those that run long jobs like yourself. I'm not sure what exactly causes the validation errors, it can be computer specific, but I'll take a closer look. |
Betting Slip Send message Joined: 26 Sep 05 Posts: 71 Credit: 5,702,246 RAC: 0 |
It's hard to find a post that would prove it conclusively but here is one from October https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6716&nowrap=true#78892 |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
Validate errors on rb_11_2* tasks: Task 774759576 <core_client_version>6.10.18</core_client_version> <![CDATA[ <stderr_txt> [2015-11-30 17:36:44:] :: BOINC:: Initializing ... ok. [2015-11-30 17:36:44:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. command: projects/boinc.bakerlab.org_rosetta/minirosetta_3.65_windows_intelx86.exe @rb_11_28_60979_105347_ab_stage0_h003___robetta_FLAGS -psipred_ss2 h003_.psipred_ss2 -in::file::fasta h003_.fasta -kill_hairpins h003_.nobuformat.psipred_ss2 -in:file:boinc_wu_zip rb_11_28_60979_105347_ab_stage0_h003___robetta.zip -frag3 rb_11_28_60979_105347_ab_stage0_h003___robetta_h003_.200.3mers.index.gz -fragA rb_11_28_60979_105347_ab_stage0_h003___robetta_h003_.200.11mers.index.gz -fragB rb_11_28_60979_105347_ab_stage0_h003___robetta_h003_.200.12mers.index.gz -nstruct 10000 -cpu_run_time 10800 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2611634 Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_b7c7d78.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/rb_11_28_60979_105347_ab_stage0_h003___robetta.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... Setting up folding (abrelax) ... Beginning folding (abrelax) ... BOINC:: Worker startup. Starting watchdog... Watchdog active. Starting work on structure: _00001 # cpu_run_time_pref: 86400 Starting work on structure: _00002 Starting work on structure: _00003 Starting work on structure: _00004 Starting work on structure: _00005 Starting work on structure: _00006 Starting work on structure: _00007 Starting work on structure: _00008 Starting work on structure: _00009 Starting work on structure: _00010 Starting work on structure: _00011 Starting work on structure: _00012 Starting work on structure: _00013 Starting work on structure: _00014 Starting work on structure: _00015 Starting work on structure: _00016 Starting work on structure: _00017 Starting work on structure: _00018 Starting work on structure: _00019 Starting work on structure: _00020 Starting work on structure: _00021 Starting work on structure: _00022 Starting work on structure: _00023 Starting work on structure: _00024 Starting work on structure: _00025 Starting work on structure: _00026 Starting work on structure: _00027 Starting work on structure: _00028 Starting work on structure: _00029 Starting work on structure: _00030 Starting work on structure: _00031 Starting work on structure: _00032 Starting work on structure: _00033 Starting work on structure: _00034 Starting work on structure: _00035 Starting work on structure: _00036 Starting work on structure: _00037 Starting work on structure: _00038 Starting work on structure: _00039 Starting work on structure: _00040 Starting work on structure: _00041 Starting work on structure: _00042 Starting work on structure: _00043 Starting work on structure: _00044 Starting work on structure: _00045 Starting work on structure: _00046 Starting work on structure: _00047 Starting work on structure: _00048 Starting work on structure: _00049 Starting work on structure: _00050 Starting work on structure: _00051 Starting work on structure: _00052 Starting work on structure: _00053 Starting work on structure: _00054 Starting work on structure: _00055 Starting work on structure: _00056 Starting work on structure: _00057 Starting work on structure: _00058 Starting work on structure: _00059 Starting work on structure: _00060 Starting work on structure: _00061 Starting work on structure: _00062 Starting work on structure: _00063 Starting work on structure: _00064 Starting work on structure: _00065 Starting work on structure: _00066 Starting work on structure: _00067 Starting work on structure: _00068 Starting work on structure: _00069 Starting work on structure: _00070 Starting work on structure: _00071 Starting work on structure: _00072 Starting work on structure: _00073 Starting work on structure: _00074 Starting work on structure: _00075 Starting work on structure: _00076 Starting work on structure: _00077 Starting work on structure: _00078 Starting work on structure: _00079 Starting work on structure: _00080 Starting work on structure: _00081 Starting work on structure: _00082 Starting work on structure: _00083 Starting work on structure: _00084 Starting work on structure: _00085 Starting work on structure: _00086 Starting work on structure: _00087 Starting work on structure: _00088 Starting work on structure: _00089 Starting work on structure: _00090 Starting work on structure: _00091 Starting work on structure: _00092 Starting work on structure: _00093 Starting work on structure: _00094 Starting work on structure: _00095 Starting work on structure: _00096 Starting work on structure: _00097 Starting work on structure: _00098 Starting work on structure: _00099 ====================================================== DONE :: 1 starting structures 58190 cpu seconds This process generated 99 decoys from 99 attempts ====================================================== BOINC :: WS_max 2.28844e+008 BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> ]]> Task 775179895 <core_client_version>6.10.18</core_client_version> <![CDATA[ <stderr_txt> [2015-12- 2 11:44:41:] :: BOINC:: Initializing ... ok. [2015-12- 2 11:44:41:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. command: projects/boinc.bakerlab.org_rosetta/minirosetta_3.65_windows_intelx86.exe @rb_11_29_60395_105396_ab_stage0_t000___robetta_FLAGS -psipred_ss2 t000_.psipred_ss2 -in::file::fasta t000_.fasta -kill_hairpins t000_.nobuformat.psipred_ss2 -in:file:boinc_wu_zip rb_11_29_60395_105396_ab_stage0_t000___robetta.zip -frag3 rb_11_29_60395_105396_ab_stage0_t000___robetta_t000_.200.3mers.index.gz -fragA rb_11_29_60395_105396_ab_stage0_t000___robetta_t000_.200.17mers.index.gz -fragB rb_11_29_60395_105396_ab_stage0_t000___robetta_t000_.200.6mers.index.gz -nstruct 10000 -cpu_run_time 10800 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2057108 Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_b7c7d78.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/rb_11_29_60395_105396_ab_stage0_t000___robetta.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... Setting up folding (abrelax) ... Beginning folding (abrelax) ... BOINC:: Worker startup. Starting watchdog... Watchdog active. Starting work on structure: _00001 # cpu_run_time_pref: 86400 Starting work on structure: _00002 Starting work on structure: _00003 Starting work on structure: _00004 Starting work on structure: _00005 Starting work on structure: _00006 Starting work on structure: _00007 Starting work on structure: _00008 Starting work on structure: _00009 Starting work on structure: _00010 Starting work on structure: _00011 Starting work on structure: _00012 Starting work on structure: _00013 Starting work on structure: _00014 Starting work on structure: _00015 Starting work on structure: _00016 Starting work on structure: _00017 Starting work on structure: _00018 Starting work on structure: _00019 Starting work on structure: _00020 Starting work on structure: _00021 Starting work on structure: _00022 Starting work on structure: _00023 Starting work on structure: _00024 Starting work on structure: _00025 Starting work on structure: _00026 Starting work on structure: _00027 Starting work on structure: _00028 Starting work on structure: _00029 Starting work on structure: _00030 Starting work on structure: _00031 Starting work on structure: _00032 Starting work on structure: _00033 Starting work on structure: _00034 Starting work on structure: _00035 Starting work on structure: _00036 Starting work on structure: _00037 Starting work on structure: _00038 Starting work on structure: _00039 Starting work on structure: _00040 Starting work on structure: _00041 Starting work on structure: _00042 Starting work on structure: _00043 Starting work on structure: _00044 ====================================================== DONE :: 1 starting structures 86100.8 cpu seconds This process generated 44 decoys from 44 attempts ====================================================== BOINC :: WS_max 2.76714e+008 BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> ]]> Both workunits were canceled, how about sending a kill command to the clients in order to avoid wasting ressources? . |
David Ball Send message Joined: 25 Nov 05 Posts: 25 Credit: 1,439,333 RAC: 0 |
Validate errors on rb_11_2* tasks I'm also getting validate errors on some workunits that say they completed OK on the client but get a validate error on the server with the workunit details saying they were cancelled. I've started aborting any WUs that start with "rb_" since yours were in the "rb_11_2*" range and mine were in the "rb_12_06_61173_105565_ab_stage0_t000*" range. I also noticed that some in that range were completing and validating. workunits: failed: rb_12_06_61173_105565_ab_stage0_t000___robetta_IGNORE_THE_REST_04_10_313585_76_0 failed: rb_12_06_61173_105562_ab_stage0_t000___robetta_IGNORE_THE_REST_07_12_313580_207_0 passed: rb_12_06_61173_105565_ab_stage0_t000___robetta_IGNORE_THE_REST_05_10_313585_82_0 passed: rb_12_06_61173_105562_ab_stage0_t000___robetta_IGNORE_THE_REST_07_12_313580_47_0 if there's a way to cancel WUs without them running for many hours on the client then I really wish rosetta would use it. Thanks, David Have you read a good Science Fiction book lately? |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
I've started aborting any WUs that start with "rb_" since yours were in the "rb_11_2*" range and mine were in the "rb_12_06_61173_105565_ab_stage0_t000*" range. I also noticed that some in that range were completing and validating. This is a bit of a bazooka-to-kill-a-housefly type of a solution. I'd encourage anyone not to abort WUs based on something as broad as starting with 'rb_' as 'rb_' work units are part of the Robetta prediction server and serve an incredibly wide range of research projects. Secondly, I'll note that a) the two 'failed' WUs you listed above David actually DID grant you full credit (Click on the WU link you posted, and scroll to the bottom, you'll see credit was granted even though it doesn't show in the summary of your WUs it does count towards your total). Lastly, looking at some of the WUs you've aborted, most of them (like this one, and this one, and this one, for example) were successfully completed by other users after being aborted on your end :). |
PanicMan Send message Joined: 31 Jan 10 Posts: 7 Credit: 276,651 RAC: 0 |
i was just looking around a bit and found this issue also... all have been rb-12 or 11 workunits..a total of 6 of them in last few days..only 1 before that as far as my history shows on site and that was an rb-11 task...i did get credit for 5/7 of them but boy it isnt nice to se 14k computation seconds thrown out the window twice in the last week...seems obvious from what i have read in this thread the issue is with the rb tasks...admittedly i have no idea as to what all these numbers mean but i assume someone does and could do something to check these units before sending? not sure how many others this has happened to as the site only goes back to 11/24 for me. |
David Ball Send message Joined: 25 Nov 05 Posts: 25 Credit: 1,439,333 RAC: 0 |
I've started aborting any WUs that start with "rb_" since yours were in the "rb_11_2*" range and mine were in the "rb_12_06_61173_105565_ab_stage0_t000*" range. I also noticed that some in that range were completing and validating. Thanks for the info. Basically I was waiting for more information and will start letting them run. BTW, I could be remembering wrong but when I checked the failed WUs shortly after they failed, ISTR that the granted credit on the workunit details was something like "-----". Anyway, I'll let them process from now on. -- David Have you read a good Science Fiction book lately? |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Unfortunately, there is no easy way to efficiently abort the tasks mid process and keep the results/work. I'll likely change the delay bound for these Robetta jobs so they do not get sent to computers that can't finish them in time. This change may prevent some computers from getting work if there are only Robetta jobs in the queue which would be rare. Or maybe there is a better fix? |
wiueiwue Send message Joined: 7 Sep 15 Posts: 1 Credit: 0 RAC: 0 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=704600829 zibochen.helix.151129ZCwh4start2_fold_and_dock_SAVE_ALL_OUT_312981_3717 777217222 2201387 10 Dec 2015 7:08:56 UTC 10 Dec 2015 7:29:33 UTC Over Client error Compute error 848.38 1.96 --- 777220523 1729478 10 Dec 2015 7:30:20 UTC 11 Dec 2015 3:02:33 UTC Over Client error Compute error 4,093.50 41.12 --- |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
Unfortunately, there is no easy way to efficiently abort the tasks mid process and keep the results/work. No, but you could at least abort them on the next scheduler request if they have not started yet. Or, if you still want the results back, simply do not generate new WUs for that job, but do not cancel the already sent out ones (should be possible I think). I'll likely change the delay bound for these Robetta jobs so they do not get sent to computers that can't finish them in time. This change may prevent some computers from getting work if there are only Robetta jobs in the queue which would be rare. Or maybe there is a better fix? Not sure what you mean, the "canceled" tasks were finished before deadline... . |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Unfortunately, there is no easy way to efficiently abort the tasks mid process and keep the results/work. The Robetta jobs often finished before the deadline but I updated the deadline and Robetta no longer cancels jobs so hopefully this will help. I may have to adjust the delay bound to find an optimal value. |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
Unfortunately, there is no easy way to efficiently abort the tasks mid process and keep the results/work. I'll likely change the delay bound for these Robetta jobs so they do not get sent to computers that can't finish them in time. This change may prevent some computers from getting work if there are only Robetta jobs in the queue which would be rare. Or maybe there is a better fix? i'd guess users hitting those issues may perhaps need to tune their computing preferences so that the boinc client do not download too many tasks parameters like the Computing preferences > Maintain enough work for an additional n days should be limited to a manageable number e.g. i used 0.5 days changing parameters on the server side unfortunately would run into various dilemmas as those normally affects everyone. e.g. having too far an expiry date would lead to the researchers waiting too long for results to be turned around. and if there are orphaned tasks that'd be worse as the boinc server only comes to know that they are orphaned after waiting till expiry and those gets re-assigned say after a 2 weeks expiry period https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6747 having too short an expiry period risks tasks getting cancelled/aborted by the server before they are complete the other thing is for users to practice a quick turn around rather than to download a large cache of jobs, e.g. i normally get sufficient number of tasks for the current session which i intend to run and set 'no new tasks' once there are adequate tasks, so that those tasks are completed and returned to the server promptly i'd guess that helps me keep a short Average turnaround time of 0.24 days |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
I updated the deadline and Robetta no longer cancels jobs so hopefully this will help. I may have to adjust the delay bound to find an optimal value. .. Wondering if that part of things - Robetta canceling jobs - was my fault (as per this thread which resulted in David adding logic to cancel tasks once a Robetta job completes, to solve a seperate issue of jobs being sent out for already-completed Robetta runs). Really hard to please everyone now isn't it :) |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
i'd guess users hitting those issues may perhaps need to tune their computing preferences so that the boinc client do not download too many tasks The issue here is not that someone has a too large cache, the work units were done long before deadline, the issue is that jobs are canceled on the server and this information is not passed to the client. . |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
Thanks link i see your point, there are various/many limitations with boinc i'd guess in part due to the protocol design. For most part it works well, then in the real world we have the extremes which fall our of the 'normal' design ranges of boinc i'd guess. i read that boinc is based on a 'one way polling' design where all network requests are initiated by the client, this limits 'push' notifications from being possible as a solution. there are probably ways to resolve that e.g. using a 2 phase closure state design (e.g. when the batch/jobs are cancelled on the server, jobs which has been downloaded become 1/2 closed, when the client completes and submits the results, the server can then assign credits and mark the job closed) but that may make boinc codes more complicated & it'd take effort to do so. i'd guess it may be good to post some of these circumstances/issues in the boinc message boards http://boinc.berkeley.edu/dev/ or log bug reports https://github.com/BOINC/boinc/issues so that the developers could consider them as boinc codes are enhanced. actually there are many pressing issues such as that servers tend to be overloaded due to design, e.g.heavy 'polling' by clients. those again are complicated protocol design issues which may take significant effort to enhance the codes, test and to install/update servers and distribute updated clients to all participants. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I updated the deadline and Robetta no longer cancels jobs so hopefully this will help. I may have to adjust the delay bound to find an optimal value. Yep, but I think reducing the delay bound was necessary so bringing the issue(s) to light was great. It may need further adjusting but it's better than before IMO. Thanks! |
Message boards :
Number crunching :
Minirosetta 3.62-3.65
©2024 University of Washington
https://www.bakerlab.org