Message boards : Number crunching : minirosetta 2.05
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next
Author | Message |
---|---|
ramostol Send message Joined: 6 Feb 07 Posts: 64 Credit: 584,052 RAC: 0 |
My first Protein_interface (validation related?) error as far as I know - MacOS 10.5: tyrsim_3gbn_2esa_Protein_interface_design_01Feb2010_17949_9_2 Outcome Success Client state Done Exit status 0 (0x0) CPU time 21540.8 <core_client_version>6.10.36</core_client_version> <![CDATA[ <stderr_txt> [...] # cpu_run_time_pref: 21600 ====================================================== DONE :: 327 starting structures 21540.3 cpu seconds This process generated 327 decoys from 327 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> ]]> Validate state Workunit error - check skipped One of two wingmen validated successfully after his deadline, but with far fewer decoys completed. |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
My first Protein_interface (validation related?) error as far as I know - MacOS 10.5: There is nothing wrong on your end. This is a very old (and rare) bug in the boinc server software. Take a look here. Wait a second, the trac item claims that the bug is fixed. Maybe it is time for Rosetta to update the server-code. AdeB |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=322413556 tyrsim_3gbn_q.gz_Protein_interface_design_25Feb2010_18415_276_1 Outcome Client error Client state Compute error Exit status 1 (0x1) CPU time 4.4375 stderr out <core_client_version>6.10.18</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Looks like there are still problems with this app, same task it just restarted near the end and i got it in the neck, not impressed. tyrsim_3gbn_1c81_Protein_interface_design_25Feb2010_18415_410_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=294414088 # cpu_run_time_pref: 14400 ====================================================== DONE :: 348 starting structures 14397.5 cpu seconds This process generated 348 decoys from 348 attempts ====================================================== # cpu_run_time_pref: 14400 ====================================================== DONE :: 2 starting structures 14498.9 cpu seconds This process generated 2 decoys from 2 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> ]]> Validate state Valid Claimed credit 102.297287162446 Granted credit 0.384433279143336 application version 2.05 |
apohawk Send message Joined: 13 Sep 08 Posts: 5 Credit: 30,438,070 RAC: 0 |
This work unit reports "success" despite having errors in the end. https://boinc.bakerlab.org/rosetta/result.php?resultid=323517090 application: minitosetta 2.05 name of work unit: ina2inaN_to_NOE__18638_5045_0 Outcome: Success Exit status: 0 (0x0) CPU time: 2212.594 but at the end of the result we got: # cpu_run_time_pref: 7200 ERROR: Unrecognized edge type! ERROR:: Exit from: ....srccorekinematicsutil.cc line: 1422 called boinc_finish CPU: Phenom II 945 OS: WinXP 64 SP2 |
Duzz Send message Joined: 14 Nov 05 Posts: 1 Credit: 13,148 RAC: 0 |
During the last days I had several WUs staying idle after some time of computation. Windows XP task manager shows no CPU activity. If one does not notice this, many hours of WU processing get lost, which is very unproductive for the project. |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
In workunit gunn_fragments_SAVE_ALL_OUT_-1wtyA__18642_1106 both tasks (324092645 and 323994500) ended with the same error: ERROR: ct == final_atoms AdeB |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,847,836 RAC: 11,935 |
Today I got strange validation errors: "Task was reported too late to validate" But there are 4 days until deadline (19 Mar)! Links to the tasks: https://boinc.bakerlab.org/rosetta/result.php?resultid=323161767 https://boinc.bakerlab.org/rosetta/result.php?resultid=323181972 https://boinc.bakerlab.org/rosetta/result.php?resultid=323205144 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
What is odd is the way the tasks were reissued before he reported the completed ones back. That wouldn't normally happen. That isn't dependent upon Mad Max's machine, so I doubt they did a restore or anything. I'll have to see what we can find out. Rosetta Moderator: Mod.Sense |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,847,836 RAC: 11,935 |
Error with "detached" is boinc related. Actually I have not detached from the project, but rather connect a new computer. But after that boinc client initially goes mad - first it started to download to the new computer(Athlon II X2 250 ) tasks have already downloaded to old computer (Athlon XP 2600+), then at some point, thought better of it and register new computer on the server under a new ID, and than deleted mistakenly downloaded tasks. (I think this point and recorded on the server as "detached"). Note: there was no transfer of any boinc-related files from old computer to new one. The new client was a clean install from the distrib. So I do not know what caused this behavior. Maybe the fact that the computer is connect to internet under same ip? Hmm, now I think that in principle, such an validate error could happen because of it. If one computer "cancels" the tasks(mistakenly downloaded), while the second worked on its, the server can issue the same WU to another volunteer computer and shift deadline time? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
True, not a problem specific to v2.05 Rosetta. Perhaps BOINC server, or client. Either way, we should start another thread if further problem tasks are found. Certainly many users that have multiple machines are connecting from same IP address (I'm talking the router's public IP address that the project servers see). And many other users come in via dynamic IPs, and so it is always different. My understanding is that BOINC uses many factors to determine if a given machine is the same as an existing registered one to keep it all straight and separated correctly. Factors such as the user ID, host name, any existing BOINC host ID, machine type, installed OS, last RPC sequence number... so a fresh install should not have caused the client to "go mad" on either machine. Indeed many users have identically configured machines at same site coming in via same IP. Rosetta Moderator: Mod.Sense |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
This took 8hrs, 2min on my 3ghz intel, four hour run time. aqp9__boinc_aqp9_fast_run01_yfsong_loopbuild_threading_cst_relax_superfast_yfsong_IGNORE_THE_REST_18658_1421_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=296064742 # cpu_run_time_pref: 14400 Continuing computation from checkpoint: chk_S_2B6OA_15_0001_Remodel__loop_1_0_0_S ... success! BOINC:: CPU time: 28914.7s, 14400s + 14400s[2010- 3-17 13:39:17:] :: BOINC InternalDecoyCount: 0 ====================================================== DONE :: 1 starting structures 28914.7 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== called boinc_finish SIGSEGV: segmentation violation Stack trace (15 frames): [0x96c49b3] [0x96ee888] [0xb7fe9420] [0x91d6455] [0x842671e] [0x83e85d3] [0x80a7840] [0x84381fe] [0x812a54a] [0x812b82d] [0x86aa16b] [0x8243cf5] [0x8049897] [0x974c15c] [0x8048121] Exiting... </stderr_txt> ]]> Validate state Valid Claimed credit__69.3077894676244 Granted credit__25.52312719487 -- for 8hrs. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2117 Credit: 41,161,072 RAC: 15,284 |
On this desktop I got a Compute error Exit status -177 (0xffffff4f) in the following task: aqp9__boinc_aqp9_fast_run01_blast_yfsong_loopbuild_threading_cst_relax_superfast_yfsong_IGNORE_THE_REST_18653_30510_0 <message> I did notice while it was running it was about 2 hours over my 8 hour runtime, on Model 6 Step 19051, but it reported 0 CPU time in the end. I allow 10Gb disk space for Boinc and have about 581Mb in use on 5 current or waiting tasks, 9.43Gb free. Also, on this laptop I got a validate error on the following task a few days back: t290__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_8451_0 |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,847,836 RAC: 11,935 |
2 Mod.Sense Yes, it is certainly not a problem with minirosetta 2.05. It looks like some rare bug with boinc server. Probably connected with the fact that the computer had the same ip (not only "external" router ip, but internal too) and same network name. The new computer was a replacement of old, so I called the new as well as the previous one, before that renaming the old one. Actually, this should not be a factor, because boinc used to identify the internal id (such as 1211592) and not windows names. But the bug is a bug and that something is not go as intended :) In any case, now more such errors do not come across, so I think this can be forgotten. 2 Sid Celery I also had a lot of errors in tasks such as *__boinc_filtered_loopbuild_threading_*. In fact, every second job terminated by an error. And violating the target CPU time in each of the first (ie all tasks of this type) + strange looking things in graphics part (such as RMSD from 20 to 50 and odd-looking models) So now I am canceling all jobs of this type, if i see them in the job queue. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Sid, each task also has a configured maximum disk space. So that must be the limit that was hit by the task you mention. This is just one more failsafe that is in place to help assure things keep running smoothly. Rosetta Moderator: Mod.Sense |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2117 Credit: 41,161,072 RAC: 15,284 |
I also had a lot of errors in tasks such as *__boinc_filtered_loopbuild_threading_*. In fact, every second job terminated by an error. And violating the target CPU time in each of the first (ie all tasks of this type) + strange looking things in graphics part (such as RMSD from 20 to 50 and odd-looking models) It's the only error I've had in the last week on that W7 laptop, and credit was granted in the clean-up job, so I'm not worried by it - I don't understand any of these validate errors but while I was reporting the other one I thought I'd just mention it. I don't think my errors are the same as yours in that case. I'm more surprised by the disk-usage issue on the Vista desktop which is otherwise very well behaved. I did suspect the task type, but others have gone through now with no problem at all, so maybe it just went a bit 'rogue' on me. I just thought it was worth describing seeing as I noticed it was a bit odd while running for 10 hours, yet the task details didn't indicate anything more than it failed on startup, which wasn't actually the case. One for the backroom team to ponder. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Miscellaneous computation errors: ---- 327069193 (v2FcInnerW_1dAl_3GM3_ProteinInterfaceDesign_15Mar2010_18672_254_0) failed on Mac OS X. Similar failure from wingman. ERROR: f.check_fold_tree() ERROR:: Exit from: src/protocols/docking/DockingProtocol.cc line: 405 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> ]]> ---- 326722657 (placestub_alt_denovo_1zvy_1z2m_ProteinInterfaceDesign_21Mar2010_18705_22_0) failed on W7 ERROR: in::file::zip minirosetta_database.zip does not exist! ERROR:: Exit from: ....srcappspublicboincminirosetta.cc line: 137 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> ---- 326721814 (tedor-cs_-tdonly-1-calbindin__18708_33_1) failed on W7. Similar failure from wingman. ERROR: rsd_type_list.size() ERROR:: Exit from: ....srccorefragmentFrame.cc line: 62 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> ]]> |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
326722657 (placestub_alt_denovo_1zvy_1z2m_ProteinInterfaceDesign_21Mar2010_18705_22_0) failed on W7 Add me to the list with tedor-cs_-tdonly-1-gb3__18708_4647 ERROR: rsd_type_list.size() ERROR:: Exit from: ....srccorefragmentFrame.cc line: 62 BOINC:: Error reading and gzipping output datafile: default.out |
allenandholmes Send message Joined: 17 Dec 07 Posts: 1 Credit: 7,563 RAC: 0 |
I have been processing my current minirosetta task for 4 or 5 days now and have had a suspicion about its checkpointing capabilities. I shut my PC down each night and restart it the next morning for BOINC processing. However the elapsed time displayed resets to 0, the time to completion continues to increase all day long (and between sessions) and the processed percentage is dramatically different from a ratio of elapsed/completion times. Am I wasting my time? |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2117 Credit: 41,161,072 RAC: 15,284 |
One unusual error I haven't seen before - W7-64bit laptop: Rossmann3x3_abinitio_SAVE_ALL_OUT_design_k031_001_18698_1551_0 Outcome Client error |
Message boards :
Number crunching :
minirosetta 2.05
©2024 University of Washington
https://www.bakerlab.org