Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Author | Message |
---|---|
Shawn Volunteer moderator Project developer Project scientist Send message Joined: 22 Jan 10 Posts: 17 Credit: 53,741 RAC: 0 |
We are aware that we have had some issues with bad jobs on Rosetta@home recently. We try to ensure that these bad jobs don't slip through, but they occasionally do. When that happens, your efforts to alert us to these problems are extremely important and very much appreciated. In order to ensure that we address technical issues promptly, graduate students in the Baker lab (such as myself) will be regularly monitoring this message board for such problems. This will be in addition to the help of Mod.Sense, our vigilant forum moderator who has done a lot to ensure that these projects run as smoothly as possible. I ask that you alert us to new issues in this thread so that we can find them more easily. Thank you all once again for your commitment to Rosetta@home! |
Hank Barta Send message Joined: 6 Feb 11 Posts: 14 Credit: 3,943,460 RAC: 0 |
I ask that you alert us to new issues in this thread so that we can find them more easily. Thank you for helping to deal with these. This morning I have seen a number of errors with a different signature. These run for 3-5 minutes before producing an error and exiting. The characteristic error seems to be: ERROR: ct == final_atoms An example is https://boinc.bakerlab.org/rosetta/workunit.php?wuid=382081360 thanks, hank |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,524,889 RAC: 7,500 |
I hope these changes involve ralph@home!!! |
Shawn Volunteer moderator Project developer Project scientist Send message Joined: 22 Jan 10 Posts: 17 Credit: 53,741 RAC: 0 |
Thank you for helping to deal with these. This morning I have seen a number of errors with a different signature. These run for 3-5 minutes before producing an error and exiting. The characteristic error seems to be: Hey Hank, thanks for letting us know. This job has been deleted and is no longer on the queue. Apparently, this was a small test job that reported failure early, and the author marked them for deletion right away, but sometimes those jobs propagate for a while anyway. In any case, you shouldn't see this particular job anymore, but if for some reason it persists, please give us an update! |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I ask that you alert us to new issues in this thread so that we can find them more easily. I thought these were tested on RALPH before being brought over the Rosetta? If that is the case, then this job should not have slipped through. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,524,889 RAC: 7,500 |
Ralph has had big problems since December.... Few wu, no comunication from team, etc I hope this situation change If you need our help to "control" the code, please give us some informations, news, details, etc |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
We're in the process of upgrading RALPH. The current server is very unstable. We do need to be far better at providing information about new projects/jobs that we test on RALPH and I'll stress that point to the lab members. The RALPH WU flow will depend on whether or not we have new jobs to test. Many jobs have already been tested. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,524,889 RAC: 7,500 |
Thanks for information :-) |
Speedy Send message Joined: 25 Sep 05 Posts: 163 Credit: 808,098 RAC: 0 |
420656625 FOLD_N_DOCK_dagk_D2symm got Validate state Invalid after CPU time 2010.416 run time meant to be 3 hours. corresponding work unit number 420591203 got after Validate state Invalid after CPU time 3843.709 (has debug message) I posted the above message in minirosetta 2.17 on 06/05/11 Edit = Added click able links Have a crunching good day!! |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
out of memory error codes on these tasks, that is not possible as I have 3.24GB of RAM. FOLD_N_DOCK_2kqt_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26674_9746_0 FOLD_N_DOCK_2kqt_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26674_1528_0 FOLD_N_DOCK_dagk_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26520_9259_1 Error message: - Unhandled Exception Record - Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7C812AFB |
Zydor Send message Joined: 4 May 11 Posts: 7 Credit: 12,648 RAC: 0 |
Couple of possible problem WUs for you - they are 1 hour WUs, ran for around 25-30 mins, and failed to progress beyond 2-3% completion. Other 1hr ones had a consistent completion percentage roughly in line with time done so far, so I aborted both. https://boinc.bakerlab.org/rosetta/result.php?resultid=421246729 https://boinc.bakerlab.org/rosetta/result.php?resultid=421246619 Regards Zy |
Ray Wang Send message Joined: 9 Mar 09 Posts: 8 Credit: 230,454 RAC: 0 |
out of memory error codes on these tasks, that is not possible as I have 3.24GB of RAM. Hi Speedy and Greg_BE, I am Ray, a graduate student in the Baker lab. I will be taking care of the issues caused by "FOLD_N_DOCK" related jobs. As Greg_BE said, this is really not likely that these jobs could run out of all those 3.24GB of RAM. Thank you all for letting us know the problems, as well as your contribution to Rosetta@home!!! |
Zydor Send message Joined: 4 May 11 Posts: 7 Credit: 12,648 RAC: 0 |
Could someone take a peek at the list for my laptop? I'm not new to BOINC, but am a total newb at Rosetta, so I will at present miss the obvious until my feet are under the table. (running a few WUs to get used to Rosetta ready for the penthalon in a day or so) https://boinc.bakerlab.org/rosetta/results.php?hostid=1441160&offset=20 I made a post two up re slow ones, but I'm wondering if its a bad batch. Running two from same date time batch, and they are slow as well (18-19% done circa 2hrs45min for 1 hour WUs). Two running at present are Task IDs: 421246725 and 421246743 . I am starting to wonder if they are 1hr WUs, maybe there are longer ones in that batch, there were 1hr ones I did previously in the same batch, so its a bit strange. Ignore the laptop preference as set at present, it was set for 1hr when that batch was downloaded. Regards Zy |
Elizabeth Send message Joined: 24 Nov 06 Posts: 1 Credit: 6,905 RAC: 0 |
Couple of possible problem WUs for you - they are 1 hour WUs, ran for around 25-30 mins, and failed to progress beyond 2-3% completion. Other 1hr ones had a consistent completion percentage roughly in line with time done so far, so I aborted both. Hi Zy, this job is currently returning models at a reasonable rate, but we're looking into the problem. thanks for the heads up! |
Adam Gajdacs (Mr. Fusion) Send message Joined: 26 Nov 05 Posts: 13 Credit: 2,790,385 RAC: 2,424 |
I am Ray, a graduate student in the Baker lab. I will be taking care of the issues caused by "FOLD_N_DOCK" related jobs. As Greg_BE said, this is really not likely that these jobs could run out of all those 3.24GB of RAM. They definitely can. I just noticed that one of my two rigs started trashing like hell. Turned out, a single one of these FOLD_N_DOCK WUs (https://boinc.bakerlab.org/rosetta/result.php?resultid=421379634) was using 1.45GB VM on a system with only 1GB physical memory; it was effectively running from the disk. The other core was idle because there was no memory left to run another WU on it, but if there was, it would've been about 3GBs total. |
Zydor Send message Joined: 4 May 11 Posts: 7 Credit: 12,648 RAC: 0 |
Quick note to close the loop on my posts above. I've ended up having to do an detatch/attatch (after aborting held WUs) on my machines. Sorry about the aborts, but felt I had no choice. On restarts, the problem has disappeared, and at present at least, all appears to be progressing normaly now. Yet to complete one since detatch etc, but all three machine appear to be behaving now. No idea the reason, strange it hit all three machines. No hang over from other worries elsewhere as far as I know as things have been stable in my recent travels around BOINC. Anyway .... for what its worth, detatch etc resolved my problems, absolutely no idea why though :) Regards Zy |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Zydor, your expectations of what a 1hr work unit is are not realistic for R@h. And you will only confuse yourself further by modifying the runtime preference frequently. Runtime preference is actually a runtime characteristic, so your preference at time of download is actually not relevant. Here are a couple of threads that discuss how the runtime works: Discussion on increasing the default run time Newbie Q&A: discussion on runtime under Q: I am on a dial-up connection, how can I use less modem time? Newbie Q&A: Q: Progress Percent not advancing? Newbie Q&A: Q: I'm familiar with SETI and BOINC already, but what should I know about Rosetta? I'm sure you were trying to complete and return work as quickly as possible for immediate credit recognition during the penthalon... unfortunately, when you do that, some of the other nice things such as accurate progress %, and consistently completing within such a limited timeframe go out the door. Each task must complete at least one model. For some tasks you will see a model every 5 minutes or so, for others, it can take several hours. So, not all tasks are going to complete within your one hour target, and that is normal and to be expected. You might want to start a thread to discuss suggested settings for penthalon participants. Rosetta Moderator: Mod.Sense |
Zydor Send message Joined: 4 May 11 Posts: 7 Credit: 12,648 RAC: 0 |
Spoke too soon :) Another for you, from the laptop - it has had a total attatch/detatch and clean out, so this one started on a pristine clean default setup, no tweeks or o/c - but some more detail this time as I was trying to watch out for it. Task ID 421414504 finished in normal time. Task ID 421414503 had started at exactly the same time as the one that finished, except it had only completed 20% by the time the one above finished. It also was using (and still is) 270Mb of memory. That figure has slowly risen all the time it has run, not fast, but has steadily risen (and still rises at a rate of about 0.5Mb per minute - no wild fluctuation (barring the odd 100Kb or so), just steady inexorable rise. Memory Leak? Blasee phrase, but not impossible. The one that went through ok (421414504) was using 63Mb of memory when it finished. The replacement task that has started, began using 43Mb of memory, to early to say if thats a bad one as well. Good luck on the hunt .... fingers crossed you nail it tomorrow with the Pentathelon coming up. EDIT: Just seen your post above ... its not pentathelon related as such, when that starts, the longer the WU the better for me - less messing around. The short ones selected was only because the option was there and wanted to do some quick ones to check all was well before the event start's tommorow night, my not being used to Rosetta. Point noted, I will change it to default 3hours for now. I can start a thread re pentathelon if it helps you, but I'm not knowledgeable enough yet on Rosetta to comment or set it up properly. I'll give it a whirl if you want me to ... ?? Regards Zy |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Your observations are consistent with what I was saying... different tasks run rather differently. They have different memory requirements, they have different amount of CPU time to complete a model, and they are attempting different approaches to solving the problem so that the "better" approach can be revealed. As for memory, yes as a given model progresses, it often will gradually use more and more memory. Once the model is completed, the memory is released and if runtime preference permits, another model is begun... and then from that new local low in memory usage it will gradually use more and more as the model progresses. As for creating a new thread, what I was suggesting was to create a thread asking the questions about what traits you'd like to optimize or minimize for pentathelon and see what suggestions others may have for you. Rosetta Moderator: Mod.Sense |
Zydor Send message Joined: 4 May 11 Posts: 7 Credit: 12,648 RAC: 0 |
Re Thread - Okie Doke, will do Regards Zy |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org