Problems and Technical Issues with Rosetta@home

Author	Message
Rocco Moretti Send message Joined: 18 May 10 Posts: 66 Credit: 585,745 RAC: 0	Message 70834 - Posted: 31 Jul 2011, 23:53:24 UTC - in response to Message 70824. Oh yeah, once again someone said awhile back that they would be monitoring the boards for discussions like this one. Once again the system fails and no one see's or says anything about it. It's not really an issue about the system failing - it simply that we don't (currently) have jobs that are ready to run right this minute. The running of the jobs on R@h is only one step in the process - it takes a while to figure out what sorts of jobs will give usable scientific results, to set up the jobs, test them to make sure they won't cause a huge failure rate, and then at the end of the runs to process the results to figure out what the next round should do. Usually we have enough things going on that the computational lull in one project will be covered by the compute phase of a different one. We just happen to have hit a point where none of the currently active projects is in an active compute phase. (And doesn't help that we're maximally distant from both the previous and next CASP - as you've probably noticed, activity seems to ramp up before [mad rush to finalize improvements], during, and after [post-analyis] CASP.) We're aware that the queue is empty - a message has been sent out on the appropriate internal mailing list. While we want to provide you with work units, we don't want to waste your time with scientifically pointless make-work. - It's somewhat trivial to re-run old jobs, but is that worth doing if no one is going to look at the results? I hesitate to say this, as I don't want it to sound like we're chasing you away(), but I'd agree with the implicit recommendation stated above to crunch other projects while we have this momentary lull. You can increase your stats on other projects secure in the knowledge that no one will gain on you with Rosetta@home. With any luck, we'll have new jobs for you early next week. (e.g. "We apologize for the inconvenience - Regular service should resume shortly.") ) We really do appreciate your efforts. Having access to the computational resources of R@h allows us to do things we couldn't do otherwise. Frankly speaking, I was surprised how quickly and easily R@h handled my recent jobs. I would have monopolized our local computational resources, but R@h crunched through it like it was nothing. - It's prompted me to think about possible process improvement experiments that I probably wouldn't have otherwise considered due to the computational cost. (Unfortunately, it's in the very preliminary stages and nowhere near the point where I could actually launch any jobs.) ID: 70834 · Rating: 0 · rate: /

googloo Send message Joined: 15 Sep 06 Posts: 135 Credit: 23,895,011 RAC: 68	Message 70836 - Posted: 1 Aug 2011, 2:53:22 UTC Please give us a reason for not letting us know sooner. This behavior is rude and insulting to your volunteers. You are not showing any consideration, much less appreciation. ID: 70836 · Rating: 0 · rate: /

Michael G.R. Send message Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0	Message 70838 - Posted: 1 Aug 2011, 4:59:33 UTC Calm down everybody. We're trying to cure diseases and make science progress. A little patience. The project is doing excellent science, and while communication could be better sometimes, they're allowed to miss a few days here and there. We might not see it on our end, but I'm sure things are pretty hectic on the other side of the screen... ID: 70838 · Rating: 0 · rate: /

TPCBF Send message Joined: 29 Nov 10 Posts: 111 Credit: 5,930,751 RAC: 27	Message 70839 - Posted: 1 Aug 2011, 6:23:10 UTC - in response to Message 70838. Calm down everybody. We're trying to cure diseases and make science progress. A little patience. The project is doing excellent science, and while communication could be better sometimes, they're allowed to miss a few days here and there. We might not see it on our end, but I'm sure things are pretty hectic on the other side of the screen... Sorry, but basic communication from side of the researchers/sysadmins is an absolute must... At least for me (and I am sure quite a few other will think similar) it is not that there is an outage of WU's, regardless of knowingly (like this time) or by "accident" (like it was the case the last few times since the change of the year), but that there is an absolute lack of communication from their side. They need the collaboration of the people running the WU's but time and time again, they don't seem to bother to keep those people informed. As I already said, a simple message on the home page or a quick note here in the forum, up front or within reasonable time, is all that it takes... Ralf ID: 70839 · Rating: 0 · rate: /

HiFiTubeGuy Send message Joined: 12 Jan 10 Posts: 22 Credit: 6,291,999 RAC: 0	Message 70840 - Posted: 1 Aug 2011, 6:49:31 UTC Personally, I don't believe the people at Rosetta owe me anything for the crunching I do (I'm not doing it as a favor to them), nor do I crunch to earn the most credits, or to compete. I volunteer to crunch for the goal of curing / preventing disease, and I believe that the people at Rosetta volunteer their time and energy for the same goal. I think we're all on the same team here. In times of temporary lack of WU's, or other problems with Rosetta, better communication would be nice, but it's no big deal to me if it takes a few days; it's no loss for me. As long as there continues to be SOME communication; and so far there is, even if a little slow. Just my personal feelings. ID: 70840 · Rating: 0 · rate: /

Sid Celery Send message Joined: 11 Feb 08 Posts: 2488 Credit: 46,551,772 RAC: 3,092	Message 70842 - Posted: 1 Aug 2011, 11:58:06 UTC - in response to Message 70834. While we want to provide you with work units, we don't want to waste your time with scientifically pointless make-work. Quite right. Sorry, but it absolutely doesn't look like there is any appreciation. Get over yourself. Volunteering is 100% about giving and 0% about receiving, even appreciation. Needy much? I switched to 24hour job runs for Rosetta a few days ago to eke out my remaining jobs & selected a back-up project set at low priority for just these eventualities. The only guarantee is there are no guarantees. All projects have downtime. ID: 70842 · Rating: 0 · rate: /

muddocktor Send message Joined: 11 May 07 Posts: 17 Credit: 14,543,886 RAC: 0	Message 70843 - Posted: 1 Aug 2011, 13:56:16 UTC That's all fine and dandy, Sid, but taking 5 minutes out of their day to post a message like Rocco Moretti posted before running out of work is not asking a whole lot and it would help keep people informed and they could make plans in advance to set up another project to switch to in case of time such as this. In my situation for example, I am at work on a drilling rig hundreds of miles away from my computers, which I presently have set to run 100% Rosetta. If I knew this was coming up I could have set them up to run another project too, in case something happened with Rosetta (like now). My machines still seem to be crunching work units for now, but when they run out the 5 systems at the house will not be doing any work, just burning my electricity for no return. And that could simply be minimized or eliminated by a little better communication from the Baker Labs. That is what gets people upset, not the point that they ran out of work. ID: 70843 · Rating: 0 · rate: /

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 70844 - Posted: 1 Aug 2011, 14:01:16 UTC - in response to Message 70842. While we want to provide you with work units, we don't want to waste your time with scientifically pointless make-work. Quite right. Sorry, but it absolutely doesn't look like there is any appreciation. Get over yourself. Volunteering is 100% about giving and 0% about receiving, even appreciation. Needy much? I switched to 24hour job runs for Rosetta a few days ago to eke out my remaining jobs & selected a back-up project set at low priority for just these eventualities. The only guarantee is there are no guarantees. All projects have downtime. For example, Milkyway went down due to their main A/C unit failing and having to get a portable system installed to keep their servers cool. They were completely offline for a few days if not a week I think it was. I run a total of 4 projects including R@H to keep my system busy. ID: 70844 · Rating: 0 · rate: /

GarageFarm.net Send message Joined: 21 Apr 10 Posts: 19 Credit: 17,915,923 RAC: 0	Message 70845 - Posted: 1 Aug 2011, 17:12:03 UTC Last modified: 1 Aug 2011, 17:16:42 UTC HDD in one of my computers died last night and i had to replace whole system, no new numbers for this guy... :) Just wondering, how will you redistribute lost jobs, like those from crashed disc on one of my machines? ID: 70845 · Rating: 0 · rate: /

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 70846 - Posted: 1 Aug 2011, 19:27:06 UTC - in response to Message 70844. While we want to provide you with work units, we don't want to waste your time with scientifically pointless make-work. Quite right. Sorry, but it absolutely doesn't look like there is any appreciation. Get over yourself. Volunteering is 100% about giving and 0% about receiving, even appreciation. Needy much? I switched to 24hour job runs for Rosetta a few days ago to eke out my remaining jobs & selected a back-up project set at low priority for just these eventualities. The only guarantee is there are no guarantees. All projects have downtime. For example, Milkyway went down due to their main A/C unit failing and having to get a portable system installed to keep their servers cool. They were completely offline for a few days if not a week I think it was. I run a total of 4 projects including R@H to keep my system busy. Well now the R@H infection is spreading. Milkyway is offline. Poem is out of work. So just Einstein is sending out work. 1/4 projects?!?!? weird!! ID: 70846 · Rating: 0 · rate: /

Tex1954 Send message Joined: 3 Apr 11 Posts: 9 Credit: 3,394,752 RAC: 0	Message 70847 - Posted: 1 Aug 2011, 19:51:11 UTC - in response to Message 70846. Well now the R@H infection is spreading. Milkyway is offline. Poem is out of work. So just Einstein is sending out work. 1/4 projects?!?!? weird!! Good Grief, all projects have downtime at some point. Sometimes a "master task" is finishing up, sends out some stray WU's to complete, then it's all processed before the next "master task" generates a gazzillion WU's for us. I know I shouldn't be surprised at the impatience of folks since I see it everywhere in life, but can't help it. BOINC-SIMAP has been out of work forever and Orbit is off until they get more funding in who knows how long. Sometimes projects just end! Look at Archived projects on Boincstats!!! But, there never seems to be an end to complaining... and to those that do, I say "Join the other millions of us and do something else!" Patience is required... these folks are researchers... can't rush good research! And complaining about hardware and A/C failures... well hell, complain to the wall so the rest of us don't have to hear about it. I spent the night in a motel room when my apartment A/C went out the other day, big deal! Stuff happens! Get a life and be an adult!! JMHO Tex1954 ID: 70847 · Rating: 0 · rate: /

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 70849 - Posted: 1 Aug 2011, 22:13:02 UTC - in response to Message 70845. HDD in one of my computers died last night and i had to replace whole system, no new numbers for this guy... :) Just wondering, how will you redistribute lost jobs, like those from crashed disc on one of my machines? Once your WUs are past their deadline, the same WU is sent to another host. ID: 70849 · Rating: 0 · rate: /

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 70852 - Posted: 2 Aug 2011, 3:03:11 UTC - in response to Message 70847. Well now the R@H infection is spreading. Milkyway is offline. Poem is out of work. So just Einstein is sending out work. 1/4 projects?!?!? weird!! Good Grief, all projects have downtime at some point. Sometimes a "master task" is finishing up, sends out some stray WU's to complete, then it's all processed before the next "master task" generates a gazzillion WU's for us. I know I shouldn't be surprised at the impatience of folks since I see it everywhere in life, but can't help it. BOINC-SIMAP has been out of work forever and Orbit is off until they get more funding in who knows how long. Sometimes projects just end! Look at Archived projects on Boincstats!!! But, there never seems to be an end to complaining... and to those that do, I say "Join the other millions of us and do something else!" Patience is required... these folks are researchers... can't rush good research! And complaining about hardware and A/C failures... well hell, complain to the wall so the rest of us don't have to hear about it. I spent the night in a motel room when my apartment A/C went out the other day, big deal! Stuff happens! Get a life and be an adult!! JMHO Tex1954 That was not a complaint, it was an observation, though it could be seen as a complaint. It was not intended that way. Just saying that out of 4 projects 3 of them went dark at the same time. That's just weird. ID: 70852 · Rating: 0 · rate: /

GarageFarm.net Send message Joined: 21 Apr 10 Posts: 19 Credit: 17,915,923 RAC: 0	Message 70853 - Posted: 2 Aug 2011, 3:37:21 UTC - in response to Message 70852. That was not a complaint, it was an observation, though it could be seen as a complaint. It was not intended that way. Just saying that out of 4 projects 3 of them went dark at the same time. That's just weird. End of the world is coming, possibly. ...rapture it was, right? :D ID: 70853 · Rating: 0 · rate: /

Sid Celery Send message Joined: 11 Feb 08 Posts: 2488 Credit: 46,551,772 RAC: 3,092	Message 70854 - Posted: 2 Aug 2011, 4:48:09 UTC - in response to Message 70843. That's all fine and dandy, Sid, but taking 5 minutes out of their day to post a message like Rocco Moretti posted before running out of work is not asking a whole lot and it would help keep people informed and they could make plans in advance to set up another project to switch to in case of time such as this. In my situation for example, I am at work on a drilling rig hundreds of miles away from my computers, which I presently have set to run 100% Rosetta. If I knew this was coming up I could have set them up to run another project too, in case something happened with Rosetta (like now). My machines still seem to be crunching work units for now, but when they run out the 5 systems at the house will not be doing any work, just burning my electricity for no return. And that could simply be minimized or eliminated by a little better communication from the Baker Labs. That is what gets people upset, not the point that they ran out of work. I was going to appreciate your position given your circumstances, but then I had a look at how close you came to running out of work. Holy Moly! I think I just found out where all the work was getting sucked to. Is that a full 10 days you've got there on some pretty hefty machines? Maybe you slipped down to your last 7-8 days at worst, with a couple of jobs on one machine timing out? I can't even believe you said anything at all. "All for one and I hope it's me..." would be a fine motto. As it happens some new work has slipped through this morning and I think we're both up to a full compliment now. I'll be sticking to 24hr runs until the supply is more assured so I don't hog WUs. ID: 70854 · Rating: 0 · rate: /

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 70857 - Posted: 2 Aug 2011, 12:05:55 UTC - in response to Message 70853. That was not a complaint, it was an observation, though it could be seen as a complaint. It was not intended that way. Just saying that out of 4 projects 3 of them went dark at the same time. That's just weird. End of the world is coming, possibly. ...rapture it was, right? :D mutters something about an improbability drive a manic depressed robot while having dinner at restaurant at the end of the universe. ID: 70857 · Rating: 0 · rate: /

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 70858 - Posted: 2 Aug 2011, 12:13:07 UTC As of 12:12 GMT there are Ready to send 50,304 tasks ready to send!! ID: 70858 · Rating: 0 · rate: /

googloo Send message Joined: 15 Sep 06 Posts: 135 Credit: 23,895,011 RAC: 68	Message 70859 - Posted: 2 Aug 2011, 17:30:51 UTC Ready to send 3 ID: 70859 · Rating: 0 · rate: /

Brett Collins Send message Joined: 13 Feb 11 Posts: 2 Credit: 147,888 RAC: 0	Message 70862 - Posted: 2 Aug 2011, 20:44:57 UTC I do not understand what is causing these errors 4.4E+08 4.01E+08 2 Aug 2011 12:03:34 UTC 2 Aug 2011 17:34:27 UTC Over Client error Compute error 0.44 0 --- 4.4E+08 4.01E+08 2 Aug 2011 12:03:34 UTC 2 Aug 2011 17:34:27 UTC Over Client error Compute error 0 0 --- 4.4E+08 4.01E+08 2 Aug 2011 12:03:34 UTC 2 Aug 2011 17:34:27 UTC Over Client error Compute error 0.48 0 --- 4.4E+08 4.01E+08 2 Aug 2011 12:03:34 UTC 2 Aug 2011 17:34:27 UTC Over Client error Downloading 0 0 --- 4.4E+08 4.01E+08 2 Aug 2011 12:03:34 UTC 2 Aug 2011 17:34:27 UTC Over Client error Compute error 0 0 --- 4.4E+08 4.01E+08 2 Aug 2011 12:03:34 UTC 2 Aug 2011 17:34:27 UTC Over Client error Downloading 0 0 --- 4.4E+08 4.01E+08 2 Aug 2011 12:03:34 UTC 2 Aug 2011 17:34:27 UTC Over Client error Compute error 0.36 0 --- 4.4E+08 4.01E+08 2 Aug 2011 12:03:34 UTC 2 Aug 2011 17:34:27 UTC Over Client error Downloading 0 0 --- 4.4E+08 4.01E+08 2 Aug 2011 11:44:35 UTC 2 Aug 2011 17:34:27 UTC Over Client error Compute error 8.28 0.03 --- 4.39E+08 4.01E+08 2 Aug 2011 5:19:23 UTC 2 Aug 2011 7:07:14 UTC Over Client error Compute error 3.96 0.01 Do you have any suggestions? ID: 70862 · Rating: 0 · rate: /

dango Send message Joined: 22 Dec 08 Posts: 3 Credit: 75,820 RAC: 0	Message 70863 - Posted: 2 Aug 2011, 20:48:34 UTC join to WCG! there is still work ;) ID: 70863 · Rating: 0 · rate: /