whats with the hanging of computers (related to projects with no active WU)

Message boards : Cafe Rosetta : whats with the hanging of computers (related to projects with no active WU)

To post messages, you must log in.

AuthorMessage
Profile nasher

Send message
Joined: 5 Nov 05
Posts: 98
Credit: 618,288
RAC: 0
Message 5748 - Posted: 9 Dec 2005, 22:55:29 UTC

Ok right now i am still at sea. (pull into port tonight) so i will be able to work on my computers then

my problem aparently is that since i had seti (sharing about 5% of the cpu time) it hasnt sent any results in 4 days and i am asumeing that its stalled.

being that i become deployed now and then (in the military) is there a way through boinc or such to ensure this wont happen when i cant get to a computer for a month or so.. or is this just an isolated case. (yea im going to supsend seti for the next underway (monday morning) so i wont have a problem with that but im worried that some other program may run into a simalar problem sometime and cause another hickup.. is there a way to prevent this??? or any sugestion on ways to prevent this from happening
i run rosetta for science and seti as a conversation piece since its easyer for me to explain seti at home and then branch out and explane other DC projects. so nope im always going to have some time probaly set up for seti. but i will have the majority running projects like this one and others i think that may also be worth it but every now and then i like doing thngs like running seti.. playin solitar and other things with my computer that dont help science
ID: 5748 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,359
RAC: 13
Message 5757 - Posted: 9 Dec 2005, 23:59:57 UTC - in response to Message 5748.  
Last modified: 10 Dec 2005, 0:01:18 UTC

my problem aparently is that since i had seti (sharing about 5% of the cpu time) it hasnt sent any results in 4 days and i am asumeing that its stalled.


This particular situation _shouldn't_ be possible... but we all know how that goes. JMVII has tweaked the scheduler to cover any imaginable, and most unimaginable, circumstances. And then this comes along. IF your problem is SETI work that's stuck in "downloading" status, JMVII and the Einstein crew are currently discussing both if a solution is even possible, and if so, what it would be. JMVII correctly states that "downloading" work _could_ show up at any second, so must be included in the "on hand" list to avoid overloading the cache. The Einstein people are correctly stating that this shouldn't allow a CPU to sit idle. Both agree with the other, they just differ on the priorities. When they find some middle ground, and figure out how to get there, we'll see a new BOINC version that will at least solve _this_ new case.

Now, advice for you... Best would be if you could find some trustworthy (not to put his account key on your CPUs...) other cruncher and give him VPN, or even physical, access to check your boxes once in a while when you're gone.

Second possibility would be to increase your cache to just as much as each box can handle - if it's Rosetta 95%, SETI 5%, and a decent CPU speed, that may even be the full 10 days. (Or more... there _is_ a hack...) If you're gone for 6 months, and something breaks the day after you leave, it won't help, but it just _might_ get you past any smaller hiccups.

Third would be to rethink your project priorities and look really hard at running "CPDN-only" when you're going to be away for a while. I love Rosetta, but if you're away from the CPU, that one-year deadline for a ClimatePrediction "sulphur" WU sure looks good. No network access, project down, whatever? No problem. As long as the CPU keeps running, you can come back six weeks later and fix whatever other problem there is, and you'll still just be part way through the result anyway. When you're able to keep an eye on things, unsuspend Rosetta and let it rip.

More complicated version of #3 - set everything up Rosetta 60%, CPDN 35%, SETI 5%. Keep your cache _small_ instead of large. When you're home, suspend CPDN if you want to give Rosetta extra time. (Just don't starve CPDN _too_ much; even with a one year deadline, and even with my overclocked 3700+, I will need around 1000 hours during that year to avoid problems...) When you're away, suspend SETI. If anything goes wrong with Rosetta, you'll lose only the small amount in your cache, but CPDN will keep crunching.

Any way I can figure, every answer that covers long, unmonitored, away periods, with a goal of zero idle CPU time, has to have CPDN in the mix somewhere. There isn't another project that can survive without network contact at _least_ every week or two.

ID: 5757 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 5781 - Posted: 10 Dec 2005, 7:34:37 UTC - in response to Message 5757.  

JMVII correctly states that "downloading" work _could_ show up at any second, so must be included in the "on hand" list to avoid overloading the cache. The Einstein people are correctly stating that this shouldn't allow a CPU to sit idle. Both agree with the other, they just differ on the priorities. When they find some middle ground, and figure out how to get there, we'll see a new BOINC version that will at least solve _this_ new case.


OK, I see both sides of this argument. Giving the scheduler permission to do one thing could help alleviate this. The User would need to understand what they are doing, but suppose the scheduler had the authority to abort WU's before work had started on them.

That'd be a minor annoyance, in that it'd tend to clog up results, but what holes can be punched in the following scenario.

SETI and Rosetta, both have network access, both merrily crunching away.

Seti loses network access, and gets stuck with a ton of "pending" downloads. Eventually the scheduler code gets to the point where Rosetta is running out of work. "Oh well," it says, "I can get work from Rosetta, which isn't happening for SETI. I'll get some Rosetta work so that I can keep crunching something." It then does so.

Two minutes later, SETI un-wedges, and all the pending downloads complete.

"Oh my gosh!" says the scheduler, "I'm over committed. I'd better panic." It promptly panics, and starts aborting WU's based on resource share / LTD etc. to get the cache back under control.

The key issue here is that it must be made very clear to the user what the consequences of giving the scheduler that authority will be. And I would add that the default would be for this to be disabled.
ID: 5781 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Cafe Rosetta : whats with the hanging of computers (related to projects with no active WU)



©2024 University of Washington
https://www.bakerlab.org