Message boards : Number crunching : What's with all the errors???
Author | Message |
---|---|
FernValleyIT Send message Joined: 1 Dec 05 Posts: 7 Credit: 84,334 RAC: 0 |
Four machines - approx 2 out of 5 WU's error out. Another 1 out of 5 must be aborted by user to proceed. Less than 50% success rate. Reset project twice. No difference. Have now stopped getting WU's from Docking. Might be the switching back and forth. Will know more Friday. Anyone else? |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 9,592 |
Have you got 'leave applications in memory while suspended' checked? Also what's your switch time between projects? The longer the better... |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 113 |
I can see from my results that I have also seen errors, not many though. They fail after just a few seconds without intervention, and are failed by others as well. Not a big deal for me as I don't have to do anything, and the size of the download is not a problem, but could be for others. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
FernValleyIT Send message Joined: 1 Dec 05 Posts: 7 Credit: 84,334 RAC: 0 |
<< Have you got 'leave applications in memory while suspended' checked? Also what's your switch time between projects? The longer the better... >> ...did not have the memory thing checked. Switching set to 1 hour. Never had an issue. Not sure if they're related, but this behavior began after the holiday downtime, then ran okay for a couple weeks, and now has begun again. I'll give it some more time with the new prefs. Thanks! |
FernValleyIT Send message Joined: 1 Dec 05 Posts: 7 Credit: 84,334 RAC: 0 |
...just wanted to add, this issue does not happen with Docking WU's. Thanks! |
FernValleyIT Send message Joined: 1 Dec 05 Posts: 7 Credit: 84,334 RAC: 0 |
No better. 18 out of the last 33 in error. What a waste. I'm detaching now. I'll watch for 2.06 and maybe come back then. Thanks. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2117 Credit: 41,159,202 RAC: 15,498 |
No better. 18 out of the last 33 in error. What a waste. I'm detaching now. I'll watch for 2.06 and maybe come back then. Thanks. Before you go, Roger, can you check your Boinc Manager preferences in the Advanced menu and go to the Processor Usage tab. If the "Use at most xx% CPU time" figure is less than 100% you get errors in WU characterised by the message "Can't acquire lockfile - exiting" in the WU details. I see these errors in your problem WUs. After you change this, re-boot and try again. If the lockfile problem persists, the zero-byte lockfiles are usually found in your Boincslots folder. Try and delete them manually. If they refuse to go, close down boinc manager, go into Task Manager and end all boinc processes, then try again. They should go. Re-boot and try one last time. In summary: 100% CPU usage in processor usage preferences, re-boot, ensure lockfiles have truly disappeared, re-boot and try one last time. If it still doesn't work, I'm out of ideas. Hope it works. |
FernValleyIT Send message Joined: 1 Dec 05 Posts: 7 Credit: 84,334 RAC: 0 |
In summary: 100% CPU usage in processor usage preferences, re-boot, ensure lockfiles have truly disappeared, re-boot and try one last time. Thanks Sid. I don't really want to run 100% 24/7. I'm throttled at 50 and would like to stay that way. Hopefully 2.06 will address this. Thanks again! |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2117 Credit: 41,159,202 RAC: 15,498 |
Thanks Sid. I don't really want to run 100% 24/7. I'm throttled at 50 and would like to stay that way. Hopefully 2.06 will address this. Thanks again! That's ok if that's your intention, but be aware this is not a Rosetta issue, as I understand it, but a Boinc issue. It's just that Rosetta WUs seem to be most susceptible to falling over as a result. Version 2.07 (or whatever) won't solve it. I couldn't hazard a reason why, it just is. A new Boinc version might solve it, but I don't keep up with what they're working on so have no idea if it will. While looking for this solution, previously given, I googled the phrase "can't acquire lockfile" and found reports of the same problem on the Seti project. If you insist on running less than 100% then the errors will continue because of your choice. Can I ask you just to switch to 100% for one day to confirm if the problem clears for you. If it does, then you know you're in control of a solution youself, then if you chose not to go with it then it's your own choice and you already know the outcome. Hopefully this one day will also confirm for you whether running at 100% is the problem you perceive or not. I originally ran at 50% CPU, because I thought Boinc would slow my machine down, but when I switched to 100% to solve this very issue I didn't notice any slowdown at all. |
Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0 |
In summary: 100% CPU usage in processor usage preferences, re-boot, ensure lockfiles have truly disappeared, re-boot and try one last time. Sid is right that this is a BOINC issue not restricted to rosetta but there is something else you can do on multiple processor systems. If you currently allow BOINC to use 50% of CPU time of 100% of your processors switch these numbers. Allow 100% of CPU time but restrict BOINC to half of your cores. I believe this solution has worked for others. Snags |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,135,730 RAC: 4,670 |
In summary: 100% CPU usage in processor usage preferences, re-boot, ensure lockfiles have truly disappeared, re-boot and try one last time. You do know that the 50% doesn't mean your cpu only uses 50% of itself, right? It just means Boinc only runs 50% of the time available to the cpu. It runs at 100% but only 50% of the time, it does not just use 50% of the cpu at any one moment. It might be better to set it to 100% and then change how Boinc runs on your pc. In Boinc Manager, Advanced, Preferences, you can set Boinc to only run during certain hours of the day, you can set it to only run after the machine has been idle for x amount of time, you can set it to use less than 100% of the processors, meaning you can set it to only use 1 core of a dual core machine, or 3 cores of a quad core etc. You can do this thru Boinc Manager and then it is machine specific or you can do it on the webpage and it will be globally set for all your pc's. I have 16 pc's running right now so do it thru Boinc Manager on each pc. In the end though it is your choice as is the projects you chose to crunch for. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 9,592 |
I'll 2nd that for netburst (P4) CPUs with HT turned on... Only using the real CPUs is better than 50% of all real and virtual CPUs... |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2117 Credit: 41,159,202 RAC: 15,498 |
You do know that the 50% doesn't mean your cpu only uses 50% of itself, right? It just means Boinc only runs 50% of the time available to the cpu. It runs at 100% but only 50% of the time, it does not just use 50% of the cpu at any one moment. I run a sidebar gadget on Vista (System Monitor, I think) which shows the CPU usage of each of my 4 cores. When I was running at 50% CPU time it always confused me that every core was alternating 0% then 100% several times a second, every second, every minute, hour, day etc. So yes, you're right I think. Madness, when I think about it now... |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 9,592 |
You do know that the 50% doesn't mean your cpu only uses 50% of itself, right? It just means Boinc only runs 50% of the time available to the cpu. It runs at 100% but only 50% of the time, it does not just use 50% of the cpu at any one moment. It has to work like that on some time scale. The time-scale could be decreased to something so small that you wouldn't notice the fluctuations in real-time, but the devs might have opted against that because it might cause more cache-swapping and therefore reduce efficiency... I guess the most efficient way to do it is to have BOINC at 100% at all times and reduce the clock-rate to scale. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2117 Credit: 41,159,202 RAC: 15,498 |
Errr, not sure about that. I guess I thought it would run constantly but put a ceiling of 50% on the processor usage directed toward BoincRosetta. It was only when I understood that Boinc runs at low priority, so that anythingeverything else that wanted to run went ahead of it, that I was reassured it was going to do what I actually wanted - i.e. not get in the way of any foreground apps I was running. That said, mod.sense rightly pulled me up about the RAM Rosetta uses getting in the way of other apps, but I guess any adjustment to CPU time doesn't improve that side of things. Saying that, I do limit the memory usage on the "disk and memory usage" tab while I'm running. I haven't heard about (nor had) any issues with that setting, though having 8Gb RAM makes it a little redundant. I still use it to reduce any slowdown issues using virtual RAM too much. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
The % of CPU time is basically used for BOINC to assure that it leaves at LEAST as much slack time as required to meet the setting. So if you run at say 70% it means BOINC will not even try to run for 30% of the time. And during the 70% of the time it is to run, it activates and places it's chip in line for CPU, but higher priority tasks may not yield it. And as a result, BOINC may not actually get 70%, but something less then that. It is a rather rudimentary control, but it resolves the heat problems which were I believe it's major purpose in life. So, what I'm saying is just that it isn't smart enough to figure out that it only actually got 5 seconds of CPU time during the last 7 seconds. And so it is going to go idle for the next 3 seconds, regardless of that fact. It really only would be noticeable on a busy machine ...which is perhaps why some wish to limit CPU% in the first place. I wasn't sure which machine you were referring to. But it would indeed be more efficient to use 50% of the number of CPUs, then to run 50% of the time. The reason being you would have half as many tasks active at any given time, and therefore be using much less memory. Ex: 4CPUs running half of the CPUs would only begin 2 tasks. CPU should still run just as cool, but you'll only be using memory for 2. If you said to run all CPUs but only 50% of the time, 4 tasks will begin and all 4 will cycle on and off. Rosetta Moderator: Mod.Sense |
FernValleyIT Send message Joined: 1 Dec 05 Posts: 7 Credit: 84,334 RAC: 0 |
The % of CPU time is basically Well put. I was earlier going to comment about the memory thing. Load up a 4-core machine with 12-20 hour Docking WU's and you've used up 4GB memory. That's major hit if you start swappping to hard drive. Not so bad with the smaller Rosetta WU's. The heat thing too is why I wanted to throttle back. So I'm back and put the number of processors at 50% working 100% of the time. Believe it or not, my core-2 spreads a nice smooth 30-70% across both cores, not just one. The 4-core machines use only 2 cores (one off of each processor) at about 80% each. Much smoother, better performance, and no errors yet. Thanks! <img src="http://www.boincstats.com/signature/user_1246837.gif"> |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,135,730 RAC: 4,670 |
The % of CPU time is basically I almost hate to do this but.....you can't tell Boinc WHICH of the 2 processors to use, it uses any 2 it wants to, but does respect your 50% of the total. Now if you have HT that is a different story, but for 4 real live processors, Boinc will pick which 2 it uses. As for practical use, you should not notice any difference at all, you will now have 2 processors for your exclusive usage no matter what. If you needed to use only processors 1 and 3 for your own program for example, then you probably wouldn't be running Boinc anyway. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2117 Credit: 41,159,202 RAC: 15,498 |
The % of CPU time is basically... I've just taken a look at each of your machines and they all have half a dozen or more completed WUs, all completed and validated successfully. Excellent news and a good discussion thread for this issue. At one time I had this problem persist for months (literally) with no known solution. Now we've hit it quickly and, hopefully, have a happier user, rather than the discontented ones in the past. Good job, guys. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,135,730 RAC: 4,670 |
The % of CPU time is basically... Alright! I love it when in the end the User is happy and crunching! |
Message boards :
Number crunching :
What's with all the errors???
©2024 University of Washington
https://www.bakerlab.org