Message boards : Number crunching : Lots of Stuck Tasks
Author | Message |
---|---|
Jonathan Jeckell Send message Joined: 17 Dec 05 Posts: 7 Credit: 4,685,188 RAC: 0 |
I'm having about one stuck task per computer per day, sometimes more. One of my machines seems less prone to the problem. This seems to be a little different than the stuck unit description in the FAQ. The progress bar proceeds normally up to some random % complete before stopping, and the "time remaining" is replaced with "--". The actual elapsed CPU time reported by the tasks properties is much lower than the time in the BOINC client. I'm not sure if I am having this problem with Mac OS X or Windows 10, but it is happening with all of my Ubuntu Linux boxes, all of which are running v3.19x. I've maxed out the memory allocation to BOINC through BOINC manager, but I'm not sure if that has helped or not. Any ideas other than waiting for them to waste days worth of processing time waiting for them to die on their own or aborting them? I've also tried suspending and resuming the task, but I don't think that has ever worked. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I'm having about one stuck task per computer per day, sometimes more. One of my machines seems less prone to the problem. This seems to be a little different than the stuck unit description in the FAQ. can you tell us what work unit(s) are getting stuck (job names), computer ID, and any other info you think might be useful to help us diagnose the issue further. thanks! |
Jonathan Jeckell Send message Joined: 17 Dec 05 Posts: 7 Credit: 4,685,188 RAC: 0 |
I have only been able to monitor a few of my machines closely enough to know for sure it is happening with them. One of them is 2381976, which is running Ubuntu Linux 3.19.0-30. It's got an i5-3210M CPU and 4GB RAM. I have the RAM allocation maxed out in the BOINC manager and I don't have any other processes running except what normally would kick in with Linux. It looks like the watchdog caught TaskID 784730752 and 784067042, but I aborted 784718526, 784718525, 784568649, 784013724. 783981597, 783601399, 783601372, etc. I think I am seeing this happen at least once a day with this machine. I am reluctant to post the stderr text from these because it is really long. I'm also not very conversant with Rosetta error logs, but this seems to keep popping up in the bad tasks: "No heartbeat from core client for 30 sec - exiting" "FILE_LOCK::unlock(): close failed.: Bad file descriptor" I have the same problem, but maybe just a little less frequently with machine 2383125, but this one gets rebooted more often, so maybe that could be why. It's also running Ubuntu Linux 3.19.0-33 on an i5-3320M with 4GB RAM with the RAM allocation maxed out in the BOINC client. I also have another similarly configured machine, an i5-2450M with 4GB RAM running Ubuntu that doesn't seem to get stuck work units often, if at all. I appreciate any thoughts you might have on this. I'd prefer not to have to micromanage these things, but I want to maximize the efficiency of my contributions to this project. |
Message boards :
Number crunching :
Lots of Stuck Tasks
©2024 University of Washington
https://www.bakerlab.org