Lots of Stuck Tasks

Message boards : Number crunching : Lots of Stuck Tasks

To post messages, you must log in.

AuthorMessage
Profile Jonathan Jeckell

Send message
Joined: 17 Dec 05
Posts: 7
Credit: 4,685,188
RAC: 0
Message 79401 - Posted: 12 Jan 2016, 15:52:30 UTC

I'm having about one stuck task per computer per day, sometimes more. One of my machines seems less prone to the problem. This seems to be a little different than the stuck unit description in the FAQ.

The progress bar proceeds normally up to some random % complete before stopping, and the "time remaining" is replaced with "--". The actual elapsed CPU time reported by the tasks properties is much lower than the time in the BOINC client.

I'm not sure if I am having this problem with Mac OS X or Windows 10, but it is happening with all of my Ubuntu Linux boxes, all of which are running v3.19x. I've maxed out the memory allocation to BOINC through BOINC manager, but I'm not sure if that has helped or not.

Any ideas other than waiting for them to waste days worth of processing time waiting for them to die on their own or aborting them? I've also tried suspending and resuming the task, but I don't think that has ever worked.
ID: 79401 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 79404 - Posted: 12 Jan 2016, 19:41:06 UTC - in response to Message 79401.  

I'm having about one stuck task per computer per day, sometimes more. One of my machines seems less prone to the problem. This seems to be a little different than the stuck unit description in the FAQ.

The progress bar proceeds normally up to some random % complete before stopping, and the "time remaining" is replaced with "--". The actual elapsed CPU time reported by the tasks properties is much lower than the time in the BOINC client.

I'm not sure if I am having this problem with Mac OS X or Windows 10, but it is happening with all of my Ubuntu Linux boxes, all of which are running v3.19x. I've maxed out the memory allocation to BOINC through BOINC manager, but I'm not sure if that has helped or not.

Any ideas other than waiting for them to waste days worth of processing time waiting for them to die on their own or aborting them? I've also tried suspending and resuming the task, but I don't think that has ever worked.


can you tell us what work unit(s) are getting stuck (job names), computer ID, and any other info you think might be useful to help us diagnose the issue further. thanks!
ID: 79404 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jonathan Jeckell

Send message
Joined: 17 Dec 05
Posts: 7
Credit: 4,685,188
RAC: 0
Message 79407 - Posted: 13 Jan 2016, 14:42:01 UTC - in response to Message 79404.  



can you tell us what work unit(s) are getting stuck (job names), computer ID, and any other info you think might be useful to help us diagnose the issue further. thanks!


I have only been able to monitor a few of my machines closely enough to know for sure it is happening with them. One of them is 2381976, which is running Ubuntu Linux 3.19.0-30. It's got an i5-3210M CPU and 4GB RAM. I have the RAM allocation maxed out in the BOINC manager and I don't have any other processes running except what normally would kick in with Linux.

It looks like the watchdog caught TaskID 784730752 and 784067042, but I aborted 784718526, 784718525, 784568649, 784013724. 783981597, 783601399, 783601372, etc. I think I am seeing this happen at least once a day with this machine. I am reluctant to post the stderr text from these because it is really long.

I'm also not very conversant with Rosetta error logs, but this seems to keep popping up in the bad tasks:
"No heartbeat from core client for 30 sec - exiting"
"FILE_LOCK::unlock(): close failed.: Bad file descriptor"

I have the same problem, but maybe just a little less frequently with machine 2383125, but this one gets rebooted more often, so maybe that could be why. It's also running Ubuntu Linux 3.19.0-33 on an i5-3320M with 4GB RAM with the RAM allocation maxed out in the BOINC client. I also have another similarly configured machine, an i5-2450M with 4GB RAM running Ubuntu that doesn't seem to get stuck work units often, if at all.

I appreciate any thoughts you might have on this. I'd prefer not to have to micromanage these things, but I want to maximize the efficiency of my contributions to this project.
ID: 79407 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Lots of Stuck Tasks



©2024 University of Washington
https://www.bakerlab.org