Message boards : Number crunching : Stuck on uploading is a new problem?
Author | Message |
---|---|
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Sure thought that I had seen this behavior before, but no mention of "uploading" anywhere here? At least the search reports there is no such comment? Anyway, my client has a completed unit from the 160122cc... project that has been stuck in the "uploading" status for a couple of days now. Just now I saw another unit from the same project get completed and uploaded successfully, which shows enough of the servers are running properly (though the server status shows most of them are down again). The problem work unit shows on the Transfers tab with the status "Upload: retry in..." Clicking on the "Retry Now" results in a few seconds of retrying, and then it goes back to that waiting-to-retry status. The deadline of the stuck unit is the 17th, so maybe it will get unstuck before that or the work will just get discarded when the deadline arrives. However, as mentioned, it's already been stuck for a couple of days. As regards the old problems with bad scheduling and wasted bandwidth, mostly I quit worrying about it. There were various complicated and tedious suggestions offered. I think most of them were well intentioned and even sincere, but some of them are just wildly guessing and mostly I just don't want to be bothered. At this point I mostly don't care, but I will add the minor observation that it seems the BOINC client for Macs works "properly". At least it always seems to start the units based on the correct deadlines and (in contrast to the Windows and Linux clients) I've never noticed it in obvious trouble with downloaded units that cannot be completed. My usage pattern for the Mac is most similar to one of the Windows machines, so I don't see that as a cause. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Are you running the same version of BOINC Manager on the Windows and Mac you refer to? The projects and work units have no control over which is dispatched next. BOINC Manager controls the workflow and dispatching of CPU resources. Rosetta Moderator: Mod.Sense |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Are you running the same version of BOINC Manager on the Windows and Mac you refer to? The projects and work units have no control over which is dispatched next. BOINC Manager controls the workflow and dispatching of CPU resources. Sorry, I don't want to spend a lot of time beyond trying to make sure everything is running the latest version of everything. Perhaps more to the point, I'm not sure about the point of your question, since each platform has to have some platform-dependent code. Since you seem to be asking about the "by the way" part, perhaps I should clarify that what seems to be going on is that the Mac always picks work based on earliest deadlines, whereas the Windows and Linux clients sometimes pick units with much later deadlines. (On all platforms, there are short-deadline units that get jumped to the front, causing other work to be suspended, sometimes a bit awkwardly (but that's just the checkpointing problem).) Anyway, on the original question, I can add a bit of data. Not specific to the sub-project. Just noticed that another computer has a stuck-on-uploading re12dslf... Deadline for that one is the 18th, to be compared with the deadline of the 17th for the one that's stuck on this computer. Both Windows, but I'm not putting many hours on the Linux boxen these days. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I have one that has been stuck for several hours. I am running Win7 64-bit and BOINC 7.7.2 (x64). failed upload 512 rosetta@home 4/13/2017 9:35:35 PM Started upload of UN-NM_C4Yang_001512_2L8HC4-12_DHR62_0009.pdb_C4Yang_17_04_20_48_34_localDocking_0_SAVE_ALL_OUT_479518_11_0_0 513 4/13/2017 9:35:56 PM Project communication failed: attempting access to reference site 514 rosetta@home 4/13/2017 9:35:56 PM Temporarily failed upload of UN-NM_C4Yang_001512_2L8HC4-12_DHR62_0009.pdb_C4Yang_17_04_20_48_34_localDocking_0_SAVE_ALL_OUT_479518_11_0_0: transient HTTP error 515 rosetta@home 4/13/2017 9:35:56 PM Backing off 03:56:54 on upload of UN-NM_C4Yang_001512_2L8HC4-12_DHR62_0009.pdb_C4Yang_17_04_20_48_34_localDocking_0_SAVE_ALL_OUT_479518_11_0_0 516 4/13/2017 9:35:57 PM Internet access OK - project servers may be temporarily down. |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
I also have a task that is stuck uploading. Will try to post more details about it tomorrow if it's still stuck when I wake up. I tried putting back the hosts information in case it's a dns problem, also tried flushing the dns resolver cache and removing any host entries too.. Doesn't seem to be DNS related. **38 cores crunching for R@H on behalf of cancercomputer.org - a non-profit supporting High Performance Computing in Cancer Research |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
The bad one is still stuck after retrying 17 times. But another Rosetta WU on that machine has since uploaded successfully in the last hour, so it appears that the server is OK. I tried rebooting the PC to try to fix the stuck one, but the BOINC Manager would not reconnect to the client (red dot in icon), and the desktop froze. I had to force a reboot, and then run Windows repair to regain the desktop. I then tried to abort the stuck one, but it would not abort. I will have to try using Task Manager to get rid of it somehow. So clearly it is the work unit itself that is bad. |
Ace Casino Send message Joined: 16 Jul 07 Posts: 18 Credit: 13,858,424 RAC: 9,948 |
I have 2 stuck WU's. I'm running windows 10 and Boinc 7.6.33 if that's of any help. Tried hitting retry a few times...no luck. Oh well...stuff happens. Happy Crunch'n |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Just got a second one on the first machine. This one is an rb... unit, so now I have two units stuck uploading on one machine and one on another. Three different projects, but I've also seen at least one successful upload of a completed unit from one of the same three projects. Not sure if it's a useful diagnostic, but when told to try again, it seems they fail in two ways. Sometimes nothing goes up, and other times a small packet goes up. Also the return to waiting mode is sometimes faster, but usually it takes a while. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Luigi R. Send message Joined: 7 Feb 14 Posts: 39 Credit: 2,045,527 RAC: 0 |
5 tasks stuck on uploading here. client_state.xml <file> <name>rb_03_23_72525_116778__t000__ab_robetta_IGNORE_THE_REST_474917_815_0_0</name> <nbytes>530178.000000</nbytes> <max_nbytes>25000000.000000</max_nbytes> <md5_cksum>221c7cf702ff15910a96060fed236335</md5_cksum> <status>1</status> <upload_url>http://srv1.bakerlab.org/rosetta_cgi/file_upload_handler</upload_url> <persistent_file_xfer> <num_retries>11</num_retries> <first_request_time>1492170993.617707</first_request_time> <next_request_time>1492253144.919740</next_request_time> <time_so_far>2552.406322</time_so_far> <last_bytes_xferred>32768.000000</last_bytes_xferred> <is_upload>1</is_upload> </persistent_file_xfer> </file> <file> <name>rb_03_23_72525_116778__t000__4_C1_SAVE_ALL_OUT_IGNORE_THE_REST_474917_259_0_0</name> <nbytes>887888.000000</nbytes> <max_nbytes>25000000.000000</max_nbytes> <md5_cksum>33e5d718ef03b6c814fecaa4d08c9b81</md5_cksum> <status>1</status> <upload_url>http://srv4.bakerlab.org/rosetta_cgi/file_upload_handler</upload_url> <persistent_file_xfer> <num_retries>11</num_retries> <first_request_time>1492169234.382923</first_request_time> <next_request_time>1492257568.171602</next_request_time> <time_so_far>2357.694327</time_so_far> <last_bytes_xferred>207.000000</last_bytes_xferred> <is_upload>1</is_upload> </persistent_file_xfer> </file> <file> <name>UN-NM_C4Yang_000006_2L8HC4-12_DHR32_0019.pdb_C4Yang_17_04_20_47_25_localDocking_9_SAVE_ALL_OUT_479492_23_0_0</name> <nbytes>22502.000000</nbytes> <max_nbytes>50000000.000000</max_nbytes> <md5_cksum>db8309e5c372f565885d1075ac2b9683</md5_cksum> <status>1</status> <upload_url>http://srv4.bakerlab.org/rosetta_cgi/file_upload_handler</upload_url> <persistent_file_xfer> <num_retries>8</num_retries> <first_request_time>1492198192.963454</first_request_time> <next_request_time>1492258150.352014</next_request_time> <time_so_far>2485.533146</time_so_far> <last_bytes_xferred>22502.000000</last_bytes_xferred> <is_upload>1</is_upload> </persistent_file_xfer> </file> <file> <name>3566f810a5e0096440dc8f17796115d2_eehee_pd1-docking_CancerImmunotherapy_17_04_13_32_36_globalDocking_4_SAVE_ALL_OUT_478149_7_0_0</name> <nbytes>99159.000000</nbytes> <max_nbytes>50000000.000000</max_nbytes> <md5_cksum>a72161808b6852c6bb6f86c8fc85619f</md5_cksum> <status>1</status> <upload_url>http://srv3.bakerlab.org/rosetta_cgi/file_upload_handler</upload_url> <persistent_file_xfer> <num_retries>5</num_retries> <first_request_time>1492214322.948390</first_request_time> <next_request_time>1492248107.442623</next_request_time> <time_so_far>2275.530723</time_so_far> <last_bytes_xferred>32768.000000</last_bytes_xferred> <is_upload>1</is_upload> </persistent_file_xfer> </file> <file> <name>14dslfv5_14re4np_gb_0037_0001_30_0002_SAVE_ALL_OUT_480050_322_0_0</name> <nbytes>337741.000000</nbytes> <max_nbytes>50000000.000000</max_nbytes> <md5_cksum>8b05674e7048a0d3632f82d93a4d9571</md5_cksum> <status>1</status> <upload_url>http://srv1.bakerlab.org/rosetta_cgi/file_upload_handler</upload_url> <persistent_file_xfer> <num_retries>5</num_retries> <first_request_time>1492243267.332962</first_request_time> <next_request_time>1492251204.143131</next_request_time> <time_so_far>1553.738969</time_so_far> <last_bytes_xferred>32768.000000</last_bytes_xferred> <is_upload>1</is_upload> </persistent_file_xfer> </file> |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Third machine now. That's my last Windows 10 box, so all of them have at least one. Still seeing some units go through without any problem. The Mac's okay so far, and I'll run a Linux box next week to see how it's going there. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
This has been a very elusive issue. Our sys admins have been working pretty hard at trying to figure out what is going on. It may be a network issue on the UW side but we are not sure. Still working on trying to figure this out. Thanks for all the feeback/updates. |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
This has been a very elusive issue. Our sys admins have been working pretty hard at trying to figure out what is going on. It may be a network issue on the UW side but we are not sure. Still working on trying to figure this out. Thanks for all the feeback/updates. Not sure if this is a helpful clue, but I notice that the "Elapsed" column on the "Transfers" tab does not seem to make any sense. At least not if it is supposed to be related to elapsed time for the current transfer, which is how I've been interpreting it. They never seem to go to zero now? Clicking on "Retry Now" causes them to start ticking while the client is trying and the transfer is active, but then it stops counting while it's in the "Upload: retry in..." status. One of them is now up to 13:14 and the other is at 4:31, but they don't ever get back to zero, so maybe its some kind of initialization failure in the transfer of those stuck units, and then it never resets properly, so it isn't really retrying? #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2117 Credit: 41,155,895 RAC: 16,061 |
I was going to say I wasn't seeing this on 3 of my devices (Android, AMD desktop, Intel desktop) but just got home to find one each on my main AMD desktop and Intel laptop. Almost all other tasks go through, but when one sticks it keeps on failing. Could some flag be getting set on these individual tasks? (Guessing) |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Just checking one of my other machines, and I notice that the stuck packets seem to be hanging at the same point. Can you see it on your machine? Either 64K if the results are small, or 0.06 for larger results, but I think that's just rounding from 64K. If that's true, then it seems about 40 packets of the results are uploaded before something goes wrong. One more thought. Anyone running other projects? Is this only a rosetta@home problem or is it some new problem at the BOINC client level? (Seems unlikely since the client software hasn't been upgraded recently.) #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Luigi R. Send message Joined: 7 Feb 14 Posts: 39 Credit: 2,045,527 RAC: 0 |
I'm running LHC@Home, WCG and NumberFields too. No problem for these projects. |
John C MacAlister Send message Joined: 6 Dec 10 Posts: 16 Credit: 944,813 RAC: 0 |
I am running WCG and FAH with no problems. One Rosetta task stuck at 100% upload progress quoting 'Upload retry in 5:10:37' |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
The problem is not just limited to Windows. I now have one stuck on my Ubuntu machine. Stuck Ubuntu upload 2159 rosetta@home 4/16/2017 12:57:28 PM Temporarily failed upload of rb_03_29_73500_116896__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_477229_82_0_0: transient HTTP error 2160 rosetta@home 4/16/2017 12:57:28 PM Backing off 00:02:38 on upload of rb_03_29_73500_116896__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_477229_82_0_0 2161 4/16/2017 12:57:31 PM Internet access OK - project servers may be temporarily down. |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Hmm... Interesting, but at least no reports for the Mac client. Pretty sure that code is significantly different from the other BOINC clients. My own Mac continues to run without stuck-unit problems. Minor data point: The 64K thing has changed. Now most of the stuck units seem to upload slightly different amounts of data before they freeze.Here are the four results I currently have stuck on this machine: 0.00/1.62 MB 0.21/511.73 KB 0.22/219.68 KB 0.20/407.93 KB (The last two are rb... and the other two on this machine are different ones. The oldest one will hit its deadline today.) Just woke up another machine with a stuck unit. It was stuck at 64K, but after telling it to retry, the new stuck condition is 0.19/496.52 KB for an re12dslf... project. Some kind of input buffer size problem? Maybe the rosetta@home people are trying to increase the buffer sizes for incoming data? The problem is somehow related to certain work units requesting smaller buffers than they actually need, and then getting stuck because they can't send the rest of their data? #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Iceshard_ Send message Joined: 3 Dec 16 Posts: 1 Credit: 651,985 RAC: 0 |
Got a stuck upload 4/16/2017 12:34:06 PM | | Project communication failed: attempting access to reference site 4/16/2017 12:34:06 PM | rosetta@home | Temporarily failed upload of 8HnT3821_fold_and_dock_SAVE_ALL_OUT_480303_164_0_0: transient HTTP error 4/16/2017 12:34:06 PM | rosetta@home | Backing off 00:02:15 on upload of 8HnT3821_fold_and_dock_SAVE_ALL_OUT_480303_164_0_0 4/16/2017 12:34:08 PM | | Internet access OK - project servers may be temporarily down. 4/16/2017 12:36:21 PM | rosetta@home | Started upload of 8HnT3821_fold_and_dock_SAVE_ALL_OUT_480303_164_0_0 4/16/2017 12:36:22 PM | rosetta@home | [error] Error reported by file upload server: [8HnT3821_fold_and_dock_SAVE_ALL_OUT_480303_164_0_0] locked by file_upload_handler PID=2741 4/16/2017 12:36:22 PM | rosetta@home | Temporarily failed upload of 8HnT3821_fold_and_dock_SAVE_ALL_OUT_480303_164_0_0: transient upload error 4/16/2017 12:36:22 PM | rosetta@home | Backing off 00:06:39 on upload of 8HnT3821_fold_and_dock_SAVE_ALL_OUT_480303_164_0_0 |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Hmm... Interesting, but at least no reports for the Mac client. Pretty sure that code is significantly different from the other BOINC clients. My own Mac continues to run without stuck-unit problems. Retried a few minutes later, and the new stuck status is: 0.06/1.62 MB 64.00/511.73 KB 64.00/219.68 KB 64.00/407.93 KB It would seem to mean something, but what? Of course I'm going to go meta on you again... Confidence in the quality of the code has a relationship to confidence in the quality of the results. *sigh* #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Message boards :
Number crunching :
Stuck on uploading is a new problem?
©2024 University of Washington
https://www.bakerlab.org