Message boards : Number crunching : Stalled downloads
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next
Author | Message |
---|---|
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 28 |
Okay, good - I've not had that stall for a while, but it is alwas good to know that an actual fault was found. Peter: closing word before you get the feeling that all is perfect here... we live about 2km from the town centre, and cycle to/from. On a bike, you are going slow enough to vere around potholes, but slow enough also that you notice how many there are. Last summer, all the equipment arrived and we thought "at last". Next day the equipment was gone, the potholes were still there, but the pavements were bright shiny and new!!! Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,845,183 RAC: 9,025 |
Okay, good - I've not had that stall for a while, but it is alwas good to know that an actual fault was found. Mine have also severely reduced. I got one a day ago on one of my four computers, but I'm seeing one a week or two instead of a few a day. Peter: closing word before you get the feeling that all is perfect here... we live about 2km from the town centre, and cycle to/from. On a bike, you are going slow enough to vere around potholes, but slow enough also that you notice how many there are. Last summer, all the equipment arrived and we thought "at last". Next day the equipment was gone, the potholes were still there, but the pavements were bright shiny and new!!! Yip, round here they seem to retarmac things to some weird formula, which certainly has nothing to do with the actual condition of the road. At least on a bike you can use the pavement or the road. I often go on the pavement to avoid the speedbumps, those are deadly to cyclists. If I didn't have full suspension I could imagine going over the handlebars quite easily. Yet another piece of health and softy shooting itself in the foot. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
Thanks all for the specific details. Even firefox hangs trying to download the file. Blimey! Hope! Maybe this host will finally shift after 4 idle but unattended days |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,845,183 RAC: 9,025 |
Thanks all for the specific details. Even firefox hangs trying to download the file. So if you don't cancel the stuck download, does it sit there forever? I assumed Boinc would be clever enough to abort at some point.... I've seen mine count up to 10 retries, but I've always stopped it at that point or earlier. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 393 Credit: 12,114,842 RAC: 4,200 |
They cancel when the WU reaches its deadline at which time my remote machine switches to running only Rosetta for the next three days until the next stalled download :-) |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Stalled downloads issue should now be resolved! They found the problem with the stalled downloads, there were firewalls and switches elsewhere in the university network that were detecting byte sequences in the zip files that tripped a malware filter. Any currently hung downloads should complete on their next retry. Future downloads should not run in to the problem. Thanks to all for your reports and tracking down some specific example files. Please keep an eye on it and please report if you still see any further problems with downloads hanging. Rosetta Moderator: Mod.Sense |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 393 Credit: 12,114,842 RAC: 4,200 |
Stalled downloads issue should now be resolved! Glory hallelujah :-) Thank you. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,845,183 RAC: 9,025 |
Stalled downloads issue should now be resolved! Wow, when I worked at a University they had bugger all protection. The Nimda virus infected every machine within an hour. But.... those firewalls detecting the problem, did they not flag it to someone who would then have investigated it? |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,845,183 RAC: 9,025 |
Oh my god, Boinc needs severely reprogramming. If it can't download something, it should try something else, not just sit there. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 28 |
>>> f it can't download something, it should try something else, not just sit there. +1 Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,845,183 RAC: 9,025 |
>>> f it can't download something, it should try something else, not just sit there. I've spoken about it on the Boinc forums, but they show no interest and just blame Rosetta. Boinc needs to be made more resilient. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 28 |
What I guess it should do, is work unit zyyxzyzx has a download issue, okay, leave that one pending and start zyyxzyzx+1. It goes through, fine, issue with the old one, cancel it. If it sticks on the same file, flag that to the staff, and stop sending jobs with that file. It should be robust enough to see and act on enventualities such as this. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
Stalled downloads issue should now be resolved! I did notice this - that it went through rather than aborting and ran to completion. I was very surprised, but grateful I didn't need a 200+ round trip or suffer 3 days more of running the wrong project. New tasks coming down too, running and reporting, though it seems WCG had to clear down a lot of tasks for the bulk of the following 24hrs |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
So if you don't cancel the stuck download, does it sit there forever? I assumed Boinc would be clever enough to abort at some point.... I've seen mine count up to 10 retries, but I've always stopped it at that point or earlier. The issue isn't with running tasks - those continue unhindered - it's that when downloads are stalled it blocks downloading anything else for that whole project until it succeeds. In my case, for 5+ days and would've continued until the deadline passed (8 days). The task buffer gets exhausted and any backup project is allowed to download unhindered and then run because they're the only tasks available. When I had this issue at an attended machine and aborted the stuck files, I'd often find that the next half dozen tasks were of the same batch and would also stall. Very easy to imagine circumstances where several tasks are stuck, seem to fill the buffer, then prevent backup project tasks coming down due to the expectation of having enough work and end up having nothing to run at all, which might conceivably be worse So maybe allow repeated attempts to download for a maximum 24hrs, else abort? But could be a loss of connectivity at the user end, or at the project end? I don't know the answer. Above my pay-grade |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
So maybe allow repeated attempts to download for a maximum 24hrs, else abort? But could be a loss of connectivity at the user end, or at the project end? Yes, some means for BOINC to give up is required. Perhaps it should wait even longer if the file is "particularly large" (intentionally left for others to define). The other issue is that in the case here at R@h the problems seemed to be with small files, but the small file was required to run a given WU. That WU might have had many GBs of other required files that came down ok. So the actual size of the particular file having problems is not really the only consideration. If you abort the WU, all of the WU files are being lost (unless other WUs your machine has on-board use them as well). I should think the BOINC Manager can see the difference between a lack of connectivity being the cause, and a dropped frame. So that info. could be used. If there was a loss of connectivity, then perhaps you double the maximum timeout. ...all good discussion for BOINC message boards. Rosetta Moderator: Mod.Sense |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,845,183 RAC: 9,025 |
So maybe allow repeated attempts to download for a maximum 24hrs, else abort? But could be a loss of connectivity at the user end, or at the project end? It seems Boinc hasn't heard of a single file being a problem. It can cope with no connectivity, but it assumes that not being able to get one file means the whole project is screwed. I'd change it to try another WU, then if a few get stuck, run another project for a bit and retry periodically. And of course abort after several retries and get different WUs. If I'm phoning a company and get an engaged tone, I don't just keep trying, I try their other number. ...all good discussion for BOINC message boards. I've tried, they don't listen. It's all Rosetta's fault.... |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 28 |
Zip the required files for the job, and other files that may be useful for future jobs. Cruncher end task unzips and does whatis needed with the contents. A simple "what to do" text file would probably suffice. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,845,183 RAC: 9,025 |
Zip the required files for the job, and other files that may be useful for future jobs. Cruncher end task unzips and does whatis needed with the contents. A simple "what to do" text file would probably suffice. Not sure what you mean here. It's zip files that were getting stuck. And transmitting files we might already have is a waste of server resources. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
The problem has been resolved. Sounds like the piece of the network that thought it saw malware caused the frame to just be dropped. That's probably pretty rare. The servers have been whitelisted on the malware detection software. Also, if https were used for file downloads, it would probably prevent intermediate nodes from analyzing the packet content. Although I suppose it might also trip on to a case where the encrypted packets trip some detection if the software doesn't ignore the encrypted packets. The project is also looking in to using https. Rosetta Moderator: Mod.Sense |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,845,183 RAC: 9,025 |
The problem has been resolved. Sounds like the piece of the network that thought it saw malware caused the frame to just be dropped. That's probably pretty rare. The servers have been whitelisted on the malware detection software. Why were no techs informed by this firewall? When a computer detects a problem, a human should be informed. |
Message boards :
Number crunching :
Stalled downloads
©2024 University of Washington
https://www.bakerlab.org