Stalled downloads

Message boards : Number crunching : Stalled downloads

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 34
Message 92410 - Posted: 27 Mar 2020, 19:10:22 UTC

Okay, good - I've not had that stall for a while, but it is alwas good to know that an actual fault was found.

Peter: closing word before you get the feeling that all is perfect here... we live about 2km from the town centre, and cycle to/from. On a bike, you are going slow enough to vere around potholes, but slow enough also that you notice how many there are. Last summer, all the equipment arrived and we thought "at last". Next day the equipment was gone, the potholes were still there, but the pavements were bright shiny and new!!!
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 92410 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,845,183
RAC: 9,025
Message 92412 - Posted: 27 Mar 2020, 19:23:40 UTC - in response to Message 92410.  
Last modified: 27 Mar 2020, 19:24:33 UTC

Okay, good - I've not had that stall for a while, but it is alwas good to know that an actual fault was found.


Mine have also severely reduced. I got one a day ago on one of my four computers, but I'm seeing one a week or two instead of a few a day.

Peter: closing word before you get the feeling that all is perfect here... we live about 2km from the town centre, and cycle to/from. On a bike, you are going slow enough to vere around potholes, but slow enough also that you notice how many there are. Last summer, all the equipment arrived and we thought "at last". Next day the equipment was gone, the potholes were still there, but the pavements were bright shiny and new!!!


Yip, round here they seem to retarmac things to some weird formula, which certainly has nothing to do with the actual condition of the road. At least on a bike you can use the pavement or the road. I often go on the pavement to avoid the speedbumps, those are deadly to cyclists. If I didn't have full suspension I could imagine going over the handlebars quite easily. Yet another piece of health and softy shooting itself in the foot.
ID: 92412 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,249,734
RAC: 9,368
Message 92429 - Posted: 28 Mar 2020, 0:03:44 UTC - in response to Message 92407.  

Thanks all for the specific details. Even firefox hangs trying to download the file.
I have shared the details with the Project Team and they were able to recreate the hang problem. So, hopefully they are on their way to tracking down the issue.

Blimey! Hope!

Maybe this host will finally shift after 4 idle but unattended days
ID: 92429 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,845,183
RAC: 9,025
Message 92431 - Posted: 28 Mar 2020, 0:11:05 UTC - in response to Message 92429.  

Thanks all for the specific details. Even firefox hangs trying to download the file.
I have shared the details with the Project Team and they were able to recreate the hang problem. So, hopefully they are on their way to tracking down the issue.

Blimey! Hope!

Maybe this host will finally shift after 4 idle but unattended days


So if you don't cancel the stuck download, does it sit there forever? I assumed Boinc would be clever enough to abort at some point.... I've seen mine count up to 10 retries, but I've always stopped it at that point or earlier.
ID: 92431 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 393
Credit: 12,114,842
RAC: 4,200
Message 92445 - Posted: 28 Mar 2020, 12:06:16 UTC - in response to Message 92431.  


So if you don't cancel the stuck download, does it sit there forever? I assumed Boinc would be clever enough to abort at some point.... I've seen mine count up to 10 retries, but I've always stopped it at that point or earlier.


They cancel when the WU reaches its deadline at which time my remote machine switches to running only Rosetta for the next three days until the next stalled download :-)
ID: 92445 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92454 - Posted: 28 Mar 2020, 13:57:42 UTC
Last modified: 28 Mar 2020, 13:58:33 UTC

Stalled downloads issue should now be resolved!

They found the problem with the stalled downloads, there were firewalls and switches elsewhere in the university network that were detecting byte sequences in the zip files that tripped a malware filter. Any currently hung downloads should complete on their next retry. Future downloads should not run in to the problem.

Thanks to all for your reports and tracking down some specific example files.

Please keep an eye on it and please report if you still see any further problems with downloads hanging.
Rosetta Moderator: Mod.Sense
ID: 92454 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 393
Credit: 12,114,842
RAC: 4,200
Message 92463 - Posted: 28 Mar 2020, 15:05:15 UTC - in response to Message 92454.  

Stalled downloads issue should now be resolved!

They found the problem with the stalled downloads, there were firewalls and switches elsewhere in the university network that were detecting byte sequences in the zip files that tripped a malware filter. Any currently hung downloads should complete on their next retry. Future downloads should not run in to the problem.

Thanks to all for your reports and tracking down some specific example files.

Please keep an eye on it and please report if you still see any further problems with downloads hanging.


Glory hallelujah :-)

Thank you.
ID: 92463 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,845,183
RAC: 9,025
Message 92466 - Posted: 28 Mar 2020, 16:14:59 UTC - in response to Message 92454.  

Stalled downloads issue should now be resolved!

They found the problem with the stalled downloads, there were firewalls and switches elsewhere in the university network that were detecting byte sequences in the zip files that tripped a malware filter. Any currently hung downloads should complete on their next retry. Future downloads should not run in to the problem.

Thanks to all for your reports and tracking down some specific example files.

Please keep an eye on it and please report if you still see any further problems with downloads hanging.


Wow, when I worked at a University they had bugger all protection. The Nimda virus infected every machine within an hour.

But.... those firewalls detecting the problem, did they not flag it to someone who would then have investigated it?
ID: 92466 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,845,183
RAC: 9,025
Message 92467 - Posted: 28 Mar 2020, 16:15:50 UTC - in response to Message 92445.  


So if you don't cancel the stuck download, does it sit there forever? I assumed Boinc would be clever enough to abort at some point.... I've seen mine count up to 10 retries, but I've always stopped it at that point or earlier.


They cancel when the WU reaches its deadline at which time my remote machine switches to running only Rosetta for the next three days until the next stalled download :-)


Oh my god, Boinc needs severely reprogramming. If it can't download something, it should try something else, not just sit there.
ID: 92467 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 34
Message 92479 - Posted: 28 Mar 2020, 18:03:21 UTC

>>> f it can't download something, it should try something else, not just sit there.

+1
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 92479 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,845,183
RAC: 9,025
Message 92482 - Posted: 28 Mar 2020, 18:14:01 UTC - in response to Message 92479.  

>>> f it can't download something, it should try something else, not just sit there.

+1


I've spoken about it on the Boinc forums, but they show no interest and just blame Rosetta. Boinc needs to be made more resilient.
ID: 92482 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 34
Message 92485 - Posted: 28 Mar 2020, 20:00:05 UTC
Last modified: 28 Mar 2020, 20:01:53 UTC

What I guess it should do, is work unit zyyxzyzx has a download issue, okay, leave that one pending and start zyyxzyzx+1. It goes through, fine, issue with the old one, cancel it. If it sticks on the same file, flag that to the staff, and stop sending jobs with that file. It should be robust enough to see and act on enventualities such as this.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 92485 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,249,734
RAC: 9,368
Message 92515 - Posted: 29 Mar 2020, 10:22:19 UTC - in response to Message 92454.  

Stalled downloads issue should now be resolved!

They found the problem with the stalled downloads, there were firewalls and switches elsewhere in the university network that were detecting byte sequences in the zip files that tripped a malware filter. Any currently hung downloads should complete on their next retry. Future downloads should not run in to the problem.

Thanks to all for your reports and tracking down some specific example files.

Please keep an eye on it and please report if you still see any further problems with downloads hanging.

I did notice this - that it went through rather than aborting and ran to completion. I was very surprised, but grateful I didn't need a 200+ round trip or suffer 3 days more of running the wrong project.
New tasks coming down too, running and reporting, though it seems WCG had to clear down a lot of tasks for the bulk of the following 24hrs
ID: 92515 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,249,734
RAC: 9,368
Message 92519 - Posted: 29 Mar 2020, 10:46:27 UTC - in response to Message 92467.  

So if you don't cancel the stuck download, does it sit there forever? I assumed Boinc would be clever enough to abort at some point.... I've seen mine count up to 10 retries, but I've always stopped it at that point or earlier.

They cancel when the WU reaches its deadline at which time my remote machine switches to running only Rosetta for the next three days until the next stalled download :-)

Oh my god, Boinc needs severely reprogramming. If it can't download something, it should try something else, not just sit there.

The issue isn't with running tasks - those continue unhindered - it's that when downloads are stalled it blocks downloading anything else for that whole project until it succeeds. In my case, for 5+ days and would've continued until the deadline passed (8 days). The task buffer gets exhausted and any backup project is allowed to download unhindered and then run because they're the only tasks available.

When I had this issue at an attended machine and aborted the stuck files, I'd often find that the next half dozen tasks were of the same batch and would also stall. Very easy to imagine circumstances where several tasks are stuck, seem to fill the buffer, then prevent backup project tasks coming down due to the expectation of having enough work and end up having nothing to run at all, which might conceivably be worse

So maybe allow repeated attempts to download for a maximum 24hrs, else abort? But could be a loss of connectivity at the user end, or at the project end?

I don't know the answer. Above my pay-grade
ID: 92519 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92542 - Posted: 29 Mar 2020, 16:20:10 UTC - in response to Message 92519.  

So maybe allow repeated attempts to download for a maximum 24hrs, else abort? But could be a loss of connectivity at the user end, or at the project end?


Yes, some means for BOINC to give up is required. Perhaps it should wait even longer if the file is "particularly large" (intentionally left for others to define). The other issue is that in the case here at R@h the problems seemed to be with small files, but the small file was required to run a given WU. That WU might have had many GBs of other required files that came down ok. So the actual size of the particular file having problems is not really the only consideration. If you abort the WU, all of the WU files are being lost (unless other WUs your machine has on-board use them as well).

I should think the BOINC Manager can see the difference between a lack of connectivity being the cause, and a dropped frame. So that info. could be used. If there was a loss of connectivity, then perhaps you double the maximum timeout.

...all good discussion for BOINC message boards.
Rosetta Moderator: Mod.Sense
ID: 92542 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,845,183
RAC: 9,025
Message 92561 - Posted: 29 Mar 2020, 19:03:21 UTC - in response to Message 92542.  

So maybe allow repeated attempts to download for a maximum 24hrs, else abort? But could be a loss of connectivity at the user end, or at the project end?


Yes, some means for BOINC to give up is required. Perhaps it should wait even longer if the file is "particularly large" (intentionally left for others to define). The other issue is that in the case here at R@h the problems seemed to be with small files, but the small file was required to run a given WU. That WU might have had many GBs of other required files that came down ok. So the actual size of the particular file having problems is not really the only consideration. If you abort the WU, all of the WU files are being lost (unless other WUs your machine has on-board use them as well).

I should think the BOINC Manager can see the difference between a lack of connectivity being the cause, and a dropped frame. So that info. could be used. If there was a loss of connectivity, then perhaps you double the maximum timeout.


It seems Boinc hasn't heard of a single file being a problem. It can cope with no connectivity, but it assumes that not being able to get one file means the whole project is screwed. I'd change it to try another WU, then if a few get stuck, run another project for a bit and retry periodically. And of course abort after several retries and get different WUs. If I'm phoning a company and get an engaged tone, I don't just keep trying, I try their other number.

...all good discussion for BOINC message boards.


I've tried, they don't listen. It's all Rosetta's fault....
ID: 92561 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 34
Message 92562 - Posted: 29 Mar 2020, 19:21:10 UTC
Last modified: 29 Mar 2020, 19:25:19 UTC

Zip the required files for the job, and other files that may be useful for future jobs. Cruncher end task unzips and does whatis needed with the contents. A simple "what to do" text file would probably suffice.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 92562 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,845,183
RAC: 9,025
Message 92563 - Posted: 29 Mar 2020, 19:31:54 UTC - in response to Message 92562.  

Zip the required files for the job, and other files that may be useful for future jobs. Cruncher end task unzips and does whatis needed with the contents. A simple "what to do" text file would probably suffice.


Not sure what you mean here. It's zip files that were getting stuck. And transmitting files we might already have is a waste of server resources.
ID: 92563 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92575 - Posted: 30 Mar 2020, 1:40:28 UTC

The problem has been resolved. Sounds like the piece of the network that thought it saw malware caused the frame to just be dropped. That's probably pretty rare. The servers have been whitelisted on the malware detection software.

Also, if https were used for file downloads, it would probably prevent intermediate nodes from analyzing the packet content. Although I suppose it might also trip on to a case where the encrypted packets trip some detection if the software doesn't ignore the encrypted packets. The project is also looking in to using https.
Rosetta Moderator: Mod.Sense
ID: 92575 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,845,183
RAC: 9,025
Message 92659 - Posted: 30 Mar 2020, 21:21:41 UTC - in response to Message 92575.  

The problem has been resolved. Sounds like the piece of the network that thought it saw malware caused the frame to just be dropped. That's probably pretty rare. The servers have been whitelisted on the malware detection software.

Also, if https were used for file downloads, it would probably prevent intermediate nodes from analyzing the packet content. Although I suppose it might also trip on to a case where the encrypted packets trip some detection if the software doesn't ignore the encrypted packets. The project is also looking in to using https.


Why were no techs informed by this firewall? When a computer detects a problem, a human should be informed.
ID: 92659 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : Stalled downloads



©2024 University of Washington
https://www.bakerlab.org