Message boards : Number crunching : minirosetta 2.05
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next
Author | Message |
---|---|
TomaszPawel Send message Joined: 28 Apr 07 Posts: 54 Credit: 2,791,145 RAC: 0 |
|
Craig Dickinson Send message Joined: 7 May 07 Posts: 8 Credit: 1,021,887 RAC: 0 |
Anyone else seeing the following consistent error:- It does not show which server it is connecting to, I tried the url you provided and srv3 as well. Both of these download at 136KB/sec consistently until the download hits 4.57MB then the transfer rate drops to 0KB/sec and after 5 minutes or so it times out the connection. |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
This one failed after about 20 seconds lr15clusfa_opt_.1ail.1ail.IGNORE_THE_REST.c.1.24.pdb.pdb.JOB_17559_3 Exit status -1073741819 (0xc0000005) Reason: Access Violation (0xc0000005) at address 0x006D2D46 read attempt to address 0x00000000 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
...even so, the reaction on the client should be to see that it did not receive the entire length of the file, and then try a second request for just the remainder of the file (this is done with an HTTP range header). I just tried the link from home and it worked fine for me 5.1MB. So that would explain why others are not seeing the same. Is it possible your ISP is limiting the time of each connection or something? Even so, I'm still puzzled why it doesn't sound like it is doing a retry from where it left off. What is the reaction on the client after the connection gets the 4.57M and then times out?? It should schedule a retry on the file until it gets it. And the retry should pick up where the first attempt left off at the 4.57M. You should see this in the advanced view, in the transfers tab. Perhaps there is something up with Win7? I'd be curious to have a look at a Wireshark trace if you would take the time to gather one. Rosetta Moderator: Mod.Sense |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi jcorn. Either this is an old task or the memory limit hasn't been changed yet, this one had the same problem on the same rig, would you believe! Only ran for 10min this time. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=288907163 igfhum_looprefine_placestub2_2dsrI_2R99_ProteinInterfaceDesign_2Feb2010_17660_441_0 Wed 10 Feb 2010 16:19:03 EST|rosetta@home|Aborting task igfhum_looprefine_placestub2_2dsrI_2R99_ProteinInterfaceDesign_2Feb2010_17660_441_0: exceeded memory limit 910.28MB > 909.78MB Wed 10 Feb 2010 16:19:05 EST|rosetta@home|Output file igfhum_looprefine_placestub2_2dsrI_2R99_ProteinInterfaceDesign_2Feb2010_17660_441_0_0 for task absent <core_client_version>6.2.14</core_client_version> <![CDATA[ <message> Maximum memory exceeded </message> |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
Some lr15clusfa_opt WUs are failing after a few seconds: https://boinc.bakerlab.org/rosetta/result.php?resultid=316845967 https://boinc.bakerlab.org/rosetta/result.php?resultid=316788455 https://boinc.bakerlab.org/rosetta/result.php?resultid=316769175 https://boinc.bakerlab.org/rosetta/result.php?resultid=316754661 https://boinc.bakerlab.org/rosetta/result.php?resultid=316741225 |
Craig Dickinson Send message Joined: 7 May 07 Posts: 8 Credit: 1,021,887 RAC: 0 |
I have the wireshark trace from the re-attaching of the Rosetta project to the Boinc client showing the original download failure of the graphics file and the first 2 attempts to resume the download of the graphics file. Once again all other Rosetta files and the first work unit have sucessfully downloaded. Where do you want me to send the wireshark trace report. |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,847,410 RAC: 11,938 |
This is a good idea, but I think the specific WU I mentioned had another problem. It continued to take memory until the maximum available was reached. So maybe it tooke more RAM if I would have more in my PC. By the way - it looks like a typical memory leak... A fairly common error in computer programs Hi jcorn. Yes, its old. Hint: name of the task contains date when it was scheduled. 2 Feb 2010 in this case. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Credit wise, this task: https://boinc.bakerlab.org/rosetta/result.php?resultid=315895911 (igfhum_looprefine_placestub2_2dsrI_2WA0_ProteinInterfaceDesign_2Feb2010_17660_334_0) wasn't even worthy my CPU time. I claimed 99 credit and got only 3! It ran the full length of time, 15000+ seconds,ran 44 models and generated 2 decoys. Something is wrong with those numbers. Especially granted credit. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=315583449 lr15clusfa_opt_.1dhn.1dhn.IGNORE_THE_REST.c.14.1.pdb.pdb.JOB_17574_1_0 Compute error -177 (0xffffff4f) Got full credit though. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2117 Credit: 41,159,890 RAC: 15,363 |
Credit wise, this task: https://boinc.bakerlab.org/rosetta/result.php?resultid=315895911 (igfhum_looprefine_placestub2_2dsrI_2WA0_ProteinInterfaceDesign_2Feb2010_17660_334_0) wasn't even worthy my CPU time. I claimed 99 credit and got only 3! But the times you were awarded more than claimed credit weren't a problem? Funny how that works. It's an average and you're ahead of average generally. I am too but I thought best not to mention it ;) |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Let's not get testy Sid. It looks like he ran 46 models and got credit for only the last 2. I've asked the Project Team to look in to these "double headers" as I call them. Thanks for reporting it Greg. If you have any hints about any rare events that may have occurred on your PC about the time those last two models would have been run, that would be great. Did you happen to power off or shutdown BOINC about that time? Rosetta Moderator: Mod.Sense |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
It would appear that some of these lr15clusfa.. work units have a problem. lr15clusfa_opt_.2cmx.2cmx.SAVE_ALL_OUT_IGNORE_THE_REST.c.2.28.pdb.pdb.JOB_17759_1_0 The previous one I reported has also failed on its second attempt |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Craig and I arranged exchange of the trace via PMs... Craig, I've reviewed the trace and it seems the BOINC client is requesting a partial transfer from the next server in Rosetta's cluster, just as expected. But your machine never acknowledges receipt of any of the packets. And so the 3rd and 4th try are asking for the same point in the file that the second did. Plenty of time elapses, and the frame is resend about 10 times. But never acknowledged. Eventually the connection is closed. The only times I've seen such fundamental things not work properly is when (bear with me :) there is a fundamental problem. Some examples from my past... bad ethernet port in the hub. Bad cable. But most commonly there's a router between the client and the server that's in need of a code refresh. That, or perhaps Win7 itself or the driver for your LAN adapter need an update. Because it sounds like it is happening so consistently at the same point, I tend to think there is a code problem here, because a hardware problem like a bad cable or port would probably be more intermittent. The only other thing I can think of to try is that the downloads started out looking good, so after that first one gets most of the file and fails, shutdown BOINC, and shutdown the LAN adapter ("disable"), reenable the adapter, and fire up BOINC again. Now the retry will occur with a fresh start on the adapter and perhaps get you over the hump. But ultimately, that is just a work around. I think the true fix is going to be drivers, Win7, or router/firewall code updates. You might also try disabling the IPv6 support on the adapter. Is anyone aware of any specific TCP fixes for Win7? Notes to self (cuz this little scrap of paper is sure to disappear as soon as anyone askes me anything else about it!): client starts on srv1, retries occur to srv5, srv6, and srv3. Local ports used are 51221, 51229, 51235, 51241. These make good filters, such as: tcp.port == 51235 in the Wireshark display. Rosetta Moderator: Mod.Sense |
mfbabb2 Send message Joined: 10 Oct 08 Posts: 4 Credit: 10,345 RAC: 0 |
What is up with the low credit? 316913595 289066389 10 Feb 2010 15:09:57 UTC 12 Feb 2010 0:36:02 UTC Over Success Done 12,371.91 36.80 2.13 |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2117 Credit: 41,159,890 RAC: 15,363 |
Let's not get testy Sid. I didn't mean it that way - sorry if that's how it came across. I just recalled Sarel's comment way up the thread that "The critstubs jobs are of the same sort of unevenly crediting trajectories as the *gbnnotyr*. You can have a look at Eva-Maria Strauch's message on the Protein-interface design thread for details" so I'm pretty much ignoring all the vagaries of credit awards against claims. It averages out so we win some, we lose some. Is that not right? If it's not then I can report quite a few too, for what it's worth. Probably of more benefit I should report some compute errors, much the same as reported by others: BOINC client version 6.10.18 for windows_x86_64 Processor: 2 GenuineIntel Intel(R) Core(TM)2 Duo CPU T6600@2.20GHz [Intel64 Family 6 Model 23 Stepping 10] OS: Microsoft Windows 7: Home Premium x64 Edition, (06.01.7600.00) Memory: 4.00 GB physical # cpu_run_time_pref: 28800 Reason: Access Violation (0xc0000005) at address 0x006D2D46 read attempt to address 0x00000000 CPU time 20.65453 lr15clusfa_opt_.1ctf.1ctf.IGNORE_THE_REST.c.18.2.pdb.pdb.JOB_17573_10_0 BOINC client version 6.10.18 for windows_x86_64 Processor: AMD Phenom(tm) 9850 Quad-Core Processor [AMD64 Family 16 Model 2 Stepping 3] OS: Microsoft Windows Vista Home Premium x64 Edition, Service Pack 2, 06.00.6002.00) Memory: 8.00 GB physical # cpu_run_time_pref: 28800 Reason: Access Violation (0xc0000005) at address 0x006D2D46 read attempt to address 0x00000000 CPU time 14.77329 lr15clusfa_opt_.1scj.1scj.IGNORE_THE_REST.c.2.32.pdb.pdb.JOB_17610_1_0 CPU time 15.2101 lr15clusfa_opt_.1iib.1iib.IGNORE_THE_REST.c.9.2.pdb.pdb.JOB_17588_5_1 CPU time 15.1477 lr15clusfa_opt_.1ttz.1ttz.IGNORE_THE_REST.c.0.27.pdb.pdb.JOB_17619_4_1 CPU time 15.0073 lr15clusfa_opt_.1ail.1ail.IGNORE_THE_REST.c.4.11.pdb.pdb.JOB_17559_8_1 |
Copelco Send message Joined: 11 Feb 10 Posts: 1 Credit: 8,097 RAC: 0 |
I'm a new user running latest version. The first work unit you sent ran fine to about 70% then stopped and dropped off the task list as submitted. Account shows no work units submitted. May be a problem. Thanks, TC |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
Now I've got also quite low credit: WU 288293546. I usually need something like 450-650 CPU-seconds for 1Cr, on this WU I got 1Cr/1972sec. . |
Admin Send message Joined: 13 Apr 07 Posts: 42 Credit: 260,782 RAC: 0 |
Compute error - exit status 1 lrmixclus_opt_.1hz6.1hz6.SAVE_ALL_OUT_IGNORE_THE_REST.c.20.2.pdb.pdb.JOB_17816_1_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=317250268 ERROR: start_res != middle_res ERROR:: Exit from: ....srcprotocolsmovesKinematicMover.cc line: 132 BOINC:: Error reading and gzipping output datafile: default.out |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
This one failed after just 14 sec. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=289171483 lr15clusfa_opt_.1ail.1ail.SAVE_ALL_OUT_IGNORE_THE_REST.c.4.20.pdb.pdb.JOB_17676_7_0 Fri 12 Feb 2010 21:40:02 EST|rosetta@home|Output file lr15clusfa_opt_.1ail.1ail.SAVE_ALL_OUT_IGNORE_THE_REST.c.4.20.pdb.pdb.JOB_17676_7_0_0 for task absent <core_client_version>6.2.14</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> BOINC:: Worker startup. Starting watchdog... Watchdog active. SIGSEGV: segmentation violation Stack trace (8 frames): [0x96c49b3] [0x96ee888] [0xb7fd1420] [0x80a8721] [0x808fcc1] [0x804985f] [0x974c15c] [0x8048121] Exiting... </stderr_txt> |
Message boards :
Number crunching :
minirosetta 2.05
©2024 University of Washington
https://www.bakerlab.org