minirosetta 2.05

Author	Message
Aroundomaha Send message Joined: 11 Sep 08 Posts: 14 Credit: 55,623,619 RAC: 0	Message 64996 - Posted: 15 Jan 2010, 21:46:29 UTC - in response to Message 64951. For the past two days my Windows 7 machine has been bombing with occasional blue screen of death crashes. I ran the Microsoft debugger and it points to an issue with minirosetta 2.05. --------- enclosed debug information ----------------- 3: kd> !analyze -v ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* MULTIPLE_IRP_COMPLETE_REQUESTS (44) A driver has requested that an IRP be completed (IoCompleteRequest()), but the packet has already been completed. This is a tough bug to find because the easiest case, a driver actually attempted to complete its own packet twice, is generally not what happened. Rather, two separate drivers each believe that they own the packet, and each attempts to complete it. The first actually works, and the second fails. Tracking down which drivers in the system actually did this is difficult, generally because the trails of the first driver have been covered by the second. However, the driver stack for the current request can be found by examining the DeviceObject fields in each of the stack locations. Arguments: Arg1: fffffa800afb3320, Address of the IRP Arg2: 0000000000000eae Arg3: 0000000000000000 Arg4: 0000000000000000 Debugging Details: ------------------ IRP_ADDRESS: fffffa800afb3320 CUSTOMER_CRASH_COUNT: 1 DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT BUGCHECK_STR: 0x44 PROCESS_NAME: minirosetta_2. CURRENT_IRQL: 2 LAST_CONTROL_TRANSFER: from fffff8000285fb95 to fffff80002875f00 ID: 64996 · Rating: 0 · rate: / Reply Quote

Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 29,766,470 RAC: 5,707	Message 65002 - Posted: 16 Jan 2010, 3:02:32 UTC - in response to Message 64953. Last modified: 16 Jan 2010, 3:04:02 UTC Hi, I'll be resubmitting the gbnnotyr protein design trajectories to boinc over the next few hours. The tests I ran on ralph showed that the checkpointing issue is resolved. To make sure that there are no other issues, I will submit these trajectories 'slowly' starting with a modest sized batch, and according to the responses I get on the thread I will increase the number of work units over the next few days. Please keep me posted about these problems. Your reports have been invaluable in tracking this problem down! Sarel. At last I have received enough WUs of this type for check. My output - still there are problems with checkpointing. In difference from version 2.03 the information about "CPU time at last checkpoint" is displayed now correctly that gives the chance to BOINC client to switch between projects, but after restart calculation still starts from the beginning. Here a task example which I watched: 8gbnnotyr_3gbn_2iug_9Jan2010_16915_7_0 Before restart it has been used 0:33 hour CPU time, 27 models done, after restarting another 1:27 hour and 72 more models are calculated. But apparently in the report 72 models counted after restarting are mirrored only, 27 models do not suffice, also the task was completed with Validate error. Here another example: 8gbnnotyr_3gbn_1ijt_9Jan2010_16915_1_0 The same results - in report there are only models counted after restarting and Validate error too. For matching here the task of this type which was computing without breaks: 8gbnnotyr_3gbn_1woj_9Jan2010_16909_12_0 Without interruption 2 hours of CPU result to 94 models (compare with 72 and 67 in the previous cases in the same 2 hours of CPU time) and Validate state = Valid The difference just corresponds somewhere to 0.5 hours of CPU time, and so much time passed before restartings ID: 65002 · Rating: 0 · rate: / Reply Quote

Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 29,766,470 RAC: 5,707	Message 65003 - Posted: 16 Jan 2010, 3:22:05 UTC - in response to Message 64995. Please don't presume that the information from the Project Team is an inaccurate description and that your memory observations are a new and permanent condition for all to enjoy going forward. As Sarel points out, they introduced a new type of work unit which has a new low-memory phase to execution. And so you are only going to see the lower memory usage when that specific type of task is being worked on. And this new type of work unit was introduced in prior versions, so the actual delta to v2.05 is small. Since this new type of work is a current area of review, you may see a high concentration of this type of work for a period of time. But it doesn't mean we can presume more then was stated. Yes, here I was mistaken. Simply with new version 2.05 some time in the beginning i recieve ONLY the new types of WU using few RAM. From what I have come to a (wrong) conclusion. But now some WUs of old types come, and for them memory usage about same have as in version 2.03. ID: 65003 · Rating: 0 · rate: / Reply Quote

Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0	Message 65011 - Posted: 16 Jan 2010, 23:27:01 UTC https://boinc.bakerlab.org/rosetta/result.php?resultid=310901552 This one stalled twice at about 5 hrs 35 mins but was running for over 9 hours. I restarted boinc and it then stalled again in the same place. ID: 65011 · Rating: 0 · rate: / Reply Quote

Mike_Solo Send message Joined: 16 Nov 09 Posts: 2 Credit: 67,261 RAC: 0	Message 65013 - Posted: 17 Jan 2010, 11:06:30 UTC Soooo... this new version hangs too often. 2.0.3 was much more stable. It hangs on my 2xAthlonMP 2800 as well on the Intel E8400 so the CPU is not the issue. I think 15% of tasks stuck in the middle consuming >200 Megs of RAM but no CPU. I'm thinking to leave Rosetta for a while until new version ready as tired of kicking off broken tasks every morning :( ID: 65013 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 65015 - Posted: 17 Jan 2010, 11:47:55 UTC Looks like Mike Solo has 3 machines: One WinXP using BOINC version 6.10.18 One WinXP using BOINC version 6.10.18 One WinServer 2003 using BOINC version 6.10.18 Rosetta Moderator: Mod.Sense ID: 65015 · Rating: 0 · rate: / Reply Quote

Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 29,766,470 RAC: 5,707	Message 65020 - Posted: 17 Jan 2010, 18:11:01 UTC Last modified: 17 Jan 2010, 18:12:57 UTC 2 more tasks of type gbnnotyr with the same result - by operation without stops all work normally, but if during calculation there was a break - results befo a break disappear, and the task is ended with validate error. Total i have: 2 WU handled without stops, seems all of them is OK: https://boinc.bakerlab.org/rosetta/result.php?resultid=310752146 https://boinc.bakerlab.org/rosetta/result.php?resultid=311145245 And 3 WU with a break in processing, all were completed with validate errors: https://boinc.bakerlab.org/rosetta/result.php?resultid=310935403 https://boinc.bakerlab.org/rosetta/result.php?resultid=310946429 https://boinc.bakerlab.org/rosetta/result.php?resultid=311163725 P.S. Last from these 3(id 311163725) it has been stopped at the very beginning of operation, still before 1st checkpoint has been written. However after restarting its processing all was completed with validate error. So it is possible validate errors in this type of WUs are not linked with checkpoints directly and these are 2 different bugs. ID: 65020 · Rating: 0 · rate: / Reply Quote

Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0	Message 65021 - Posted: 17 Jan 2010, 19:06:18 UTC - in response to Message 65020. Thanks! We'll have a look at this as soon as possible and let you know what we find. Best, Sarel. 2 more tasks of type gbnnotyr with the same result - by operation without stops all work normally, but if during calculation there was a break - results befo a break disappear, and the task is ended with validate error. Total i have: 2 WU handled without stops, seems all of them is OK: https://boinc.bakerlab.org/rosetta/result.php?resultid=310752146 https://boinc.bakerlab.org/rosetta/result.php?resultid=311145245 And 3 WU with a break in processing, all were completed with validate errors: https://boinc.bakerlab.org/rosetta/result.php?resultid=310935403 https://boinc.bakerlab.org/rosetta/result.php?resultid=310946429 https://boinc.bakerlab.org/rosetta/result.php?resultid=311163725 P.S. Last from these 3(id 311163725) it has been stopped at the very beginning of operation, still before 1st checkpoint has been written. However after restarting its processing all was completed with validate error. So it is possible validate errors in this type of WUs are not linked with checkpoints directly and these are 2 different bugs. ID: 65021 · Rating: 0 · rate: / Reply Quote

svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0	Message 65022 - Posted: 17 Jan 2010, 19:40:25 UTC In the last week I've had to abort 11 tasks on W7 because the tasks are hung consuming 0% CPU time. I was hoping that the combination of upgrading to the latest BOINC and the new 2.05 version of R@h would fix the problem but no: it continues as before. Tasks on Mac OS X seem to be unaffected by this problem. Until there's some indication this problem is fixed I'm not getting any more tasks for W7. ID: 65022 · Rating: 0 · rate: / Reply Quote

AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0	Message 65023 - Posted: 17 Jan 2010, 21:10:47 UTC Task: 311103842 Workunit: homopt_nat2.t368_.t368_.IGNORE_THE_REST.S_00003_0000018_07.pdb_00003.pdb.JOB_16835_29 ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file AdeB ID: 65023 · Rating: 0 · rate: / Reply Quote

P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0	Message 65024 - Posted: 17 Jan 2010, 21:17:05 UTC Last modified: 17 Jan 2010, 22:15:46 UTC Here's another Validate error, it didn't seem to have any problems running. Edit/ This was on 64bit linux. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=283574991 8gbnnotyr_3gbn_1s68_9Jan2010_16915_22_0 # cpu_run_time_pref: 14400 ====================================================== DONE :: 37 starting structures 14469.9 cpu seconds This process generated 37 decoys from 37 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish Validate error__Done__14,470.06 ========================================================================= Edit/ added this. This one was on linux 32bit, again didn't seem to have a problem. Very low credits. 8gbnnotyr_3gbn_1opd_9Jan2010_16915_42_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=283817716 # cpu_run_time_pref: 14400 ====================================================== DONE :: 8 starting structures 12134.6 cpu seconds This process generated 8 decoys from 8 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish Success__Done__12,135.35__28.60__4.61 ID: 65024 · Rating: 0 · rate: / Reply Quote

Admin Send message Joined: 13 Apr 07 Posts: 42 Credit: 260,782 RAC: 0	Message 65025 - Posted: 17 Jan 2010, 23:06:39 UTC Validate Error on Win7, successfully completed by a wingman on win xp https://boinc.bakerlab.org/rosetta/result.php?resultid=311128874 name: 8gbnnotyr_3gbn_1iuk_9Jan2010_16915_131_0 ID: 65025 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2454 Credit: 46,464,996 RAC: 2,192	Message 65026 - Posted: 18 Jan 2010, 1:15:29 UTC Last modified: 18 Jan 2010, 1:17:58 UTC About time I updated my recent fault lists. I've had several errors under 2.03, but only this under 2.05: On Intel T5500 laptop running W7 and Boinc 6.10.18 Outcome Validate error 8gbnnotyr_3gbn_2onu_9Jan2010_16909_17_0 # cpu_run_time_pref: 28800 ====================================================== DONE :: 345 starting structures 28787.1 cpu seconds This process generated 345 decoys from 345 attempts ====================================================== Note: On several occasions the following line appears: No heartbeat from core client for 30 sec - exiting Edit: Wingman running XP also received a validate error on apparently successful completion. ID: 65026 · Rating: 0 · rate: / Reply Quote

MVeiga Send message Joined: 15 Oct 07 Posts: 1 Credit: 3,443,193 RAC: 8	Message 65029 - Posted: 18 Jan 2010, 12:24:34 UTC Hi guys, let me just tell you. If youre using Windows7 the beta version 6.10.24 or even the new beta 6.10.29 is much more stable. Ive used a lot of time the beta 6.10.24 and i had no problem at all with rosetta. For me its much more stable than 6.10.18 in windows7 of course. Anyway its just my case. ID: 65029 · Rating: 0 · rate: / Reply Quote

Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 29,766,470 RAC: 5,707	Message 65031 - Posted: 18 Jan 2010, 13:45:20 UTC - in response to Message 65023. Task: 311103842 Workunit: homopt_nat2.t368_.t368_.IGNORE_THE_REST.S_00003_0000018_07.pdb_00003.pdb.JOB_16835_29 ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file AdeB I too had a same error in this type of WU: https://boinc.bakerlab.org/rosetta/result.php?resultid=310238605 And on 2nd computer processing this WU - too: https://boinc.bakerlab.org/rosetta/result.php?resultid=310471681 The truth it was still version 2.03, therefore I did not write about it, but above an example of the same error and to versions 2.05. ID: 65031 · Rating: 0 · rate: / Reply Quote

Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 29,766,470 RAC: 5,707	Message 65032 - Posted: 18 Jan 2010, 14:51:56 UTC - in response to Message 65024. Here's another Validate error, it didn't seem to have any problems running. Edit/ This was on 64bit linux. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=283574991 8gbnnotyr_3gbn_1s68_9Jan2010_16915_22_0 Seems only one problem with that WU - it has restart too (may be swith to another project?) and bug related with it. This one was on linux 32bit, again didn't seem to have a problem. Very low credits. 8gbnnotyr_3gbn_1opd_9Jan2010_16915_42_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=283817716 I too have such example: https://boinc.bakerlab.org/rosetta/result.php?resultid=311202691 Claimed credit=54.35 vs Granted credit = 1.83 (about 30 times lower) And I even can tell what exactly with it have occurred: Usually in this type of WUs model settle up very fast, nearby 1 or several minutes on 1 model. This task started as - approximately for 15 minutes 13 models have been calculated (on ~500 steps in each) , but about 14th something has occurred, calculation has not stopped on 500th step, and proceeded much longer, I saw as the counter have passed for 40000 steps, and did not look any more further(i think all was about 60000-70000 steps total). I was already think to abort this task since thought that calculation has gone in cycles, but in 5 hours (instead of several minutes) calculation of 14th model all the same was completed. I.e. 13 models were considered about 15 minutes, and 14th about 5 hours. From here from such small stake-in Granted credit - since they are calculated proportionally to quantity of models. (If not this 14th model, for 5 hours it would be calculated about 300 models instead of 14 and Granted credit would be close to Claimed credit). I think too most was and in your taks... P.S. Quite probably that it NOT an error, but a feature of algorithm - if it finds something interesting more detail calculation of this model probably starts. It is desirable for specifying for scientists responsible for this type of WUs. ID: 65032 · Rating: 0 · rate: / Reply Quote

Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0	Message 65034 - Posted: 18 Jan 2010, 18:39:22 UTC Hello, based on the reports of validator issues, David Kim has now fixed the validator. He also asked me to remind people that credit is granted based on the client's claimed credit, regardless of validator results. Let us know if you see more such problems. Thanks, Sarel. ID: 65034 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2454 Credit: 46,464,996 RAC: 2,192	Message 65035 - Posted: 18 Jan 2010, 19:27:57 UTC Thanks for the information Sarel - and David for the fix. No further errors today, but a cursory check has revealed I haven't re-booted my desktop since Dec 15th! I'm sure I've had various updates since then, but that's a ridiculous amount of uptime for me... Back in 5... ;) ID: 65035 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0	Message 65036 - Posted: 18 Jan 2010, 19:57:23 UTC - in response to Message 65034. credit is granted based on the client's claimed credit, regardless of validator results. Does that not apply only to results with compute errors or validate errors? . ID: 65036 · Rating: 0 · rate: / Reply Quote

Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 29,766,470 RAC: 5,707	Message 65037 - Posted: 18 Jan 2010, 23:49:52 UTC - in response to Message 64959. Last modified: 18 Jan 2010, 23:56:15 UTC hellotheworld wrote: Hi, I have a strange graphic I wanted to show you... I think there might be a problem... Please go to see this sreen shoot : http://www.flickr.com/photos/37828392@N08/4273113531/ (Capitain Flam is my account on Flickr) Possible bug for the application BOINC / ROSETTA, because the protein is completely folded, in a tiny meat ball ;-) I hope this is NOT a bug, or even, I hope it will help you to solve it ;) Oxfez wrote: One of my tasks has "meatballed" too: lr5_no_pro_close_no_dun_A_rlbd_1rnb_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_16701_583_0 Running new 2.05 According to the time to completion, it's going to be a long old process too. I have another "meatball" too. Task: https://boinc.bakerlab.org/rosetta/result.php?resultid=311361747 Some screenshots: http://s001.radikal.ru/i193/1001/1f/cffd2181b53b.jpg http://i073.radikal.ru/1001/d9/c87d3083bfb9.jpg http://s41.radikal.ru/i094/1001/8e/a86dfd3a7d6a.jpg Plus about last 2 hours of computation(or ~20 steps) there were no changes in Energy or RMSD at all. (I did not do more screenshots since further varied nothing except CPU Time and Steps count) I do not think that it is an error in the software, but probably weak place in the scientific algorithm itself, so it is necessary to address it not to programmers, but scientists. ID: 65037 · Rating: 0 · rate: / Reply Quote