Problems with Rosetta version 5.64

Message boards : Number crunching : Problems with Rosetta version 5.64

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 40461 - Posted: 7 May 2007, 1:26:26 UTC

Please post any issues here. In particular, let us know if you think there's a big problem with "checkpointing" or with PowerPC Macs. Thanks!



ID: 40461 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
dev

Send message
Joined: 1 Dec 05
Posts: 3
Credit: 6,590
RAC: 0
Message 40501 - Posted: 7 May 2007, 22:20:29 UTC - in response to Message 40461.  


I have been unable to start any work with any PPC machine running OS 10.3 or 10.3.9 running Rosetta 5.62 and 5.64, at download they fail with a computation error and continue to download work even though they error out as soon as they are done downloading. Intel OS 10.4.x is stable and X86 Linux is stable. I am aware of the problem and furnishing this information in regards to the ongoing issue. Set up a PPC machine 10.3 & 5.64 on Ralph and awaiting new work.
ID: 40501 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
dev

Send message
Joined: 1 Dec 05
Posts: 3
Credit: 6,590
RAC: 0
Message 40502 - Posted: 7 May 2007, 22:27:55 UTC - in response to Message 40501.  

Further info: Following fresh download after joining project with a G3 running OS 10.3

Log info Boinc 5.2.13

Mon May 7 15:00:02 2007|rosetta@home|Finished download of rosetta_5.64_powerpc-apple-darwin
Mon May 7 15:00:02 2007|rosetta@home|Throughput 67112 bytes/sec
Mon May 7 15:00:15 2007||request_reschedule_cpus: files downloaded
Mon May 7 15:00:16 2007|rosetta@home|Starting result 2j03_FOLD_AND_DOCK_SYMM_RELAX_1701_1936_0 using rosetta version 564
Mon May 7 15:00:19 2007|rosetta@home|Unrecoverable error for result 2j03_FOLD_AND_DOCK_SYMM_RELAX_1701_1936_0 (process got signal 5)
Mon May 7 15:00:19 2007||request_reschedule_cpus: process exited
Mon May 7 15:00:19 2007|rosetta@home|Computation for result 2j03_FOLD_AND_DOCK_SYMM_RELAX_1701_1936_0 finished
Mon May 7 15:00:32 2007||request_reschedule_cpus: project op
ID: 40502 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 40503 - Posted: 7 May 2007, 23:15:13 UTC

dev,

We support OSX 10.3.9 or later versions, thus the errors on OSX10.3 are expected. I would detach your computers running 10.3 from the project or upgrade the OS. The 10.3.9 errors are from our 5.62 rosetta version which had a bug. It should be fixed now with our recent application update.
ID: 40503 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
dev

Send message
Joined: 1 Dec 05
Posts: 3
Credit: 6,590
RAC: 0
Message 40508 - Posted: 8 May 2007, 0:42:39 UTC - in response to Message 40503.  

Thank you sir, I will make a note of that!
ID: 40508 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Fivestar Crashtest

Send message
Joined: 12 Jul 06
Posts: 2
Credit: 141,777
RAC: 0
Message 40514 - Posted: 8 May 2007, 6:15:42 UTC

Result ID 78012509
Name 1utg__BOINC_ABRELAX_SAVE_ALL_OUT-1utg_-frags83__1705_1344_1
Workunit 70077736
Created 8 May 2007 1:14:56 UTC
Sent 8 May 2007 1:15:19 UTC
Received 8 May 2007 6:01:15 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 139 (0x8b)
Computer ID 487966
Report deadline 18 May 2007 1:15:19 UTC
CPU time 8212.397242
stderr out

<core_client_version>5.8.17</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
# random seed: 3367377
No heartbeat from core client for 31 sec - exiting
SIGSEGV: segmentation violation
Stack trace (13 frames):
[0x8cbf0fb]
[0x8cb9f2c]
[0xffffe500]
[0x8c2957f]
[0x8b30e02]
[0x8c1106f]
[0x849608c]
[0x80dad29]
[0x85b4d1b]
[0x86d8113]
[0x86d81be]
[0x8d22ff4]
[0x8048111]

Exiting...
SIGSEGV: segmentation violation
SIGABRT: abort called

repeat sigabrt about a million times and then:

SIGABRT: abort called
SIGABRT: abort called

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 34.5143106775097
Granted credit 0
application version 5.64
ID: 40514 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ramostol

Send message
Joined: 6 Feb 07
Posts: 64
Credit: 584,052
RAC: 0
Message 40524 - Posted: 8 May 2007, 12:35:25 UTC

Rosetta is working, so I can't complain too loudly. But 5.64 has somehow turned my Mac (10.3.9) into a megalomaniac.

Of course its claimed credits have never agreed too closely with the granted credits, but this typical example of current results is surely ridiculous?:
Claimed credit
497.631627972514
Granted credit
3.98991572197545

-- R. A. Mostol
ID: 40524 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 40529 - Posted: 8 May 2007, 13:31:34 UTC
Last modified: 8 May 2007, 13:34:25 UTC

ramostol, your BOINC benchmarks got inflated somehow. Rerun your benchmarks and your claimed credit should come back to normal. Advanced view, then advanced pulldown menu bar, then run CPU benchmarks. Once it completes, then update to the project.

If that doesn't correct it, then you should report it as a problem with your 5.8.17 beta release of BOINC.

CPU reported at:
Measured floating point speed 14102.56 million ops/sec
Measured integer speed 95755.51 million ops/sec

Rosetta Moderator: Mod.Sense
ID: 40529 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 40532 - Posted: 8 May 2007, 13:53:52 UTC
Last modified: 8 May 2007, 14:14:53 UTC

I agree Mod. Sense.

Below is data from all the powerbook 6,5's running on a small project. He could use this as a comparison. However, It does appear he should be getting more than 3 credits for two hours work. Claiming roughly 250/hour is a bit high.

image
ID: 40532 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 40544 - Posted: 8 May 2007, 17:23:47 UTC

Looks like you're right. His benchmarks are now:

Measured floating point speed 582.32 million ops/sec
Measured integer speed 1816.64 million ops/sec

Could the lower than claimed (using peer data or his new data) be because of the mac app isn't as optimized as others?? Thereby, it just does less/hour. I seem to remember this being an issue months ago, but am to lazy to look up very old threads.
ID: 40544 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ramostol

Send message
Joined: 6 Feb 07
Posts: 64
Credit: 584,052
RAC: 0
Message 40584 - Posted: 9 May 2007, 9:15:04 UTC
Last modified: 9 May 2007, 9:22:26 UTC

Thanks both of you. As you have seen I have performed the benchmarks but not compared results yet (I am offline most of the time). Of course I should have been more precise in stating that I was not surprised at the granted credits, merely at the claimed credits. (This is not a thread for discussing credits, but some results from Rosetta 5.59 might illustrate:

CPU time: 14131.78 -- Granted credit: 3.84102681959318
CPU time: 8999.71 -- Granted credit: 2.41438254677557
CPU time: 10,540.56 -- Granted credit: 2.70
CPU time: 28,207.87 [1 model 7h 50 min] -- Granted credit: 7.46)

For anyone wondering what may have caused these inflating benchmarks they developed after my temporarily changing the default CPU time from 2 hours to 8 hours.


By the way, if the task 1bm8__BOINC_ABRELAX_SAVE_ALL_OUT-1bm8_-frags83__1705_174_0 is supposed to be checkpointing I have seen no sign of it. My computer was shut down after 41 min crunching (34% cpl.), and now the wu has just restarted from scratch with model 1 step 1.

-- R. A. Mostol
ID: 40584 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ramostol

Send message
Joined: 6 Feb 07
Posts: 64
Credit: 584,052
RAC: 0
Message 40586 - Posted: 9 May 2007, 11:15:05 UTC

Hm, the second running of 1bm8__BOINC_ABRELAX_SAVE_ALL_OUT-1bm8_-frags83__1705_174_0 lasted merely 1 h 25 min., generating this message:

CPU time 5140.18
stderr out
<core_client_version>5.8.17</core_client_version>
<![CDATA[
<stderr_txt>
Rosetta@home Macintosh Stack Size checker.
Original size: 0.
Maximum size: 8388608.
RLIM_INFINITY 0
# cpu_run_time_pref: 7200
# random seed: 3446547
*** malloc[761]: error for object 0xb4ad460: Incorrect checksum for freed object - object was probably modified after being freed; break at szone_error
SIGSEGV: segmentation violation
Rosetta@home Macintosh Stack Size checker.
Original size: 0.
Maximum size: 8388608.
RLIM_INFINITY 0
# cpu_run_time_pref: 7200
======================================================
DONE :: 1 starting structures 5140.18 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
]]>

Validate state Valid


But since Rosetta is satisfied...

-- R. A. Mostol
ID: 40586 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
McSummation

Send message
Joined: 30 May 06
Posts: 2
Credit: 62,822
RAC: 0
Message 40593 - Posted: 9 May 2007, 14:21:33 UTC - in response to Message 40461.  

Please post any issues here. In particular, let us know if you think there's a big problem with "checkpointing" or with PowerPC Macs. Thanks!
I'm running 5.64 on 2 machines, both with BOINC 5.8.16. On the machine running XP Home, the checkpointing is working properly. However, on the one running Win98SE, the checkpointing does not appear to work properly. When I turned the machine off last night, it was over 5 hours into a WU. This morning, it restarted that WU.

ID: 40593 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mdettweiler
Avatar

Send message
Joined: 15 Oct 06
Posts: 33
Credit: 2,509
RAC: 0
Message 40598 - Posted: 9 May 2007, 18:06:36 UTC - in response to Message 40584.  

By the way, if the task 1bm8__BOINC_ABRELAX_SAVE_ALL_OUT-1bm8_-frags83__1705_174_0 is supposed to be checkpointing I have seen no sign of it. My computer was shut down after 41 min crunching (34% cpl.), and now the wu has just restarted from scratch with model 1 step 1.


Ditto for me. XP Pro SP2, Intel P4 3.2Ghz HT, BOINC v5.4.11, if that helps. I've had a few similar workunits that have had problems exactly like what he's describing, with them always starting over from the beginning after I shut down and turn back on my computer. (This isn't a problem when the task is preempted, though, because I have my preferences set to "leave apps in memory while preempted"). I've set my CPU Run Time preference to 1 hour as a semi-workaround, to make the workunits easier to handle even without checkpoints (it was previously set to 10 hours), but the models in the recent workunits seem to be taking a lot longer than an hour--upwards of 2-3 hours, I've noticed.
ID: 40598 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Andy Lee Robinson

Send message
Joined: 30 May 06
Posts: 1
Credit: 74,015
RAC: 0
Message 40608 - Posted: 9 May 2007, 22:53:51 UTC

I've had a couple of errors now on Linux FC6 (P4 and AMD) where the 5.64 app would just stop and appear to sleep without aborting and moving on...
The WU result indicates "SIGSEGV: segmentation violation" and tries to exit.

Unfortunately this means that a core on the hosts are effectively idle for a few hours until I notice and abort manually :(

I think the new app still needs a little more scrutiny.

Andy.
ID: 40608 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 40612 - Posted: 9 May 2007, 23:43:48 UTC
Last modified: 9 May 2007, 23:45:06 UTC

5/10/2007 1:40:29 AM|rosetta@home|Restarting task 1cg5B_BOINC_ABRELAX_SAVE_ALL_OUT_BARCODE-1cg5B-frags83__1706_1219_1 using rosetta version 564

percentage complete went from 9% to .1%
cpu time went from x number of minutes completed to 0:00 completed
model reverted back to 1 and step 1

thought 5.64 was supposed to stop this from happening.
did it not benchmark at 9%?

using boinc 5.8.16
ID: 40612 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 40650 - Posted: 10 May 2007, 14:17:15 UTC

what is the minimum completion before a work unit benchmarks?
ID: 40650 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mdettweiler
Avatar

Send message
Joined: 15 Oct 06
Posts: 33
Credit: 2,509
RAC: 0
Message 40658 - Posted: 10 May 2007, 16:51:32 UTC - in response to Message 40650.  

what is the minimum completion before a work unit benchmarks?


I think you mean checkpoint rather than benchmark. Checkpointing is when a workunit saves its state so it can resume later; benchmarking is what your BOINC client does every week or so to see how fast your CPU is, and thus claim amounts of credit based on that (some projects will grant credit based on claimed credit, whereas some--I believe Rosetta is one--will grant credit based on less variable methods).

As for your question, no, unfortunately, I don't know the answer to that. :-( That actually would be something I would like to know myself.
ID: 40658 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 40683 - Posted: 10 May 2007, 23:09:04 UTC - in response to Message 40658.  

yes you are correct, checkpoint is what i was meaning to ask.
well if you get the answer let me know...

anyone know why at 9% it reset to .2% after a reboot?

what is the minimum completion before a work unit benchmarks?


I think you mean checkpoint rather than benchmark. Checkpointing is when a workunit saves its state so it can resume later; benchmarking is what your BOINC client does every week or so to see how fast your CPU is, and thus claim amounts of credit based on that (some projects will grant credit based on claimed credit, whereas some--I believe Rosetta is one--will grant credit based on less variable methods).

As for your question, no, unfortunately, I don't know the answer to that. :-( That actually would be something I would like to know myself.


ID: 40683 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 40688 - Posted: 11 May 2007, 0:21:10 UTC

There are three types of checkpointing. From the longest to shortest interval between checkpoints.

1. After each model is produced. This checkpointing is done for every type of job and depends on the rate of model production which depends on a number of factors like the size of the protein, the type of experiment, the computer etc.

2. During the standard relax protocol (the protein jiggles around a little in the graphics and uses full sidechains). There are a number of spots in the control flow where a checkpoint can be made as a model is being computed. This also depends on factors described above but is only available for specific types of jobs that use the relax protocol.

3. and, a more recent addition, checkpointing for pose and jumping jobs. These types of jobs should checkpoint at intervals depending on your disk write interval preference.
ID: 40688 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Problems with Rosetta version 5.64



©2024 University of Washington
https://www.bakerlab.org