Message boards : Number crunching : "Rosetta v4.12 i686-pc-linux-gnu" : fixed 20 h CPU time, fixed 20 credits
Author | Message |
---|---|
xii5ku Send message Joined: 29 Nov 16 Posts: 22 Credit: 13,815,783 RAC: 128 |
Hi, I downloaded tasks on April 6, 04:49 UTC, onto six Linux x86-64 computers. They received a mixture of "Rosetta v4.12 x86_64-pc-linux-gnu" and "Rosetta v4.12 i686-pc-linux-gnu" tasks. The x86_64 tasks behave as expected:
|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Your machines are hidden, so noone can examine your machines or WUs to help you understand what is going on. There are still some cases where the watchdog is finding long-running models and kicking in to end tasks. This extended runtime without reaching the end of a model results in poor credit. Have you seen any similar issues with the new application version running on Ralph? Rosetta Moderator: Mod.Sense |
xii5ku Send message Joined: 29 Nov 16 Posts: 22 Credit: 13,815,783 RAC: 128 |
Example of a good task (v4.12 x86-64): 1140903423 Example of a bad task from the same host (v4.12 i686): 1140917005 This was at a time when I had 16 h target CPU time configured. stderr of the bad task: <core_client_version>7.8.3</core_client_version> <![CDATA[ <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.12_i686-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol predictor_v11_boinc--fuse--covid_spike_design_boinc_v1.xml @flags_jhr_cv -in:file:silent 3qt0mq9p_Junior_HalfRoid_vs_COVID-19_design1.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 3qt0mq9p_Junior_HalfRoid_vs_COVID-19_design1.zip @3qt0mq9p_Junior_HalfRoid_vs_COVID-19_design1.flags -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 Starting watchdog... Watchdog active. BOINC:: CPU time: 72161s, 14400s + 57600s[2020- 4- 7 5:12: 9:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 InternalDecoyCount: 0 (GZ) ----- 0 ----- Stream information inconsistent. Writing W_0000001 ====================================================== DONE :: 1 starting structures 72161 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== 05:12:09 (86554): called boinc_finish(0) </stderr_txt> ]]> From spot checks, all other tasks of this faulty type, also on other computers of mine, show the same pattern, i.e. "...default.out.gz: could not open file." and "This process generated 1 decoys from 1 attempts". I am not yet registered at Ralph. Maybe I'll try it tomorrow. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Yep, the one you called bad task was ended by the watchdog, as you say, the watchdog kicks in 4 hours after the runtime preference. I'm hopeful that the new version coming soon will reduce the number of these. Rosetta Moderator: Mod.Sense |
xii5ku Send message Joined: 29 Nov 16 Posts: 22 Credit: 13,815,783 RAC: 128 |
I wrote: All in all I have >800 valid results from these downloads by now (there are dual-processor servers among these computers), but I haven't counted how many of them are x86-64 and how many i686. However, all i686 tasks on all computers which got these are showing this behavior, while all x86-86 tasks on the very same computers behave well. I counted now: circa 700 v4.12 x86-64 results on 5 of 6 active computers, all of these tasks good exactly 110 v4.12 i686 results on 3 of the 6 computers, all of these tasks bad in the same way I had more i686 tasks in the queue, cancelled them. (That is, 1 computer got only i686 tasks until the point when I went checking, 2 computers got both kinds, the other 3 computers only x86-64 tasks. The 1 unlucky computer will begin working on x86-64 tasks now like the others.) That is, in my observation, it is a systematic 100% repeatable fault of the i686 build, which does not occur with the x86-64 build. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Thank you for the analysis and summary. That is very helpful in pinpointing problem areas. I have reported the issue of i686 Linux WUs never completing first model, causing watchdog end, to the Project Team. Rosetta Moderator: Mod.Sense |
biodoc Send message Joined: 19 Feb 06 Posts: 14 Credit: 30,717,792 RAC: 0 |
@xii5ku, thank you for tracking this problem down. My caches are full of the linux i686 tasks. I would think it would be a good idea to stop the server from sending this work until the bug in the app is fixed. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
To any that have found this thread because they are having Linux i686 issues, please join Ralph (project url is: http://ralph.bakerlab.org/) with your machine. This will help with testing when changes are made to address this, and confirm they are working. No promises on when a new version will be available there. It may take some time. If you would like to try creating a simple cc_config.xml file, you can get BOINC to use the x86_64 version instead of the i686 version that is having trouble. There is an example here. Rosetta Moderator: Mod.Sense |
xii5ku Send message Joined: 29 Nov 16 Posts: 22 Credit: 13,815,783 RAC: 128 |
Last night I received a bunch of tasks from Ralph to 4 of the same 6 computers. I had the default target CPU time configured at Ralph, which is 1 hour. I have 257 valid results, of 257 tasks received:
So there is slight progress from v4.12 to v4.15 on my hosts, but not a breakthrough yet. |
magiceye04 Send message Joined: 11 May 11 Posts: 11 Credit: 1,702,178 RAC: 0 |
To any that have found this thread because they are having Linux i686 issues, please join Ralph (project url is: http://ralph.bakerlab.org/) with your machine. This will help with testing when changes are made to address this, and confirm they are working. I have some broken WUs on my PC, new version 4.15 *i686* Why are INTEL686 WUs sent to AMD-PCs? The x86 run perfect, please keep away these i686 WUs from non-intel-PCs. https://boinc.bakerlab.org/rosetta/result.php?resultid=1148091774 |
magiceye04 Send message Joined: 11 May 11 Posts: 11 Credit: 1,702,178 RAC: 0 |
What exactly from this example is needed? <no_alt_platform>1</no_alt_platform> ? I have an existing config file and only want to add the relevant line. |
Millenium Send message Joined: 20 Sep 05 Posts: 68 Credit: 184,283 RAC: 0 |
Lol, they are called "intel686" exactly because they are x86, not because they are for intel cpu. And they aren't 64bit. |
Laurent Send message Joined: 15 Mar 20 Posts: 14 Credit: 88,800 RAC: 0 |
All AMD CPUs starting with the Athlon K7 implement i686. That's ~2000, or 20 years ago. How old are your computers? It is called i686 because intel created the instruction set. AMD can run it (and Windows) because they bought the right to use the platform from intel. The SSSE problem popping up is related to 64 bit AMDs not implementating a part of Intel's stuff. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Yes, so you could omit the section called <log_flags>. This is only for Linux hosts. And only until a new version resolves the i686 issue where all tasks are never finishing their first model, and are ended by the watchdog. You still need the rest of the shell, so it can be reduced to this: <cc_config> <options> <no_alt_platform>1</no_alt_platform> </options> </cc_config> Rosetta Moderator: Mod.Sense |
magiceye04 Send message Joined: 11 May 11 Posts: 11 Credit: 1,702,178 RAC: 0 |
OK - then i686=32bit and x64 = 64bit? But the needed solution is still: run only the 64bit WUs on AMD CPU. Why is the 20 years old i686 code still in use if the problem is known? |
Laurent Send message Joined: 15 Mar 20 Posts: 14 Credit: 88,800 RAC: 0 |
OK - then i686=32bit and x64 = 64bit? All current intel and AMD for PC (windows) are based on the x86 system, invented 1978. The i686 is the 6 generation and works very fine. x86-64, also called AMD64 is the 64 bit extension for x86. That one was invented by AMD and crosslicenced to intel. All current PC CPUs, even the ones from Intel include that extension, as well as all the previous extensions (i286, i386, i486, ... Till roughly generation 12, depending on how you count the generations). There is no problem with i686, there is a problem with Rosetta. You are barking up the wrong tree. Don't blame intel or AMD. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Can an more experienced Linux user help here? Tasks not starting. active_task_state: UNINITIALIZED. Rosetta Moderator: Mod.Sense |
JohnDK Send message Joined: 6 Apr 20 Posts: 33 Credit: 2,390,240 RAC: 0 |
I've put <no_alt_platform>1</no_alt_platform> in the cc_config.xml file, but still got a i686 WU. I did choose read config files but did not restart BOINC, is that necessary? https://boinc.bakerlab.org/rosetta/results.php?hostid=4063805 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Is the no_alt_platform tag within the options tag, within the cc_config tag? Rosetta Moderator: Mod.Sense |
JohnDK Send message Joined: 6 Apr 20 Posts: 33 Credit: 2,390,240 RAC: 0 |
This is my cc_config <cc_config> |
Message boards :
Number crunching :
"Rosetta v4.12 i686-pc-linux-gnu" : fixed 20 h CPU time, fixed 20 credits
©2024 University of Washington
https://www.bakerlab.org