Rosetta 4.0+

Message boards : Number crunching : Rosetta 4.0+

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 19 · Next

AuthorMessage
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,054,272
RAC: 7,218
Message 88113 - Posted: 19 Jan 2018, 17:39:50 UTC - in response to Message 88095.  

Version 4.06: Getting "computational error" after 1 second of trying on a Mac Pro, Boinc 7.6.33.

Mini 3.78 works fine.


It looks to me like Rosetta 4.06 version is compiled with AVX2 enabled. My rosetta_4.06_x86_64-pc-linux-gnu binary had a number of AVX2 instructions, but I doubt it makes much performance difference. Your Harpertown computer does not support any AVX instructions.

All the 3.78 binaries passed (no AVX2).
All the 4.06 binaries failed (AVX2).

IMO, it looks like someone at Rosetta turned on the AVX compile switch on 4.06 without fixing the server job dispatcher to send 4.06 jobs ONLY to CPU that did support them.
Negative impact ... burns network traffic, power, disk space, slows job completion for Rosetta job submitters, ... UGH!
Asleep at the switch.

I don't see any PREFERENCE to tell Rosetta to stop sending the version 4.06 AVX jobs, so it appears that you and everyone else in that situation is stuck.

Too bad they could not have figured this out using RALPH .... 8-)
ID: 88113 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 8,387
Message 88115 - Posted: 19 Jan 2018, 19:34:22 UTC - in response to Message 88113.  
Last modified: 19 Jan 2018, 19:38:44 UTC

It looks to me like Rosetta 4.06 version is compiled with AVX2 enabled. My rosetta_4.06_x86_64-pc-linux-gnu binary had a number of AVX2 instructions,

Uh, in the windows version i see only 64 bits active. I will see deeper, if i'm able to.

but I doubt it makes much performance difference.

Maybe this is only the beginning. Maybe they put bigger simulations in avx wus

IMO, it looks like someone at Rosetta turned on the AVX compile switch on 4.06 without fixing the server job dispatcher to send 4.06 jobs ONLY to CPU that did support them.

With the new servers this is easier to do.

Too bad they could not have figured this out using RALPH .... 8-)

I prefer that wus crash in Ralph than in Rosetta.
Sometimes i think that Rosetta's admins are afraid to use Ralph....
ID: 88115 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,054,272
RAC: 7,218
Message 88116 - Posted: 19 Jan 2018, 20:17:47 UTC - in response to Message 88115.  
Last modified: 19 Jan 2018, 20:19:07 UTC

It looks to me like Rosetta 4.06 version is compiled with AVX2 enabled. My rosetta_4.06_x86_64-pc-linux-gnu binary had a number of AVX2 instructions,

Uh, in the windows version i see only 64 bits active. I will see deeper, if i'm able to.

but I doubt it makes much performance difference.

Maybe this is only the beginning. Maybe they put bigger simulations in avx wus

IMO, it looks like someone at Rosetta turned on the AVX compile switch on 4.06 without fixing the server job dispatcher to send 4.06 jobs ONLY to CPU that did support them.

With the new servers this is easier to do.

Too bad they could not have figured this out using RALPH .... 8-)

I prefer that wus crash in Ralph than in Rosetta.
Sometimes i think that Rosetta's admins are afraid to use Ralph....


The windows event exception should show an "ILLEGAL INSTRUCTION" is the cause of the abort.

If you can disassemble, look for instructions using the YMM registers. That is the easiest way on Linux. I just use "objdump -d binary > binary.od" and then look at registers used. If you see ymm registers, the binary was compiled with avx2.

grep ymm binary.od
5a23591: c4 e3 7d 18 44 0f 10 vinsertf128 $0x1,0x10(%rdi,%rcx,1),%ymm0,%ymm0
5a235a0: c4 c3 7d 19 44 0a 20 vextractf128 $0x1,%ymm0,0x20(%r10,%rcx,1)
5a23faa: c4 e2 7d 19 45 c8 vbroadcastsd -0x38(%rbp),%ymm0
5a23fb4: c4 c1 7d 7f 02 vmovdqa %ymm0,(%r10)
5a24f4c: c4 e3 7d 18 40 10 01 vinsertf128 $0x1,0x10(%rax),%ymm0,%ymm0
5a24f66: c4 e3 7d 19 84 24 90 vextractf128 $0x1,%ymm0,0x190(%rsp)
5a24f76: c4 e3 7d 18 40 30 01 vinsertf128 $0x1,0x30(%rax),%ymm0,%ymm0
5a24f86: c4 e3 7d 19 84 24 b0 vextractf128 $0x1,%ymm0,0x1b0(%rsp)
ID: 88116 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Saenger
Avatar

Send message
Joined: 19 Sep 05
Posts: 271
Credit: 824,883
RAC: 0
Message 88124 - Posted: 20 Jan 2018, 14:12:57 UTC

The last three of my 4.06-WUs (967991507, 967991500 and 967991525) all errored out after a few seconds with the following message or similar:
ERROR: Error in simple_cycpep_predict app: The N-methylation position indices must be within the pose!
ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 1399
BACKTRACE:
[0xe60f258]
[0x8914d8a]
[0x891762b]
[0x805620d]
[0xeabf881]
[0xeabfa7d]
[0x82f2057]
BOINC:: Error reading and gzipping output datafile: default.out
08:35:42 (30370): called boinc_finish(1)


Four others (967991472, 967991483, 967991493 and 967991490) are currently running without problems. I fail to see any pattern in the list on my host in regard of names or such. I have to wait for approximately another 6-10h for the first of the currently running to see whether it will error out later.
Grüße vom Sänger
ID: 88124 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 130
Credit: 1,766,254
RAC: 0
Message 88131 - Posted: 20 Jan 2018, 23:06:00 UTC

Re: My host Windows XP with Pentium 4 CPU. Issue with v4.06 windows _intelx86 since began getting these workunits. Will not actually process and get these messages and errors soon after starting the WUs.
01/20/2018 1:42:00 PM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/20/2018 1:42:01 PM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/20/2018 1:42:02 PM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/20/2018 1:42:03 PM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/20/2018 1:42:04 PM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/20/2018 1:42:08 PM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/20/2018 1:42:09 PM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/20/2018 1:42:09 PM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/20/2018 1:42:09 PM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/20/2018 1:42:10 PM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)

01/20/2018 1:42:13 PM | Rosetta@home | Computation for task ec8a5_MEM_1.5len.cg5fa5.bnd3_aivan_SAVE_ALL_OUT_03_09_541378_1156_0 finished
01/20/2018 1:42:13 PM | Rosetta@home | Output file ec8a5_MEM_1.5len.cg5fa5.bnd3_aivan_SAVE_ALL_OUT_03_09_541378_1156_0_r1956498269_0 for task ec8a5_MEM_1.5len.cg5fa5.bnd3_aivan_SAVE_ALL_OUT_03_09_541378_1156_0 absent
01/20/2018 1:42:13 PM | Rosetta@home | Computation for task PF06577.11_aivan_SAVE_ALL_OUT_03_09_541721_1576_0 finished
01/20/2018 1:42:13 PM | Rosetta@home | Output file PF06577.11_aivan_SAVE_ALL_OUT_03_09_541721_1576_0_r632274840_0 for task PF06577.11_aivan_SAVE_ALL_OUT_03_09_541721_1576_0 absent
01/20/2018 1:42:23 PM | | Suspending computation - CPU is busy
01/20/2018 1:42:33 PM | | Resuming computation

Exit status -185 (0xFFFFFF47) ERR_RESULT_START
Stderr output
<core_client_version>7.8.3</core_client_version>
<![CDATA[
<message>
couldn't start app: CreateProcess() failed - (unknown error)</message>
]]>

I've reset the project with no change in processing.

FYI - Thanks.
ID: 88131 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 9,701
Message 88151 - Posted: 24 Jan 2018, 2:01:09 UTC

No errors in the last week on mini-rosetta 3.78 but all these on 4.06

PF04295.12_aivan_SAVE_ALL_OUT_03_09_541715_1877_0
PF09868.8_aivan_SAVE_ALL_OUT_03_09_541716_745_0
PF12787.6_aivan_SAVE_ALL_OUT_03_09_541721_799_0
PF06763.10_aivan_SAVE_ALL_OUT_03_09_541721_1445_0
PF10076.8_aivan_SAVE_ALL_OUT_03_09_541721_1445_0
PF11732.7_aivan_SAVE_ALL_OUT_03_09_541716_1461_0
PF07762.13_aivan_SAVE_ALL_OUT_03_09_541716_1248_1
All the above show this same error, which I first mentioned in October
std::cerr: Exception was thrown:
chi angle must be between -180 and 180: nan


CycA_AGPF_6res_hydrophobic_designs_2_CycA_AGPF_c.17.8_0001_SAVE_ALL_OUT_542098_301_0
CycA_AGPF_6res_hydrophobic_designs_2_CycA_AGPF_c.31.6_0001_SAVE_ALL_OUT_542108_763_0
Both the above show this error
ERROR: Error in simple_cycpep_predict app: The N-methylation position indices must be within the pose!
ERROR:: Exit from: ......srcprotocolscyclic_peptide_predictSimpleCycpepPredictApplication.cc line: 1398
BOINC:: Error reading and gzipping output datafile: default.out


CycA_HP_6res_hydrophobic_automated_c.2.5_0001_SAVE_ALL_OUT_542206_138_0
ERROR: in::file::boinc_wu_zip CycA_HP_6res_hydrophobic_designs_2_c.2.5_0001.zip does not exist!
ERROR:: Exit from: ......srcappspublicboincminirosetta.cc line: 180
BOINC:: Error reading and gzipping output datafile: default.out

ID: 88151 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 130
Credit: 1,766,254
RAC: 0
Message 88152 - Posted: 24 Jan 2018, 7:11:31 UTC - in response to Message 88131.  

Another v4.06 windows_intelx86 error for my host Windows XP with Pentium 4 CPU, which occurred shortly after starting the WU.
Re: Workunit 873834971 CycA_AGPF_7res_hydrophobic_designs_CycA_AGPF_7res_c.6.10_0001_SAVE_ALL_OUT_542366_836_0

Client state Compute error
Exit status -185 (0xFFFFFF47) ERR_RESULT_START
Computer ID 1580783

Stderr output: <core_client_version>7.8.3</core_client_version>
<message>couldn't start app: CreateProcess() failed - (unknown error)</message>
ID: 88152 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 8,387
Message 88154 - Posted: 24 Jan 2018, 13:31:42 UTC

Seems that 4.06 will need some debug
ID: 88154 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 22 Aug 06
Posts: 42
Credit: 1,258,039
RAC: 0
Message 88155 - Posted: 24 Jan 2018, 16:12:57 UTC

PLUS it would be nice that the admin who created this topic would come and read it, sometimes !
ID: 88155 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 8,387
Message 88168 - Posted: 26 Jan 2018, 7:59:27 UTC - in response to Message 88155.  

PLUS it would be nice that the admin who created this topic would come and read it, sometimes !


Lack of communications in Rosetta it's a long (and sad) story.
ID: 88168 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 9,701
Message 88188 - Posted: 30 Jan 2018, 2:24:40 UTC
Last modified: 30 Jan 2018, 2:25:43 UTC

Another week goes by, another 7 PF* tasks coming up with the same "nan" error after running to apparent completion

PF14335.5_aivan_SAVE_ALL_OUT_03_09_541721_2511_0
PF11824.7_aivan_SAVE_ALL_OUT_03_09_541716_1913_0
PF11981.7_aivan_SAVE_ALL_OUT_03_09_541721_3743_0
PF10092.8_aivan_SAVE_ALL_OUT_03_09_541721_2663_0
PF03169.14_aivan_SAVE_ALL_OUT_03_09_541715_2865_0
PF10972.7_aivan_SAVE_ALL_OUT_03_09_541721_4953_0
PF10070.8_aivan_SAVE_ALL_OUT_03_09_541721_4953_0
<core_client_version>7.8.3</core_client_version>
<![CDATA[
<message>
Incorrect function.
(0x1) - exit code 1 (0x1)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.06_windows_x86_64.exe @PF10070.8.flags -in:file:boinc_wu_zip PF10070.8.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3099308
Starting watchdog...
Watchdog active.
std::cerr: Exception was thrown:
chi angle must be between -180 and 180: nan

</stderr_txt>
]]>

ID: 88188 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 130
Credit: 1,766,254
RAC: 0
Message 88202 - Posted: 1 Feb 2018, 5:35:21 UTC - in response to Message 88131.  

Re: My host Windows XP with Pentium 4 CPU. Issue with v4.06 windows_intelx86 since began getting these workunits. Will not actually process the WUs, and I'll get these messages and errors soon after starting the WUs.


01/31/2018 2:17:15 AM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/31/2018 2:17:16 AM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/31/2018 2:17:16 AM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/31/2018 2:17:17 AM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/31/2018 2:17:17 AM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)

01/31/2018 2:17:21 AM | Rosetta@home | Computation for task PF11000.7_jumps_aivan_SAVE_ALL_OUT_03_09_543757_128_0 finished
01/31/2018 2:17:21 AM | Rosetta@home | Output file PF11000.7_jumps_aivan_SAVE_ALL_OUT_03_09_543757_128_0_r1320681983_0 for task PF11000.7_jumps_aivan_SAVE_ALL_OUT_03_09_543757_128_0 absent`

01/31/2018 2:19:41 AM | Rosetta@home | task PF11067.7_aivan_SAVE_ALL_OUT_03_09_541721_6990_0 resumed by user
01/31/2018 2:19:56 AM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/31/2018 2:19:57 AM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/31/2018 2:19:58 AM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/31/2018 2:19:58 AM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/31/2018 2:19:58 AM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
01/31/2018 2:20:02 AM | Rosetta@home | Computation for task PF11067.7_aivan_SAVE_ALL_OUT_03_09_541721_6990_0 finished
01/31/2018 2:20:02 AM | Rosetta@home | Output file PF11067.7_aivan_SAVE_ALL_OUT_03_09_541721_6990_0_r1604317983_0 for task PF11067.7_aivan_SAVE_ALL_OUT_03_09_541721_6990_0 absent

Is the above a problem unique to XP or is it across various OS?
ID: 88202 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Juha

Send message
Joined: 28 Mar 16
Posts: 13
Credit: 705,034
RAC: 0
Message 88210 - Posted: 1 Feb 2018, 19:34:33 UTC - in response to Message 88131.  

Windows XP. Issue with v4.06


Looks like 4.06 was compiled with Visual Studio 2015 which by default doesn't create XP compatible program files.

I don't know if the project decided to drop XP support or if it happened accidentally.
ID: 88210 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 8,387
Message 88217 - Posted: 2 Feb 2018, 9:54:07 UTC - in response to Message 88210.  

I don't know if the project decided to drop XP support or if it happened accidentally.


From Apps page:
Microsoft Windows (98 or later) running on an Intel x86-compatible CPU 4.06

So it seems that they don't drop XP
(even if i think it's a good idea to abandon XP)
ID: 88217 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 8,387
Message 88246 - Posted: 7 Feb 2018, 21:11:26 UTC

972378950

ERROR: Assertion `! lines.empty()` failed.
ERROR:: Exit from: ......srccoreiopdbpdb_reader.cc line: 78
BOINC:: Error reading and gzipping output datafile: default.out
22:06:16 (14768): called boinc_finish(1)
ID: 88246 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pututu

Send message
Joined: 12 Jun 16
Posts: 5
Credit: 10,028,325
RAC: 0
Message 88261 - Posted: 10 Feb 2018, 23:11:52 UTC

Got a few of these errors running Rosetta v4.06 over the past few days.

Task ID_____WU name
973051977 PF09826.8_bnd_aivan_SAVE_ALL_OUT_03_09_543807_2050_0
973046720 PF06980.10_bnd_aivan_SAVE_ALL_OUT_03_09_543807_2040_0
973000035 PF06980.10_bnd_aivan_SAVE_ALL_OUT_03_09_543807_1949_0
972990240 PF10070.8_bnd_aivan_SAVE_ALL_OUT_03_09_543807_1934_0
972988377 PF13584.5_bnd_aivan_SAVE_ALL_OUT_03_09_543807_1931_0
972971744 PF10070.8_bnd_aivan_SAVE_ALL_OUT_03_09_543807_1906_0


Sample error message:

<core_client_version>7.8.3</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.06_windows_intelx86.exe @PF09826.8.bnd.flags -in:file:boinc_wu_zip PF09826.8.bnd.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3839119
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 21742.4s, 14400s + 7200s[2018- 2-10 12:37:57:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 21742.4 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
12:37:57 (5308): called boinc_finish(0)

</stderr_txt>
ID: 88261 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 291,214,977
RAC: 1
Message 88267 - Posted: 11 Feb 2018, 19:27:01 UTC - in response to Message 88261.  

The same error I already reported 2 months ago and I can see it still continues.

And it is because I'm not crunching Rosetta but in some android devices not affected by this error.

Got a few of these errors running Rosetta v4.06 over the past few days.

Task ID_____WU name
973051977 PF09826.8_bnd_aivan_SAVE_ALL_OUT_03_09_543807_2050_0
973046720 PF06980.10_bnd_aivan_SAVE_ALL_OUT_03_09_543807_2040_0
973000035 PF06980.10_bnd_aivan_SAVE_ALL_OUT_03_09_543807_1949_0
972990240 PF10070.8_bnd_aivan_SAVE_ALL_OUT_03_09_543807_1934_0
972988377 PF13584.5_bnd_aivan_SAVE_ALL_OUT_03_09_543807_1931_0
972971744 PF10070.8_bnd_aivan_SAVE_ALL_OUT_03_09_543807_1906_0


Sample error message:

<core_client_version>7.8.3</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.06_windows_intelx86.exe @PF09826.8.bnd.flags -in:file:boinc_wu_zip PF09826.8.bnd.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3839119
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 21742.4s, 14400s + 7200s[2018- 2-10 12:37:57:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 21742.4 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
12:37:57 (5308): called boinc_finish(0)

</stderr_txt>
ID: 88267 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,169,305
RAC: 3,400
Message 88268 - Posted: 12 Feb 2018, 0:09:16 UTC - in response to Message 88267.  

The same error I already reported 2 months ago and I can see it still continues.


I'm having that problem and the problem of something 'preempting' my workunits I have taken one of my machines
off of here!! It's a boinc ONLY machine and nothing is preempting anything anywhere!! I have other machine that aren't over the 50%
error mark, most are under 10%.
ID: 88268 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aladar42

Send message
Joined: 14 Nov 17
Posts: 2
Credit: 67,864
RAC: 0
Message 88326 - Posted: 20 Feb 2018, 15:11:36 UTC

Getting a good amount of errors myself:

https://boinc.bakerlab.org/workunit.php?wuid=878899708
https://boinc.bakerlab.org/workunit.php?wuid=878668362
https://boinc.bakerlab.org/workunit.php?wuid=878668207
https://boinc.bakerlab.org/workunit.php?wuid=878668371
https://boinc.bakerlab.org/workunit.php?wuid=878668369
ID: 88326 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LarryMajor

Send message
Joined: 1 Apr 16
Posts: 22
Credit: 31,533,212
RAC: 0
Message 88337 - Posted: 22 Feb 2018, 12:25:02 UTC

I started getting errors about a week ago. The common points are that the jobs are all PF*_bnd_aivan_SAVE_ALL_OUT*, and that I only get errors on the machine with AMD Opterons. Some WU’s with this name run successfully, and the ones that fail all exceed the target CPU time by four hours before failing.
The error, in part, is “WARNING! cannot get file size for default.out.gz: could not open file” and “Output exists: default.out.gz Size: -1.” The Exit Status is 11.

About half the jobs fail when re-sent to other machines, but when I looked at one that finished successfully on another machine, I see the same errors in both outputs:
Failed:
https://boinc.bakerlab.org/result.php?resultid=974837214
Completed:
https://boinc.bakerlab.org/result.php?resultid=975103716

After seeing the same errors, but an Exit Status 0 on the re-send, I’m really confused about where the problem lies, and will appreciate any help you guys can give me.
ID: 88337 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 19 · Next

Message boards : Number crunching : Rosetta 4.0+



©2024 University of Washington
https://www.bakerlab.org