Rosetta 4.0+

Message boards : Number crunching : Rosetta 4.0+

To post messages, you must log in.

Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 19 · Next

AuthorMessage
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,633,537
RAC: 7,232
Message 91377 - Posted: 17 Nov 2019, 14:47:56 UTC - in response to Message 91375.  

RAM requirements more then 3 GB / WU ... isn't fun anymore

And.....not always.
Sometimes also i have these errors even if i have more than 7gb of ram free.
I think it's an allocation memory problem.
ID: 91377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,633,537
RAC: 7,232
Message 91378 - Posted: 17 Nov 2019, 14:49:15 UTC

After 6hs of calculation (my default time is 2hs)
1105183556
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 21704.1s, 14400s + 7200s[2019-11-17 15:29:37:] :: BOINC
Writing W_0000001
======================================================
DONE :: 1 starting structures 21704.1 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
15:29:37 (11620): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>ref_MonomerDesign2019_BOINC_SAVE_ALL_OUT_880858_181_1_r481700808_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>

ID: 91378 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Johannes Kingma

Send message
Joined: 4 Dec 17
Posts: 33
Credit: 9,453,639
RAC: 0
Message 91452 - Posted: 12 Dec 2019, 9:56:01 UTC - in response to Message 87456.  

Since a couple of months I face a problem that the Windows 4.07 stalls on uploads reaching 100%. The than move to status Upload: pending (project backoff). All data seems to upload fine. stdouae.txt reads:


    12-Dec-2019 10:36:27 [Rosetta@home] update requested by user
    12-Dec-2019 10:36:29 [Rosetta@home] Sending scheduler request: Requested by user.
    12-Dec-2019 10:36:29 [Rosetta@home] Reporting 1 completed tasks
    12-Dec-2019 10:36:29 [Rosetta@home] Not requesting tasks: too many uploads in progress
    12-Dec-2019 10:36:31 [Rosetta@home] Scheduler request completed
    12-Dec-2019 10:36:48 [Rosetta@home] work fetch suspended by user
    12-Dec-2019 10:36:49 [Rosetta@home] work fetch resumed by user



There are about 50 uploads with progress 100% like this. I can Abort them but next time the same thing happen.

ID: 91452 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 130
Credit: 1,766,254
RAC: 0
Message 91480 - Posted: 26 Dec 2019, 10:49:25 UTC

Application version: Rosetta v4.07 windows_intelx86
Device: 3710630, Task: 1112811554, and WU 1002360285.
Name: foldit_2007855_0003_fold_and_dock_SAVE_ALL_OUT_849400_5017_0
Status: Error while computing
Exit status: 1 (0x00000001) Unknown error code
Incorrect function. (0x1) - exit code 1 (0x1)

ERROR: Assertion `std::abs( coordsys_rot.det() - 1.0 ) < 1e-6` failed.
ERROR:: Exit from: ......srccoreposesymmetryutil.cc line: 898
BOINC:: Error reading and gzipping output datafile: default.out
ID: 91480 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,633,537
RAC: 7,232
Message 91503 - Posted: 2 Jan 2020, 11:54:54 UTC

Again a lot of errors
1114556241
1114555979
Etc
ID: 91503 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 130
Credit: 1,766,254
RAC: 0
Message 91509 - Posted: 3 Jan 2020, 7:26:12 UTC

Application version: Rosetta v4.07 windows_intelx86
Device: 3710630, Task: 1113518581, and WU 1002915300.
Name: rb_12_26_12887_12992__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_884505_148
Status: Error while computing
Errors: Too many errors (may have bug) Too many total results
Exit status: -529697949 (0xE06D7363) Unknown error code
<message>(unknown error) - exit code -529697949 (0xe06d7363)</message>

Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7591C5AF
Engaging BOINC Windows Runtime Debugger...

Note that prior task errored-out due to: Unknown error code
Incorrect function. (0x1) - exit code 1 (0x1)
ID: 91509 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 130
Credit: 1,766,254
RAC: 0
Message 91605 - Posted: 24 Jan 2020, 5:49:28 UTC

Application version: Rosetta v4.07 windows_intelx86
Device: 3710630, Task: 1118475697, and WU 1007449108.
Name: rb_01_23_14034_14473__t000__0_C2_SAVE_ALL_OUT_IGNORE_THE_REST_886490_64
Status: Error while downloading
Errors: Too many errors (may have bug). Too many total results.
Exit status: -186 (0xFFFFFF46) ERR_RESULT_DOWNLOAD
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>flags_rb_01_23_14034_14473__t000__0_C2_robetta</file_name>
<error_code>-224 (permanent HTTP error)</error_code>
<error_message>permanent HTTP error</error_message>
ID: 91605 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 130
Credit: 1,766,254
RAC: 0
Message 91720 - Posted: 17 Feb 2020, 3:19:58 UTC

Application version: Rosetta v4.07 windows_intelx86
Device: 1759960, Task: 1121955810, and WU: 1010612613.
Name: rb_02_08_15652_15556__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_891233_2568_0
Status: Error while computing
Exit status: 1 (0x00000001) Unknown error code
<message>Incorrect function. (0x1) - exit code 1 (0x1)</message>

There was also additional info in Event Log:
2/16/2020 4:33:48 AM | Rosetta@home | Computation for task rb_02_08_15652_15556__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_891233_2568_0 finished
2/16/2020 4:33:48 AM | Rosetta@home | Output file rb_02_08_15652_15556__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_891233_2568_0_r524257430_0 for task rb_02_08_15652_15556__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_891233_2568_0 absent

Similar error with Task: 1121955812 and WU: 1010612617.
ID: 91720 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,402
Message 91722 - Posted: 17 Feb 2020, 3:46:56 UTC - in response to Message 91720.  
Last modified: 17 Feb 2020, 3:50:22 UTC

Application version: Rosetta v4.07 windows_intelx86
Device: 1759960, Task: 1121955810, and WU: 1010612613.
Name: rb_02_08_15652_15556__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_891233_2568_0
Status: Error while computing
Exit status: 1 (0x00000001) Unknown error code
<message>Incorrect function. (0x1) - exit code 1 (0x1)</message>

There was also additional info in Event Log:
2/16/2020 4:33:48 AM | Rosetta@home | Computation for task rb_02_08_15652_15556__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_891233_2568_0 finished
2/16/2020 4:33:48 AM | Rosetta@home | Output file rb_02_08_15652_15556__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_891233_2568_0_r524257430_0 for task rb_02_08_15652_15556__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_891233_2568_0 absent

Similar error with Task: 1121955812 and WU: 1010612617.

Looks like the important lines are:

Status: Error while computing
<message>Incorrect function. (0x1) - exit code 1 (0x1)</message>

All the rest appears to be the result of that.

I'd like to see more information shown about WHAT the error in computing was, though, whenever such an error occurs.
ID: 91722 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 130
Credit: 1,766,254
RAC: 0
Message 91739 - Posted: 19 Feb 2020, 6:12:59 UTC - in response to Message 91720.  

Application version: Rosetta v4.07 windows_intelx86
Device: 1759960, Task: 1121955810, and WU: 1010612613.
Name: rb_02_08_15652_15556__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_891233_2568_0
Status: Error while computing
Exit status: 1 (0x00000001) Unknown error code
<message>Incorrect function. (0x1) - exit code 1 (0x1)</message>

There was also additional info in Event Log:
2/16/2020 4:33:48 AM | Rosetta@home | Computation for task rb_02_08_15652_15556__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_891233_2568_0 finished
2/16/2020 4:33:48 AM | Rosetta@home | Output file rb_02_08_15652_15556__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_891233_2568_0_r524257430_0 for task rb_02_08_15652_15556__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_891233_2568_0 absent

Similar error with Task: 1121955812 and WU: 1010612617.

The 2nd host has now completed crunching this task, and had a successful outcome. However, the app used in this case was Rosetta v4.07 windows_x86_64 and not Rosetta v4.07 windows_intelx86.

Concerning WU 1010612617, my host as well as 2nd host both were using v4.07 windows_intelx86 and that one also failed validation for same reason as my host. Appears app v4.07 windows_intelx86 needs to be checked out in association with this type of task. Note also that my system is a 64.
ID: 91739 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,402
Message 91801 - Posted: 29 Feb 2020, 1:08:49 UTC

Are there multiple upload servers, with at least one not working?

I have a 4.07 workunit that is finished, but an output file repeatedly refuses to upload.

Another workunit that finished after this one quickly uploaded, and was marked completed.

Another attempt to do the upload is scheduled for about 5.5 hours from now.

The file size for the upload is 2.53 MB.
ID: 91801 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 91807 - Posted: 29 Feb 2020, 15:14:20 UTC - in response to Message 91801.  

There are multiple upload servers. Yes. The BOINC Manager sets a delay on the next attempt, and I believe the WU has a list of possible upload servers, and BOINC will rotate through them on future attempts.
Rosetta Moderator: Mod.Sense
ID: 91807 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,402
Message 91809 - Posted: 29 Feb 2020, 16:59:47 UTC - in response to Message 91807.  
Last modified: 29 Feb 2020, 17:03:59 UTC

There are multiple upload servers. Yes. The BOINC Manager sets a delay on the next attempt, and I believe the WU has a list of possible upload servers, and BOINC will rotate through them on future attempts.

I see the delays on next attempts. Up to several hours now.

If the WU has a list of upload servers, it has failed to upload to many of them over the last few days.

Does this mean the the list of upload servers included in the WU should be examined to determine if it includes any valid servers? Can users do that or does it need to be done at the project end of the connections?
ID: 91809 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 91810 - Posted: 29 Feb 2020, 19:02:39 UTC - in response to Message 91809.  

Does this mean the the list of upload servers included in the WU should be examined to determine if it includes any valid servers?

It means that it is a bad work unit. All your others are going through, and I don't see the upload problem at all.
Ditch it.
ID: 91810 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,402
Message 91920 - Posted: 10 Mar 2020, 18:42:34 UTC

Another small download is stalled, which blocks my computer from downloading any tasks.

3/10/2020 11:47:11 AM | Rosetta@home | Not requesting tasks: some download is stalled
3/10/2020 11:47:13 AM | Rosetta@home | Scheduler request completed
3/10/2020 12:12:53 PM | Rosetta@home | Started download of rb_03_01_17261_17076_ab_t000__robetta.zip
3/10/2020 12:18:00 PM | | Project communication failed: attempting access to reference site
3/10/2020 12:18:00 PM | Rosetta@home | Temporarily failed download of rb_03_01_17261_17076_ab_t000__robetta.zip: transient HTTP error
3/10/2020 12:18:00 PM | Rosetta@home | Backing off 04:04:15 on download of rb_03_01_17261_17076_ab_t000__robetta.zip
3/10/2020 12:18:01 PM | | Internet access OK - project servers may be temporarily down.

4.05 KB of a 5.37 KB file was downloaded.

The last file I saw this problem on was also an *.zip file.

Could this mean that the server gets confused about when it's time to stop sending more of an *.zip file?

Or could it mean that when a download of a small file fails, the next attempt always uses the same server?
ID: 91920 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 34
Message 91936 - Posted: 11 Mar 2020, 8:37:35 UTC - in response to Message 91920.  

I had one of these earlier this week, a download stalled, at 90+% complete. I tried restarting the process, and the machine, but ultimetely killed the job. It was completed and validated by my wingman. He has Windows 10, I have 8.1, but I doubt that is significant.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 91936 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,402
Message 91939 - Posted: 11 Mar 2020, 12:13:28 UTC - in response to Message 91936.  
Last modified: 11 Mar 2020, 12:16:45 UTC

I had one of these earlier this week, a download stalled, at 90+% complete. I tried restarting the process, and the machine, but ultimetely killed the job. It was completed and validated by my wingman. He has Windows 10, I have 8.1, but I doubt that is significant.

Probably not - I also use Windows 10.

Another such failure:

3/11/2020 6:41:34 AM | Rosetta@home | Started download of twc_method_msd_cpp_c580_9mer_gb_000073_msd.zip
3/11/2020 6:46:41 AM | Rosetta@home | Temporarily failed download of twc_method_msd_cpp_c580_9mer_gb_000073_msd.zip: transient HTTP error
3/11/2020 6:46:41 AM | Rosetta@home | Backing off 01:26:54 on download of twc_method_msd_cpp_c580_9mer_gb_000073_msd.zip
3/11/2020 6:46:42 AM | | Project communication failed: attempting access to reference site
3/11/2020 6:46:44 AM | | Internet access OK - project servers may be temporarily down.

Several attempts, each failed after getting 2.63 KB of the expected 3.04 KB.

I now just abort the download in such cases - nothing else seems to help if several attempts to download the same file have failed.
ID: 91939 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,402
Message 91952 - Posted: 13 Mar 2020, 12:32:37 UTC

Another case of a small input file repeatedly failing to download:

3/13/2020 5:57:03 AM | Rosetta@home | Started download of twc_method_msd_cpp_JC_9_34334_1_msd.zip
3/13/2020 6:02:10 AM | Rosetta@home | Temporarily failed download of twc_method_msd_cpp_JC_9_34334_1_msd.zip: transient HTTP error
3/13/2020 6:02:10 AM | Rosetta@home | Backing off 04:42:28 on download of twc_method_msd_cpp_JC_9_34334_1_msd.zip
3/13/2020 6:02:11 AM | | Project communication failed: attempting access to reference site
3/13/2020 6:02:13 AM | | Internet access OK - project servers may be temporarily down.
3/13/2020 6:43:44 AM | Rosetta@home | Sending scheduler request: To report completed tasks.
3/13/2020 6:43:44 AM | Rosetta@home | Reporting 1 completed tasks
3/13/2020 6:43:44 AM | Rosetta@home | Not requesting tasks: some download is stalled
3/13/2020 6:43:46 AM | Rosetta@home | Scheduler request completed

Only 2.63 KB of the expected 3.03 KB would download.

Windows 10 did an automatic update during last night. Unclear if this was involved in the download problem.

I aborted this download to allow downloading more tasks.
ID: 91952 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,402
Message 92020 - Posted: 16 Mar 2020, 22:12:28 UTC

Another repeatedly failing download of a small input file:

3/16/2020 4:54:30 PM | Rosetta@home | Started download of 9v1nm_gb_c3065_9mer_gb_001680.zip
3/16/2020 4:59:38 PM | Rosetta@home | Temporarily failed download of 9v1nm_gb_c3065_9mer_gb_001680.zip: transient HTTP error
3/16/2020 4:59:38 PM | Rosetta@home | Backing off 00:58:42 on download of 9v1nm_gb_c3065_9mer_gb_001680.zip
3/16/2020 4:59:39 PM | | Project communication failed: attempting access to reference site
3/16/2020 4:59:41 PM | | Internet access OK - project servers may be temporarily down.

Got 2.63 KB of the expected 2.93 KB.

Is it worthwhile to report this type of download failure?
ID: 92020 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 92021 - Posted: 16 Mar 2020, 22:33:03 UTC - in response to Message 92020.  

Is it worthwhile to report this type of download failure?

I tried.
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1000
ID: 92021 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 19 · Next

Message boards : Number crunching : Rosetta 4.0+



©2024 University of Washington
https://www.bakerlab.org