High number of invalid tasks

Message boards : Number crunching : High number of invalid tasks

To post messages, you must log in.

AuthorMessage
tvdsluis

Send message
Joined: 27 Mar 20
Posts: 11
Credit: 514,960
RAC: 0
Message 103126 - Posted: 8 Nov 2021, 12:33:54 UTC

Lately i have a high number of validation errors.
Like this one:
https://boinc.bakerlab.org/rosetta/result.php?resultid=1446981031
Nothing changed on my end, and the system is running fine and stable as always.
This system also runs a number of WCG tasks without any problems or errors.
Also the reported runtime of 2 minutes is not correct.
This task ran 4+ hours before being invalidated.
It's a win10 pro system with an amd ryzen 7 2700 with 16Gb of memory.
It also runs FAH tasks GPU and CPU without any roblems.

Anyone else have a growing number of validation errors?
ID: 103126 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 103128 - Posted: 8 Nov 2021, 13:17:38 UTC - in response to Message 103126.  
Last modified: 8 Nov 2021, 14:17:20 UTC

Anyone else have a growing number of validation errors?

Yes, I have so far today a 10% invalid rate, which is unusually high.
I have five Ubuntu machines on it, three with VirtualBox to run the pythons too.
ID: 103128 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,162,382
RAC: 4,112
Message 103131 - Posted: 8 Nov 2021, 19:33:48 UTC - in response to Message 103126.  

Lately i have a high number of validation errors.
Like this one:
https://boinc.bakerlab.org/rosetta/result.php?resultid=1446981031
Nothing changed on my end, and the system is running fine and stable as always.
This system also runs a number of WCG tasks without any problems or errors.
Also the reported runtime of 2 minutes is not correct.
This task ran 4+ hours before being invalidated.
It's a win10 pro system with an amd ryzen 7 2700 with 16Gb of memory.
It also runs FAH tasks GPU and CPU without any roblems.

Anyone else have a growing number of validation errors?


Each Rosetta task blocks off 8gb of memory for itself, so the problem could be your other tasks, and whatever else you do with the pc, trying to use more than the remaining 8gb and if they also push the envelope then they are both trying to use the same blocks of memory and invalid units are inevitable. For me on my Win10/11 pc the WCG Africa Rainfall tasks use 5gb each, I don't run FAH so don't know how much memory it uses.
ID: 103131 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 103132 - Posted: 8 Nov 2021, 21:06:47 UTC - in response to Message 103126.  

I see the same message:

ERROR: [ERROR] Unable to open constraints file: m_09051a5a5815e8c4a7a718313fa04930_0001_000000061_0001_1_35_49_H_._HHH_b2_06813_0002_1_0001.MSAcst
https://boinc.bakerlab.org/rosetta/result.php?resultid=1447459313

It is strange that it is classified as "invalid" rather than "error".
ID: 103132 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sam6861

Send message
Joined: 25 Mar 20
Posts: 4
Credit: 2,411,420
RAC: 0
Message 103133 - Posted: 8 Nov 2021, 21:19:14 UTC - in response to Message 103126.  

So far, my computer's invalids appers to randomly happen to: 5nvx_graft_buwei_xab and 5nvx_graft_buwei_xad

For my computer, nearly all of my invalids ran for less then 3 mimutes. Log shows an error.
Sent time: 8 Nov 2021, 9:07:29 UTC
Received: 8 Nov 2021, 9:10:38 UTC
Run time: 2 min 54 sec
Validate state: Invalid
ERROR: [ERROR] Unable to open constraints file
https://boinc.bakerlab.org/rosetta/result.php?resultid=1447267533
ID: 103133 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,798,444
RAC: 22,599
Message 103135 - Posted: 8 Nov 2021, 22:35:48 UTC - in response to Message 103126.  

Anyone else have a growing number of validation errors?
Nope, for me there are a slightly higher than usual number of compute Errors.
Don't worry about it- 5nvx_graft_buwei_ have produced a steady stream of Invalids & Errors ever since they were first released, and the percentage of them compared to Valids does vary as you get batches of work that have more or less than the usual number of error producing Tasks in them. And most of them die within a matter of minutes.
If you start getting Invalids or Errors that aren't 5nvx_graft_buwei_ Tasks (other than the very occasional RB Task), and the computer that processes the resent Task doesn't get an error, then it's time to be concerned.



Also the reported runtime of 2 minutes is not correct.
This task ran 4+ hours before being invalidated.
No so.
It did run for only a few minutes.
Run time 2 min 46 sec
CPU time 2 min 25 sec

Grant
Darwin NT
ID: 103135 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,798,444
RAC: 22,599
Message 103136 - Posted: 8 Nov 2021, 22:39:19 UTC - in response to Message 103132.  

It is strange that it is classified as "invalid" rather than "error".
Yep, ever since those Tasks were released the exact same error in the stderr output can result in either a Computation error or a Validation error.
Grant
Darwin NT
ID: 103136 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tvdsluis

Send message
Joined: 27 Mar 20
Posts: 11
Credit: 514,960
RAC: 0
Message 103139 - Posted: 9 Nov 2021, 10:10:38 UTC

Thanks for the responses.
For now i will limit Rosetta to 1 task at a time, to see if the memory constraint is the factor.
I changed it from 1 to 2 not so long ago, so we'll see what happens in the next few days.
ID: 103139 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,798,444
RAC: 22,599
Message 103143 - Posted: 9 Nov 2021, 22:22:06 UTC - in response to Message 103139.  
Last modified: 9 Nov 2021, 22:23:50 UTC

Thanks for the responses.
For now i will limit Rosetta to 1 task at a time, to see if the memory constraint is the factor.
Why? The problem has noting to do with memory issues. As we mentioned, the problem is with those particular Tasks.
Yes, your system is very low on RAM for the number of cores/threads it has (For Rosetta you need to allow 1.3GB RAM per core/thread in use to avoid problems due to lack of RAM; more if you process Python Tasks). But since you aren't using all of them for Rosetta it's not going to be an issue. Memory issues will usually result in an unhandled exception error.
Grant
Darwin NT
ID: 103143 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,798,444
RAC: 22,599
Message 103153 - Posted: 11 Nov 2021, 1:49:04 UTC

My Invalids have more than quadrupled overnight. A particularly bad group of 5nvx_graft_buwei_ Tasks making their way through the system.
Grant
Darwin NT
ID: 103153 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 392
Credit: 12,095,213
RAC: 5,468
Message 103155 - Posted: 11 Nov 2021, 7:21:42 UTC - in response to Message 103153.  
Last modified: 11 Nov 2021, 7:24:52 UTC

My Invalids have more than quadrupled overnight. A particularly bad group of 5nvx_graft_buwei_ Tasks making their way through the system.


I had 10, 9 of which were the standard “Unable to open constraints file” but the 10th lasted the full 8 hours and had several hundred “Invalid pointer” errors :-

Task 1448313622
Name 5nvx_graft_buwei_xad_SAVE_ALL_OUT_IGNORE_THE_REST_5gk5dq7m_1731808_18_1
Workunit 1291016882
Created 10 Nov 2021, 8:43:42 UTC
Sent 10 Nov 2021, 8:47:51 UTC
Report deadline 13 Nov 2021, 8:47:51 UTC
Received 10 Nov 2021, 17:38:40 UTC
Server state Over
Outcome Validate error
Client state Done
Exit status 0 (0x00000000)
Computer ID 3563484
Run time 8 hours 2 min
CPU time 8 hours 0 min 10 sec
Validate state Invalid
Credit 470.80
Device peak FLOPS 7.03 GFLOPS
Application version Rosetta v4.20
x86_64-pc-linux-gnu
Peak working set size 995.81 MB
Peak swap size 1,130.45 MB
Peak disk usage 30.48 MB
Stderr output
<core_client_version>7.16.17</core_client_version>
<![CDATA[
<stderr_txt>
*
*** Error in `../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu': free(): invalid pointer: 0x00000000067bd783 ***
*** Error in `../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu': free(): invalid pointer: 0x00000000067bd783 ***
.
.
.
*** Error in `../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu': free(): invalid pointer: 0x00000000067bd783 ***
*** Error in `../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu': free(): invalid pointer: 0x00000000067bd783 ***
*** Error in `../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu': free(): invalid pointer: 0x00000000067bd783 ***
*** Error in `../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu': free(): invalid pointer: 0x00000000067bd783 ***
======================================================
DONE :: 649 starting structures 28810.4 cpu seconds
This process generated 649 decoys from 649 attempts
======================================================
BOINC :: WS_max 1.04103e+09
17:01:05 (1950): called boinc_finish(0)

</stderr_txt>
]]>
ID: 103155 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 392
Credit: 12,095,213
RAC: 5,468
Message 103156 - Posted: 11 Nov 2021, 7:21:44 UTC - in response to Message 103153.  
Last modified: 11 Nov 2021, 7:26:24 UTC

Delete duplicate post
ID: 103156 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,798,444
RAC: 22,599
Message 103157 - Posted: 11 Nov 2021, 7:36:53 UTC - in response to Message 103155.  

I had 10, 9 of which were the standard “Unable to open constraints file” but the 10th lasted the full 8 hours and had several hundred “Invalid pointer” errors :-
======================================================
DONE :: 649 starting structures 28810.4 cpu seconds
This process generated 649 decoys from 649 attempts
======================================================
So even though it produced valid work, it gave a Validation error due to the invalid pointer issue.
At least you got Credit for the time spent & work done, even if it still gets counted as Invalid.
Grant
Darwin NT
ID: 103157 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 392
Credit: 12,095,213
RAC: 5,468
Message 103167 - Posted: 11 Nov 2021, 15:41:33 UTC - in response to Message 103157.  

I had 10, 9 of which were the standard “Unable to open constraints file” but the 10th lasted the full 8 hours and had several hundred “Invalid pointer” errors :-
======================================================
DONE :: 649 starting structures 28810.4 cpu seconds
This process generated 649 decoys from 649 attempts
======================================================
So even though it produced valid work, it gave a Validation error due to the invalid pointer issue.
At least you got Credit for the time spent & work done, even if it still gets counted as Invalid.


That’s about the size of it.

I posted it as I’ve not seen one like that before.
ID: 103167 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tvdsluis

Send message
Joined: 27 Mar 20
Posts: 11
Credit: 514,960
RAC: 0
Message 103337 - Posted: 16 Nov 2021, 10:20:03 UTC - in response to Message 103131.  
Last modified: 16 Nov 2021, 10:21:46 UTC

An update:
After switching back to running RAH on just one core, all invalids are gone.
I now have 27 Consecutive valid tasks.
@mikey, it looks like you're spot on with the memory requirements.
The other 7 cores now run FAH and WCG and because i only have 16gb on this system, just 1 runs RAH.
ID: 103337 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,162,382
RAC: 4,112
Message 103341 - Posted: 16 Nov 2021, 12:20:34 UTC - in response to Message 103337.  

An update:
After switching back to running RAH on just one core, all invalids are gone.
I now have 27 Consecutive valid tasks.

@mikey, it looks like you're spot on with the memory requirements.
The other 7 cores now run FAH and WCG and because i only have 16gb on this system, just 1 runs RAH.


And that's exactly why I also limit Rosetta tasks to one at a time on all my machines, most only have 16gb of ram anyway.
ID: 103341 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,798,444
RAC: 22,599
Message 103354 - Posted: 16 Nov 2021, 21:51:27 UTC - in response to Message 103337.  

An update:
After switching back to running RAH on just one core, all invalids are gone.
That is not what happened.
What happened is that a new batch of work was released, that doesn't include the types of Task that were producing the errors you were posting about. So regardless of whether you use 1 core or 256, the errors won't occur if you're not processing those particular Tasks, and they will re-occur if such Tasks are released again- all regardless of the number of cores you use.

If the errors were due to memory issues, you would have got a different error message (and in many cases it would have occurred after the Task had ben processed for some time, not when it had just started up).
As long as you allow 1.3GB of RAM per Rosetta 4.20 Tasks being processed, then you won't have any issues with memory.
If you do, then go to your account, Computing preferences, Memory, and set both "When computer is in use, use at most" and "When computer is not in use, use at most" to 95 % each.
With the amount of RAM you have v the number of cores/threads that will allow you to process 12 Tasks at a time without issue (most of the time would actually be possible to do 16, but the amount of RAM used by Tasks does vary. Presently it's between 700MB and 1GB. It can be as high as 2GB, or as low as 200MB).

However 1 Python Task would use half the systems RAM, limiting the amount of other work that could be done.
If you don't use VirtualBox for any other projects, then re-installing BOINC using the version that doesn't include VirtualBox would solve that problem.



@mikey, it looks like you're spot on with the memory requirements.
Mikey has been spouting unhelpful rubbish.
Only the Python Tasks require 8GB or RAM due to their use of VirtualBox; as i pointed out above, Rosetta 4.20 Tasks don't require nearly as much RAM.
Grant
Darwin NT
ID: 103354 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : High number of invalid tasks



©2024 University of Washington
https://www.bakerlab.org