Acourbert_xxx units taking over 40 GB RAM

Message boards : Number crunching : Acourbert_xxx units taking over 40 GB RAM

To post messages, you must log in.

AuthorMessage
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 291,214,977
RAC: 2
Message 81117 - Posted: 2 Feb 2017, 3:42:19 UTC

I have found several of these units behaving this way and I've had to abort them. Even if the host has enough memory they finish messing the rest of units being crunched and bringing the host down.

Is this the expected performance for these units (needing so much RAM)?
ID: 81117 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 81118 - Posted: 2 Feb 2017, 5:58:49 UTC

I see you have a couple of large server systems you are using for Rosetta. Are you saying that a single work unit is somehow taking 40GB? If so, that is not at all expected. Another way to ask the question is, how much more memory do you feel these Acourbert tasks are consuming than other work you have processed?
Rosetta Moderator: Mod.Sense
ID: 81118 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 291,214,977
RAC: 2
Message 81119 - Posted: 2 Feb 2017, 6:11:50 UTC

Hi,

I mean a single unit, most of units are below 800 Mb, some can reach 1-1.2 Gb but I've seen several of these ones, not all of them, go up to 40 Gb and more.
ID: 81119 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 81120 - Posted: 2 Feb 2017, 6:43:23 UTC

With 72 and 88 CPUs, I'm guessing you mean all active tasks at a given time combine to use 40GB of memory. Yes, 500MB per active work unit would not be uncommon.

You can set preferences in the BOINC Manager for how much memory BOINC is allowed to use. The BOINC Manager will then basically moderate the number of active tasks to live within your stated preference. This can help assure some memory is available for non-BOINC work. However it does bring about periods of time when come CPU cores are idle.

I often suggest adding another BOINC project that has lower memory requirements for a portion of resource share. The mix of high and low memory work units can keep all of the CPU cores busy, and reduce total memory consumed by BOINC workloads. It also gives you a second project to crunch on if R@h should hit a period where it runs out of work or has some server downtime. I'll suggest WCG as a project with some similar objectives to R@h, and generally their projects have much lower memory requirements. Even just a 10% resource share should be enough to significantly improve on the memory constraint.
Rosetta Moderator: Mod.Sense
ID: 81120 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 81121 - Posted: 2 Feb 2017, 6:46:07 UTC
Last modified: 2 Feb 2017, 6:49:12 UTC

I mean a single unit...


No, that is not normal nor expected. Please abort these tasks and provide links or the full WU name of the problem tasks from your BOINC Manager event log.
Rosetta Moderator: Mod.Sense
ID: 81121 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 291,214,977
RAC: 2
Message 81124 - Posted: 2 Feb 2017, 9:37:56 UTC - in response to Message 81121.  

I mean a single unit...


No, that is not normal nor expected. Please abort these tasks and provide links or the full WU name of the problem tasks from your BOINC Manager event log.


There is no such a message from the boinc manager, the host simply becomes starved and units would stay for ever without real advance. I detect it looking to the system monitor, all minirosseta threads are halted and one of them is occupying 40,3GB and the sytem monitor indicates that it is an Acourbert unit. I kill it and the host start recovering until normal operation.

I can try to take some screenshots if I receive some unit more of this type.
ID: 81124 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 81125 - Posted: 2 Feb 2017, 17:40:37 UTC

Oh, so you are killing the task from the task list, not aborting it from BOINC Manager? If you do use BOINC Manager abort, then a message is posted to the event log with exact WU name that you could copy/paste to a message here.

If you killed from task list, I should think that the problem WU would then restart. Did this happen long enough ago that we can presume the task later completed normally after it restarted? Or did it get hung up again?
Rosetta Moderator: Mod.Sense
ID: 81125 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 81126 - Posted: 2 Feb 2017, 17:44:53 UTC

In the task list, aren't all of the tasks shown as "minirosetta..."? Where did you see " Acourbert..."?
Rosetta Moderator: Mod.Sense
ID: 81126 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 81127 - Posted: 2 Feb 2017, 18:52:25 UTC

I sent a message to the researcher who owns these jobs.
ID: 81127 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 291,214,977
RAC: 2
Message 81128 - Posted: 2 Feb 2017, 19:15:01 UTC - in response to Message 81126.  

In the task list, aren't all of the tasks shown as "minirosetta..."? Where did you see " Acourbert..."?


These ones
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=810281216
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=810280862

Looking for them I've seen that also these ones have amny errors (no relation with memory occupation)

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=811137203
ID: 81128 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 81131 - Posted: 2 Feb 2017, 22:06:32 UTC

Perfect. Thanks for tracking those down. That will help a great deal in resolving the problem.

DK, note these both failed on other hosts as well, with out of memory error and segment violations. And the third is the one I noticed as well (actually saw a few like this) and EMailed DK about earlier in the day.
Rosetta Moderator: Mod.Sense
ID: 81131 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
acourbet
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 2 Feb 17
Posts: 1
Credit: 103,615
RAC: 0
Message 81132 - Posted: 2 Feb 2017, 23:02:43 UTC - in response to Message 81131.  

Perfect. Thanks for tracking those down. That will help a great deal in resolving the problem.

DK, note these both failed on other hosts as well, with out of memory error and segment violations. And the third is the one I noticed as well (actually saw a few like this) and EMailed DK about earlier in the day.


Hi everyone,

I am really sorry for this mistake on my part. I just identified the problem causing exponential increase in memory of some of my jobs.

Thanks for your help, and again all my apologies.

ID: 81132 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 291,214,977
RAC: 2
Message 81134 - Posted: 3 Feb 2017, 10:09:29 UTC - in response to Message 81132.  

Perfect. Thanks for tracking those down. That will help a great deal in resolving the problem.

DK, note these both failed on other hosts as well, with out of memory error and segment violations. And the third is the one I noticed as well (actually saw a few like this) and EMailed DK about earlier in the day.


Hi everyone,

I am really sorry for this mistake on my part. I just identified the problem causing exponential increase in memory of some of my jobs.

Thanks for your help, and again all my apologies.




You're welcome, We are glad glad to help. Continue the good work.
ID: 81134 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Acourbert_xxx units taking over 40 GB RAM



©2024 University of Washington
https://www.bakerlab.org