Message boards : Number crunching : Acourbert_xxx units taking over 40 GB RAM
Author | Message |
---|---|
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 2 |
I have found several of these units behaving this way and I've had to abort them. Even if the host has enough memory they finish messing the rest of units being crunched and bringing the host down. Is this the expected performance for these units (needing so much RAM)? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I see you have a couple of large server systems you are using for Rosetta. Are you saying that a single work unit is somehow taking 40GB? If so, that is not at all expected. Another way to ask the question is, how much more memory do you feel these Acourbert tasks are consuming than other work you have processed? Rosetta Moderator: Mod.Sense |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 2 |
Hi, I mean a single unit, most of units are below 800 Mb, some can reach 1-1.2 Gb but I've seen several of these ones, not all of them, go up to 40 Gb and more. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
With 72 and 88 CPUs, I'm guessing you mean all active tasks at a given time combine to use 40GB of memory. Yes, 500MB per active work unit would not be uncommon. You can set preferences in the BOINC Manager for how much memory BOINC is allowed to use. The BOINC Manager will then basically moderate the number of active tasks to live within your stated preference. This can help assure some memory is available for non-BOINC work. However it does bring about periods of time when come CPU cores are idle. I often suggest adding another BOINC project that has lower memory requirements for a portion of resource share. The mix of high and low memory work units can keep all of the CPU cores busy, and reduce total memory consumed by BOINC workloads. It also gives you a second project to crunch on if R@h should hit a period where it runs out of work or has some server downtime. I'll suggest WCG as a project with some similar objectives to R@h, and generally their projects have much lower memory requirements. Even just a 10% resource share should be enough to significantly improve on the memory constraint. Rosetta Moderator: Mod.Sense |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I mean a single unit... No, that is not normal nor expected. Please abort these tasks and provide links or the full WU name of the problem tasks from your BOINC Manager event log. Rosetta Moderator: Mod.Sense |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 2 |
I mean a single unit... There is no such a message from the boinc manager, the host simply becomes starved and units would stay for ever without real advance. I detect it looking to the system monitor, all minirosseta threads are halted and one of them is occupying 40,3GB and the sytem monitor indicates that it is an Acourbert unit. I kill it and the host start recovering until normal operation. I can try to take some screenshots if I receive some unit more of this type. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Oh, so you are killing the task from the task list, not aborting it from BOINC Manager? If you do use BOINC Manager abort, then a message is posted to the event log with exact WU name that you could copy/paste to a message here. If you killed from task list, I should think that the problem WU would then restart. Did this happen long enough ago that we can presume the task later completed normally after it restarted? Or did it get hung up again? Rosetta Moderator: Mod.Sense |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
In the task list, aren't all of the tasks shown as "minirosetta..."? Where did you see " Acourbert..."? Rosetta Moderator: Mod.Sense |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I sent a message to the researcher who owns these jobs. |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 2 |
In the task list, aren't all of the tasks shown as "minirosetta..."? Where did you see " Acourbert..."? These ones https://boinc.bakerlab.org/rosetta/workunit.php?wuid=810281216 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=810280862 Looking for them I've seen that also these ones have amny errors (no relation with memory occupation) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=811137203 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Perfect. Thanks for tracking those down. That will help a great deal in resolving the problem. DK, note these both failed on other hosts as well, with out of memory error and segment violations. And the third is the one I noticed as well (actually saw a few like this) and EMailed DK about earlier in the day. Rosetta Moderator: Mod.Sense |
acourbet Volunteer moderator Project developer Project scientist Send message Joined: 2 Feb 17 Posts: 1 Credit: 103,615 RAC: 0 |
Perfect. Thanks for tracking those down. That will help a great deal in resolving the problem. Hi everyone, I am really sorry for this mistake on my part. I just identified the problem causing exponential increase in memory of some of my jobs. Thanks for your help, and again all my apologies. |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 2 |
Perfect. Thanks for tracking those down. That will help a great deal in resolving the problem. You're welcome, We are glad glad to help. Continue the good work. |
Message boards :
Number crunching :
Acourbert_xxx units taking over 40 GB RAM
©2024 University of Washington
https://www.bakerlab.org