Message boards : Number crunching : Occasional VirtualBox failures
Author | Message |
---|---|
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,658,896 RAC: 10,801 |
In addition to the thread for computers that won't run any VirtualBox tasks (which seems to be hardware related somehow), there are regular failures on my machines that are usually happy to run VirtualBox tasks. I haven't looked at the VirtualBox preview for many of them yet, but I have now seen two in a row with this error: Spectre V2 : Spectre mitigation: LFENCE not serializing, switching to generic retpoline https://ibb.co/FDkPMtB If the logs from these are useful then I'll collect and post them - I presume Vbox.log, VboxHardening.log and one of the BOINC logs would be the appropriate ones to post? |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,658,896 RAC: 10,801 |
For my machines that will usually successfully run VirtualBox tasks, this Spectre V2 error is still the way that most of the ones that stop running fail. Unfortunately they will often run for days like this if they're not spotted. Fortunately I have BOINCTasks running so usually spot them sooner than that for my local machines, but not the remote ones. |
computezrmle Send message Joined: 9 Dec 11 Posts: 63 Credit: 9,680,103 RAC: 0 |
The Spectre mitigation message appears at every VM start. I don't think it causes the error. Instead its the last info printed on the console before the VM hangs. |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,658,896 RAC: 10,801 |
The Spectre mitigation message appears at every VM start. Ok yeah that makes sense. So it's at least narrowed down to any point after that! Would the VBox logs be helpful in diagnosing it, assuming anyone on the project is ineterested? |
.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0 |
If the `Occasional VirtualBox failures` you are talking about are the ones that fail after a few seconds and lock, and use no more cpu time (if it is can you alter the thread title to show this) In studying of my own error rate and looking at wingmen that have completed the task as valid Are the fail at start up work units more often on systems with a large number of cpu/threads when the system is so busy that the application borks itself by not waiting for another instance of rosetta to finish reading from file The output files from these work units often or mostly have line in them like :- 'F:ProgramDataBOINCslots12vm_image.vdi' is locked for reading by another task}, The `slotsnumber` can appear several times in one output file with different `number` in the `slots` as if several instances of rosetta are fighting each other to read {race condition} the file and so crash the work unit From what I have seen of it, is any system with more than 12 cpu/threads [approxametly] more likely to have the startup faults than 4 or 8 core systems A full top down view that only the Admin can get may rubbish this idea in seconds , its the best I have got on it so here it is for you to consider [ and tell me I am talking carp ] Also things like {The object is not ready}, make me think the app is tripping over itself {The object functionality is limited} could be because some of the required components of the `slots` folder have not loaded in time. |
computezrmle Send message Joined: 9 Dec 11 Posts: 63 Credit: 9,680,103 RAC: 0 |
The slots/x are the working directories for each task. They should be cleared by BOINC when a task ends and (in case of vbox tasks) the vdi image should be deregistered. Your messages point out that there is either a vdi file from a previous task in the slot, e.g. after a crash or a timeout, or that the corresponding entry has not been removed from the VirtualBox medium manager. Both has to be cleaned up manually. - Shut down BOINC - wait until all corresponding processes are closed - delete garbage from the slots; be careful not to remove anything from currently "in progress" tasks - Open the VirtualBox Manager and run the medium manager from the menu - Remove orphaned disk entries; also be careful to ... (same as above) - Restart BOINC My explanation would be that systems under heavy load (lots of concurrently running tasks with heavy I/O) sooner or later run into timeout problems and leave garbage in the slots. |
.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0 |
I did find two dud / zombies in there so will have to keep an eye on that, thanks. |
Message boards :
Number crunching :
Occasional VirtualBox failures
©2024 University of Washington
https://www.bakerlab.org