errors both computers

Message boards : Number crunching : errors both computers

To post messages, you must log in.

AuthorMessage
Robby1959

Send message
Joined: 10 May 07
Posts: 38
Credit: 9,298,741
RAC: 0
Message 77796 - Posted: 2 Jan 2015, 6:20:30 UTC

both of my computers are having compute errors . both not overclocked . have there been wu changes ? any ideas
ID: 77796 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 77797 - Posted: 2 Jan 2015, 12:36:51 UTC - in response to Message 77796.  

It helps if you can give a link to some of the affected tasks, or at least give the task ID. However, looking through the logs your two computers seem to have developed unrelated faults. Have you done any hardware or software upgrades recently?


Computer 1615483 (Windows 7 SP1):

Reports of memory errors (both RAM and disk errors) in tasks 707727918, 707727911, 707727894, 707722337 and 706971831 among others. The older of those tasks have been completed successfully by other participants, which suggests it is an issue on your side, not the WU.


Computer 1614373 (Windows 7):

"Too many exit(0)s" errors in all tasks. That implies something is preventing your tasks from completing - new software installed or anti-virus upgrade? The older tasks were completed successfully by other users.

708484332, 708114044, 707980497 and 707859337 among others.


In the second machine there was a sudden change of behaviour on all tasks received since 26 December. Did you make any changes on Christmas day?

The first machine's problems date back to the start of December.
ID: 77797 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 77798 - Posted: 2 Jan 2015, 14:29:43 UTC
Last modified: 2 Jan 2015, 14:31:49 UTC

There is a function, (which DK always carefully reminds me is NOT actually part of the "watchdog") that runs as a task begins. This is where prior checkpointed work is loaded up and everything is initialized. It sets a simple counter indicating how many times it has seen this exact point starting up before. If it has seen the same point start 5 times, then it ends the task.

The purpose of this is to allow a couple of times for when a machine does a series of reboots to install updates etc. but to put a limit on such patience and be able to "cut bait" and keep things moving along.

So here are some ways a task might find itself restarting over and over:
1) your BOINC Manager computing preferences do not "keep tasks in memory"
2) BOINC is being ended and restarted several times over a period of time short enough that the task is unable to reach the next checkpoint (usually less than an hour).
3) The machine is rebooting several times over a "period of time" as above.
Rosetta Moderator: Mod.Sense
ID: 77798 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Robby1959

Send message
Joined: 10 May 07
Posts: 38
Credit: 9,298,741
RAC: 0
Message 77803 - Posted: 3 Jan 2015, 1:00:40 UTC
Last modified: 3 Jan 2015, 1:02:12 UTC

I have not made any changes - the dual core named unit is rarely touched . but my available time is less than half . I did an update and later a scan for any malware . the other is only slowed by facebook - I lowered the utilization to 97 % dropping one rossetta core and speeding up the GPU usage - other than the graphics cards and hard drives the cpu memory and motherboard are the same - also I defragged the drives but the dual name system is auto defrag once a week
ID: 77803 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 77804 - Posted: 3 Jan 2015, 2:07:18 UTC - in response to Message 77803.  

I have not made any changes - the dual core named unit is rarely touched . but my available time is less than half . I did an update and later a scan for any malware . the other is only slowed by facebook - I lowered the utilization to 97 % dropping one rossetta core and speeding up the GPU usage - other than the graphics cards and hard drives the cpu memory and motherboard are the same - also I defragged the drives but the dual name system is auto defrag once a week


With public access I can't see any differences between your computers other than the OS installation, so I will continue to use that to differentiate between the machines.


Have you set 97% CPU utilisation on the Windows 7 machine? Exit(o) errors can sometimes be caused when CPU utilisation is below 100%. I'd suggest putting the setting back to 100% for a short while to see if that clears up the problem; as a second test, temporarily deactivate use of the GPU and see if that resolves things.



For the SP1 computer, it is possible that the strain of 4 tasks on your CPU plus a separate task on your GPU is causing BOINC to write a lot of data to the memory swap file on your computer. Do you use an SSD in this computer? Apparently BOINC doesn't always react well to using the swap file on an SSD under Windows 7 and 8. I'd suggest testing deactivating the GPU and then testing reducing the number of CPU cores to see if the problem clears up. If either test works you may want to try reducing the amount of Rosetta work you do, perhaps by giving more work to a less memory-hungry project.

If the problem persists even with less strain on the CPU and GPU I'd suggest doing diagnostic tests on your memory to see if there is a fault with your hardware.
ID: 77804 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 77806 - Posted: 3 Jan 2015, 19:08:09 UTC

It is possible that although you haven't made any changes to the computer, an automatic software update (Winodws, antivirus, etc.) could have been installed on its own. To rule this out I might suggest doing a system restore (Not full reinstall of OS, just Windows system restore which can roll back recent changes) to a restore point a few dated before these issues started happening. I would suggest doing it in this order:
- Detach BOINC from any BOINC projects.
- Restore computer back a couple of weeks
- After restore is complete, re-attach to any BOINC projects.
- Of course, also verify your date, time, and timezone settings - sometimes I've seen PCs be stuck in the wrong month/year and it can cause all sorts of funny stuff to happen.

If after restoring your system back a few weeks (It just rolls back any software changes/updates, but wont mess up any personal files) if the issue is gone then it proves it was a software issue due to some sort of software change/automatic update.
ID: 77806 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : errors both computers



©2024 University of Washington
https://www.bakerlab.org