More checkpointing problems

Author	Message
Sid Celery Send message Joined: 11 Feb 08 Posts: 2417 Credit: 46,339,640 RAC: 28,820	Message 89508 - Posted: 9 Sep 2018, 22:38:14 UTC - in response to Message 89506. Application Rosetta 4.07 Name PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044 State Running CPU time 08:38:41 CPU time since checkpoint 01:15:19 Elapsed time 14:30:57 Estimated time remaining 00:16:47 Fraction done 98.108% PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044_0 Seeing as this task hasn't finished yet it may be worthwhile tracking how it's getting on with just an excerpt of its attributes Application Rosetta 4.07 Name PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044 State Running CPU time 08:54:25 CPU time since checkpoint 01:31:03 Elapsed time 15:48:38 Estimated time remaining 00:17:45 Fraction done 98.163% So, 78 mins have passed, just 16 mins of CPU time, no further checkpoint, estimated time remaining actually increased by 1 minute. No other PF tasks (or RB) are doing this. 2 later PF tasks completed normally around the 8hr mark as expected. No idea what's going on. ID: 89508 · Rating: 0 · rate: / Reply Quote

shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0	Message 89510 - Posted: 10 Sep 2018, 2:42:29 UTC - in response to Message 89507. Followup data: The task with 8 hours uncheckpointed actually did checkpoint sometime before 10 hours and it finally finished around12 hours. Right now I'm actually on a Linux box, one of my machines that rarely runs for a long period. It has a small supply of non PF... units and none of them appear to be sick puppies. I'm trying to avoid downloading any of the PF... units here, but worse than that, the project has apparently switched to the short-term rb... units. I see that one of them did the fancy finish with the Computation Error. If it crashed quickly (and I suspect it did), then there is little waste of my machine's computation time, but the Rosetta project is just wasting bandwidth for any data that was sent. It should NOT be a battle to participate "effectively" in the project. If the project is having trouble retaining volunteers, then perhaps there is a connection? #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 \| Speech) ID: 89510 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2417 Credit: 46,339,640 RAC: 28,820	Message 89511 - Posted: 10 Sep 2018, 2:48:57 UTC - in response to Message 89508. Application Rosetta 4.07 Name PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044 State Running CPU time 08:38:41 CPU time since checkpoint 01:15:19 Elapsed time 14:30:57 Estimated time remaining 00:16:47 Fraction done 98.108% PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044_0 Seeing as this task hasn't finished yet it may be worthwhile tracking how it's getting on with just an excerpt of its attributes Application Rosetta 4.07 Name PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044 State Running CPU time 08:54:25 CPU time since checkpoint 01:31:03 Elapsed time 15:48:38 Estimated time remaining 00:17:45 Fraction done 98.163% So, 78 mins have passed, just 16 mins of CPU time, no further checkpoint, estimated time remaining actually increased by 1 minute. No other PF tasks (or RB) are doing this. 2 later PF tasks completed normally around the 8hr mark as expected. No idea what's going on. All a bit weird - still running... CPU time 09:44:54 CPU time since checkpoint 02:21:32 Elapsed time 20:02:51 Estimated time remaining 00:20:33 Fraction done 98.319% Another 250mins have passed, only 50mins of CPU time further on, still no checkpoint, remaining time 3 minutes more <shrug> ID: 89511 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2417 Credit: 46,339,640 RAC: 28,820	Message 89516 - Posted: 10 Sep 2018, 12:48:09 UTC - in response to Message 89511. Ok, so it died not long after with a compute error. Final figures and std err report at the end. Application Rosetta 4.07 Name PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044 State Running CPU time 08:38:41 CPU time since checkpoint 01:15:19 Elapsed time 14:30:57 Estimated time remaining 00:16:47 Fraction done 98.108% PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044_0 Seeing as this task hasn't finished yet it may be worthwhile tracking how it's getting on with just an excerpt of its attributes Application Rosetta 4.07 Name PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044 State Running CPU time 08:54:25 CPU time since checkpoint 01:31:03 Elapsed time 15:48:38 Estimated time remaining 00:17:45 Fraction done 98.163% So, 78 mins have passed, just 16 mins of CPU time, no further checkpoint, estimated time remaining actually increased by 1 minute. No other PF tasks (or RB) are doing this. 2 later PF tasks completed normally around the 8hr mark as expected. No idea what's going on. All a bit weird - still running... CPU time 09:44:54 CPU time since checkpoint 02:21:32 Elapsed time 20:02:51 Estimated time remaining 00:20:33 Fraction done 98.319% Another 250mins have passed, only 50mins of CPU time further on, still no checkpoint, remaining time 3 minutes more <shrug> CPU time 09:50:24 Elapsed time 20:29:39 Stderr report (edited for brevity) <core_client_version>7.12.1</core_client_version> <![CDATA[ <message> Disk usage limit exceeded</message> <stderr_txt> range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range [Deleted 460 repeated lines] sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range Unhandled Exception Detected... sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x000007FEFCB531F2 Engaging BOINC Windows Runtime Debugger... sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range [Deleted 30 more repeated lines] sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range ******************** BOINC Windows Runtime Debugger Version 7.9.0 Dump Timestamp : 09/10/18 04:11:44 Install Directory : C:Program FilesBOINC Data Directory : C:ProgramDataBOINC Project Symstore : https://boinc.bakerlab.org/rosetta/symstore sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range LoadLibraryA( C:ProgramDataBOINCdbghelp.dll ): GetLastError = 126 sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range Loaded Library : dbghelp.dll LoadLibraryA( C:ProgramDataBOINCsymsrv.dll ): GetLastError = 126 LoadLibraryA( symsrv.dll ): GetLastError = 126 LoadLibraryA( C:ProgramDataBOINCsrcsrv.dll ): GetLastError = 126 LoadLibraryA( srcsrv.dll ): GetLastError = 126 sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range LoadLibraryA( C:ProgramDataBOINCversion.dll ): GetLastError = 126 Loaded Library : version.dll sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal rangeDebugger Engine : 4.0.5.0 Symbol Search Path: C:ProgramDataBOINCslots6;C:ProgramDataBOINCprojectsboinc.bakerlab.org_rosetta;srvC:ProgramDataBOINCprojectsboinc.bakerlab.org_rosettasymbolshttp://msdl.microsoft.com/download/symbols;srvC:ProgramDataBOINCprojectsboinc.bakerlab.org_rosettasymbolshttps://boinc.bakerlab.org/rosetta/symstore [Deleted section] * Dump of the Process Statistics: * - I/O Operations Counters - Read: 50841, Write: 480090826, Other 34490324 - I/O Transfers Counters - Read: 362491692, Write: 1490402847, Other -408524486 - Paged Pool Usage - QuotaPagedPoolUsage: 283472, QuotaPeakPagedPoolUsage: 283480 QuotaNonPagedPoolUsage: 15000, QuotaPeakNonPagedPoolUsage: 15720 - Virtual Memory Usage - VirtualSize: 437776384, PeakVirtualSize: 1152479232 - Pagefile Usage - PagefileUsage: 437776384, PeakPagefileUsage: 585773056 - Working Set Size - WorkingSetSize: 439787520, PeakWorkingSetSize: 604778496, PageFaultCount: 1885542 * Dump of thread ID 1196 (state: Initialized): * - Information - Status: Base Priority: Normal, Priority: Normal, , Kernel Time: 0.000000, User Time: 0.000000, Wait Time: 0.000000 - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x000007FEFCB531F2 - Registers - rax=0000000000000000 rbx=0000000000000001 rcx=00000000432b82e0 rdx=0000000019b5f3e0 rsi=0000000000000000 rdi=0000000000000000 r8=0000000019b5f3e0 r9=00000000432b82d0 r10=0000000000000001 r11=0000000000000fff r12=0000000000000000 r13=0000000000000000 r14=0000000000000000 r15=0000000000000000 rip=00000000fcb531f2 rsp=0000000019b5f3b8 rbp=0000000000000000 cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246 - Callstack - ChildEBP RetAddr Args to Child 19b5f3b0 417358ee 00000001 19b5f3e0 19b5f3e0 432b82d0 KERNELBASE!DebugBreak+0x0 sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range 19b5f7f0 417368e0 00000000 00000000 00000000 00000000 rosetta_4.07_windows_x86_64!cppdb::backend::statements_cache::statements_cache+0x0 sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range 19b5fa50 76df59cd 00000000 00000000 00000000 00000000 rosetta_4.07_windows_x86_64!cppdb::backend::statements_cache::statements_cache+0x0 sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range 19b5fa80 76f5383d sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range 00000000 00000000 00000000 00000000 kernel32!BaseThreadInitThunk+0x0 sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range 19b5fad0 sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range 00000000 00000000 00000000 00000000 00000000 ntdll!RtlUserThreadStart+0x0 * Dump of thread ID 30689287 (state: Initialized): * - Information - Status: Base Priority: Normal, Priority: Unknown, , Kernel Time: 17179869184.000000, User Time: 21475590144.000000, Wait Time: 0.000000 - Registers - rax=0000000000000000 rbx=0000000000000000 rcx=0000000000000000 rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000 r8=0000000000000000 r9=0000000000000000 r10=0000000000000000 r11=0000000000000000 r12=0000000000000000 r13=0000000000000000 r14=0000000000000000 r15=0000000000000000 rip=0000000000000000 rsp=0000000000000000 rbp=0000000000000000 cs=0000 ss=0000 ds=0000 es=0000 fs=0000 gs=0000 efl=00000000 - Callstack - ChildEBP RetAddr Args to Child sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range (-nosymbols- PC == 0) 00000000 00000000 00000000 00000000 00000000 00000000 !+0x0 sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range * Debug Message Dump * Foreground Window Data *** Window Name : Window Class : Window Process ID: 0 Window Thread ID : 0 Exiting... </stderr_txt> ]]> ID: 89516 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 2	Message 89522 - Posted: 10 Sep 2018, 15:19:14 UTC - in response to Message 89516. Sid, did you see the "Disk usage limit exceeded" error message in the STDERR? If BOINC exceeded your disk allocated, disk writes would fail. <core_client_version>7.12.1</core_client_version> <![CDATA[ <message> Disk usage limit exceeded</message> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <stderr_txt> range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range ID: 89522 · Rating: 0 · rate: / Reply Quote

Admin Project administrator Send message Joined: 1 Jul 05 Posts: 5145 Credit: 0 RAC: 0	Message 89523 - Posted: 10 Sep 2018, 17:50:38 UTC I talked to Ivan, the owner of these jobs. He said there may be a few very large targets in his benchmark that take a while to generate models. He said he doesn't have plans for any more such targets. Sorry for any inconvenience. ID: 89523 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2417 Credit: 46,339,640 RAC: 28,820	Message 89524 - Posted: 10 Sep 2018, 21:04:07 UTC - in response to Message 89522. Sid, did you see the "Disk usage limit exceeded" error message in the STDERR? If BOINC exceeded your disk allocated, disk writes would fail. <core_client_version>7.12.1</core_client_version> <![CDATA[ <message> Disk usage limit exceeded</message> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <stderr_txt> range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range sin_cos_range ERROR: -nan(ind) is outside of [-1,+1] sin and cos value legal range No, I didn't notice it. Thanks for pointing it out. I have to say I was blinded by the extreme length of the report and glossed over that part. To be fair, this STDERR report is only revealed after the task reported so I didn't have any evidence of it earlier. That said, I allocate 10Gb of disk space to Rosetta and the ~40 tasks I hold in my buffer consumes just short of 5Gb, with just over 5Gb spare. There was no sign of this getting called up while the job was running. I will add a couple of Gb more now though as I have plenty to spare. While the disk line is obviously caused by 'something' I can't help looking at the 500 separate ERROR lines saying values are out of range. In my ignorance it does seem kind of relevant as to why this task has gone rogue the way it has. The job did run over 20 hours before crashing. Am I right to be more concerned by those 20hrs than the eventual crash it resulted in? I'll leave that to the experts, none of whom are me. I should emphasise, while I have plenty of issues with PF* tasks - reported over the last 8 months in the pinned thread - this particular one is a one-off. ID: 89524 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2417 Credit: 46,339,640 RAC: 28,820	Message 89525 - Posted: 10 Sep 2018, 21:15:20 UTC - in response to Message 89523. I talked to Ivan, the owner of these jobs. He said there may be a few very large targets in his benchmark that take a while to generate models. He said he doesn't have plans for any more such targets. Sorry for any inconvenience. One thing I haven't mentioned is that a lot of these PF tasks get to 567 hours still on the 1st model with like 580,000 steps. This particular one was on the 6th model, not just the 1st, if that makes a difference. This applies to pretty much all PF tasks I've looked at. Maybe this is why PF tasks generally lend themselves to problems, though I'm obviously guessing here. I'd appreciate it if someone took a look at the errors reported in the Rosetta 4.0x thread as well. Those show a much more common issue in my experience, resulting in Computing Errors. ID: 89525 · Rating: 0 · rate: / Reply Quote

shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0	Message 89797 - Posted: 29 Oct 2018, 6:49:05 UTC - in response to Message 89525. I wonder if that's in reference to the PF problems? Still running about 25% sick puppies when I don't get them nuked before they start. Same policy towards rb units. Current puppy has over an hour with no checkpoint, and I want to reboot the machine, so I've already queued some "safe" tasks and will nuke that one before shutting down (unless it managed to checkpoint itself while I'm writing this message). During the recent task shortage I actually switched to a different project. I noticed that most of their tasks are on the order of 2 to 4 hours now. If the goal of longer work units is to save bandwidth, it certainly doesn't seem to be working in my case with all the nuking of likely sick puppies and other problematic work units that's going on. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 \| Speech) ID: 89797 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2417 Credit: 46,339,640 RAC: 28,820	Message 89799 - Posted: 29 Oct 2018, 12:26:05 UTC - in response to Message 89797. I wonder if that's in reference to the PF problems? Still running about 25% sick puppies when I don't get them nuked before they start. Same policy towards rb units. Current puppy has over an hour with no checkpoint, and I want to reboot the machine, so I've already queued some "safe" tasks and will nuke that one before shutting down (unless it managed to checkpoint itself while I'm writing this message). During the recent task shortage I actually switched to a different project. I noticed that most of their tasks are on the order of 2 to 4 hours now. If the goal of longer work units is to save bandwidth, it certainly doesn't seem to be working in my case with all the nuking of likely sick puppies and other problematic work units that's going on. It was about the past PF problems. I've checked all my machines and I have no errors at all related to the current batch of PF jobs even though I definitely had the same issues as you last time. I do have some errors, but I think they're more related to my overclock - so, all about me, not the tasks. All my current running PF tasks on this machine have checkpointed within the last 11mins (1) 4mins (1) and under 2mins (6) ID: 89799 · Rating: 0 · rate: / Reply Quote

shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0	Message 90036 - Posted: 19 Dec 2018, 23:11:09 UTC - in response to Message 89799. Thanks for the data and sorry I haven't been checking in more frequently. Well, not really sorry, since that mostly means there are no problems that seem worth worrying about. Or back to the sorry side again, maybe not visiting just reflects a loss of hope of making things better... Latest peculiarities: (1) Tasks that terminate themselves en masse when the computer wakes up. Presumably there is another (possibly new) completion criterion related to wall clock time, and when the computer wakes up many of the tasks discover that they are now regarded as completed. Not bad as a sanity check of some sort. (2) Sick puppies from new projects, but nothing prevalent and annoying as the previous ones. Still seeing about 20% of the rb tasks behaving badly, but mostly ignoring that problem except for the 3-day tasks (which still get nuked whenever I spot them in time) and for the one machine with the limited run time. Today's visit was actually provoked by another out-of-tasks condition, so off to look for relevant posts... #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 \| Speech) ID: 90036 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2417 Credit: 46,339,640 RAC: 28,820	Message 90040 - Posted: 20 Dec 2018, 10:29:12 UTC - in response to Message 90036. Today's visit was actually provoked by another out-of-tasks condition, so off to look for relevant posts... Yup, try the top pinned thread. No tasks of any type currently available, 5 days before Christmas.... ID: 90040 · Rating: 0 · rate: / Reply Quote

shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0	Message 90136 - Posted: 3 Jan 2019, 20:51:37 UTC - in response to Message 90040. Not sure where you were referencing, but if you mean the top thread in the "Number crunching" forum, then it's rarely useful. Currently it's 10 days old. This one is mostly for checkpointing problems, which seem less severe than before. They have spread to some of the new subprojects, however. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 \| Speech) ID: 90136 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2417 Credit: 46,339,640 RAC: 28,820	Message 90156 - Posted: 7 Jan 2019, 4:00:28 UTC - in response to Message 90136. Not sure where you were referencing, but if you mean the top thread in the "Number crunching" forum, then it's rarely useful. Currently it's 10 days old. and the message you replied to was 13 days old... ID: 90156 · Rating: 0 · rate: / Reply Quote

shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0	Message 90888 - Posted: 4 Jul 2019, 4:14:31 UTC More sick puppies to report. Names start with "Cx_" where I have noticed x values from 3 to 5. Especially annoying in that the tasks claim to be checkpointing properly, but are lying about it. If you look at the Properties, it will say there was a recent checkpoint, perhaps a minute ago, but if you then reboot the computer, it typically loses 20% of its progress, representing about two hours of work. The elapsed time is conserved. In today's example, the task had over 7 hours in the Elapsed column and Remaining was under an hour, but after rebooting the computer, Elapsed was still over 7, but Progress had fallen to 60% and Remaining was over 3 hours. Usually I spot these things on a computer than only runs for a few hours at a time. However this time I actually noticed it during the major OS upgrades last month. Just confirmed it on the short-running computer. On your [the project management's] side it should probably show as a series of peaks in completion times. At least on the evidence I've noticed, the 2-hour loss seems to be consistent, so there would be one peak around 8 hours for uninterrupted tasks, a second around 10 hours for once-interrupted tasks, and smaller and smaller peaks each two hours after that for more and more interruptions. The rb sick puppies remain around 20% of all rb tasks. In their defense, at least they tell the truth about never completing a checkpoint. They seemed to be getting worse lately, often running from zero without a single checkpoint, so I'm back to scrubbing them from the short-running machine before they get a chance to start. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 \| Speech) ID: 90888 · Rating: 0 · rate: / Reply Quote