Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 24 · 25 · 26 · 27 · 28 · 29 · 30 . . . 55 · Next

AuthorMessage
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,137,231
RAC: 4,613
Message 75444 - Posted: 24 Apr 2013, 11:05:28 UTC - in response to Message 75437.  

I wonder if the Rosetta staff even knows there's a problem.

Also, I have these HUGE WU, that after about 6 hours or so, they still are on Model 1, Step 0, and on the graphics, it shows just one big "sinusoidal" line in the "Searching..." graph and the rest of the graphs are blank.

EDIT: After6-7 hours of running, they DO NOT checkpoint. I'm aborting all tasks and running SETI in the meanwhile.


YES they know! ModSense said he was emailing the good doctor to ensure he knew, nothing being done about it by now IS a problem though!!

I am STILL getting the cryo units sent to me, which OF COURSE are failing, I am turning Rosetta off on all my remaining machines at least for now! NNT for me!!! If the Project doesn't give a damm WHY should I?!!!
ID: 75444 · Rating: 0 · rate: Rate + / Rate - Report as offensive
JRP2706

Send message
Joined: 19 Mar 13
Posts: 2
Credit: 161,938
RAC: 420
Message 75445 - Posted: 24 Apr 2013, 11:09:02 UTC

Just experienced the CASP9....... uploading problem.

With all the issues being reported I am also suspending work here until some progress is made to resolve them or at least some formal acknowledgement is made by the Moderators/Support staff.

So Long, and Thanks for all the fish,

jrp2706
ID: 75445 · Rating: 0 · rate: Rate + / Rate - Report as offensive
morgan

Send message
Joined: 30 Jun 06
Posts: 3
Credit: 387,964
RAC: 0
Message 75446 - Posted: 24 Apr 2013, 11:39:06 UTC - in response to Message 75445.  

Just experienced the CASP9....... uploading problem.

jrp2706


CASP9 And ActCys waiting for upload here
ID: 75446 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 75447 - Posted: 24 Apr 2013, 11:54:46 UTC
Last modified: 24 Apr 2013, 11:55:16 UTC

I have several other types that won't upload. The problems are big but as usual no news. My computers will run other projects with medical programs this week.
Greetings,
TJ.
ID: 75447 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Josh and Amanda

Send message
Joined: 20 Oct 11
Posts: 1
Credit: 8,591,642
RAC: 0
Message 75449 - Posted: 24 Apr 2013, 14:21:13 UTC

As a result of the recent myriad of issues with this project compounded by little/no action by Team Baker I will by indefinitely suspending my project time on Rosetta. When your team can respect my time and resources and monitor the program effectively I may consider allowing new tasks, as for now project Collatz thanks you for my additional CPU cycles...
ID: 75449 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,843,285
RAC: 0
Message 75451 - Posted: 24 Apr 2013, 16:41:26 UTC
Last modified: 24 Apr 2013, 16:44:30 UTC

As the problems at the project persist without any but a general note that there is an awareness of 'problems' -- and not specifics, I think there is an inclination of a pattern or sequence of responses.

1) Bring to the attention of the project (via this site) each of the several specific problems encountered.

2) Mitigate the problem at the workstation (kill off known 'bad boy' work units.

3) Verify the project has acknowledged the specific issue (not done yet)

4) Lacking acknowledgement of a problem - move to a 'no new work' posture, while
awaiting project acknowledgement, resolution and explanation.

5) Should other problems surface (failed uploads, failed downloads, delayed validation) note these problems as they occur.

6) Verify the project has acknowledged the specific issues (not done yet).

7) Lacking acknowledgement of specific multiple problems let along lacking a resolution of any of the problems (and ideally explanation) -- move to a 'suspend project processing' posture while awaiting project acknowledgement, resolution and explanation.

8) Allow some time to pass (days) for the project to do what it should be doing to resolve problems (assuming the data being generated is still of value to the project).

9) After the passage of time with no specific acknowledgement, and time frame for resolution, consider detatching from project (killing of possible good work units which get sent back into the queue -- perhaps with new due dates).

Currently, I submit we are in stage 8 here....

I'd note that aside from this thread, and one on computational errors, there has been nothing in the way of a response or acknowledgement of the reported issues by actual project people (some from volunteers who note the project 'is aware') and nothing on the home page to alert folks who are less proactive about a project bumping into issues.
ID: 75451 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,843,285
RAC: 0
Message 75456 - Posted: 24 Apr 2013, 21:48:32 UTC - in response to Message 75451.  

Update -- so the project took the server fully offline (no notice -- no surprise). The presumption was they were working to correct things. Then again, no notice of what they did. Server now reports all green. Great. Uploads don't go through. Not so great.

I get the feeling that there may be an element of random going on in the troubleshooting efforts. At least the explanations are not random, they are instead nonexistent.

Awaiting developments.





As the problems at the project persist without any but a general note that there is an awareness of 'problems' -- and not specifics, I think there is an inclination of a pattern or sequence of responses.

1) Bring to the attention of the project (via this site) each of the several specific problems encountered.

2) Mitigate the problem at the workstation (kill off known 'bad boy' work units.

3) Verify the project has acknowledged the specific issue (not done yet)

4) Lacking acknowledgement of a problem - move to a 'no new work' posture, while
awaiting project acknowledgement, resolution and explanation.

5) Should other problems surface (failed uploads, failed downloads, delayed validation) note these problems as they occur.

6) Verify the project has acknowledged the specific issues (not done yet).

7) Lacking acknowledgement of specific multiple problems let along lacking a resolution of any of the problems (and ideally explanation) -- move to a 'suspend project processing' posture while awaiting project acknowledgement, resolution and explanation.

8) Allow some time to pass (days) for the project to do what it should be doing to resolve problems (assuming the data being generated is still of value to the project).

9) After the passage of time with no specific acknowledgement, and time frame for resolution, consider detatching from project (killing of possible good work units which get sent back into the queue -- perhaps with new due dates).

Currently, I submit we are in stage 8 here....

I'd note that aside from this thread, and one on computational errors, there has been nothing in the way of a response or acknowledgement of the reported issues by actual project people (some from volunteers who note the project 'is aware') and nothing on the home page to alert folks who are less proactive about a project bumping into issues.

ID: 75456 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 75460 - Posted: 24 Apr 2013, 22:30:04 UTC
Last modified: 24 Apr 2013, 22:31:07 UTC

Long time ago I read somewhere in these fora that they work with students. I guess these students are still in the learning process of computer science and if there is no other help then it will be a process of trial and error.

I have seen this with other projects too, then when a post graduate student becomes a PhD, leaves the project and the expertise is gone.
Perhaps that is the case here as well.
And courses communication are off the curriculum in the US...???

Docking@home is getting my cpu cycles.
Greetings,
TJ.
ID: 75460 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 75461 - Posted: 24 Apr 2013, 22:37:07 UTC
Last modified: 24 Apr 2013, 22:38:08 UTC

Impatient as I am, I did one more " retry now" for the files waiting uploading, and guess what, yes, they flow trough the fiber and copper wires immediately.
It is working again?

The "pending" are still there pending...
So not all is working again...
Greetings,
TJ.
ID: 75461 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,843,285
RAC: 0
Message 75463 - Posted: 24 Apr 2013, 22:52:22 UTC - in response to Message 75461.  
Last modified: 24 Apr 2013, 22:54:11 UTC

Thanks for that -- uploads are going through now. That let's me mark the CASP and Cryo units as bad, stay with no new work and let other work units process.

I'd note that unlike the Cryo units, the CASP units might be OK on Windows 7 systems and when they fail they do fail quickly (unlike the Cryo units).

As to pendings -- looks like they just cleared as well.

Now as to the Cryo and CASP workunits....

It is all supposition though as we remain in something of an informational black hole...


Impatient as I am, I did one more " retry now" for the files waiting uploading, and guess what, yes, they flow trough the fiber and copper wires immediately.
It is working again?

The "pending" are still there pending...
So not all is working again...
ID: 75463 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 75464 - Posted: 24 Apr 2013, 23:00:03 UTC

Indeed the pendings are gone. As the Docking Wu's are finished I will start Rosie again. See if I can get 1 million before the weekend.

Happy crunching for the good cause.
Greetings,
TJ.
ID: 75464 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,843,285
RAC: 0
Message 75465 - Posted: 24 Apr 2013, 23:08:57 UTC - in response to Message 75464.  

So, from the update now on the home page (thanks for that), the network specific issues (which showed up as pendings and upload/download problems) were 'resolved' via a reboot.

No comment about the Cryo and Casp work unit issues. I've elected for no new work, and have killed off any Cryo units. As to the Casp work units -- it seems most complete properly and from what I could see, when they fail they fail in under 5 minutes so I'll let them process.

But until I see an assessment of both the Cryo and Casp work units at the project level, I think I'll hang back with no new work and simply clear queues.




Indeed the pendings are gone. As the Docking Wu's are finished I will start Rosie again. See if I can get 1 million before the weekend.

Happy crunching for the good cause.

ID: 75465 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Yifan Song
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 May 09
Posts: 62
Credit: 7,322
RAC: 0
Message 75466 - Posted: 25 Apr 2013, 0:48:59 UTC

Hi, I'm sorry for causing all the trouble with my cryo work units. The crashes are related to using electron density data. I'm updating r@h with bug fixes that should make these jobs more stable.
Yifan
ID: 75466 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 75467 - Posted: 25 Apr 2013, 1:42:00 UTC

All good now. Switching back to Rosetta.
ID: 75467 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,843,285
RAC: 0
Message 75468 - Posted: 25 Apr 2013, 4:00:55 UTC - in response to Message 75466.  

Thanks for the message -- more than anything the angst is a function of information flow.

Hang around so you can catch updated reports over the next few days regarding the updated cryo work units.

To confirm, the updates are in place and any cryo units that are received should be ok -- correct??



Hi, I'm sorry for causing all the trouble with my cryo work units. The crashes are related to using electron density data. I'm updating r@h with bug fixes that should make these jobs more stable.
Yifan

ID: 75468 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 75470 - Posted: 25 Apr 2013, 9:21:26 UTC - in response to Message 75468.  

To confirm, the updates are in place and any cryo units that are received should be ok -- correct??


I think not, got one cryo and that one errored out quickly.
But it could be an old one. So at first I will not directly abort any cryo's.
Greetings,
TJ.
ID: 75470 · Rating: 0 · rate: Rate + / Rate - Report as offensive
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,137,231
RAC: 4,613
Message 75471 - Posted: 25 Apr 2013, 11:25:34 UTC - in response to Message 75470.  

To confirm, the updates are in place and any cryo units that are received should be ok -- correct??


I think not, got one cryo and that one errored out quickly.
But it could be an old one. So at first I will not directly abort any cryo's.


I aborted all units in my cache, I only keep a single day one, and have moved all my pc's elsewhere. I will come back, I still have a goal to get, but not until after Rosetta figures things out. There are waaay too many other places to crunch for to waste time crunching for a project that takes DAYS to say and do anything after we, the crunchers, find problems! Dr. Yifan Song came on and said 'worry' but people are STILL having problems! Why wasn't EVERY cryo unit pulled and sent over to Albert for re-testing? Oh I know...they want us to crunch thru and have problems with all the 'old' units just in case one or two will work!! NO, NO, NO!!! That is NOT the way to engender confidence in your worker bees!!! Send them crap hoping that maybe they either won't care or will find a gem or two in the ton of crap you send out!! AT LEAST the good Doctor came on and SAID SOMETHING, but obviously the problems STILL exist!! Units get sent to people and then get aborted by the project ALL THE TIME, usually over deadline issues, but for other reasons too! WHY wasn't EVERY cryo unit aborted and updated to the new criteria and then sent for Beta testing? Saying 'sorry' doesn't mean a darned thing IF NOTHING CHANGES!!!
ID: 75471 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Brian Priebe

Send message
Joined: 27 Nov 09
Posts: 16
Credit: 33,020,247
RAC: 0
Message 75475 - Posted: 25 Apr 2013, 19:03:01 UTC - in response to Message 75471.  

Saying 'sorry' doesn't mean a darned thing IF NOTHING CHANGES!!!

I have to second this sentiment. About 35% of all work units I've returned in the last week have been aborted due to 'out of memory' errors. If this appalling record doesn't soon change, ROSETTA is history for me.
ID: 75475 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,269,631
RAC: 2,123
Message 75477 - Posted: 25 Apr 2013, 19:59:30 UTC - in response to Message 75475.  
Last modified: 25 Apr 2013, 20:09:00 UTC

Saying 'sorry' doesn't mean a darned thing IF NOTHING CHANGES!!!

I have to second this sentiment. About 35% of all work units I've returned in the last week have been aborted due to 'out of memory' errors. If this appalling record doesn't soon change, ROSETTA is history for me.


I've recently sent some comments to boinc_dev about a problem with the way BOINC keeps track of the amount of memory in use, especially under Windows Vista. For 32-bit workunits, it does not count the SYSWOW64 modules needed to run those workunits under 64-bit Windows.

A few possible ways to handle this, at least partially:

Wait for a future version of BOINC that does count them, and offers separate memory limits for 32-bit memory space and for the entire 64-bit memory space BOINC uses for all workunits.

Set each of your computers to subscribe only to BOINC projects that offer only 32-bit workunits, or only 64-bit workunits, but not both on the same computer.

Upgrade your Windows Vista computers to Windows 7, where the SYSWOW64 modules are much smaller. I don't know how this applies to 64-bit Windows XP or 64-bit Windows 8. Windows Vista uses roughly the same amount of memory for the 32-bit workunits and for the SYSWOW64 modules needed to run them.

Persuade all BOINC projects to either offer a true 64-bit version of each of their applications (even if it won't run any faster than the 32-bit version), or double the estimates of required memory for all 32-bit workunits sent to 64-bit versions of BOINC. 64-bit applications don't use any SYSWOW64 modules when they run, and therefore don't need any memory space to load them.
ID: 75477 · Rating: 0 · rate: Rate + / Rate - Report as offensive
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,137,231
RAC: 4,613
Message 75482 - Posted: 26 Apr 2013, 10:51:48 UTC - in response to Message 75477.  

Saying 'sorry' doesn't mean a darned thing IF NOTHING CHANGES!!!

I have to second this sentiment. About 35% of all work units I've returned in the last week have been aborted due to 'out of memory' errors. If this appalling record doesn't soon change, ROSETTA is history for me.


I've recently sent some comments to boinc_dev about a problem with the way BOINC keeps track of the amount of memory in use, especially under Windows Vista. For 32-bit workunits, it does not count the SYSWOW64 modules needed to run those workunits under 64-bit Windows.

A few possible ways to handle this, at least partially:

Wait for a future version of BOINC that does count them, and offers separate memory limits for 32-bit memory space and for the entire 64-bit memory space BOINC uses for all workunits.

Set each of your computers to subscribe only to BOINC projects that offer only 32-bit workunits, or only 64-bit workunits, but not both on the same computer.

Upgrade your Windows Vista computers to Windows 7, where the SYSWOW64 modules are much smaller. I don't know how this applies to 64-bit Windows XP or 64-bit Windows 8. Windows Vista uses roughly the same amount of memory for the 32-bit workunits and for the SYSWOW64 modules needed to run them.

Persuade all BOINC projects to either offer a true 64-bit version of each of their applications (even if it won't run any faster than the 32-bit version), or double the estimates of required memory for all 32-bit workunits sent to 64-bit versions of BOINC. 64-bit applications don't use any SYSWOW64 modules when they run, and therefore don't need any memory space to load them.


I am using 64bit Win7 Ultimate on all of my Rosetta machines, so that isn't really an issue for me, and I still never crunched a cryo unit successfully!
ID: 75482 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 . . . 24 · 25 · 26 · 27 · 28 · 29 · 30 . . . 55 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org