could not open file cs_frags.9mers.gz

Message boards : Number crunching : could not open file cs_frags.9mers.gz

To post messages, you must log in.

AuthorMessage
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 70081 - Posted: 20 Apr 2011, 20:26:56 UTC

This error message caused a whole bunch of tasks to crash. Here is the list of url's (not hyperlinked or listed by job) Wingman also crashed and burned with the same error.

1)https://boinc.bakerlab.org/rosetta/result.php?resultid=415518819
2)https://boinc.bakerlab.org/rosetta/result.php?resultid=415512595
3)https://boinc.bakerlab.org/rosetta/result.php?resultid=415512067
4)https://boinc.bakerlab.org/rosetta/result.php?resultid=415511095
5)https://boinc.bakerlab.org/rosetta/result.php?resultid=415509091
6)https://boinc.bakerlab.org/rosetta/result.php?resultid=415455002
7)https://boinc.bakerlab.org/rosetta/result.php?resultid=415312639
8)https://boinc.bakerlab.org/rosetta/result.php?resultid=415300586
9)https://boinc.bakerlab.org/rosetta/result.php?resultid=416241939
10)https://boinc.bakerlab.org/rosetta/result.php?resultid=416241783
11)https://boinc.bakerlab.org/rosetta/result.php?resultid=416236453
12)https://boinc.bakerlab.org/rosetta/result.php?resultid=416236308
13)https://boinc.bakerlab.org/rosetta/result.php?resultid=416235755
14)https://boinc.bakerlab.org/rosetta/result.php?resultid=416224401
15)https://boinc.bakerlab.org/rosetta/result.php?resultid=415950174
16)https://boinc.bakerlab.org/rosetta/result.php?resultid=415900620
17)https://boinc.bakerlab.org/rosetta/result.php?resultid=415538992
18)https://boinc.bakerlab.org/rosetta/result.php?resultid=415526487
19)https://boinc.bakerlab.org/rosetta/result.php?resultid=416389060
20)https://boinc.bakerlab.org/rosetta/result.php?resultid=416290151
21)https://boinc.bakerlab.org/rosetta/result.php?resultid=416289039
22)https://boinc.bakerlab.org/rosetta/result.php?resultid=416274392
23)https://boinc.bakerlab.org/rosetta/result.php?resultid=416264884

That's pretty bad!! 23 tasks!! Who dropped the ball this time?
ID: 70081 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 70082 - Posted: 20 Apr 2011, 22:34:25 UTC
Last modified: 20 Apr 2011, 22:35:11 UTC

Who dropped the ball? I was hoping that you were going to step up and take the blame <sarcastic grin>

People have been complaining for over two weeks now about this and a few other "wingman included" errors in both the "Compute Error" and the "Minirosetta 2.17" threads and neither has the issue been resolved nor have we received even an acknowledgement that there is even an issue from any of the developers.

Further, this is not just a case of having a few jobs polluting the system and just having to wait until they are worked off the queue - as of today (20 April) these tasks are still being generated.

If you take a look at the front page you will see that the "estimated terraflops for the project is down under 110 - where just a few short months ago it was up around 150 - and this is with the recent addition of the two mega-computer(s) run my the Microsoft Windows Azure group and the Russian "2e" group - each with a RAC of well over 100K

I wonder what the project's "terraflops" would be without these two groups? I don't wonder why so many seem to have left the ranks of active participation.
ID: 70082 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 70088 - Posted: 22 Apr 2011, 21:22:53 UTC

3 more from today!!!!!!
https://boinc.bakerlab.org/rosetta/result.php?resultid=416978220
https://boinc.bakerlab.org/rosetta/result.php?resultid=416799986
https://boinc.bakerlab.org/rosetta/result.php?resultid=416794737
you guys are killing me! 26 tasks to one user that no one checked?
3 out of 4 cores goto waste with this problem.
maybe I should dedicate my resources to another project for a month and come back and see if anything changes. (doubt it)

and the lack of communication by the project leaders on here is just astounding.
you can't even take the time to tell us what the tasks are? (you used to)
you can't acknowledge that there is a problem? (a small gaming company in the Czechoslovakia does a better job of this than you guys are doing)

I joined this project for 2 reasons
1) It is Seattle based (my old home)
2) There was talk about looking for cures with cancer (which my mother in law here in Belgium died from years ago)

An added bonus was there seemed to be good interaction between the science team/technology team and the people donating their cpu's. This part of things has died and gone away.

We have asked you guys before to come talk to us, to acknowledge and look into what the problems are and pull bad tasks or fix the related file. You don't do that any more.
ID: 70088 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James Thompson

Send message
Joined: 13 Oct 05
Posts: 46
Credit: 186,109
RAC: 0
Message 70132 - Posted: 26 Apr 2011, 22:18:18 UTC - in response to Message 70088.  

Hi everyone,

This job is my fault. Mod.Sense e-mailed me over the weekend, and the last vestiges of the jobs should have run their course. While there's no excuse for letting this problem go on for so long, I'd like to offer an apology for anyone who feels like their time is being wasted. While there's no excuse for letting this go so long without being caught, I'd like to explain the problem in more detail below and mention what actions we're taking in the future to prevent this.

The fundamental problem with these jobs is that the .zip files sent out to everyone's computers was missing a file. This means that Rosetta failed instantly as soon as the jobs were attempted to start. These jobs were part of a very large batch, and only some of the jobs were failing. As many of the jobs were successful, I didn't realize that this was going on until Mod.Sense e-mailed me. Even worse, I originally misdiagnosed which job was causing the problem, and removed jobs from the queue that were actually succeeding. Now that we have the true culprit the job success rates should return to previous levels.

The work that these jobs are doing is actually very important and exciting, I'll explain it in a separate post very soon. We're currently involved in a worldwide competition where people try to determine protein structures with limited experimental data, and our preliminary results are very promising.

In order to prevent this happening in the future, I'll no longer be submitting jobs in such large batches, so that jobs causing errors will be more obvious. I'm very sorry for the mistake, and even more sorry that I've managed to upset some of you. We're testing new and experimental methods all of the time with Rosetta, which makes it very unique and exciting, but mistakes like this are simply not acceptable even in testing. My sincerest apologies for the mistake, and I hope that you'll continue giving us your interest and your time.
ID: 70132 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 70134 - Posted: 26 Apr 2011, 23:45:58 UTC

Thanks for your reply James.
ID: 70134 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James Thompson

Send message
Joined: 13 Oct 05
Posts: 46
Credit: 186,109
RAC: 0
Message 70135 - Posted: 27 Apr 2011, 0:16:28 UTC - in response to Message 70134.  

Thanks for your reply James.


You're very welcome, it's the least that I can do. I mean that literally, because we're trying to make this kind of mistake difficult to make in the future.

We're currently discussing automated options for picking up and notifying developers of this problem, as the human component (me in this case) failed here, and we do not want this to happen in the future. Once we decide on a solution I'll it on the forum.

Once more, I'm very sorry for wasting your time. I'm going to try and encourage my colleagues to post around here and let you know what we're doing, we really appreciate all of your efforts and will try to communicate better.

Sincerely,

James
ID: 70135 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 70140 - Posted: 27 Apr 2011, 9:39:58 UTC - in response to Message 70135.  

Thanks for your reply James.


You're very welcome, it's the least that I can do. I mean that literally, because we're trying to make this kind of mistake difficult to make in the future.

We're currently discussing automated options for picking up and notifying developers of this problem, as the human component (me in this case) failed here, and we do not want this to happen in the future. Once we decide on a solution I'll it on the forum.

Once more, I'm very sorry for wasting your time. I'm going to try and encourage my colleagues to post around here and let you know what we're doing, we really appreciate all of your efforts and will try to communicate better.

Sincerely,

James


That would be nice, we have been asking for more communication from you guys.
Looking forward to more updates or job descriptions.
ID: 70140 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : could not open file cs_frags.9mers.gz



©2024 University of Washington
https://www.bakerlab.org