Message boards : Number crunching : could not open file cs_frags.9mers.gz
Author | Message |
---|---|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
This error message caused a whole bunch of tasks to crash. Here is the list of url's (not hyperlinked or listed by job) Wingman also crashed and burned with the same error. 1)https://boinc.bakerlab.org/rosetta/result.php?resultid=415518819 2)https://boinc.bakerlab.org/rosetta/result.php?resultid=415512595 3)https://boinc.bakerlab.org/rosetta/result.php?resultid=415512067 4)https://boinc.bakerlab.org/rosetta/result.php?resultid=415511095 5)https://boinc.bakerlab.org/rosetta/result.php?resultid=415509091 6)https://boinc.bakerlab.org/rosetta/result.php?resultid=415455002 7)https://boinc.bakerlab.org/rosetta/result.php?resultid=415312639 8)https://boinc.bakerlab.org/rosetta/result.php?resultid=415300586 9)https://boinc.bakerlab.org/rosetta/result.php?resultid=416241939 10)https://boinc.bakerlab.org/rosetta/result.php?resultid=416241783 11)https://boinc.bakerlab.org/rosetta/result.php?resultid=416236453 12)https://boinc.bakerlab.org/rosetta/result.php?resultid=416236308 13)https://boinc.bakerlab.org/rosetta/result.php?resultid=416235755 14)https://boinc.bakerlab.org/rosetta/result.php?resultid=416224401 15)https://boinc.bakerlab.org/rosetta/result.php?resultid=415950174 16)https://boinc.bakerlab.org/rosetta/result.php?resultid=415900620 17)https://boinc.bakerlab.org/rosetta/result.php?resultid=415538992 18)https://boinc.bakerlab.org/rosetta/result.php?resultid=415526487 19)https://boinc.bakerlab.org/rosetta/result.php?resultid=416389060 20)https://boinc.bakerlab.org/rosetta/result.php?resultid=416290151 21)https://boinc.bakerlab.org/rosetta/result.php?resultid=416289039 22)https://boinc.bakerlab.org/rosetta/result.php?resultid=416274392 23)https://boinc.bakerlab.org/rosetta/result.php?resultid=416264884 That's pretty bad!! 23 tasks!! Who dropped the ball this time? |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Who dropped the ball? I was hoping that you were going to step up and take the blame <sarcastic grin> People have been complaining for over two weeks now about this and a few other "wingman included" errors in both the "Compute Error" and the "Minirosetta 2.17" threads and neither has the issue been resolved nor have we received even an acknowledgement that there is even an issue from any of the developers. Further, this is not just a case of having a few jobs polluting the system and just having to wait until they are worked off the queue - as of today (20 April) these tasks are still being generated. If you take a look at the front page you will see that the "estimated terraflops for the project is down under 110 - where just a few short months ago it was up around 150 - and this is with the recent addition of the two mega-computer(s) run my the Microsoft Windows Azure group and the Russian "2e" group - each with a RAC of well over 100K I wonder what the project's "terraflops" would be without these two groups? I don't wonder why so many seem to have left the ranks of active participation. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
3 more from today!!!!!! https://boinc.bakerlab.org/rosetta/result.php?resultid=416978220 https://boinc.bakerlab.org/rosetta/result.php?resultid=416799986 https://boinc.bakerlab.org/rosetta/result.php?resultid=416794737 you guys are killing me! 26 tasks to one user that no one checked? 3 out of 4 cores goto waste with this problem. maybe I should dedicate my resources to another project for a month and come back and see if anything changes. (doubt it) and the lack of communication by the project leaders on here is just astounding. you can't even take the time to tell us what the tasks are? (you used to) you can't acknowledge that there is a problem? (a small gaming company in the Czechoslovakia does a better job of this than you guys are doing) I joined this project for 2 reasons 1) It is Seattle based (my old home) 2) There was talk about looking for cures with cancer (which my mother in law here in Belgium died from years ago) An added bonus was there seemed to be good interaction between the science team/technology team and the people donating their cpu's. This part of things has died and gone away. We have asked you guys before to come talk to us, to acknowledge and look into what the problems are and pull bad tasks or fix the related file. You don't do that any more. |
James Thompson Send message Joined: 13 Oct 05 Posts: 46 Credit: 186,109 RAC: 0 |
Hi everyone, This job is my fault. Mod.Sense e-mailed me over the weekend, and the last vestiges of the jobs should have run their course. While there's no excuse for letting this problem go on for so long, I'd like to offer an apology for anyone who feels like their time is being wasted. While there's no excuse for letting this go so long without being caught, I'd like to explain the problem in more detail below and mention what actions we're taking in the future to prevent this. The fundamental problem with these jobs is that the .zip files sent out to everyone's computers was missing a file. This means that Rosetta failed instantly as soon as the jobs were attempted to start. These jobs were part of a very large batch, and only some of the jobs were failing. As many of the jobs were successful, I didn't realize that this was going on until Mod.Sense e-mailed me. Even worse, I originally misdiagnosed which job was causing the problem, and removed jobs from the queue that were actually succeeding. Now that we have the true culprit the job success rates should return to previous levels. The work that these jobs are doing is actually very important and exciting, I'll explain it in a separate post very soon. We're currently involved in a worldwide competition where people try to determine protein structures with limited experimental data, and our preliminary results are very promising. In order to prevent this happening in the future, I'll no longer be submitting jobs in such large batches, so that jobs causing errors will be more obvious. I'm very sorry for the mistake, and even more sorry that I've managed to upset some of you. We're testing new and experimental methods all of the time with Rosetta, which makes it very unique and exciting, but mistakes like this are simply not acceptable even in testing. My sincerest apologies for the mistake, and I hope that you'll continue giving us your interest and your time. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Thanks for your reply James. |
James Thompson Send message Joined: 13 Oct 05 Posts: 46 Credit: 186,109 RAC: 0 |
Thanks for your reply James. You're very welcome, it's the least that I can do. I mean that literally, because we're trying to make this kind of mistake difficult to make in the future. We're currently discussing automated options for picking up and notifying developers of this problem, as the human component (me in this case) failed here, and we do not want this to happen in the future. Once we decide on a solution I'll it on the forum. Once more, I'm very sorry for wasting your time. I'm going to try and encourage my colleagues to post around here and let you know what we're doing, we really appreciate all of your efforts and will try to communicate better. Sincerely, James |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Thanks for your reply James. That would be nice, we have been asking for more communication from you guys. Looking forward to more updates or job descriptions. |
Message boards :
Number crunching :
could not open file cs_frags.9mers.gz
©2024 University of Washington
https://www.bakerlab.org