Message boards : Number crunching : Work done - no "pay" for it
Author | Message |
---|---|
Raistmer Send message Joined: 7 Apr 20 Posts: 49 Credit: 797,293 RAC: 0 |
Имя Mini_Protein_binds_IL6R_COVID-19_1p9m_2_SAVE_ALL_OUT_IGNORE_THE_REST_2fu9wb9d_924136_4_0 2 issues in one result: 1) task aborted by the project, so no client fault. Client done part of work (elapsed and CPU times are not zero) but got zero credit. Incorrect usage of credit system in this case IMO. 2) task exited via exception - not a best way to exit. |
magiceye04 Send message Joined: 11 May 11 Posts: 11 Credit: 1,702,178 RAC: 0 |
I also had about 70 project aborted WUs last night. Many of them were partly computed, some also fully computed. I would really recommend to test these beta-WUs on the Ralph-project. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,941,043 RAC: 23,092 |
I also had about 70 project aborted WUs last night.Or at least pay Credit for the work that was done before they were cancelled. Grant Darwin NT |
Tomcat雄猫 Send message Joined: 20 Dec 14 Posts: 180 Credit: 5,386,173 RAC: 0 |
I also had about 70 project aborted WUs last night.Or at least pay Credit for the work that was done before they were cancelled. Valid point. I still think we need more testing on Ralph before releasing WUs here. Users here expected stable stuff, since this is not the test project (well, to be fair, Ralph has a pretty paltry user base compared to the main project, so bug testing will take a lot longer and problems may still slip through.). Furthermore, if the task was completed before they got cancelled, it seems fair to reward credits. (if BOINC even allows that type of thing, that is) |
Raistmer Send message Joined: 7 Apr 20 Posts: 49 Credit: 797,293 RAC: 0 |
(if BOINC even allows that type of thing, that is) And if not there is the hint what to implement in next release. |
Terrible T Send message Joined: 29 Dec 16 Posts: 4 Credit: 1,333,030 RAC: 0 |
After loosing a good 200.000 sec of wasted computer power over cancelled and error tasks (also cancellled) it indeed would be better to have some more testing prior releasing of work units. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,631,095 RAC: 7,054 |
I would really recommend to test these beta-WUs on the Ralph-project. +1 |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,631,095 RAC: 7,054 |
Not only tonight. Now others 4 wus aborted by server. |
magiceye04 Send message Joined: 11 May 11 Posts: 11 Credit: 1,702,178 RAC: 0 |
Today i got defective WUs without checkpointing. But the Computer needed to restart by external reason. AGAIN many hours of wasted computing - all start from zero. Maybe i try an not beta-project the next days... |
bcov Volunteer moderator Project developer Project scientist Send message Joined: 8 Nov 16 Posts: 12 Credit: 11,348 RAC: 0 |
Hey everyone, Sorry about the cancelled jobs. You're seeing the growing pains as we transition over to more design focused projects on R@H. I'll give you guys the full story so you can put what happened here in perspective. 1. We finally figured out how to do protein design on R@H 2. We started doing monomer design on R@H (these are future protein binders) 3. We worked hard an got an update out to allow Protein Interface Design on R@H 4. We started submitting interface designs to do massive sampling using R@H 5. These runs were too successful and it blew up the servers on our ends -- We decided to remedy this by using filtering on the R@H jobs in this way. Only some of the outputs get stored on our servers and the rest are discarded as they are received 6. This new freedom allowed for even larger jobs to be submitted. Absolutely incredible designs are coming out the other side. This increase in compute power is equivalent to about 5 years of methods development. 7. These jobs have an early filter and a late filter. The early filter still takes time, but ensures that the protein is worth spending compute on. The job runs and then the late filter decides whether or not we'll keep the job 8. Some computers are really fast and burn through tons and tons of early filters. Since we keep the output in order for users to get credit for their computation, this resulted in tons and tons of data. 9. This data is still sent back to the server where it will be discarded. But some users were noticing data transfers of the maximum size of 500MB being sent back. 10. This was the point where the decision to cancel the jobs was made, we decided that people's internet bandwidth took priority over lost cycles. What are we doing to remedy this: We are working out a system whereby these early filtered jobs will either not be transmitted or be greatly reduced in size. We first and foremost want to make sure that users are credited for their work, but we also need to balance this with internet bandwidth. On the topic of the beta server: We did run these on the beta server, and the jobs were successful. But this was a case of rare events. The beta server gave us mixed results here, but they were not severe red flags. The jobs in question got unlucky with the set of proteins they were set to design and were also run on fast computers with long job lengths. Only about 1% of the jobs caused this data explosion and then it was only seen for a few specific configurations. We now know what this looks like now though and will make sur |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I just want to point out that when bcov said "We finally figured out how to do protein design on R@H", he is distinguishing "design" from the "structure predictions" that have been using R@h for years. Designing a protein from scratch is very different than predicting the shape that a given sequence of amino acids will take. Rosetta Moderator: Mod.Sense |
Admin Project administrator Send message Joined: 1 Jul 05 Posts: 4805 Credit: 0 RAC: 0 |
We have made designs using R@h in the past but not using the new protocols and sampling strategies that bcov et al are running for COVID-19 and large scale scaffold design and interface design in general. These new strategies that bcov helped develop with massive sampling are producing the largest number of good designs as judged by various metrics we use in the lab. |
Raistmer Send message Joined: 7 Apr 20 Posts: 49 Credit: 797,293 RAC: 0 |
Could you explain this in more details. Why server required for this? Why it can't be done on clients? Does server compares recived results from many clients on this stage?
And what prevents to accept result, pay the credit, compare it with another (as in 5) and then discard. Why credit payment binded with keeping result?
Perhaps it means some additional "filtering" should be done on client side. That is, if many models generated report back only fixed number of best (this infers one can compare those models on client of course). |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,941,043 RAC: 23,092 |
I understand that in this project even Tasks that have a Computation error can provide useful data for the project, but if a Task isn't a computation error, it should be considered Valid, and get Credit for whatever work it has done. So any Tasks that had been started and are aborted by the server should count as Valid (if they have produced Valid work of course), that way they will get Credit for any work done prior to being aborted. Unstarted tasks won't get Credit, but they will still count as a Valid result, not as an Error. Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,941,043 RAC: 23,092 |
Hey everyone,Thanks for filling us in. Along with consistent Credit (and comparable amounts with other projects), it will go a long way to help the project retain much of it's recently acquired computing resources even after Covid-19 is no longer in the news. I would suggest similar posts as new things are rolled out and existing ones tweaked as results are returned etc, will have the greatest benefit for retaining crunchers. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
Sorry about the cancelled jobs. You're seeing the growing pains as we transition over to more design focused projects on R@H. Just quoting the bits I think are important. This is the good news. The only comment I'd make is about ensuring checkpoints are made at key stages. It sounds like the running jobs cut short is annoying but relatively trivial, especially if the lessons are learned so won't recur. If that's the case I wouldn't waste much effort finding a way to compensate users. Annoying yes, but bygones. Great post. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,631,095 RAC: 7,054 |
Hey everyone, I like your work and i continue to crunch!! |
magiceye04 Send message Joined: 11 May 11 Posts: 11 Credit: 1,702,178 RAC: 0 |
Thank you for the explanation! Today i also had to abort some WUs. They consumed about 1,8GB per WU und freezed the PC. I only allowed about 12 WUs, but it was still to much for 16GB RAM. Maybe these WUs can be sent to PCs with minimum 4GB per CPU-Core... |
Jonathan Send message Joined: 4 Oct 17 Posts: 43 Credit: 1,337,472 RAC: 0 |
What do you have set for your computing preferences? How much RAM are you allowing to be used? The tasks should have just went into 'waiting for memory' |
Message boards :
Number crunching :
Work done - no "pay" for it
©2024 University of Washington
https://www.bakerlab.org