Work done - no "pay" for it

Message boards : Number crunching : Work done - no "pay" for it

To post messages, you must log in.

AuthorMessage
Raistmer

Send message
Joined: 7 Apr 20
Posts: 49
Credit: 797,293
RAC: 0
Message 95178 - Posted: 23 Apr 2020, 6:54:06 UTC
Last modified: 23 Apr 2020, 6:55:19 UTC

Имя Mini_Protein_binds_IL6R_COVID-19_1p9m_2_SAVE_ALL_OUT_IGNORE_THE_REST_2fu9wb9d_924136_4_0
Задача 1041653356
Создан 22 Apr 2020, 21:53:15 UTC
Отправлен 22 Apr 2020, 22:47:25 UTC
Крайний срок отчёта 25 Apr 2020, 22:47:25 UTC
Получен 23 Apr 2020, 3:17:44 UTC
Состояние сервера Завершено
Результат выполнения Ошибка вычислений
Состояние клиента Отменён сервером
Статус выхода 202 (0x000000CA) EXIT_ABORTED_BY_PROJECT
ID компьютера 4186879
Время выполнения 2 часов 0 мин. 19 сек.
Время ЦП 1 часов 58 мин. 27 сек.

Состояние проверки Неправильный
Очки 0.00
Пиковая производительность устройства, FLOPS 2.49 GFLOPS
Версия приложения Rosetta v4.15
windows_intelx86
Peak working set size 613.73 MB
Peak swap size 596.34 MB
Peak disk usage 978.98 MB
Текст протокола
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
aborted by project - no longer usable</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.15_windows_intelx86.exe -run:protocol jd2_scripting -parser:protocol predictor_v11_boinc--fuse--il1r_design_boinc_v1.xml @flags_il6r -in:file:silent Mini_Protein_binds_IL6R_COVID-19_1p9m_2_SAVE_ALL_OUT_IGNORE_THE_REST_2fu9wb9d.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip Mini_Protein_binds_IL6R_COVID-19_1p9m_2_SAVE_ALL_OUT_IGNORE_THE_REST_2fu9wb9d.zip @Mini_Protein_binds_IL6R_COVID-19_1p9m_2_SAVE_ALL_OUT_IGNORE_THE_REST_2fu9wb9d.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 5000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1048673
Starting watchdog...
Watchdog active.


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x778E86FE

Engaging BOINC Windows Runtime Debugger...



********************


BOINC Windows Runtime Debugger Version 7.9.0


2 issues in one result:

1) task aborted by the project, so no client fault. Client done part of work (elapsed and CPU times are not zero) but got zero credit. Incorrect usage of credit system in this case IMO.

2) task exited via exception - not a best way to exit.
ID: 95178 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
magiceye04

Send message
Joined: 11 May 11
Posts: 11
Credit: 1,702,178
RAC: 0
Message 95183 - Posted: 23 Apr 2020, 7:25:21 UTC
Last modified: 23 Apr 2020, 7:25:33 UTC

I also had about 70 project aborted WUs last night.
Many of them were partly computed, some also fully computed.

I would really recommend to test these beta-WUs on the Ralph-project.
ID: 95183 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1681
Credit: 17,854,150
RAC: 20,118
Message 95184 - Posted: 23 Apr 2020, 7:35:20 UTC - in response to Message 95183.  

I also had about 70 project aborted WUs last night.
Many of them were partly computed, some also fully computed.

I would really recommend to test these beta-WUs on the Ralph-project.
Or at least pay Credit for the work that was done before they were cancelled.
Grant
Darwin NT
ID: 95184 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tomcat雄猫

Send message
Joined: 20 Dec 14
Posts: 180
Credit: 5,386,173
RAC: 0
Message 95189 - Posted: 23 Apr 2020, 8:01:20 UTC - in response to Message 95184.  
Last modified: 23 Apr 2020, 8:03:51 UTC

I also had about 70 project aborted WUs last night.
Many of them were partly computed, some also fully computed.

I would really recommend to test these beta-WUs on the Ralph-project.
Or at least pay Credit for the work that was done before they were cancelled.


Valid point.
I still think we need more testing on Ralph before releasing WUs here. Users here expected stable stuff, since this is not the test project (well, to be fair, Ralph has a pretty paltry user base compared to the main project, so bug testing will take a lot longer and problems may still slip through.).
Furthermore, if the task was completed before they got cancelled, it seems fair to reward credits. (if BOINC even allows that type of thing, that is)
ID: 95189 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Raistmer

Send message
Joined: 7 Apr 20
Posts: 49
Credit: 797,293
RAC: 0
Message 95202 - Posted: 23 Apr 2020, 10:57:14 UTC - in response to Message 95189.  

(if BOINC even allows that type of thing, that is)

And if not there is the hint what to implement in next release.
ID: 95202 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Terrible T

Send message
Joined: 29 Dec 16
Posts: 4
Credit: 1,333,030
RAC: 0
Message 95205 - Posted: 23 Apr 2020, 11:40:03 UTC

After loosing a good 200.000 sec of wasted computer power over cancelled and error tasks (also cancellled) it indeed would be better to have some more testing prior releasing of work units.
ID: 95205 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 8,387
Message 95208 - Posted: 23 Apr 2020, 12:15:44 UTC - in response to Message 95183.  

I would really recommend to test these beta-WUs on the Ralph-project.

+1
ID: 95208 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 8,387
Message 95213 - Posted: 23 Apr 2020, 15:39:41 UTC

Not only tonight.
Now others 4 wus aborted by server.
ID: 95213 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
magiceye04

Send message
Joined: 11 May 11
Posts: 11
Credit: 1,702,178
RAC: 0
Message 95218 - Posted: 23 Apr 2020, 17:30:33 UTC

Today i got defective WUs without checkpointing.
But the Computer needed to restart by external reason.
AGAIN many hours of wasted computing - all start from zero.
Maybe i try an not beta-project the next days...
ID: 95218 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bcov
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 8 Nov 16
Posts: 12
Credit: 11,348
RAC: 0
Message 95229 - Posted: 23 Apr 2020, 19:47:19 UTC

Hey everyone,

Sorry about the cancelled jobs. You're seeing the growing pains as we transition over to more design focused projects on R@H.

I'll give you guys the full story so you can put what happened here in perspective.

1. We finally figured out how to do protein design on R@H
2. We started doing monomer design on R@H (these are future protein binders)
3. We worked hard an got an update out to allow Protein Interface Design on R@H
4. We started submitting interface designs to do massive sampling using R@H
5. These runs were too successful and it blew up the servers on our ends
-- We decided to remedy this by using filtering on the R@H jobs in this way. Only some of the outputs get stored on our servers and the rest are discarded as they are received
6. This new freedom allowed for even larger jobs to be submitted. Absolutely incredible designs are coming out the other side. This increase in compute power is equivalent to about 5 years of methods development.
7. These jobs have an early filter and a late filter. The early filter still takes time, but ensures that the protein is worth spending compute on. The job runs and then the late filter decides whether or not we'll keep the job
8. Some computers are really fast and burn through tons and tons of early filters. Since we keep the output in order for users to get credit for their computation, this resulted in tons and tons of data.
9. This data is still sent back to the server where it will be discarded. But some users were noticing data transfers of the maximum size of 500MB being sent back.
10. This was the point where the decision to cancel the jobs was made, we decided that people's internet bandwidth took priority over lost cycles.

What are we doing to remedy this:

We are working out a system whereby these early filtered jobs will either not be transmitted or be greatly reduced in size. We first and foremost want to make sure that users are credited for their work, but we also need to balance this with internet bandwidth.


On the topic of the beta server:

We did run these on the beta server, and the jobs were successful. But this was a case of rare events.

The beta server gave us mixed results here, but they were not severe red flags. The jobs in question got unlucky with the set of proteins they were set to design and were also run on fast computers with long job lengths. Only about 1% of the jobs caused this data explosion and then it was only seen for a few specific configurations.

We now know what this looks like now though and will make sur
ID: 95229 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 95230 - Posted: 23 Apr 2020, 19:56:42 UTC

I just want to point out that when bcov said "We finally figured out how to do protein design on R@H", he is distinguishing "design" from the "structure predictions" that have been using R@h for years. Designing a protein from scratch is very different than predicting the shape that a given sequence of amino acids will take.
Rosetta Moderator: Mod.Sense
ID: 95230 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 4805
Credit: 0
RAC: 0
Message 95231 - Posted: 23 Apr 2020, 20:11:27 UTC

We have made designs using R@h in the past but not using the new protocols and sampling strategies that bcov et al are running for COVID-19 and large scale scaffold design and interface design in general. These new strategies that bcov helped develop with massive sampling are producing the largest number of good designs as judged by various metrics we use in the lab.
ID: 95231 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Raistmer

Send message
Joined: 7 Apr 20
Posts: 49
Credit: 797,293
RAC: 0
Message 95241 - Posted: 23 Apr 2020, 22:14:42 UTC - in response to Message 95229.  
Last modified: 23 Apr 2020, 22:15:19 UTC



5. These runs were too successful and it blew up the servers on our ends
-- We decided to remedy this by using filtering on the R@H jobs in this way. Only some of the outputs get stored on our servers and the rest are discarded as they are received

Could you explain this in more details. Why server required for this? Why it can't be done on clients? Does server compares recived results from many clients on this stage?


8. Some computers are really fast and burn through tons and tons of early filters. Since we keep the output in order for users to get credit for their computation, this resulted in tons and tons of data.

And what prevents to accept result, pay the credit, compare it with another (as in 5) and then discard. Why credit payment binded with keeping result?


9. This data is still sent back to the server where it will be discarded. But some users were noticing data transfers of the maximum size of 500MB being sent back.

Perhaps it means some additional "filtering" should be done on client side. That is, if many models generated report back only fixed number of best (this infers one can compare those models on client of course).
ID: 95241 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1681
Credit: 17,854,150
RAC: 20,118
Message 95251 - Posted: 23 Apr 2020, 22:58:28 UTC

I understand that in this project even Tasks that have a Computation error can provide useful data for the project, but if a Task isn't a computation error, it should be considered Valid, and get Credit for whatever work it has done.
So any Tasks that had been started and are aborted by the server should count as Valid (if they have produced Valid work of course), that way they will get Credit for any work done prior to being aborted. Unstarted tasks won't get Credit, but they will still count as a Valid result, not as an Error.
Grant
Darwin NT
ID: 95251 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1681
Credit: 17,854,150
RAC: 20,118
Message 95254 - Posted: 23 Apr 2020, 23:05:09 UTC - in response to Message 95229.  

Hey everyone,

Sorry about the cancelled jobs. You're seeing the growing pains as we transition over to more design focused projects on R@H.

I'll give you guys the full story so you can put what happened here in perspective.

...
Thanks for filling us in.
Along with consistent Credit (and comparable amounts with other projects), it will go a long way to help the project retain much of it's recently acquired computing resources even after Covid-19 is no longer in the news.
I would suggest similar posts as new things are rolled out and existing ones tweaked as results are returned etc, will have the greatest benefit for retaining crunchers.
Grant
Darwin NT
ID: 95254 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 9,701
Message 95258 - Posted: 23 Apr 2020, 23:43:08 UTC - in response to Message 95229.  

Sorry about the cancelled jobs. You're seeing the growing pains as we transition over to more design focused projects on R@H.

I'll give you guys the full story so you can put what happened here in perspective.

5. These runs were too successful and it blew up the servers on our ends
6. This new freedom allowed for even larger jobs to be submitted. Absolutely incredible designs are coming out the other side. This increase in compute power is equivalent to about 5 years of methods development.
7. These jobs have an early filter and a late filter. The early filter still takes time, but ensures that the protein is worth spending compute on. The job runs and then the late filter decides whether or not we'll keep the job
8. Some computers are really fast and burn through tons and tons of early filters. Since we keep the output in order for users to get credit for their computation, this resulted in tons and tons of data.

What are we doing to remedy this:

We are working out a system whereby these early filtered jobs will either not be transmitted or be greatly reduced in size. We first and foremost want to make sure that users are credited for their work, but we also need to balance this with internet bandwidth.

On the topic of the beta server:

We did run these on the beta server, and the jobs were successful. But this was a case of rare events.

The beta server gave us mixed results here, but they were not severe red flags. The jobs in question got unlucky with the set of proteins they were set to design and were also run on fast computers with long job lengths. Only about 1% of the jobs caused this data explosion and then it was only seen for a few specific configurations.

Just quoting the bits I think are important. This is the good news.
The only comment I'd make is about ensuring checkpoints are made at key stages.
It sounds like the running jobs cut short is annoying but relatively trivial, especially if the lessons are learned so won't recur.
If that's the case I wouldn't waste much effort finding a way to compensate users. Annoying yes, but bygones.

Great post.
ID: 95258 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 8,387
Message 95278 - Posted: 24 Apr 2020, 7:33:58 UTC - in response to Message 95229.  

Hey everyone,
Sorry about the cancelled jobs. You're seeing the growing pains as we transition over to more design focused projects on R@H.
I'll give you guys the full story so you can put what happened here in perspective.

I like your work and i continue to crunch!!
ID: 95278 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
magiceye04

Send message
Joined: 11 May 11
Posts: 11
Credit: 1,702,178
RAC: 0
Message 95326 - Posted: 24 Apr 2020, 21:50:41 UTC

Thank you for the explanation!

Today i also had to abort some WUs. They consumed about 1,8GB per WU und freezed the PC. I only allowed about 12 WUs, but it was still to much for 16GB RAM.
Maybe these WUs can be sent to PCs with minimum 4GB per CPU-Core...
ID: 95326 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jonathan

Send message
Joined: 4 Oct 17
Posts: 43
Credit: 1,337,472
RAC: 0
Message 95332 - Posted: 25 Apr 2020, 4:19:19 UTC - in response to Message 95326.  

What do you have set for your computing preferences? How much RAM are you allowing to be used? The tasks should have just went into 'waiting for memory'
ID: 95332 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Work done - no "pay" for it



©2024 University of Washington
https://www.bakerlab.org