Message boards : Number crunching : Discussion of the merits and challenges of using GPUs
Author | Message |
---|---|
Seneca Send message Joined: 9 Apr 07 Posts: 1 Credit: 86,244 RAC: 0 |
Nice news ... glad to be able to assit in fighting Corona ... By the way: Any chance that we will get GPU appications, too ? It's hard to see my CPU struggle at 89°C while my NVIDIA RTX 2080super is on half-astronomic diet, munching away Milkywa@home workunits in 2'30" each while crunching Folding@home in parallel. Seneca 0=0 |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,281,662 RAC: 1,402 |
Nice news ... glad to be able to assit in fighting Corona ... The previous attempt to add a GPU application for Rosetta@Home, gave a result that only ran at a speed similar to that of the CPU application - a little faster on some machines, a little slower on others. This usually indicates that the algorithm used has too few places for enough of the many cores on a GPU to be used at once. I've seen no sign that this has been tried again with newer versions of the applications. That GPU should be well suited to run for GPUGRID. Their work is related to medical research, but they aren't saying whether their current work will help COVID-19 research or not - apparently they don't know yet. |
Michal Gust Send message Joined: 25 Mar 20 Posts: 1 Credit: 31,331 RAC: 0 |
Nice news ... glad to be able to assit in fighting Corona ... I had the same question on my mind... But I'm really surprised by the answer. Folding@h does the same from my point of view but it benefits from amazing GPU power a lot. Could someone explain why it can benefit but R@h can't? With respect to mention of inefficiency with low amount of L3 cache I assume that there are no or little optimization of algorithm to specifics of the CPU/GPU architecture. I have no clue what can be issue in this specific case because I don't know anything about algorithm and it's coding. I have quite simple example to describe how encoding can significantly affect performance (double in the example) and why I'm talking about for who are curious. Optimization in the example is mainly focused on execution queues even it's done by hardware in limited scale in modern CPUs because it's notoriously known problem after years to be imaginable for laymen. But similar concepts should be used in programming at large scale where HW optimization cannot do job instead of programmers - what data I can reuse, what causes a bottleneck and how could/should I process data to prevent it. E.g. we have lot of SIMD instructions what can even multiply two arrays of FP numbers but if I do more subsequent operation with that arrays it could be more effective do all with one row or column and then with second etc... because all operations could be done at much higher speed in memory cache and then just results of all operation flushed to memory instead of loading and flushing whole array each step because it simply overflows caches etc. So let's take simple code IF (A-B)<0 THEN Z=C+D ELSE Z=A-B+C+D knowing condition is true in only 1% of cases. Take ancient CPU with no execution queue - 1 cycle per command always, another with 2 stages queue (jump or immediate use of previous result takes 2 cycles otherwise 1) and last with 3+3 stages (jump takes 6 cycles but use of previous result up to 3 cycles depending how many cycles before it had been executed) and show how to optimize code for each one. All are old CPUs thus none has out-of order execution or speculative execution, CPU is so many times faster (so many cycles per nano second) how many stages it has. I won't use assembler commands just symbolically written two operand equations and Pascal/BASIC style condition commands to be easy to understand. a) I can be dumb and simply write exactly what line of code presents - it looks like this 1) X=A-B 2) IF X>=0 THEN GO TO 5) 3) Z=C+D 4) GO TO 8) 5) Z=A-B 6) Z=Z+C 7) Z=Z+D no queue CPU - in 99% it takes 5 cycles, in 1% 4 cycles; weighted average 4,99; average execution time 4,99 ns 2 stage CPU - 99% 9 cycles, 1% 5 cycles; avg. 8,96 cycles; 4,48 ns 3:3 stage CPU - 99% 16 cycles, 1% 6 cycles; avg. 15,9 cycles; 2,65 ns b) general optimization by reuse of result or use of cleared register. No queuing on mind so CPUs with queuing are waiting cycles at 2) and 5) for results and after conditional jump 1) X=A-B 2) IF X>=0 THEN GO TO 4) 3) X=0 4) Z=C+D 5) Z=Z+X No queue 99% 4 cyc., 1% 5 cyc.; avg. 4,01 cyc.; 4,01 ns; diff to the previous -19,6%; to the worst -19,6% 2 stage 100% 7 cyc.; avg. 7 cyc.; 3,5 ns; diff -21,9%; -21,9% 3:3 stage 99% 13 cyc., 1% 9 cyc.; avg. 12,96 cyc.; 2,16 ns; diff -18,5%; -18,5% c) basic queuing optimization - computing C+D in advance save cycles at 3) and 5). It could be done by out-of-order execution at modern CPUs (depending on range and speculative execution). 1) X=A-B 2) Z=C+D 3) IF X>=0 THEN GO TO 5) 4) X=0 5) Z=Z+X No queue 99% 4 cyc., 1% 5 cyc.; avg. 4,01 cyc.; 4,01 ns; 0%; -19,6% 2 stages 99% 5 cyc., 1% 6 cyc.; avg. 5,01 cyc.; 2,51 ns; -28,3%; -44% 3:3 stages 99% 10 cyc., 1% 8 cyc.; avg. 9,98 cyc.; 1,66 ns; -23,1%; -37,4% d) advanced queuing optimization - probability of result is taken into account - saving lot of cycles after conditional jump at most probable case - it's efficient with larger execution queues (but they are even larger in fact) because it costs some cycles... similar job is done by speculative execution with some success rate at modern CPUs 1) X=A-B 2) Y=C+D 3) Z=0 4) IF X<0 THEN GO TO 6) 5) Z=X 6) Z=Z+Y No queue 99% 6 cyc., 1% 5 cyc.; avg. 5,99 cyc.; 5,99 ns 2 stages 99% 7 cyc., 1% 6 cyc.; avg. 6,99 cyc.; 3,5 ns 3:3 stages 99% 8 cyc., 1% 10 cyc.; avg. 8,02 cyc.; 1,33 ns.; -19,9%; -49,8% |
The red spirit Send message Joined: 22 Nov 15 Posts: 10 Credit: 214,036 RAC: 78 |
I wonder if there will be some GPU tasks due to emergency. It could really speed up everything and help to reach project's goals faster and more efficiently. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,281,662 RAC: 1,402 |
The previous attempt at a GPU version gave one that ran at about the SAME speed as the CPU version - a little slower on some computers, and a little faster on others. This was not considered fast enough to make further development worthwhile. An idea for the next attempt at a GPU version of Rosetta: Since the application does not have enough sections that will run in parallel for the usual method of conversion, have the GPU run sections in parallel for what would ordinarily be at least 10 separate tasks. Have all 10 of these use the same largest input file, so that it doesn't increase the download time very much. More than one shared input file would be even better, but there must be some input for each that is not shared, so it will do more than give at least 10 copies of the same answer. Separate the read-only and read-write portions of the GPU memory, so that only one copy of the read-only portion is needed, but a separate copy of the read-write section for each ordinary task. Depending on how big these portions are, the graphics board may or may not need much more memory than most graphics boards. The CPU section of the app also needs one copy of the read-only portion and a separate read-write section for each ordinary task. The rather large number of GPU cores needs to to be divided into groups; for Nvidia GPUs, the smallest such group is called a warp and usually contains 32 GPU cores. Within a warp, at any time all of the cores within the warp must either be doing the same operation, with exceptions only if some are doing nothing. This strongly restricts branching within the program. Graphics boards typically have a clock rate which is about a quarter of the CPU clock rate. This means that there have to be at least 4 warps for each ordinary task to finish the task at the same speed as the CPU version. If you start with the number of GPU cores (say 3200), divide by 32 to get the number of warps (100), then divide by 4 to get independent groups (25). So if I've guessed all of these numbers right, you can run about 25 ordinary tasks in parallel IF THERE IS ENOUGH GRAPHICS MEMORY. I've studies CUDA but not OpenCL, so I might be able to convert sections of the code into CUDA but not into OpenCL, for running only under Windows. This would give a GPU version for Nvidia GPUs, but not for AMD or Intel GPUs, and only for running under Windows. Rosetta is rather restrictive about sharing their source code, so I don't expect them to let me try this. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,625,551 RAC: 6,845 |
The previous attempt at a GPU version gave one that ran at about the SAME speed as the CPU version - a little slower on some computers, and a little faster on others. This was not considered fast enough to make further development worthwhile. The only previous attempt that i know was over 5 years ago and a lot of thinghs are changed (hw, sw, etc) But i know there are a lot of problems to port R@H on gpu. I've studies CUDA but not OpenCL, so I might be able to convert sections of the code into CUDA but not into OpenCL, for running only under Windows. Today is not a problem with, for example, SYCL and other tools (HypSycl, DPC++, etc). |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
i'd guess using gpu would take reworking the 'architecture' , some codes are probably difficult to port over to a 'gpu' setup and the same codes may run poorly on a gpu. but gpu may still be useful for various tasks. in terms of gpu my thoughts are that r@h could start a different 'project' or rather a different 'app' that is. e.g. those convolutional neural network stuff these days runs off gpu as 'default' setup. a trouble with 'distributed training' is that it would take breaking the domain into 'shards' of data that can be separated cleanly. i'd guess it is one reason it isn't done. i think google is doing 'alpha fold' along those lines https://deepmind.com/blog/article/AlphaFold-Using-AI-for-scientific-discovery |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,625,551 RAC: 6,845 |
in terms of gpu my thoughts are that r@h could start a different 'project' or rather a different 'app' that is. e.g. those convolutional neural network stuff these days runs off gpu as 'default' setup. a trouble with 'distributed training' is that it would take breaking the domain into 'shards' of data that can be separated cleanly. i'd guess it is one reason it isn't done.i think google is doing 'alpha fold' along those lines Seems you're right We plan to add the latest protocols that will be tested in this coming CASP to Robetta in the near future and open it up to the public. This will not only improve the prediction quality for most targets, but will also significantly reduce the cpu computing requirements. We also are looking into the possibility of running these ML models and minimization modeling strategies to R@h which will make use of GPUs. This may require the ability to run python, Tensor Flow, and PyRosetta on BOINC clients |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
We plan to add the latest protocols that will be tested in this coming CASP to Robetta in the near future and open it up to the public. This will not only improve the prediction quality for most targets, but will also significantly reduce the cpu computing requirements. We also are looking into the possibility of running these ML models and minimization modeling strategies to R@h which will make use of GPUs. This may require the ability to run python, Tensor Flow, and PyRosetta on BOINC clients my guess is these challenges may attract the attention of those 'bored' with the 'easy to run' r@h and who have powerful hardware like GTX1080/1070 or RTX2080/2070, so having them setup tensor flow, pyrosetta & gpu may be an exciting for those who eager to try those 'cutting edge' work units lol tensorflow cnn and all those 'deep learning' stuff are a huge resource hog in terms of requiring high cpu and *gpu* power and huge amounts of memory on both the pc (like 32 GB) and gpu (e.g. 8 GB min), training can run for days full power on the gpu for the 'deep' models and the data set is no less huge running into hundreds of gigabytes. so 'data shards' would likely be necessary oh btw, i noticed recently that even our plain old rosetta@home boinc network are crunching at petaflops of computing power on par with the super computers lol |
Admin Project administrator Send message Joined: 1 Jul 05 Posts: 4805 Credit: 0 RAC: 0 |
It would be used mainly to run the models and not train. |
Sesson Send message Joined: 23 Mar 20 Posts: 2 Credit: 513,889 RAC: 0 |
Many science applications have a structure that makes it easier to expand its features but makes it harder to optimize, such as object-oriented programming and iterations without considering SIMD. Let's say I have an application that looks like this: class metaclass { public: virtual int number() { return 0; } } class obj1 : public metaclass { public: int number() { return 1; } } class obj2 : public metaclass { public: int number() { return 2; } } int main(void){ std::vector<metaclass*> inputdata; size_t i; int result = 0; for(i=0;i<500;++i) { inputdata.push_back(new obj1()); } for(i=0;i<500;++i) { inputdata.push_back(new obj2()); } for(i=0;i<inputdata.size();++i) { result+=inputdata[i]->number(); delete inputdata[i];} std::cout << result; return 0; } I believe structure like this is quite common in scientific applications. We know that this program can be rewritten to take advantage of SIMD instructions and GPU computing, but how to? Using SIMD and GPU means that any indirection, wrapper functions or even conditionals have to be removed, which makes the program very hard to introduce new features. And scientists usually optimize their application by using better algorithms but not by taking advantage of SIMD extensions or GPU, unless they are not going to change the underlying algorithm at all, for example FFT algorithm. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,625,551 RAC: 6,845 |
tensorflow cnn and all those 'deep learning' stuff are a huge resource hog in terms of requiring high cpu and *gpu* power and huge amounts of memory on both the pc (like 32 GB) and gpu (e.g. 8 GB min), training can run for days full power on the gpu for the 'deep' models and the data set is no less huge running into hundreds of gigabytes. Oh, well, sometimes the advancement of science requires sacrifices :-P oh btw, i noticed recently that even our plain old rosetta@home boinc network are crunching at petaflops of computing power on par with the super computers lol Great!! |
Raistmer Send message Joined: 7 Apr 20 Posts: 49 Credit: 797,293 RAC: 0 |
To make use of GPU one should keep compute-intensive data structures in GPU memory all the time. Or at least long enough to compensate transfer overhead. And then parallelization comes into play. Of course there should be _huge_ (not just few as for SIMD, but tens of thousands) number of same operations performed on data arrays that could be done simultaneously (input doesn't depend on output of another one). So benefits of GPU strongly depend on apps algorithm. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,625,551 RAC: 6,845 |
And, with OpenCl 3.0, OpenCl is dead |
Raistmer Send message Joined: 7 Apr 20 Posts: 49 Credit: 797,293 RAC: 0 |
And, with OpenCl 3.0, OpenCl is dead Why ? OCL 2 features are cool, but one can live w/o it. Are there real apps, especially real BOINc-projects app that make use of OpenCL 2.0 ? |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,625,551 RAC: 6,845 |
And, with OpenCl 3.0, OpenCl is dead 'cause is a simple rebrand of OpenCl 1.2. They abandoned OpenCl 2.x to his fate. Simply: OpenCl 3.0 is great.....if it was released 5 years ago. The only sunbeam (little sunbeam) is C++ for OpenCl |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,625,551 RAC: 6,845 |
A small excerpt from this recent publication about RosettaCommons: Although we have addressed many problems over the last 20 years, there are still several outstanding issues with respect to the three main areas explored in this paper: (1) Technical: The tension between best practices in software development and rapid scientific progress means we must continually provide incentives for prioritizing maintenance, usability, and reproducibility. Additionally, we are currently reconsidering the basic abstractions and data structures that we have relied on for over ten years to make use of massively parallel hardware architectures (e.g. GPUs). :-O |
Philip C Swift [Gridcoin] Send message Joined: 23 Dec 18 Posts: 1 Credit: 1,109,582 RAC: 6 |
Other flavours of GPU projects are available. You don't have to stick to Milky Way tasks. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,281,662 RAC: 1,402 |
Other flavours of GPU projects are available. You don't have to stick to Milky Way tasks. For those interested only in projects related to medical research, the only choice now appears to be Folding@home, which wasn't set up to be compatible with BOINC projects. It's possible, but difficult, to run it on a computer that has BOINC running at the same time. Their forums currently aren't working. GPUGRID is BOINC-compatible and uses GPUs, but is is out of workunits. They appear to be unlikely to get a new researcher who can create useful new workunits until the COVID-19 travel restrictions go away . In the meantime, it is a good place to discuss problems with running both Folding@home and BOINC at the same time. Folding@home is thinking of creating a BOINC-compatible version of their software, probably AFTER the COVID-19 travel restrictions go away. The Open Pandemics subproject at World Community Grid currently does COVID-19 work, using CPUs only, but is thinking of creating a GPU version of their software. If you are interested in other types of GPU projects, note that Asteroids@home currently has disk space problems interfering with uploads. |
kksplace Send message Joined: 12 May 19 Posts: 7 Credit: 5,303,601 RAC: 0 |
GPUGrid again has workunits available. |
Message boards :
Number crunching :
Discussion of the merits and challenges of using GPUs
©2024 University of Washington
https://www.bakerlab.org