Will there be a 64-bit client in the near future?

Author	Message
Christian L. Send message Joined: 12 Aug 07 Posts: 3 Credit: 203,369 RAC: 0	Message 65192 - Posted: 3 Feb 2010, 19:59:45 UTC The title says it all. ID: 65192 · Rating: 0 · rate: / Reply Quote

.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0	Message 65193 - Posted: 3 Feb 2010, 21:40:47 UTC So may i take it that none of them on the Applications page will do ?? Can you explain your problem a bit more. ID: 65193 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 65214 - Posted: 5 Feb 2010, 18:57:34 UTC - in response to Message 65193. So may i take it that none of them on the Applications page will do ?? Can you explain your problem a bit more. Those are wrappers, they "run" as if the CPU were 32 bit, they dont take advantage of the 64-bit capabilities. And to OP: It seems the work required to make a 64-bit versions surpasses the "speed" or TFlop increase, if at all, of the 64-bit version vs 32-bit. ID: 65214 · Rating: 0 · rate: / Reply Quote

.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0	Message 65217 - Posted: 5 Feb 2010, 21:31:29 UTC A ha, A native 64 bit client, there are some projects that have them. and as you say if there is little to gain from using 64, they will have more important things to do on the list. ID: 65217 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 65228 - Posted: 7 Feb 2010, 23:59:40 UTC - in response to Message 65217. A ha, A native 64 bit client, there are some projects that have them. and as you say if there is little to gain from using 64, they will have more important things to do on the list. There is little gain for ROSETTA exclusively. There are a bunch of threads explaining in more detail regarding 64-bit clients. But the bottom line is that they rather work on other things than develop 64-bit clients. It's way more advantageous to develop a GPU client instead anyways. ID: 65228 · Rating: 0 · rate: / Reply Quote

DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,255,054 RAC: 0	Message 65234 - Posted: 8 Feb 2010, 19:36:10 UTC Well-written code shouldn't need much work if any to re-compile with a x86_64 compiler. Just compile for multiple architectures (x86 and x86_64) from the same source. True, there's no 64-bit specific enhancements, but you'll still use the 64-bit CPU feature set available from the compiler. (It will probably help Windows clients more than Linux clients.) ID: 65234 · Rating: 0 · rate: / Reply Quote

ghost Send message Joined: 16 Oct 16 Posts: 2 Credit: 6,402,194 RAC: 0	Message 80747 - Posted: 17 Oct 2016, 7:03:49 UTC Just discovered this thread via a search. When I run Rosetta, the name shows on task manager is `minirosetta_3.73_windows_x86_64.exe (32 bits)`, which is a kind of wired... ID: 80747 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2121 Credit: 12,390,943 RAC: 179	Message 80748 - Posted: 17 Oct 2016, 18:31:34 UTC 6 years ago..... ID: 80748 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0	Message 80756 - Posted: 19 Oct 2016, 7:49:35 UTC - in response to Message 80747. When I run Rosetta, the name shows on task manager is `minirosetta_3.73_windows_x86_64.exe (32 bits)`, which is a kind of wired... 32-bit application send as 64-bit to make 64-bit BOINC client happy... . ID: 80756 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 80761 - Posted: 20 Oct 2016, 7:06:18 UTC Last modified: 20 Oct 2016, 7:14:03 UTC my guess would be that the codes could have been compiled using a commercial compiler linked against various optimised commercial libraries which is in 32 bits. in which source codes are not available. this in itself would limit the ability to go 64bit as 64 bits codes cannot be linked against 32 bit libraries (it doesn't make sense to do that any way, it is possibly just as 'slow' given that those libraries may possibly be doing quite a bit of the math) however on intel (x86_64)/ amd64 platforms win32 bits codes run just fine. there are lots of 32 bits apps around and among other things they are possibly (much) less memory hungry compared to 64 bits apps. the unfortunate downside is that 64 bits codes tend to do double precision maths (possibly say 20-30%) faster than 32 bit codes, the notion is that 64 bit instructions could possibly execute various 64 bits fp instructions say in 1 clock cycle (or less clock cycles) compared to that of 32 bits codes (could be double the clock cycles perhaps) doing 64 bits (double precision) fp just 2c ID: 80761 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2121 Credit: 12,390,943 RAC: 179	Message 80767 - Posted: 21 Oct 2016, 8:31:16 UTC - in response to Message 80761. my guess would be that the codes could have been compiled using a commercial compiler linked against various optimised commercial libraries which is in 32 bits, in which source codes are not available. This in itself would limit the ability to go 64bit as 64 bits codes cannot be linked against 32 bit libraries This may be an answer, but... Are they using COMMERCIAL libraries? I I thought they had THEIR libraries. And that they are using open compilers (like gcc). ID: 80767 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 80817 - Posted: 1 Nov 2016, 8:33:09 UTC Last modified: 1 Nov 2016, 8:39:07 UTC actually i support the notion that r@h should go 64bits esp for windows platform as the binaries today even as of 3.73 is still 32 bits (if i'm right about it). this aside from trying other 'esoteric' optimizations such as SSE/AVX/AVX2/FMA or even GPU, just going 64bits would most likely see immediate gains on windows x86_64 bits platform, especially for double precision floating point maths. some of the data processing which takes several 32 bits steps/instructions may simply be done in less steps say in a single instruction as double precision variables fit nicely in 64 bits. this could benefit / accelerate math / codes in those deepest nested parts of the loops which does bulk of computations ID: 80817 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 80819 - Posted: 1 Nov 2016, 10:09:50 UTC Last modified: 1 Nov 2016, 11:09:29 UTC got very curious about 32 bits vs 64 bits double precision maths so decided to do some tests: got the plain old codes for linpack & whetstone from here http://www.netlib.org/benchmark/ http://www.netlib.org/benchmark/linpackc.new http://www.netlib.org/benchmark/whetstone.c compile them and run > gcc -o linpack32 -m32 -O2 linpack.c -lm > gcc -o linpack64 -O2 linpack.c -lm > ./linpack32 Enter array size (q to quit) [200]: 1000 Memory required: 7824K. LINPACK benchmark, Double precision. Machine precision: 15 digits. Array size 1000 X 1000. Average rolled and unrolled performance: Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS ---------------------------------------------------- 16 0.78 96.24% 0.58% 3.18% 3566560.131 32 1.56 96.22% 0.59% 3.19% 3559853.802 64 3.13 96.20% 0.63% 3.17% 3535126.088 128 6.24 96.22% 0.59% 3.19% 3549975.930 256 12.49 96.22% 0.59% 3.19% 3550935.104 > ./linpack64 Enter array size (q to quit) [200]: 1000 Memory required: 7824K. LINPACK benchmark, Double precision. Machine precision: 15 digits. Array size 1000 X 1000. Average rolled and unrolled performance: Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS ---------------------------------------------------- 16 0.76 95.14% 0.56% 4.30% 3704148.619 32 1.51 95.10% 0.57% 4.32% 3702957.305 64 3.04 95.12% 0.57% 4.32% 3694656.050 128 6.08 95.13% 0.56% 4.31% 3688275.421 256 12.21 95.08% 0.60% 4.32% 3674843.370 that works out to about 5% gains for linpack for a 1000x1000 matrix for whestone made some changes to store the results in static variables to prevent GCC from optimizing away codes. > gcc -o whetstone32 -m32 -O2 whetstone.c -lm > gcc -o whetstone64 -O2 whetstone.c -lm > ./whetstone32 -c 100000 Loops: 100000, Iterations: 1, Duration: 2 sec. C Converted Double Precision Whetstones: 5000.0 MIPS Loops: 100000, Iterations: 1, Duration: 2 sec. C Converted Double Precision Whetstones: 5000.0 MIPS Loops: 100000, Iterations: 1, Duration: 2 sec. C Converted Double Precision Whetstones: 5000.0 MIPS Loops: 100000, Iterations: 1, Duration: 1 sec. C Converted Double Precision Whetstones: 10000.0 MIPS > ./whetstone64 -c 100000 Loops: 100000, Iterations: 1, Duration: 3 sec. C Converted Double Precision Whetstones: 3333.3 MIPS Loops: 100000, Iterations: 1, Duration: 3 sec. C Converted Double Precision Whetstones: 3333.3 MIPS Loops: 100000, Iterations: 1, Duration: 2 sec. C Converted Double Precision Whetstones: 5000.0 MIPS Loops: 100000, Iterations: 1, Duration: 3 sec. C Converted Double Precision Whetstones: 3333.3 MIPS the results defy intuition, 32 bits is faster lol the reason could be this: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/393951 On Intel processors there are the following floating point instruction sets: FPU (8087 emulation), SSE and AVX. All three have access to an internal, very fast, internal floating point processor (engine). The FPU supports 4-byte, 8-byte, and 10-byte floating point formats as single elements (scalars). The SSE and AVX support 4-byte, 8-byte floating point formats as scalars (single variable) or small vectors (2 or more elements). Ignoring the multiple element formats in SSE and AVX, the latency of a floating point multiply is on the order of 4 clock cycles (this will extend for memory references). Throughput can be as little as 1-2 clock cycles. When the problem involves a large degree of RAM reads and writes, the program is waiting for the memory as opposed to waiting for the floating point operations. Note, when small vectors can be used, the computation time can be significantly reduced (1/2, 1/4. 1/8) memory subsystem overhead can be reduced per floating operation, but the demands on memory subsystem may increase. Jim Dempsey storing data from cpu registers to memory 'killed' the whetstone benchmark, it makes the benchmark dependent on the bottleneck of moving data to memory and no longer measure simply floating point calc speeds. stalled by memory moves. but without those storing to variable (i.e. ram) codes, GCC is too 'smart' & would optimise away (remove) calculations simply because the results are not used anywhere hence the double precision speedup or slowdown between 32bits vs 64bits could be dependent on processors and manufacturers e.g. amd vs intel i'd guess the differences may even be noticed between different cpu releases / generations this has a lot of implications considering that boinc awards points based on this very legacy whetstone benchmark. it implies whetstone benchmark measures cpu-memory bottlenecks rather than computation prowess. as in the gflops is not now fast the cpu calculates, rather it is how fast those numbers can be moved between cpu registers & memory lol ID: 80819 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 80820 - Posted: 1 Nov 2016, 15:45:34 UTC Last modified: 1 Nov 2016, 16:36:50 UTC as it turns out floating point is as ancient as there has been Intel x87 FPU it is always 80bits. It treats single precision & double precision the same way. Unlike many of the recent GPU there is a drastic difference between what GPUs can handle between single precision and double precision. for GPUs the speed between single precision : double precision can be as much as 32:1 e.g. https://en.wikipedia.org/wiki/GeForce_10_series while on Intel x87 FPU everything is 80bits there is no 'single precision' everything is 'extended double precision' on x87 all the way to the newest Intel/AMD cpus haswell / skylake / etc. http://home.agh.edu.pl/~amrozek/x87.pdf 8.2. X87 FPU DATA TYPES The x87 FPU recognizes and operates on the following seven data types :single-precision floating point, double-precision floating point, double extended-precision floating point, signed word integer, signed doubleword integer, signed quadword integer, and packed BCD decimal integers. With the exception of the 80-bit double extended-precision floating-point format, all of these data types exist in memory only. When they are loaded into x87 FPU data registers, they are converted into double extended-precision floating-point format and operated on in that format. the magic of floating point on X86 including X86_64 is all in the FPU everything is 80 bits. this could explain the reason 32 bits & 64 bits code didn't matter. in fact 32 bits codes is 'faster', this is likely an artifact / accident of cpu cache. i.e. the 32 bits whetstone benchmark instruction codes and possibly its various local data (variables) may completely 'live inside the cpu cache' due to the smaller code size and data outlay. hence, this may explain to an extent that 32bits whetstone codes 'runs faster' giving more gflops. if this is true, you could possibly run the 32bits version of boinc client (even on 64 bits os, including windows) & get higher gflops on the boinc whestone benchmark and possibly higher credits as well, not just r@h, everything under boinc lol ID: 80820 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2121 Credit: 12,390,943 RAC: 179	Message 80823 - Posted: 4 Nov 2016, 8:49:18 UTC Last modified: 4 Nov 2016, 8:57:09 UTC 64 bits, obviously, are the future. All Os are abandoning 32-bit architecture (including Android). It's just a matter of time. ID: 80823 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2121 Credit: 12,390,943 RAC: 179	Message 80824 - Posted: 4 Nov 2016, 8:53:33 UTC - in response to Message 80817. some of the data processing which takes several 32 bits steps/instructions may simply be done in less steps say in a single instruction as double precision variables fit nicely in 64 bits. this could benefit / accelerate math / codes in those deepest nested parts of the loops which does bulk of computations I think it's not only "accelerate math". Admins said, in the past, that they simulate "little" proteins because the memory usage is high. With 64 bits the problem is solved. ID: 80824 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 80825 - Posted: 4 Nov 2016, 19:11:21 UTC - in response to Message 80824. Last modified: 4 Nov 2016, 19:25:20 UTC some of the data processing which takes several 32 bits steps/instructions may simply be done in less steps say in a single instruction as double precision variables fit nicely in 64 bits. this could benefit / accelerate math / codes in those deepest nested parts of the loops which does bulk of computations I think it's not only "accelerate math". Admins said, in the past, that they simulate "little" proteins because the memory usage is high. With 64 bits the problem is solved. yup that's quite true, with 64 bits it is easier to access more than 4GB memory, with 32bits, a somewhat more 'complicated' setup PAE is needed for that. https://en.wikipedia.org/wiki/Physical_Address_Extension 32 bits do have an advantage that in various scenarios it is less 'wasteful' of memory though, e.g. that if for most time small integers are used in 32 bits that's 4 bytes, while in 64 bits 8 bytes are used, multiply that by a million items in an array that becomes 8 megs vs 4 megs. but of course memory is sort of 'cheap' these days lol if i'm not wrong, among them one of the big advantages of going 64 bits is that in 32 bits mode there are 8 AVX SIMD registers available, while going to 64 bits makes that 16 AVX SIMD registers, that makes it much more flexible (and possibly faster) for vectorised SIMD codes that runs on AVX https://en.wikipedia.org/wiki/Advanced_Vector_Extensions ID: 80825 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 80826 - Posted: 4 Nov 2016, 19:46:25 UTC Last modified: 4 Nov 2016, 20:01:40 UTC repeated the previous linpack test http://www.netlib.org/benchmark/linpackc.new but this time turn on compiler optimizations with AVX/AVX2/FMA > gcc -o linpack32 -m32 -O3 -mavx -mavx2 -mfma linpack.c -lm > gcc -o linpack64 -O3 -mavx -mavx2 -mfma linpack.c -lm > ./linpack32 Enter array size (q to quit) [200]: 1000 Memory required: 7824K. LINPACK benchmark, Double precision. Machine precision: 15 digits. Array size 1000 X 1000. Average rolled and unrolled performance: Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS ---------------------------------------------------- 16 0.65 92.87% 0.52% 6.61% 4408256.717 32 1.30 92.87% 0.53% 6.60% 4407119.733 64 2.61 92.87% 0.53% 6.60% 4408167.983 128 5.21 92.90% 0.53% 6.57% 4407866.492 256 10.42 92.90% 0.53% 6.58% 4408174.773 > ./linpack64 Enter array size (q to quit) [200]: 1000 Memory required: 7824K. LINPACK benchmark, Double precision. Machine precision: 15 digits. Array size 1000 X 1000. Average rolled and unrolled performance: Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS ---------------------------------------------------- 16 0.58 94.71% 0.53% 4.76% 4823499.938 32 1.17 94.71% 0.52% 4.76% 4823708.094 64 2.34 94.71% 0.52% 4.76% 4823911.930 128 4.67 94.71% 0.52% 4.76% 4825073.475 256 9.31 94.70% 0.52% 4.78% 4840441.868 512 18.66 94.71% 0.52% 4.77% 4829583.147 note all single core figures this time round the 64bits linpack is almost 10 (9.6) percent faster than the 32bits similarly optimised (AVX) linpack performance. (this is not yet the fastest, it is simply compiler optimised) and if the 64bits AVX/AVX2/FMA linpack is compared to the original 32 bits linpack (without the AVX/AVX2/FMA optimizations), it is a much better 35.4 percent improvement. Note that the original 32 bits app is optmised with -O2 optimization which also means optimised codes. 64 bit + AVX/AVX2/FMA optimizations is a clear winner here if pushed (highly optimised) to the metal, the now 'old' Haswell i7-4770k cpu manages a whopping 177 Gflops multi core vectorized double precision floating point performance. https://www.pugetsystems.com/labs/hpc/Haswell-Floating-Point-Performance-493 as most of the GPUs has rather poor double precision floating point performance, that makes today's intel i7 haswell/skylake/kabilake comparable to the higher performance (expensive) GPUs sold on the market today in terms of double precision floating point performance. And for that matter the intel i7 haswell/skylake/kabilake use considerably much less power giving a much better performance per watt (much better energy efficiency) score compared to the GPUs ID: 80826 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2121 Credit: 12,390,943 RAC: 179	Message 80827 - Posted: 7 Nov 2016, 9:32:49 UTC - in response to Message 80826. as most of the GPUs has rather poor double precision floating point performance, that makes today's intel i7 haswell/skylake/kabilake comparable to the higher performance (expensive) GPUs sold on the market today in terms of double precision floating point performance. DP is for hpc gpu like Nvidia Tesla or Amd FirePro. Home gpus are great with SP. But these are only academic discussions (very interesting, but theorical), the fact is that this thread was opened 6 years ago!! Have we to wait other 6 years to get some results? ID: 80827 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 80836 - Posted: 11 Nov 2016, 14:57:05 UTC - in response to Message 80825. some of the data processing which takes several 32 bits steps/instructions may simply be done in less steps say in a single instruction as double precision variables fit nicely in 64 bits. this could benefit / accelerate math / codes in those deepest nested parts of the loops which does bulk of computations I think it's not only "accelerate math". Admins said, in the past, that they simulate "little" proteins because the memory usage is high. With 64 bits the problem is solved. yup that's quite true, with 64 bits it is easier to access more than 4GB memory, with 32bits, a somewhat more 'complicated' setup PAE is needed for that. https://en.wikipedia.org/wiki/Physical_Address_Extension 32 bits do have an advantage that in various scenarios it is less 'wasteful' of memory though, e.g. that if for most time small integers are used in 32 bits that's 4 bytes, while in 64 bits 8 bytes are used, multiply that by a million items in an array that becomes 8 megs vs 4 megs. but of course memory is sort of 'cheap' these days lol if i'm not wrong, among them one of the big advantages of going 64 bits is that in 32 bits mode there are 8 AVX SIMD registers available, while going to 64 bits makes that 16 AVX SIMD registers, that makes it much more flexible (and possibly faster) for vectorised SIMD codes that runs on AVX https://en.wikipedia.org/wiki/Advanced_Vector_Extensions There is little if any code that would currently benefit from 64-bit integers. I originally thought that the larger number of registers available in 64-bit mode would help but the increased code and data size of 64-bit code did more damage to the caches than registers SPILL/FILLS necessary in 32-bits (caused by fewer registers). That is what I measured when I actually recompiled the code as 32-bit AND 64-bit. Rosetta spends a large chunk of its time computing "relationships" between 2 points in 3-dimensions (using floating point math). Rosetta makes an X-dimension 64-bit floating point calculation. Rosetta makes an Y-dimension 64-bit floating point calculation. Rosetta makes an Z-dimension 64-bit floating point calculation. You can change the TYPE DEFINITION of that "point" description to just add 4th "dummy" dimension that will allow the compiler to do a SIMD vector load of all 4 dimensions, perform the operation on all 4 dimensions and then a SIMD vector store. The compiler will change the 3 sequential SCALAR operation on 3-dimensions to a SINGLE PARALLEL operation on 4-dimensions. If you add the 4th dimension in the TYPEDEF, you do not need to make ANY other source code changes for the compilers to automatically generate the low level code to perform the parallel LOAD-OPERATION-STORE. VERY low hanging fruit. The Rosetta developers said they were already "familiar" with this technique when I pointed it out last year. It WOULD be their first, easy step to take if "low" performance was a problem for them. 32-bit integer versus 64-bit integer code really makes no difference unless Rosetta code undergoes major changes. ID: 80836 · Rating: 0 · rate: / Reply Quote