GPU FLOPS: Theory vs Reality

Author	Message
Wiggo Send message Joined: 24 Jan 00 Posts: 37643 Credit: 261,360,520 RAC: 489	Message 1808456 - Posted: 11 Aug 2016, 6:47:03 UTC - in response to Message 1808426. Wow, I was hoping for more from the 1060's, they are still behind the 1070 in credit/watt-hour. Looks like we'll have to wait till they release (hopefully) the 1050Ti, and see if that can take the crown back from the 1070. Especially if the price can be competitive eventually with the 750, though that will probably take a while to drop to that level. I am just finishing setting up my new dual 1060 system, it'll be interesting to see how they do in a pair. Still need to read up on the best way to configure them, lots of .xml editing it seems to get the most out of them, above the standard Lunatics install. Just remember that the 1060 is still much younger than the 1070 so there arn't anywhere near enough of them out there yet to be a good indicator so in actual fact they are making a good impact so far and could still wind up being the next 750Ti. Just be a little more patient until the numbers even out a bit more. ;-) Cheers. ID: 1808456 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1808461 - Posted: 11 Aug 2016, 7:40:37 UTC - in response to Message 1808456. Wow, I was hoping for more from the 1060's, they are still behind the 1070 in credit/watt-hour. Looks like we'll have to wait till they release (hopefully) the 1050Ti, and see if that can take the crown back from the 1070. Especially if the price can be competitive eventually with the 750, though that will probably take a while to drop to that level. I am just finishing setting up my new dual 1060 system, it'll be interesting to see how they do in a pair. Still need to read up on the best way to configure them, lots of .xml editing it seems to get the most out of them, above the standard Lunatics install. Just remember that the 1060 is still much younger than the 1070 so there arn't anywhere near enough of them out there yet to be a good indicator so in actual fact they are making a good impact so far and could still wind up being the next 750Ti. Just be a little more patient until the numbers even out a bit more. ;-) Cheers. I agree ...by looking how the error bars (distributions) between the 1060 and 1070 overlap by more than a bit. The 1070 is in the lead but it might be a photo finish some day. But for me, this is only stock stats. It just shows what "set&forget" volunteers get with whatever the scheduler sends them as stock apps and mix of nonVLARs & guppi. I'd be much more interested in optimized identical rigs (Lunatics+rescheduler+commandlines) using different GPU apps or a slightly diff GPUs. Otherwise we're only comparing generic horses and not Kentucky Derby stallions! (sorry Shaggie, your graphs are still amazing but now I want more but different!!! lol) Cheers, RobG the victim ID: 1808461 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1808468 - Posted: 11 Aug 2016, 8:33:18 UTC - in response to Message 1807174. Last modified: 11 Aug 2016, 8:33:38 UTC Today I tried some hacks to try to automatically identify how many concurrent tasks were being executed (I used the "Time reported" field with the run-time fields to identify active task intervals and checked them for overlaps with other tasks. Tragically this doesn't work because the time-stamp is on upload not on completion; even for computers that upload "immediately" there's a 5 minute cool-down and you can get a bunch of shorties in that time that look concurrent. I could try to ignore shorties but it would still break for people who only allow network usage on a schedule. That's unfortunate....and I've been thinking of possible alternatives for a few days now. Here's my "better" idea: For hosts with "anonymous platform" (and a very low total of "Invalid" tasks), how about concentrating on the last 12hr-window? You wouldn't be able to compare using credits but you should be able to tally the returned tasks (excluding shorties) that are: In progress, Validation pending, Validation inconclusive, and Valid. So a host with i5-4590 and a GTX 1080 could then be compared with a similar host with (lets say) i5-4590 and a GTX 1070. If that works and provides interesting data, you could then split it by CPU and GPU, and then only concentrate on the GPUs (at first). You might even be able to do a graph of: nonShorties/WattHour. Does that seem feasible? RobG :-) ID: 1808468 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22726 Credit: 416,307,556 RAC: 380	Message 1808478 - Posted: 11 Aug 2016, 10:35:37 UTC - in response to Message 1808422. Interesting - looks as if the GTX750 & GTX750ti have been knocked off the top of the credits per watt - Not something I expected so soon after the launch of the GTX10xx series!!! A question - what are the sample sizes for the top 10? (As you are no doubt aware a single event in a small sample size will have a bigger impact on the error bars than the same event where the sample size is much larger) Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1808478 ·

Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1808492 - Posted: 11 Aug 2016, 12:42:07 UTC - in response to Message 1808432. hmmm.... I wonder where CPUs would end up on that graph. Would you mind running that wonderful script on my CPU only machine? ID 8034200 the xeon e3-1230 v3 tdp is 80w, whilst I am measuring 63w off the wall, I am only running 7 tasks at once Host, API, Device, Credit, Seconds, Credit/Hour, Work Units 8034200, cpu, Intel Xeon E3-1230 v3 @ 3.30GHz, 4514.9, 58438.03875, 278.134590887515, 49 However that hourly-rate assumes 8 concurrent because the script is blind to concurrency and estimates based on core count. ID: 1808492 ·

Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1808494 - Posted: 11 Aug 2016, 12:43:35 UTC - in response to Message 1808446. I am also a bit surprised that difference between GTX1080 and GTX1070 is so small, since GTX1080 has 33% more compute units (2560 vs 1920 shaders) and 25% faster memory (10GHz vs 8GHz), and its even a bit higher clocked, so something is holding GTX1080 back? Even nVidia was advertizing GTX1080 as 8.9 TFLOPS and GTX1070 as 6.5 TFLOPS. I would guess that current application implementation is not using its extra resources well... Maybe time for some new, optimized application? One theory suggested by reviewers is that it's bandwidth-bound; the performance difference between the 1070s and the 1080's is probably pretty close to the bandwidth difference. ID: 1808494 ·

Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1808495 - Posted: 11 Aug 2016, 12:47:59 UTC - in response to Message 1808478. Interesting - looks as if the GTX750 & GTX750ti have been knocked off the top of the credits per watt - Not something I expected so soon after the launch of the GTX10xx series!!! A question - what are the sample sizes for the top 10? (As you are no doubt aware a single event in a small sample size will have a bigger impact on the error bars than the same event where the sample size is much larger) I'm not at the machine I have the data on but I can get that later. I was actually interested in trying to visualize the (stock) CUDA vs OpenCL data because it's pretty damning. ID: 1808495 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22726 Credit: 416,307,556 RAC: 380	Message 1808501 - Posted: 11 Aug 2016, 13:10:22 UTC That's OK, no hurry, just being curious as the lengths of the error bars aren't quite what I would expect. Very good, effective, graphics, I think you the only person trying to produce a visual description of the performance of GPUs. On your question about timings and getting "good" data. If you could get access to the stderr files for a decent sample of the tasks you would be able to find the actual elapsed time for each task. Several problems with the idea, not the least of which is having to sort through thousands of files and pick out the real elapsed time, which has slightly different formats for each application I've looked at in the past :-( I did it a few months ago on a couple of hundred tasks and that took many hours to work out all the formats and then do a very simple excel macro to extract them reliably (once I'd downloaded the ones I wanted...) Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1808501 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1808511 - Posted: 11 Aug 2016, 13:28:10 UTC - in response to Message 1808494. Last modified: 11 Aug 2016, 13:29:59 UTC I am also a bit surprised that difference between GTX1080 and GTX1070 is so small, since GTX1080 has 33% more compute units (2560 vs 1920 shaders) and 25% faster memory (10GHz vs 8GHz), and its even a bit higher clocked, so something is holding GTX1080 back? Even nVidia was advertizing GTX1080 as 8.9 TFLOPS and GTX1070 as 6.5 TFLOPS. I would guess that current application implementation is not using its extra resources well... Maybe time for some new, optimized application? One theory suggested by reviewers is that it's bandwidth-bound; the performance difference between the 1070s and the 1080's is probably pretty close to the bandwidth difference. With current CUDA application the guppi tasks use (mostly) just one SMX in pulse finding. The difference between cards comes from GPU clock, mem speed and number of cores per SMX. There are 20 SMX units in 1080 and 16 in 980 and 12 in 780. All but one is completely idle in vlar and guppi tasks. The same may apply to OpenCL version too. The difference between cards will reflect true performance when the new CUDA app is ready. Pulse finding can be made to utilize the GPU near 100%. The same applies partially to all parts of the process: gaussians, autocorrelation, spikes and triplets and chirping - although they are more parallelized now than pulse finding. All parts of the process can be distributed to GPU queues and the GPU will distribute the load automatically. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1808511 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 1808512 - Posted: 11 Aug 2016, 13:44:20 UTC - in response to Message 1808511. From what I have observed, Raistmer app, uses all SMXs or close to that depending on the tuning and settings used. ID: 1808512 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 1808513 - Posted: 11 Aug 2016, 13:45:27 UTC - in response to Message 1808492. Last modified: 11 Aug 2016, 13:45:50 UTC Thanks for running that. It looks like it is getting credit on level with a GTX 950 and AMD Fiji ID: 1808513 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1808524 - Posted: 11 Aug 2016, 14:40:27 UTC - in response to Message 1808513. Thanks for running that. It looks like it is getting credit on level with a GTX 950 and AMD Fiji Key Kiska, Don't know if I'm not understanding an extrapolation in your analysis. I'm looking at the left graph and it looks like it is about the same as a GTX 750 Ti @~300cr/hr. 8034200, cpu, Intel Xeon E3-1230 v3 @ 3.30GHz ===>> Credit/Hour: 278.1 Cheers, RobG ID: 1808524 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 1808550 - Posted: 11 Aug 2016, 18:19:10 UTC - in response to Message 1808524. Last modified: 11 Aug 2016, 18:20:57 UTC I was going off of credit/w-hr(right graph), I know that intel sheet says 80w tdp, but I am measuring 63w from the wall ID: 1808550 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22726 Credit: 416,307,556 RAC: 380	Message 1808554 - Posted: 11 Aug 2016, 18:26:39 UTC Two things: Power from wall is the the power going into the PSU, not that being used by the CPU, so will be greater than the power being used by the CPU Second TPW is the maximum power that the chip's thermal management system can cope with. Hopefully we will never reach this in normal circumstances as doing so can have deleterious effects on the life expectancy of the chip. (Life lesson from a previous job) Sounds as your CPU is running nicely in the "Live long and prosper" region of its power capabilities :-) Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1808554 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 1808559 - Posted: 11 Aug 2016, 18:30:31 UTC - in response to Message 1808554. Then I hope you can stay away from mobile cpu's. My i5-4210u regularly hits its 15w tdp :p From hwmonitor its saying 61w usage :) Not too much lost in psu inefficiencies and other things using power ID: 1808559 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22726 Credit: 416,307,556 RAC: 380	Message 1808560 - Posted: 11 Aug 2016, 18:45:49 UTC A lot of power is lost in laptops in the various power conversion and battery control functions, they can be frighteningly inefficient - those little "wall wart" power supplies rarely make 50% in real life. I know some of the claim 80 or 90%, but that is under ideal test conditions. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1808560 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1808561 - Posted: 11 Aug 2016, 18:49:52 UTC - in response to Message 1808512. From what I have observed, Raistmer app, uses all SMXs or close to that depending on the tuning and settings used. I observed a slightly modified (added streams) cuda app giving high GPU usage (94% running one at a time), but low power consumption. I modified the pulse find code to use more smx'es (lots of blocks with 32 or 64 threads each) and GPU usage stayed the same, but power consumption jumped high and runtimes halved or so.. The generic GPU usage tells only the first SMX usage. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1808561 ·

M_M Send message Joined: 20 May 04 Posts: 76 Credit: 45,752,966 RAC: 8	Message 1808569 - Posted: 11 Aug 2016, 19:30:34 UTC - in response to Message 1808561. Last modified: 11 Aug 2016, 19:33:19 UTC The generic GPU usage tells only the first SMX usage. So this is then a catch - most developers probably fully rely on generic GPU usage indication when optimizing their GPU code and judging if their code is squeezing maximum from GPU, but this is wrong as they should actually more rely on power consumption if they want to optimize code for maximum efficiency and performance... ID: 1808569 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1808574 - Posted: 11 Aug 2016, 19:49:59 UTC - in response to Message 1808569. Last modified: 11 Aug 2016, 19:50:35 UTC The generic GPU usage tells only the first SMX usage. So this is then a catch - most developers probably fully rely on generic GPU usage indication when optimizing their GPU code and judging if their code is squeezing maximum from GPU, but this is wrong as they should actually more rely on power consumption if they want to optimize code for maximum efficiency and performance... Yeah, there are other problems (than it reads activity on the first SM only) with relying on that utilisation figure. Mostly its timescale compared to real launches. Also there are only 2 copy engines, and HyperQ only on Big Kepler or high compute capability, which limit streaming. I have come across developers using the generic utilisation figure that way, which is pretty rough If I can spin the GPU to read 100% doing nothing (which I have done in test pieces to see what would happen, and measure memory/transfer performances). Probably what we'll be doing is looking at Petri's streaming code in unit test portions under the microscope (i.e. nVidia nSight). Who knows, could even squeeze some of the more useful refinements into some of the lesser models, even though there are hardware limitations on #concurrent streams etc. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1808574 ·

Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1808630 - Posted: 12 Aug 2016, 1:32:36 UTC I was asked about the sample size: here's the data from the last scan -- I collect data from random hosts and depending on how many there are I average the top quarter or top half of them to get the mean (to try to eliminate hosts with low scores that might be running multiple tasks or perhaps on shared machines). ID: 1808630 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.