OpenCL vs CUDA (Stock)

Author	Message
Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1809194 - Posted: 15 Aug 2016, 2:12:11 UTC On request, using the same data I collected for my most recent GPU rankings, I parsed out data for tasks running the stock CUDA app and generated a comparison with OpenCL. As you can see the CUDA app generates less credit per hour on modern GPUs. ID: 1809194 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1809195 - Posted: 15 Aug 2016, 2:23:11 UTC - in response to Message 1809194. Last modified: 15 Aug 2016, 2:29:25 UTC Hmmm, I actually expected OpenCL would be further ahead on single instance stock app/settings. In the Cuda app case, there are presently hard scaling limits (likely to be lifted later with optional CPU cost). Would it be easy to extract the CPU usage [time], avg % of elapsed I suppose, per task for the same models/app ? Might give an idea how many Cuda streams we might be able to feed while staying on a single core (for the next gen). "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1809195 ·

M_M Send message Joined: 20 May 04 Posts: 76 Credit: 45,752,966 RAC: 8	Message 1809220 - Posted: 15 Aug 2016, 5:36:04 UTC - in response to Message 1809194. I somehow expected this; Latest GPUs are not used well by now old Cuda 5.0, while OpenCL is a bit higher level programming and OpenCL driver itself is doing a better job optimizing the task to actual higher-end GPU hardware. Sure, well written code in Cuda 7.5/8.0 would probably be even better but requires some additional human effort to be put in... ID: 1809220 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1809229 - Posted: 15 Aug 2016, 6:31:35 UTC - in response to Message 1809220. I somehow expected this; Latest GPUs are not used well by now old Cuda 5.0, while OpenCL is a bit higher level programming and OpenCL driver itself is doing a better job optimizing the task to actual higher-end GPU hardware. Sure, well written code in Cuda 7.5/8.0 would probably be even better but requires some additional human effort to be put in... The biggest issue there is the switch to tasks the Cuda app was never designed to run efficiently, and the Cuda app's preference to low CPU use underfeeding the newest/largest GPUs with them. After some tests I'm very hopeful Petri's contributed code will go a long way. I am, though, requesting more information on the CPU usage here, because this is a fairly critical CPU/GPU load balancing issue, when it comes to figuring out a way to scale across the range without locking up hosts/tasks. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1809229 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304	Message 1809236 - Posted: 15 Aug 2016, 7:27:37 UTC - in response to Message 1809229. After some tests I'm very hopeful Petri's contributed code will go a long way. I am, though, requesting more information on the CPU usage here, because this is a fairly critical CPU/GPU load balancing issue, when it comes to figuring out a way to scale across the range without locking up hosts/tasks. The advantage that Petri's code has is that it makes as much use as possible of the GPU when running only a single WU, so only 1 Core per GPU will be necessary to get maximum performance, where at present- particularly with high end hardware- you need 3-5 Cores per GPU to get maximum performance due to the need to run multiple WUs in order to make use of the GPU hardware available. 1 CPU core per GPU will only be an issue for the older/ most basic of systems. The default stock installation should be set to have minimal impact on system usability; for those with 1 or more GPUs and 4 or more CPU cores give them the option to reserve a CPU core & use a more tuned setting for their GPU. Possibly a 3rd option for dedicated crunchers to go for the most tuned option for their hardware, irrespective of the effect on system usability; or the 3rd option is for them to use Lunatics to select the application & configuration of their choosing. Grant Darwin NT ID: 1809236 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1809238 - Posted: 15 Aug 2016, 7:44:02 UTC - in response to Message 1809236. Last modified: 15 Aug 2016, 7:46:13 UTC After some tests I'm very hopeful Petri's contributed code will go a long way. I am, though, requesting more information on the CPU usage here, because this is a fairly critical CPU/GPU load balancing issue, when it comes to figuring out a way to scale across the range without locking up hosts/tasks. The advantage that Petri's code has is that it makes as much use as possible of the GPU when running only a single WU, so only 1 Core per GPU will be necessary to get maximum performance, where at present- particularly with high end hardware- you need 3-5 Cores per GPU to get maximum performance due to the need to run multiple WUs in order to make use of the GPU hardware available. 1 CPU core per GPU will only be an issue for the older/ most basic of systems. The default stock installation should be set to have minimal impact on system usability; for those with 1 or more GPUs and 4 or more CPU cores give them the option to reserve a CPU core & use a more tuned setting for their GPU. Possibly a 3rd option for dedicated crunchers to go for the most tuned option for their hardware, irrespective of the effect on system usability; or the 3rd option is for them to use Lunatics to select the application & configuration of their choosing. Yeah, I'd agree with the conservative stock sentiment. Where the complexity comes into the choices/options is the moving target. Scaling to use a full CPU core to fill a new GPU now, then the problem manifests again when Volta surfaces next year (or so). More than likely as more of this information becomes better known, then better self-scaling can be employed to some degree, and flexibility in choices for those that want it, will be the way to go. Probably, at least for the Cuda-Xbranch side, that will continue to appear as conservative defaults, but with the addition of optional tools for configuration/tweaking. Side-issues presented in build systems, cross platform, and limits of the existing design shoehorning in performance code. The OpenCL lead is being taken as an opportunity to take Cuda into dry-dock for complete redesign-refit, as opposed to patching what's been a relatively solid design, but is now quite dated. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1809238 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1809250 - Posted: 15 Aug 2016, 9:00:47 UTC According to site statistics Windows/x86 8.00 (cuda23) 22 Jan 2016, 0:38:52 UTC 1,076 GigaFLOPS Windows/x86 8.00 (cuda32) 22 Jan 2016, 0:38:52 UTC 7,321 GigaFLOPS Windows/x86 8.00 (cuda42) 22 Jan 2016, 0:38:52 UTC 20,515 GigaFLOPS Windows/x86 8.00 (cuda50) 22 Jan 2016, 0:38:52 UTC 22,408 GigaFLOPS Windows/x86 8.12 (opencl_nvidia_sah) 18 May 2016, 1:10:51 UTC 33,985 GigaFLOPS Windows/x86 8.12 (opencl_nvidia_SoG) 18 May 2016, 1:10:51 UTC 88,905 GigaFLOPS Both SAH and SoG overall outperform CUDAxx. But there are non-zero share of SAH app regarding SoG too (34:89). Would be interesting to establish what cards prefer SAH over SoG and for what reason. It can be card type-based choice (that is, XYZ model prefers SAH over SoG on almost all hosts where this model installed) or just random choice of server due to close enough performance. In this case any particular XYZ model will crunch SAH on some hosts and SoG on others. Such survey will be useful for project performance increase cause I plan to omit SAH build for next release (in case if no particular XYZ GPU model strongly prefers it over SoG). SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1809250 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304	Message 1809260 - Posted: 15 Aug 2016, 9:55:45 UTC - in response to Message 1809250. Last modified: 15 Aug 2016, 9:56:13 UTC Would be interesting to establish what cards prefer SAH over SoG and for what reason. It can be card type-based choice (that is, XYZ model prefers SAH over SoG on almost all hosts where this model installed) or just random choice of server due to close enough performance. In my case on Beta, the slowest application was chosen due to the work types being allocated at that time. SoG & Cuda50 got almost all Guppie work, SaH got mostly Guppie with some Arecibo work and Cuda42 got almost all Arecibo work- so it had the best processing rate & it got picked as the best application, even though was actually the slowest. Grant Darwin NT ID: 1809260 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1809267 - Posted: 15 Aug 2016, 10:44:24 UTC Last modified: 15 Aug 2016, 10:56:59 UTC Here's an interesting comparision. My GTX 950 running the latest Special CUDA verses a GTX 780 running the Stock OpenCL http://setiathome.berkeley.edu/workunit.php?wuid=2237103948 5097350994 6909215 14 Aug 2016, 22:56:00 UTC 14 Aug 2016, 23:12:16 UTC Completed and validated 367.83 367.83 56.24 SETI@home v8 v8.12 (opencl_nvidia_SoG) windows_intelx86 5097350995 7199204 14 Aug 2016, 22:55:50 UTC 14 Aug 2016, 23:06:08 UTC Completed and validated 375.86 353.63 56.24 SETI@home v8 Anonymous platform (NVIDIA GPU) We'll have to wait until afternoon to see if the latest version is any better with the GUPPI Pulse count. About half the GUPPI tasks are off by One Pulse count in the older versions. Once that one little Pulse count is solved the App will be viable. This is a typical GUPPI task with the new App; http://setiathome.berkeley.edu/result.php?resultid=5097390926 blc5_2bit_guppi_57451_66989_HIP117473_OFF_0016.13400.831.18.27.179.vlar_2 Run time: 15 min 22 sec CPU time: 15 min 1 sec Validate state: Valid ID: 1809267 ·

_heinz Volunteer tester Send message Joined: 25 Feb 05 Posts: 744 Credit: 5,539,270 RAC: 0	Message 1809269 - Posted: 15 Aug 2016, 11:11:48 UTC Last modified: 15 Aug 2016, 11:14:44 UTC Nice, you compiled your own ap x41p_zi3c Where can I download and try it on my V8-Xeon ? Please compile it with CUDA8 too. You can download CUDA8 from the "Developer area" of NVIDIA, I have it also installed.. _heinz ID: 1809269 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1809275 - Posted: 15 Aug 2016, 11:37:35 UTC - in response to Message 1809269. Last modified: 15 Aug 2016, 11:53:22 UTC I've tried it with CUDA 8. You can find some results by looking through the Inconclusive list, http://setiathome.berkeley.edu/result.php?resultid=5064040827 My results show CUDA 8 isn't any better than 7.5, and CUDA 8 only works in Darwin 15.x, not to mention the Driver is Broken in Darwin 15.6. If you want to run the CUDA 8 driver, do Not update to Darwin 15.6. Fortunately I have a Backup system that is still running Darwin 15.4. Checkout those results, CUDA verses Stock OSX OpenCL; 5064040826 6274868 28 Jul 2016, 3:18:55 UTC 6 Aug 2016, 12:07:27 UTC Completed, validation inconclusive 7,346.61 195.25 pending SETI@home v8 v8.00 (opencl_nvidia_mac) x86_64-apple-darwin 5064040827 6796479 28 Jul 2016, 3:19:05 UTC 28 Jul 2016, 12:44:11 UTC Completed, validation inconclusive 390.14 381.89 pending SETI@home v8 Anonymous platform (NVIDIA GPU) 5082539421 8028973 6 Aug 2016, 18:30:44 UTC 30 Sep 2016, 1:54:56 UTC in progress --- --- --- SETI@home v8 v8.00 x86_64-pc-linux-gnu The source code is here, https://setisvn.ssl.berkeley.edu/trac/browser/branches/sah_v7_opt/Xbranch/client/alpha/PetriR_raw3 From my experience the CUDA version doesn't matter, it's the 'Special' work accomplished by Petri using the new CUDA Streams code. ID: 1809275 ·

_heinz Volunteer tester Send message Joined: 25 Feb 05 Posts: 744 Credit: 5,539,270 RAC: 0	Message 1809278 - Posted: 15 Aug 2016, 11:44:05 UTC - in response to Message 1809275. Merci TBar, will have a look in it and try. ID: 1809278 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1809286 - Posted: 15 Aug 2016, 12:17:13 UTC - in response to Message 1809194. On request, using the same data I collected for my most recent GPU rankings, I parsed out data for tasks running the stock CUDA app and generated a comparison with OpenCL. As you can see the CUDA app generates less credit per hour on modern GPUs. Shaggie, Is your graph comparing with only 1task/gpu? If so, from my experience, NV_SoG should be crushing Cuda50 on a GTX 750 Ti ...but it looks like there's only ~20% lead for NV_SoG. Maybe I've been playing too much with Mr Kevvy's script that I can't remember the good old days of running guppis on GPU! lol RobG ;-} ID: 1809286 ·

Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1809289 - Posted: 15 Aug 2016, 12:30:26 UTC - in response to Message 1809195. Would it be easy to extract the CPU usage [time], avg % of elapsed I suppose, per task for the same models/app ? Might give an idea how many Cuda streams we might be able to feed while staying on a single core (for the next gen). Sure I can fiddle with that -- I'll need to rework the front end because I don't have that data in the digestion but I'll see what I can do. ID: 1809289 ·

Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1809317 - Posted: 15 Aug 2016, 13:34:49 UTC - in response to Message 1809286. Is your graph comparing with only 1task/gpu? Yes, "stock." ID: 1809317 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1809329 - Posted: 15 Aug 2016, 14:04:51 UTC - in response to Message 1809317. Last modified: 15 Aug 2016, 14:05:57 UTC Is your graph comparing with only 1task/gpu? Yes, "stock." I only asked because I found out yesterday that some run multiple tasks on GPU with stock. Can't remember the thread though. I'm assuming it's those who use just the GPU to crunch S@h, and they don't want to reinstall Lunatics whenever there's a new update to apps. I'm also assuming it's much more rare than Anonymous Platform, so it shouldn't affect your stats....much. Maybe others have a better idea especially those running multiple projects: S@h on GPU and other project(s) on CPU. Just another random thought! ID: 1809329 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1809332 - Posted: 15 Aug 2016, 14:10:29 UTC - in response to Message 1809329. You can run more than 1 per GPU with stock by just adding an app_config.xml I'm pretty sure it won't show up as anonymous, but it's been a long time since I've done that. ID: 1809332 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1809334 - Posted: 15 Aug 2016, 14:15:29 UTC - in response to Message 1809229. I somehow expected this; Latest GPUs are not used well by now old Cuda 5.0, while OpenCL is a bit higher level programming and OpenCL driver itself is doing a better job optimizing the task to actual higher-end GPU hardware. Sure, well written code in Cuda 7.5/8.0 would probably be even better but requires some additional human effort to be put in... The biggest issue there is the switch to tasks the Cuda app was never designed to run efficiently, and the Cuda app's preference to low CPU use underfeeding the newest/largest GPUs with them. After some tests I'm very hopeful Petri's contributed code will go a long way. I am, though, requesting more information on the CPU usage here, because this is a fairly critical CPU/GPU load balancing issue, when it comes to figuring out a way to scale across the range without locking up hosts/tasks. The CPU usage on my system depends on the choise I give for the nanosleep that I use to replace Yield() via LD_PRELAOD on my Linux. For some unknown reason my system uses 100% if I use cuda blocking sync option that is meant to free the CPU. So I use Yield() and override that before program launch. Wih 5000 nanoseconds (5 us) the CPU usage is 19-39 % per GPU task. If I increase the value for sleep the CPU usage and GPU usage drops and I could run more simultaneous apps. But I do not need to. I get 84 - 100% GPU usage (96% avg) and the power consumption of one GTX1080 is at 130-168W. Unmodified code used 79W doing guppi/vlar tasks. Without nanosleep shprties would take 46 seconds both CPU and GPU. Now they use 52 seconds GPU and 20 seconds CPU. I'm taking off the nanosleep now for a few hours. My CPU time should increase. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1809334 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22219 Credit: 416,307,556 RAC: 380	Message 1809336 - Posted: 15 Aug 2016, 14:23:24 UTC Unmodified code used 79W doing guppi/vlar tasks. CUDA or SoG? (I'm guessing CUDA, but like to be sure) Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1809336 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1809339 - Posted: 15 Aug 2016, 14:37:20 UTC - in response to Message 1809336. Unmodified code used 79W doing guppi/vlar tasks. CUDA or SoG? (I'm guessing CUDA, but like to be sure) Your guess was right. I'm doing CUDA. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1809339 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.