OpenCL vs CUDA (Stock)

Author	Message
petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1809341 - Posted: 15 Aug 2016, 14:38:55 UTC - in response to Message 1809334. I somehow expected this; Latest GPUs are not used well by now old Cuda 5.0, while OpenCL is a bit higher level programming and OpenCL driver itself is doing a better job optimizing the task to actual higher-end GPU hardware. Sure, well written code in Cuda 7.5/8.0 would probably be even better but requires some additional human effort to be put in... The biggest issue there is the switch to tasks the Cuda app was never designed to run efficiently, and the Cuda app's preference to low CPU use underfeeding the newest/largest GPUs with them. After some tests I'm very hopeful Petri's contributed code will go a long way. I am, though, requesting more information on the CPU usage here, because this is a fairly critical CPU/GPU load balancing issue, when it comes to figuring out a way to scale across the range without locking up hosts/tasks. The CPU usage on my system depends on the choise I give for the nanosleep that I use to replace Yield() via LD_PRELAOD on my Linux. For some unknown reason my system uses 100% if I use cuda blocking sync option that is meant to free the CPU. So I use Yield() and override that before program launch. Wih 5000 nanoseconds (5 us) the CPU usage is 19-39 % per GPU task. If I increase the value for sleep the CPU usage and GPU usage drops and I could run more simultaneous apps. But I do not need to. I get 84 - 100% GPU usage (96% avg) and the power consumption of one GTX1080 is at 130-168W. Unmodified code used 79W doing guppi/vlar tasks. Without nanosleep shprties would take 46 seconds both CPU and GPU. Now they use 52 seconds GPU and 20 seconds CPU. I'm taking off the nanosleep now for a few hours. My CPU time should increase. And I put back the nanosleep. Run times were 2-3 seconds better in total, but the CPU times jumped high. I like the CPU not overheating. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1809341 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1809342 - Posted: 15 Aug 2016, 14:40:51 UTC - in response to Message 1809334. The CPU usage on my system depends on the choise I give for the nanosleep that I use to replace Yield() via LD_PRELAOD on my Linux. For some unknown reason my system uses 100% if I use cuda blocking sync option that is meant to free the CPU. So I use Yield() and override that before program launch. Wih 5000 nanoseconds (5 us) the CPU usage is 19-39 % per GPU task. If I increase the value for sleep the CPU usage and GPU usage drops and I could run more simultaneous apps. But I do not need to. I get 84 - 100% GPU usage (96% avg) and the power consumption of one GTX1080 is at 130-168W. Unmodified code used 79W doing guppi/vlar tasks. Without nanosleep shprties would take 46 seconds both CPU and GPU. Now they use 52 seconds GPU and 20 seconds CPU. I'm taking off the nanosleep now for a few hours. My CPU time should increase. I have the same results. If I use any setting other than "cudaDeviceScheduleYield" the CPU use will go Over 100%. How the thing can run at 105-110% CPU is a mystery, but it happens. So, I use Yield and it uses slightly less than 100%. I know nothing about setting it up to use nanosleep. ID: 1809342 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22220 Credit: 416,307,556 RAC: 380	Message 1809343 - Posted: 15 Aug 2016, 14:42:28 UTC Does it (try) and use a second core???? Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1809343 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1809346 - Posted: 15 Aug 2016, 14:54:23 UTC - in response to Message 1809342. The CPU usage on my system depends on the choise I give for the nanosleep that I use to replace Yield() via LD_PRELAOD on my Linux. For some unknown reason my system uses 100% if I use cuda blocking sync option that is meant to free the CPU. So I use Yield() and override that before program launch. Wih 5000 nanoseconds (5 us) the CPU usage is 19-39 % per GPU task. If I increase the value for sleep the CPU usage and GPU usage drops and I could run more simultaneous apps. But I do not need to. I get 84 - 100% GPU usage (96% avg) and the power consumption of one GTX1080 is at 130-168W. Unmodified code used 79W doing guppi/vlar tasks. Without nanosleep shprties would take 46 seconds both CPU and GPU. Now they use 52 seconds GPU and 20 seconds CPU. I'm taking off the nanosleep now for a few hours. My CPU time should increase. I have the same results. If I use any setting other than "cudaDeviceScheduleYield" the CPU use will go Over 100%. How the thing can run at 105-110% CPU is a mystery, but it happens. So, I use Yield and it uses slightly less than 100%. I know nothing about setting it up to use nanosleep. For your Linux box you can try this In linux .. To reduce the CPU usage with AP save this as libsleep.c Copy the resulting libsleep.so somewhere, but not to boinc seti directory since it will be deleted from there. This will replace the yield with sleep and give other threads with lower priority too a chance to run. Yield would give only threads of the same or higher priority a slice. #include <time.h> #include <errno.h> /* * To compile run: * gcc -O2 -fPIC -shared -Wl,-soname,libsleep.so -o libsleep.so libsleep.c * * To use: * export LD_PRELOAD=/path/to/libsleep.so * before launchig BOINC. */ int sched_yield(void) { struct timespec t; struct timespec r; t.tv_sec = 0; t.tv_nsec = 1000000; // 100000 is 1 ms, 5000 is 5 us, 1 is one ns. while(nanosleep(&t, &r) == -1 && errno == EINTR) { t.tv_sec = r.tv_sec; t.tv_nsec = r.tv_nsec; } return 0; } To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1809346 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1809358 - Posted: 15 Aug 2016, 15:29:34 UTC - in response to Message 1809194. Petri mentioned the following on SG WoW: petri33: @stubbles: I'm not doing Queue Optimisation. Even though I run out of work every tuesday throughout the year. That's about 52weeks/yr * 2hr/veek * 170000cr/24hr = 737000cr/yr Posted: Monday Aug. 15, 2016 at 17:00. and I just wanted to capture it in some thread and thought this one was more appropriate. Looking into the future, that means all GTX 1080 running his app will starve for ~2hrs on Tuesdays. Since VR should be a HUGE thing at xmas, that means possibly hundreds of new 1080, 1070 and 1060s crunching for S@h. It will be important to keep our eyes on the 1070 and 1060 to see if they might also starve on Tuesdays since I expect there will me more of those on the street come 2017 than the 1080. Ideally, if we could figure out a way to keep starvation to a minimum, it should reduce the flow to other projects of those who get easily frustrated/annoyed by a 100task/device. ID: 1809358 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1809360 - Posted: 15 Aug 2016, 15:33:34 UTC - in response to Message 1809358. Starvation of the GPU isn't anything new. The higher end 900s were already starving during the weekly outage.. ID: 1809360 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874	Message 1809363 - Posted: 15 Aug 2016, 15:42:09 UTC - in response to Message 1809360. And the bigger and faster the GPUs become, the longer the outage will become, as the number of database transactions to be compacted and re-indexed every week increases. ID: 1809363 ·

Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482	Message 1809386 - Posted: 15 Aug 2016, 16:49:11 UTC - in response to Message 1809363. Dammit Scottie, I need more Power! lol ID: 1809386 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1809389 - Posted: 15 Aug 2016, 16:55:28 UTC - in response to Message 1809363. And the bigger and faster the GPUs become, the longer the outage will become, as the number of database transactions to be compacted and re-indexed every week increases. Lovely! I hadn't even thought about that variable. Thanks :-D The next 3 Tuesdays should be a good indication of loads to come in early 2017. After the 4bit files being tested on Beta, is there any talk of making the tasks longer? ID: 1809389 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1809390 - Posted: 15 Aug 2016, 16:57:54 UTC - in response to Message 1809389. After the 4bit files being tested on Beta, is there any talk of making the tasks longer? What do you mean by longer? The increase in work unit size didn't adversely affect time to complete by much. ID: 1809390 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1809397 - Posted: 15 Aug 2016, 17:13:17 UTC - in response to Message 1809390. Last modified: 15 Aug 2016, 17:14:25 UTC After the 4bit files being tested on Beta, is there any talk of making the tasks longer? What do you mean by longer? The increase in work unit size didn't adversely affect time to complete by much. Oops, that should have read: last longer, such as 1.5 longer in time by having more data to analyse. PS: Love the new avatar!!! ID: 1809397 ·

Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482	Message 1809401 - Posted: 15 Aug 2016, 17:26:09 UTC - in response to Message 1809397. Last modified: 15 Aug 2016, 17:26:27 UTC PS: Love the new avatar!!! +1 ID: 1809401 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1809450 - Posted: 15 Aug 2016, 19:56:48 UTC - in response to Message 1809334. Last modified: 15 Aug 2016, 20:28:29 UTC ... For some unknown reason my system uses 100% if I use cuda blocking sync option that is meant to free the CPU. Yes that's correct operation with blocking sync, when using pure streams with enough work to keep the whole GPU busy enqueued. The stream launches return immediately to CPU code, which eventually make it to a point where it spins on the events (instead of blocking sync). Yield will behave similarly, except should yield after a controlled sleep (which you're adjusting). In probing test builds then I restored blocking sync, by adding hard stream event waits where there be added blocking sync. By allowing some streams to parallel I was able to keep the times low and utilisation up, but drop CPU. This is a question of load balancing, and will have to be made configurable/tunable/automated for wider consumption. Probably since you found an issue with the pulsefinding to fix, I'll be able to tweak and run another test (after work). If validation is improved as hoped, then an extended wider alpha could potentially take place while the new infrastructure build continues. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1809450 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1809457 - Posted: 15 Aug 2016, 20:41:40 UTC - in response to Message 1809342. Last modified: 15 Aug 2016, 20:43:04 UTC cudaDeviceScheduleYield[/url]" the CPU use will go Over 100%. How the thing can run at 105-110% CPU is a mystery, but it happens. So, I use Yield and it uses slightly less than 100%. Yes also correct operation, since the Cuda runtime is internally multithreaded (helper threads). At least if yielding on the launching thread, then it gives some room for the helper threads to finish with less tendency to overcommit. That same mechanism is what runs afoul of BoincApi's shutdown mechanisms when closing down the device properly, whether or not using Boincapi's critical sections. (BoincApi assumes only single threaded applications, which can't be forced on Cuda runtime) Probably during Alpha we'll make the sync and shutdown behaviour more controlled/configurable, though for now sapping CPU and leaving the device not properly shutdown will be good enough to see if validation improves, and performance stays high. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1809457 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1809461 - Posted: 15 Aug 2016, 21:16:16 UTC - in response to Message 1809450. ...extended wider alpha... Where's sign up sheet?!? ID: 1809461 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1809462 - Posted: 15 Aug 2016, 21:25:52 UTC - in response to Message 1809461. Last modified: 15 Aug 2016, 21:26:50 UTC ...extended wider alpha... Where's sign up sheet?!? LoL, fingers crossed that after work I can get sufficient beer and try things out. If I see improved validation (to be determined, a big if :)) and apply some light leash on the CPU use, then probably I would post as an open alpha with a swathe of warnings. 'Alpha - advanced user - no support - limited GPU suitability - will probably break' etc. If it works out that way, what that would do is allow vetting of a limited subset of targets/use-cases, while the bigger task of new infrastructure can proceed at a more leisurely pace aimed at something more generally applicable and robust to project change. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1809462 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1809468 - Posted: 15 Aug 2016, 22:07:18 UTC - in response to Message 1809342. The CPU usage on my system depends on the choise I give for the nanosleep that I use to replace Yield() via LD_PRELAOD on my Linux. For some unknown reason my system uses 100% if I use cuda blocking sync option that is meant to free the CPU. So I use Yield() and override that before program launch. Wih 5000 nanoseconds (5 us) the CPU usage is 19-39 % per GPU task. If I increase the value for sleep the CPU usage and GPU usage drops and I could run more simultaneous apps. But I do not need to. I get 84 - 100% GPU usage (96% avg) and the power consumption of one GTX1080 is at 130-168W. Unmodified code used 79W doing guppi/vlar tasks. Without nanosleep shprties would take 46 seconds both CPU and GPU. Now they use 52 seconds GPU and 20 seconds CPU. I'm taking off the nanosleep now for a few hours. My CPU time should increase. I have the same results. If I use any setting other than "cudaDeviceScheduleYield" the CPU use will go Over 100%. How the thing can run at 105-110% CPU is a mystery, but it happens. So, I use Yield and it uses slightly less than 100%. I know nothing about setting it up to use nanosleep. @TBar A new thingy to try: When creating events in the cudaAcceleration.cu use a new flag pair cudaEventDisableTiming\|cudaEventBlockingSync instead of the old cudaEventDisableTiming alone. 1) Apply this at least to pulseDoneEvent. Not the ones with number at the end. 2) and probably to gaussDoneEvent, tripletsDoneEvent, autocorrelationDoneEvent and maybe summaxDoneEvent. Not the ones with number at the end. It will drop CPU usage but may slow things down. The GPU usage drops too, but if you have enough RAM you can try running 2 instances at a time. Watch out for the system going into constant swap state (running out of available RAM). To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1809468 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.