OpenCL vs CUDA (Stock)

Message boards : Number crunching : OpenCL vs CUDA (Stock)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1809341 - Posted: 15 Aug 2016, 14:38:55 UTC - in response to Message 1809334.  

I somehow expected this; Latest GPUs are not used well by now old Cuda 5.0, while OpenCL is a bit higher level programming and OpenCL driver itself is doing a better job optimizing the task to actual higher-end GPU hardware.

Sure, well written code in Cuda 7.5/8.0 would probably be even better but requires some additional human effort to be put in...


The biggest issue there is the switch to tasks the Cuda app was never designed to run efficiently, and the Cuda app's preference to low CPU use underfeeding the newest/largest GPUs with them.

After some tests I'm very hopeful Petri's contributed code will go a long way. I am, though, requesting more information on the CPU usage here, because this is a fairly critical CPU/GPU load balancing issue, when it comes to figuring out a way to scale across the range without locking up hosts/tasks.



The CPU usage on my system depends on the choise I give for the nanosleep that I use to replace Yield() via LD_PRELAOD on my Linux. For some unknown reason my system uses 100% if I use cuda blocking sync option that is meant to free the CPU. So I use Yield() and override that before program launch.

Wih 5000 nanoseconds (5 us) the CPU usage is 19-39 % per GPU task. If I increase the value for sleep the CPU usage and GPU usage drops and I could run more simultaneous apps. But I do not need to. I get 84 - 100% GPU usage (96% avg) and the power consumption of one GTX1080 is at 130-168W. Unmodified code used 79W doing guppi/vlar tasks.

Without nanosleep shprties would take 46 seconds both CPU and GPU. Now they use 52 seconds GPU and 20 seconds CPU. I'm taking off the nanosleep now for a few hours. My CPU time should increase.



And I put back the nanosleep. Run times were 2-3 seconds better in total, but the CPU times jumped high. I like the CPU not overheating.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1809341 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1809342 - Posted: 15 Aug 2016, 14:40:51 UTC - in response to Message 1809334.  

The CPU usage on my system depends on the choise I give for the nanosleep that I use to replace Yield() via LD_PRELAOD on my Linux. For some unknown reason my system uses 100% if I use cuda blocking sync option that is meant to free the CPU. So I use Yield() and override that before program launch.

Wih 5000 nanoseconds (5 us) the CPU usage is 19-39 % per GPU task. If I increase the value for sleep the CPU usage and GPU usage drops and I could run more simultaneous apps. But I do not need to. I get 84 - 100% GPU usage (96% avg) and the power consumption of one GTX1080 is at 130-168W. Unmodified code used 79W doing guppi/vlar tasks.

Without nanosleep shprties would take 46 seconds both CPU and GPU. Now they use 52 seconds GPU and 20 seconds CPU. I'm taking off the nanosleep now for a few hours. My CPU time should increase.

I have the same results. If I use any setting other than "cudaDeviceScheduleYield" the CPU use will go Over 100%. How the thing can run at 105-110% CPU is a mystery, but it happens. So, I use Yield and it uses slightly less than 100%. I know nothing about setting it up to use nanosleep.
ID: 1809342 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22220
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1809343 - Posted: 15 Aug 2016, 14:42:28 UTC

Does it (try) and use a second core????
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1809343 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1809346 - Posted: 15 Aug 2016, 14:54:23 UTC - in response to Message 1809342.  

The CPU usage on my system depends on the choise I give for the nanosleep that I use to replace Yield() via LD_PRELAOD on my Linux. For some unknown reason my system uses 100% if I use cuda blocking sync option that is meant to free the CPU. So I use Yield() and override that before program launch.

Wih 5000 nanoseconds (5 us) the CPU usage is 19-39 % per GPU task. If I increase the value for sleep the CPU usage and GPU usage drops and I could run more simultaneous apps. But I do not need to. I get 84 - 100% GPU usage (96% avg) and the power consumption of one GTX1080 is at 130-168W. Unmodified code used 79W doing guppi/vlar tasks.

Without nanosleep shprties would take 46 seconds both CPU and GPU. Now they use 52 seconds GPU and 20 seconds CPU. I'm taking off the nanosleep now for a few hours. My CPU time should increase.

I have the same results. If I use any setting other than "cudaDeviceScheduleYield" the CPU use will go Over 100%. How the thing can run at 105-110% CPU is a mystery, but it happens. So, I use Yield and it uses slightly less than 100%. I know nothing about setting it up to use nanosleep.


For your Linux box you can try this
In linux ..

To reduce the CPU usage with AP save this as libsleep.c

Copy the resulting libsleep.so somewhere, but not to boinc seti directory since it will be deleted from there.

This will replace the yield with sleep and give other threads with lower priority too a chance to run. Yield would give only threads of the same or higher priority a slice.


#include <time.h>
#include <errno.h>

/*
 * To compile run:
 * gcc -O2 -fPIC -shared -Wl,-soname,libsleep.so -o libsleep.so libsleep.c
 *
 * To use:
 *  export LD_PRELOAD=/path/to/libsleep.so
*  before launchig BOINC.
*/


int sched_yield(void)
{
  struct timespec t;
  struct timespec r;

  t.tv_sec  = 0;
  t.tv_nsec = 1000000; // 100000 is 1 ms, 5000 is 5 us, 1 is one ns.

  while(nanosleep(&t, &r) == -1 && errno == EINTR)
    {
      t.tv_sec = r.tv_sec;
      t.tv_nsec = r.tv_nsec;
    }

  return 0;
}


To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1809346 · Report as offensive
Profile Stubbles
Volunteer tester
Avatar

Send message
Joined: 29 Nov 99
Posts: 358
Credit: 5,909,255
RAC: 0
Canada
Message 1809358 - Posted: 15 Aug 2016, 15:29:34 UTC - in response to Message 1809194.  

Petri mentioned the following on SG WoW:
petri33: @stubbles: I'm not doing Queue Optimisation. Even though I run out of work every tuesday throughout the year. That's about 52weeks/yr * 2hr/veek * 170000cr/24hr = 737000cr/yr Posted: Monday Aug. 15, 2016 at 17:00.

and I just wanted to capture it in some thread and thought this one was more appropriate.

Looking into the future, that means all GTX 1080 running his app will starve for ~2hrs on Tuesdays.
Since VR should be a HUGE thing at xmas, that means possibly hundreds of new 1080, 1070 and 1060s crunching for S@h.

It will be important to keep our eyes on the 1070 and 1060 to see if they might also starve on Tuesdays since I expect there will me more of those on the street come 2017 than the 1080.

Ideally, if we could figure out a way to keep starvation to a minimum, it should reduce the flow to other projects of those who get easily frustrated/annoyed by a 100task/device.
ID: 1809358 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1809360 - Posted: 15 Aug 2016, 15:33:34 UTC - in response to Message 1809358.  

Starvation of the GPU isn't anything new.

The higher end 900s were already starving during the weekly outage..
ID: 1809360 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1809363 - Posted: 15 Aug 2016, 15:42:09 UTC - in response to Message 1809360.  

And the bigger and faster the GPUs become, the longer the outage will become, as the number of database transactions to be compacted and re-indexed every week increases.
ID: 1809363 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1809386 - Posted: 15 Aug 2016, 16:49:11 UTC - in response to Message 1809363.  

Dammit Scottie, I need more Power!

lol

ID: 1809386 · Report as offensive
Profile Stubbles
Volunteer tester
Avatar

Send message
Joined: 29 Nov 99
Posts: 358
Credit: 5,909,255
RAC: 0
Canada
Message 1809389 - Posted: 15 Aug 2016, 16:55:28 UTC - in response to Message 1809363.  

And the bigger and faster the GPUs become, the longer the outage will become, as the number of database transactions to be compacted and re-indexed every week increases.

Lovely! I hadn't even thought about that variable. Thanks :-D
The next 3 Tuesdays should be a good indication of loads to come in early 2017.

After the 4bit files being tested on Beta, is there any talk of making the tasks longer?
ID: 1809389 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1809390 - Posted: 15 Aug 2016, 16:57:54 UTC - in response to Message 1809389.  

After the 4bit files being tested on Beta, is there any talk of making the tasks longer?


What do you mean by longer?

The increase in work unit size didn't adversely affect time to complete by much.
ID: 1809390 · Report as offensive
Profile Stubbles
Volunteer tester
Avatar

Send message
Joined: 29 Nov 99
Posts: 358
Credit: 5,909,255
RAC: 0
Canada
Message 1809397 - Posted: 15 Aug 2016, 17:13:17 UTC - in response to Message 1809390.  
Last modified: 15 Aug 2016, 17:14:25 UTC

After the 4bit files being tested on Beta, is there any talk of making the tasks longer?

What do you mean by longer?
The increase in work unit size didn't adversely affect time to complete by much.

Oops, that should have read: last longer, such as 1.5 longer in time by having more data to analyse.

PS: Love the new avatar!!!
ID: 1809397 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1809401 - Posted: 15 Aug 2016, 17:26:09 UTC - in response to Message 1809397.  
Last modified: 15 Aug 2016, 17:26:27 UTC

PS: Love the new avatar!!!


+1

ID: 1809401 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1809450 - Posted: 15 Aug 2016, 19:56:48 UTC - in response to Message 1809334.  
Last modified: 15 Aug 2016, 20:28:29 UTC

... For some unknown reason my system uses 100% if I use cuda blocking sync option that is meant to free the CPU.


Yes that's correct operation with blocking sync, when using pure streams with enough work to keep the whole GPU busy enqueued.

The stream launches return immediately to CPU code, which eventually make it to a point where it spins on the events (instead of blocking sync). Yield will behave similarly, except should yield after a controlled sleep (which you're adjusting).

In probing test builds then I restored blocking sync, by adding hard stream event waits where there be added blocking sync. By allowing some streams to parallel I was able to keep the times low and utilisation up, but drop CPU. This is a question of load balancing, and will have to be made configurable/tunable/automated for wider consumption.

Probably since you found an issue with the pulsefinding to fix, I'll be able to tweak and run another test (after work). If validation is improved as hoped, then an extended wider alpha could potentially take place while the new infrastructure build continues.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1809450 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1809457 - Posted: 15 Aug 2016, 20:41:40 UTC - in response to Message 1809342.  
Last modified: 15 Aug 2016, 20:43:04 UTC

cudaDeviceScheduleYield[/url]" the CPU use will go Over 100%. How the thing can run at 105-110% CPU is a mystery, but it happens. So, I use Yield and it uses slightly less than 100%.


Yes also correct operation, since the Cuda runtime is internally multithreaded (helper threads). At least if yielding on the launching thread, then it gives some room for the helper threads to finish with less tendency to overcommit. That same mechanism is what runs afoul of BoincApi's shutdown mechanisms when closing down the device properly, whether or not using Boincapi's critical sections. (BoincApi assumes only single threaded applications, which can't be forced on Cuda runtime)

Probably during Alpha we'll make the sync and shutdown behaviour more controlled/configurable, though for now sapping CPU and leaving the device not properly shutdown will be good enough to see if validation improves, and performance stays high.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1809457 · Report as offensive
Profile Stubbles
Volunteer tester
Avatar

Send message
Joined: 29 Nov 99
Posts: 358
Credit: 5,909,255
RAC: 0
Canada
Message 1809461 - Posted: 15 Aug 2016, 21:16:16 UTC - in response to Message 1809450.  

...extended wider alpha...

Where's sign up sheet?!?
ID: 1809461 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1809462 - Posted: 15 Aug 2016, 21:25:52 UTC - in response to Message 1809461.  
Last modified: 15 Aug 2016, 21:26:50 UTC

...extended wider alpha...

Where's sign up sheet?!?


LoL, fingers crossed that after work I can get sufficient beer and try things out. If I see improved validation (to be determined, a big if :)) and apply some light leash on the CPU use, then probably I would post as an open alpha with a swathe of warnings. 'Alpha - advanced user - no support - limited GPU suitability - will probably break' etc.

If it works out that way, what that would do is allow vetting of a limited subset of targets/use-cases, while the bigger task of new infrastructure can proceed at a more leisurely pace aimed at something more generally applicable and robust to project change.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1809462 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1809468 - Posted: 15 Aug 2016, 22:07:18 UTC - in response to Message 1809342.  

The CPU usage on my system depends on the choise I give for the nanosleep that I use to replace Yield() via LD_PRELAOD on my Linux. For some unknown reason my system uses 100% if I use cuda blocking sync option that is meant to free the CPU. So I use Yield() and override that before program launch.

Wih 5000 nanoseconds (5 us) the CPU usage is 19-39 % per GPU task. If I increase the value for sleep the CPU usage and GPU usage drops and I could run more simultaneous apps. But I do not need to. I get 84 - 100% GPU usage (96% avg) and the power consumption of one GTX1080 is at 130-168W. Unmodified code used 79W doing guppi/vlar tasks.

Without nanosleep shprties would take 46 seconds both CPU and GPU. Now they use 52 seconds GPU and 20 seconds CPU. I'm taking off the nanosleep now for a few hours. My CPU time should increase.

I have the same results. If I use any setting other than "cudaDeviceScheduleYield" the CPU use will go Over 100%. How the thing can run at 105-110% CPU is a mystery, but it happens. So, I use Yield and it uses slightly less than 100%. I know nothing about setting it up to use nanosleep.


@TBar

A new thingy to try: When creating events in the cudaAcceleration.cu
use a new flag pair
cudaEventDisableTiming|cudaEventBlockingSync
instead of the old cudaEventDisableTiming alone.

1) Apply this at least to pulseDoneEvent. Not the ones with number at the end.
2) and probably to gaussDoneEvent, tripletsDoneEvent, autocorrelationDoneEvent and maybe summaxDoneEvent. Not the ones with number at the end.

It will drop CPU usage but may slow things down. The GPU usage drops too, but if you have enough RAM you can try running 2 instances at a time. Watch out for the system going into constant swap state (running out of available RAM).
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1809468 · Report as offensive
Previous · 1 · 2

Message boards : Number crunching : OpenCL vs CUDA (Stock)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.