Message boards :
Number crunching :
OpenCL vs CUDA (Stock)
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
I somehow expected this; Latest GPUs are not used well by now old Cuda 5.0, while OpenCL is a bit higher level programming and OpenCL driver itself is doing a better job optimizing the task to actual higher-end GPU hardware. And I put back the nanosleep. Run times were 2-3 seconds better in total, but the CPU times jumped high. I like the CPU not overheating. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
The CPU usage on my system depends on the choise I give for the nanosleep that I use to replace Yield() via LD_PRELAOD on my Linux. For some unknown reason my system uses 100% if I use cuda blocking sync option that is meant to free the CPU. So I use Yield() and override that before program launch. I have the same results. If I use any setting other than "cudaDeviceScheduleYield" the CPU use will go Over 100%. How the thing can run at 105-110% CPU is a mystery, but it happens. So, I use Yield and it uses slightly less than 100%. I know nothing about setting it up to use nanosleep. |
rob smith Send message Joined: 7 Mar 03 Posts: 22220 Credit: 416,307,556 RAC: 380 |
Does it (try) and use a second core???? Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
The CPU usage on my system depends on the choise I give for the nanosleep that I use to replace Yield() via LD_PRELAOD on my Linux. For some unknown reason my system uses 100% if I use cuda blocking sync option that is meant to free the CPU. So I use Yield() and override that before program launch. For your Linux box you can try this In linux .. To reduce the CPU usage with AP save this as libsleep.c Copy the resulting libsleep.so somewhere, but not to boinc seti directory since it will be deleted from there. This will replace the yield with sleep and give other threads with lower priority too a chance to run. Yield would give only threads of the same or higher priority a slice. #include <time.h> #include <errno.h> /* * To compile run: * gcc -O2 -fPIC -shared -Wl,-soname,libsleep.so -o libsleep.so libsleep.c * * To use: * export LD_PRELOAD=/path/to/libsleep.so * before launchig BOINC. */ int sched_yield(void) { struct timespec t; struct timespec r; t.tv_sec = 0; t.tv_nsec = 1000000; // 100000 is 1 ms, 5000 is 5 us, 1 is one ns. while(nanosleep(&t, &r) == -1 && errno == EINTR) { t.tv_sec = r.tv_sec; t.tv_nsec = r.tv_nsec; } return 0; } To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Stubbles Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0 |
Petri mentioned the following on SG WoW: petri33: @stubbles: I'm not doing Queue Optimisation. Even though I run out of work every tuesday throughout the year. That's about 52weeks/yr * 2hr/veek * 170000cr/24hr = 737000cr/yr Posted: Monday Aug. 15, 2016 at 17:00. and I just wanted to capture it in some thread and thought this one was more appropriate. Looking into the future, that means all GTX 1080 running his app will starve for ~2hrs on Tuesdays. Since VR should be a HUGE thing at xmas, that means possibly hundreds of new 1080, 1070 and 1060s crunching for S@h. It will be important to keep our eyes on the 1070 and 1060 to see if they might also starve on Tuesdays since I expect there will me more of those on the street come 2017 than the 1080. Ideally, if we could figure out a way to keep starvation to a minimum, it should reduce the flow to other projects of those who get easily frustrated/annoyed by a 100task/device. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Starvation of the GPU isn't anything new. The higher end 900s were already starving during the weekly outage.. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
And the bigger and faster the GPUs become, the longer the outage will become, as the number of database transactions to be compacted and re-indexed every week increases. |
Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482 |
Dammit Scottie, I need more Power! lol |
Stubbles Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0 |
And the bigger and faster the GPUs become, the longer the outage will become, as the number of database transactions to be compacted and re-indexed every week increases. Lovely! I hadn't even thought about that variable. Thanks :-D The next 3 Tuesdays should be a good indication of loads to come in early 2017. After the 4bit files being tested on Beta, is there any talk of making the tasks longer? |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
After the 4bit files being tested on Beta, is there any talk of making the tasks longer? What do you mean by longer? The increase in work unit size didn't adversely affect time to complete by much. |
Stubbles Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0 |
After the 4bit files being tested on Beta, is there any talk of making the tasks longer? Oops, that should have read: last longer, such as 1.5 longer in time by having more data to analyse. PS: Love the new avatar!!! |
Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482 |
PS: Love the new avatar!!! +1 |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
... For some unknown reason my system uses 100% if I use cuda blocking sync option that is meant to free the CPU. Yes that's correct operation with blocking sync, when using pure streams with enough work to keep the whole GPU busy enqueued. The stream launches return immediately to CPU code, which eventually make it to a point where it spins on the events (instead of blocking sync). Yield will behave similarly, except should yield after a controlled sleep (which you're adjusting). In probing test builds then I restored blocking sync, by adding hard stream event waits where there be added blocking sync. By allowing some streams to parallel I was able to keep the times low and utilisation up, but drop CPU. This is a question of load balancing, and will have to be made configurable/tunable/automated for wider consumption. Probably since you found an issue with the pulsefinding to fix, I'll be able to tweak and run another test (after work). If validation is improved as hoped, then an extended wider alpha could potentially take place while the new infrastructure build continues. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
cudaDeviceScheduleYield[/url]" the CPU use will go Over 100%. How the thing can run at 105-110% CPU is a mystery, but it happens. So, I use Yield and it uses slightly less than 100%. Yes also correct operation, since the Cuda runtime is internally multithreaded (helper threads). At least if yielding on the launching thread, then it gives some room for the helper threads to finish with less tendency to overcommit. That same mechanism is what runs afoul of BoincApi's shutdown mechanisms when closing down the device properly, whether or not using Boincapi's critical sections. (BoincApi assumes only single threaded applications, which can't be forced on Cuda runtime) Probably during Alpha we'll make the sync and shutdown behaviour more controlled/configurable, though for now sapping CPU and leaving the device not properly shutdown will be good enough to see if validation improves, and performance stays high. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Stubbles Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0 |
...extended wider alpha... Where's sign up sheet?!? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
...extended wider alpha... LoL, fingers crossed that after work I can get sufficient beer and try things out. If I see improved validation (to be determined, a big if :)) and apply some light leash on the CPU use, then probably I would post as an open alpha with a swathe of warnings. 'Alpha - advanced user - no support - limited GPU suitability - will probably break' etc. If it works out that way, what that would do is allow vetting of a limited subset of targets/use-cases, while the bigger task of new infrastructure can proceed at a more leisurely pace aimed at something more generally applicable and robust to project change. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
The CPU usage on my system depends on the choise I give for the nanosleep that I use to replace Yield() via LD_PRELAOD on my Linux. For some unknown reason my system uses 100% if I use cuda blocking sync option that is meant to free the CPU. So I use Yield() and override that before program launch. @TBar A new thingy to try: When creating events in the cudaAcceleration.cu use a new flag pair cudaEventDisableTiming|cudaEventBlockingSync instead of the old cudaEventDisableTiming alone. 1) Apply this at least to pulseDoneEvent. Not the ones with number at the end. 2) and probably to gaussDoneEvent, tripletsDoneEvent, autocorrelationDoneEvent and maybe summaxDoneEvent. Not the ones with number at the end. It will drop CPU usage but may slow things down. The GPU usage drops too, but if you have enough RAM you can try running 2 instances at a time. Watch out for the system going into constant swap state (running out of available RAM). To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.