guppie on NVIDIA cards

Message boards : Number crunching : guppie on NVIDIA cards
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1802123 - Posted: 11 Jul 2016, 18:31:21 UTC

With all the talk about rescheduling WUs and not having guppie's run on NVIDIA cards I would just like to share my experience.
A guppie seems to take about 30 minutes on my 750TI
A non-guppie takes about 20 minutes

A guppie on my CPU takes about 3 hours.

Why would I even want to run them on my CPU? or am I not understanding something here.

I think I have found that running only 1 SoG at a time on the GPU is best but that remains to be seen as it seems to take about a month for the RAC to really stabilize. When I see something that looks like a steady output I will test running 2 at a time again.

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1802123 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1802128 - Posted: 11 Jul 2016, 18:48:36 UTC - in response to Message 1802123.  

I agree with you Bill

I rather have the GUPPI on my GPU rather than the CPU.

The GPU can do it faster and allow for better throughput than the CPU.

I think what is bothering some people is the amount of time it takes on their GPUs.

People got used to the rapid crunching of Arecibo work units and now with the vast difference with Breakthrough Listen data, they aren't happy.

Unfortunately, CreditNew only made that worse.

But that is my opinion.

If people want to swap out their work units, so be it.

I take whatever the server wants to give me.
ID: 1802128 · Report as offensive
The_Matrix
Volunteer tester

Send message
Joined: 17 Nov 03
Posts: 414
Credit: 5,827,850
RAC: 0
Germany
Message 1802131 - Posted: 11 Jul 2016, 18:57:47 UTC

At the same time i decided to stop cpu crunshing on seti. No random.
ID: 1802131 · Report as offensive
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1802134 - Posted: 11 Jul 2016, 19:23:22 UTC - in response to Message 1802128.  

I agree with you Bill

I rather have the GUPPI on my GPU rather than the CPU.

The GPU can do it faster and allow for better throughput than the CPU.

/edit/

I take whatever the server wants to give me.


Thanks, that is what I have always done. I do actually miss the credit, but you can not have everything.

APs, they are like a dust storm, here for a short time and then gone........

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1802134 · Report as offensive
Profile Stubbles
Volunteer tester
Avatar

Send message
Joined: 29 Nov 99
Posts: 358
Credit: 5,909,255
RAC: 0
Canada
Message 1802144 - Posted: 11 Jul 2016, 21:36:38 UTC

Gents,
I will respectfully disagree...and here is my perspective below.
If I am missing data or jumping to the wrong conclusion, please let me know as I will happily change my view based on a solid perspective with supporting data.

Fortunately, with Bill's #s we can compare oranges-to-oranges since my 2 primary rigs have a GTX 750 Ti each.
But, before I get into the stats I've compiled to support my perspective, here is my preliminary conclusion specific to:
rigs with an 8-core CPU and only-1 GTX 750 Ti:
With the current ratio of guppis to nonVLAR sent by S@H server during the last week, processing:
- nonVLARs on Cuda50 (with 3 WU at a time); and
- VLARs (blc-guppi & 2010-vlar) on 7 CPU cores
seems to be the best approach to maximize throughput of WUs (and indirectly, this likely results in the highest RAC possible).
But SoG app dev on NV cards is starting to be competitive and since it is doing so with only 1 WU at a time, it is likely to be the stock app of the upcoming future (unless Petri's future apps change the field dramatically).

Before I get into my #s, keep in mind that I have written a small batch file that I use to transfer non-VLARs from the CPU-assigned tasks to the GPU queue (hoping soon to test with it Raistmer's latest GPU apps with similar-looking batches of tasks).
Lately though, I've been trying to optimizing for WU throughput (not RAC) by mostly focusing on CPU to GPU time ratios with the current server-sent ratios of VLAR & non-VLAR.
After a few days of running my batch file a few times/day to move non-vlar from CPU to GPU, I end up processing only VLARs on the CPU and all non-VLARs on the GPU. Under this scenario, my GPU processes about 3x more tasks/day than the CPU.
On the 1st rig presented below, the cache is about 100 for the CPU & GPU. But for the 2nd rig, the situation is quite different (100 to 400) and I can't yet explain why...other than the luck-of-the-draw for VLAR to nonVLAR ratios since my SoG rig seems to be almost as productive as my Cuda50 rig.

Also: for the GTX 750 Ti, I have been told that 2-3WU on Cuda50 is still currently the "Gold Standard" for max throughput (without reassigning tasks between GPU & CPU), and that is currently the main purpose for the 1st rig presented.

So here are my stats...finally! lol
On 8010413, I am running with the Lunatics setup with Cuda50 and I just so happen to have changed from 2WU to 3WU (in parallel on GPU) with the following times for non-VLARs:
 2WU: ~29-30mins/each = ~14.5 mins/task throughput
 3WU: ~39mins /each = ~13 mins/task throughput
The CPU times per core (Xeon W3550 with 7 cores running MB_win_x64_SSE3_VS2008_r3330) for the different types of tasks are:
 guppi-vlar:              2hrs to 2:30, usually ~130mins
 non-VLAR:                usually almost 3hrs (~170mins)

On 7996377, I am running with the Lunatics v0.45 beta3 setup but have changed the GPU app to Raistmer's version: MB_win_x86_SSE3_OpenCL_NV_SoG_r3484.exe
 1WU: ~14.5 16.5 mins 
The CPU times per core (Xeon W3550 with 6 cores running MB_win_x64_SSE3_VS2008_r3330) for the different types of tasks are:
 guppi-vlar:              2hrs to 2:30
 Arecibo2010...vlar:      2:25 (145mins) to 3:07 (187mins)
 non-VLAR:                usually almost 3hrs

As compared to my SoG rig and Bill's #s with SoG (version: ?), I am starting to conclude that Cuda50 still seems (to me) to be the Gold Standard for a rig running 1 or 2 GTX 750 Ti GPUs (with or w/o task transfer to CPU/GPU).
There are a few data points missing above but those are currently unavailable since they were collected ad-hoc prior to installing BoincTasks 2 weeks ago to collect the history.

Let me know if I should include specific data points in the future to come to a full-picture conclusion.
Cheers, Rob :-)
PS: on my SoG rig, I noticed 12hrs ago that many tasks have time estimates of 2.5-3mins that are completing in <10mins. Are these a sub-group of nonVLARs that I should be compiling data for separately? Unfortunately, my Cuda50 rig is at another location so I can't compare until later today.
ID: 1802144 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1802146 - Posted: 11 Jul 2016, 21:54:10 UTC
Last modified: 11 Jul 2016, 21:55:33 UTC

What i can see is that your GPU takes double the time running guppies with cuda than SoG.
Considering you didn`t use any comand line values with SoG your conclusion isn`t telling anything.


With each crime and every kindness we birth our future.
ID: 1802146 · Report as offensive
Profile Stubbles
Volunteer tester
Avatar

Send message
Joined: 29 Nov 99
Posts: 358
Credit: 5,909,255
RAC: 0
Canada
Message 1802152 - Posted: 11 Jul 2016, 22:42:51 UTC - in response to Message 1802146.  

Hey Mike!
What i can see is that your GPU takes double the time running guppies with cuda than SoG.
Under Cuda50, it takes 3WU: ~39mins = throughput: ~13 mins/WU
Under SoG, 1WU: ~14.5 to 16.5 mins = 3 WU: ~45mins
The way I've been looking at those values is that Cuda50 is still ~15% better (= 2mins diff / 13mins)

Considering you didn`t use any comand line values with SoG your conclusion isn't telling anything.
I haven't used commandline YET! lol
...but I assumed that Raistmer's latest releases included the best generic commandline for most cards.
Also, since NV_SoG has mostly a lag issue on some older cards ( < GTX 750 Ti ), I also assumed using a commandline wouldn't make a huge diff.
I don't think my perspective is to be dismissed yet since I don't think many Lunatics host setups use commandline (I might be wrong) and Bill's OP didn't mention commandline.
Fyi, I am still learning a lot and I am focusing on upcoming optimized generic setups (that's why I was using one of Raistmer's latest apps).

Do you have a commandline to suggest? I had tried Brent Norman's before but it was disastrous with guppis.
If you don't, I could try his again in a few days since there was an improvement for nonVLARs.
Cheers, Rob
ID: 1802152 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1802160 - Posted: 12 Jul 2016, 0:01:06 UTC
Last modified: 12 Jul 2016, 0:03:19 UTC

Thats in particular the comand line i suggest in the read me.

For mid range cards like the 750TI`s i would try

-use_sleep -sbs 256 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 32 -oclfft_tune_cw 32

Also changing -sbs 256 to -sbs 384 and -spike_fft_thresh 4096 to -spike_fft_thresh 2048 is worth a try for those GPU`s.


With each crime and every kindness we birth our future.
ID: 1802160 · Report as offensive
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1802163 - Posted: 12 Jul 2016, 0:51:43 UTC - in response to Message 1802160.  
Last modified: 12 Jul 2016, 0:52:32 UTC

I do not run any command lines as I do not understand them. I am running what I would consider stock SoG download.

The only thing I can say is that watching the GPU useage it sometimes goes down to 0 for a short bit on the graph. I always am willing to run any new program to see how it works, even if I do not know a lot about it.

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1802163 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1802185 - Posted: 12 Jul 2016, 5:29:00 UTC - in response to Message 1802123.  

With all the talk about rescheduling WUs and not having guppie's run on NVIDIA cards I would just like to share my experience.
A guppie seems to take about 30 minutes on my 750TI
A non-guppie takes about 20 minutes

A guppie on my CPU takes about 3 hours.

Why would I even want to run them on my CPU? or am I not understanding something here.

Well, I do think you're missing the point a bit. Certainly a VLAR runs faster on a GPU than on a CPU, as do non-VLARs and APs (unless you've got a really slow GPU paired with a really fast CPU). The issue is really which task type runs most efficiently on which device, assuming that you are trying to maximize the total output of your hardware and the money you're shelling out for your electricity (which is certainly my goal).

If you go back and look at the run times for various comparable tasks that I posted in Message 1799300 from my early VLAR rescheduling attempts, you can see that while VLARs run consistently slower than "normal" AR tasks on my GPUs (whether looking at the GT 630 or the GTX 670), they run consistently faster than the "normal" AR tasks on either of the CPUs shown. Therefore, if I can swap a VLAR originally assigned to a GPU with a non-VLAR originally assigned to a CPU, it's a win-win. The VLAR will finish faster than the non-VLAR would have on the CPU, while the non-VLAR will finish faster than the VLAR would have on the GPU. The end result is a significant increase in total throughput.

Now, I don't normally put much stock in the whole credit system as a reliable measurement technique but, at least in broad terms, it's probably the simplest way to see the effect of VLAR rescheduling on my boxes. I've been doing VLAR rescheduling for just about 3 weeks now, and taking a weekly RAC snapshot for my account shows:
Monday, June 20: RAC = 61,275.91
Monday, June 27: RAC = 65,050.05
Monday, July 4: RAC = 66,102.52
Monday, July 11: RAC = 70,191.34

Even allowing for the vagaries of the credit system, I think that increase in RAC shows that I'm getting a good bit more productivity out of my hardware and my electricity than I was when I was just letting the VLARs fall where they may.
ID: 1802185 · Report as offensive
Profile Stubbles
Volunteer tester
Avatar

Send message
Joined: 29 Nov 99
Posts: 358
Credit: 5,909,255
RAC: 0
Canada
Message 1802186 - Posted: 12 Jul 2016, 5:29:57 UTC - in response to Message 1802163.  

I do not run any command lines as I do not understand them. I am running what I would consider stock SoG download.
The only thing I can say is that watching the GPU useage it sometimes goes down to 0 for a short bit on the graph. I always am willing to run any new program to see how it works, even if I do not know a lot about it.

Hey Bill,
I had a look at your 3 rigs with 2 GTX 750 Ti each.
I noticed you have "anonymous platform" setups on those (see bottom of each page for: 6969420 , 7226971, and 7965534)
For me, 'stock' means: whatever apps the server sends the host, which in most scenarios means: not "anonymous platform".
"SETI@home v8 8.00 windows_intelx86 (cuda50)" is considered stock since it is sent by the server when someone only installs Boinc and adds the project SETI@home,
but currently the NV_SoG version installed by Lunatics v0,45 beta3 is not stock.

Since those 3 PCs are fairly identical (AMD FX-8xxx with 8-cores and 2 GTX 750 Ti each), you could consider running your own comparisons.
I would recommend the following:
- PC x: Lunatics v0.44 (or 0.45 beta3) with cuda50 and 2 WU/GPU;
- PC y: Lunatics v0.44 (or 0.45 beta3) with cuda50 and 3 WU/GPU;
- PC z: Lunatics v0.45 beta3 with MB_win_x64_NV_SoG_r3472 (default: 1 WU/GPU);

Because RAC is not a measure of a specific day's output, I am currently just counting the # of tasks processed during an exact 24hr period for both the CPU & GPU(s). I am aware this is far from a great metric but it much better than even the PC's "credit/day" as visible on BoincStats (see yours)
The eFMer BoincTasks software allows me to do this count with the History tab.
Soon I plan to use the history.csv file it generates in order to compare it to the points I am actually getting for each task...but that will be a much bigger task than counting CPU & GPU completed tasks over a 24-hr period.

Let me know if you are interested in doing different setups for your 3 PCs, or if you have any Qs or concerns about doing so.
Cheers, Rob :-)
ID: 1802186 · Report as offensive
Profile Stubbles
Volunteer tester
Avatar

Send message
Joined: 29 Nov 99
Posts: 358
Credit: 5,909,255
RAC: 0
Canada
Message 1802187 - Posted: 12 Jul 2016, 5:51:34 UTC - in response to Message 1802160.  

Thats in particular the comand line i suggest in the read me.
For mid range cards like the 750TI`s i would try
-use_sleep -sbs 256 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 32 -oclfft_tune_cw 32
Also changing -sbs 256 to -sbs 384 and -spike_fft_thresh 4096 to -spike_fft_thresh 2048 is worth a try for those GPU`s.

Thanks! I'll give that a try in a few days.

Is there a command line to optimize the cuda50 app on the GTX 750 Ti?
...cuz if I am going to use a command line for the latest NV_SoG running on 750 Ti to see if it can outperform cuda50, I should consider the same for the cuda50 app.
(maybe my Q does not apply as I do not remember seeing any posts during the last 2 months related to cuda and command line)
Looking forward to your reply Mike.
Rob :-)
ID: 1802187 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1802192 - Posted: 12 Jul 2016, 7:16:52 UTC - in response to Message 1802187.  

For Cuda there is a file called mbcuda.cfg

I was using (with 750Ti)
processpriority = abovenormal
pfblockspersm = 8
pfperiodsperlaunch = 200

I forget what the defaults are.
ID: 1802192 · Report as offensive
I3APR

Send message
Joined: 23 Apr 16
Posts: 99
Credit: 70,717,488
RAC: 0
Italy
Message 1802201 - Posted: 12 Jul 2016, 10:45:08 UTC

My experience, based on a couple of days of statistics :



[/img]

On my PC guppies seems to be processed slower by the CPU ( I7 4790K @ 4 Ghz)
but not by far...

Also notice that some class of WU are crunched faster by the 780ti than the 1080...

A.

P.S.
Avg in second for cpu is based on 9 fields since there are no 09my/09dc wu being crunched by the CPU
ID: 1802201 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22200
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1802203 - Posted: 12 Jul 2016, 11:21:14 UTC

Without knowing what the AR of the data is, particularly that from Arecibo your comparison may not be as valid as you may think. For Arecibo data, even within a given data set (ddyy) set there can be some very substantial differences in AR, and thus run-time.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1802203 · Report as offensive
I3APR

Send message
Joined: 23 Apr 16
Posts: 99
Credit: 70,717,488
RAC: 0
Italy
Message 1802204 - Posted: 12 Jul 2016, 11:38:04 UTC - in response to Message 1802203.  

Without knowing what the AR of the data is, particularly that from Arecibo your comparison may not be as valid as you may think. For Arecibo data, even within a given data set (ddyy) set there can be some very substantial differences in AR, and thus run-time.

Rob, that's why I posted the number of samples the statistic is based upon.

For the "blc4", with more than 20/40 samples, the AP should be "equalized" through the GPUs and the CPU as well...unless, some mechanism is distributing them "Low to the CPU" and "High to the GPU" or viceversa...

A.
ID: 1802204 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1802214 - Posted: 12 Jul 2016, 13:34:17 UTC - in response to Message 1802187.  

Thats in particular the comand line i suggest in the read me.
For mid range cards like the 750TI`s i would try
-use_sleep -sbs 256 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 32 -oclfft_tune_cw 32
Also changing -sbs 256 to -sbs 384 and -spike_fft_thresh 4096 to -spike_fft_thresh 2048 is worth a try for those GPU`s.

Thanks! I'll give that a try in a few days.

Is there a command line to optimize the cuda50 app on the GTX 750 Ti?
...cuz if I am going to use a command line for the latest NV_SoG running on 750 Ti to see if it can outperform cuda50, I should consider the same for the cuda50 app.
(maybe my Q does not apply as I do not remember seeing any posts during the last 2 months related to cuda and command line)
Looking forward to your reply Mike.
Rob :-)

You can simulate the OpenCL App with CUDA by adding the -poll commandline option.
Using this command will speed up the CUDA App and use a Full CPU core in the process.
Add it to the app_info.xml as this;
<plan_class>cuda75</plan_class>
        <avg_ncpus>0.1</avg_ncpus>
        <max_ncpus>0.1</max_ncpus>
        <cmdline>-poll</cmdline>
         <coproc>
            <type>CUDA</type>

Each CUDA instance will use a Full CPU core.
ID: 1802214 · Report as offensive
spitfire_mk_2
Avatar

Send message
Joined: 14 Apr 00
Posts: 563
Credit: 27,306,885
RAC: 0
United States
Message 1802215 - Posted: 12 Jul 2016, 13:40:32 UTC - in response to Message 1802160.  


For mid range cards like the 750TI`s i would try

-use_sleep -sbs 256 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 32 -oclfft_tune_cw 32


I put that into mb_cmdline-8.12_windows_intel__opencl_nvidia_SoG.txt
It is magic!
ID: 1802215 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1802217 - Posted: 12 Jul 2016, 13:53:43 UTC

One of my rigs has 3 950's and one 750Ti in it, do you think that adding this to that system would yield substantial improvements?

ID: 1802217 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1802219 - Posted: 12 Jul 2016, 14:01:17 UTC - in response to Message 1802217.  

the read me I have shows ...

Mid range cards x50 x60 x70
-sbs 192 -spike_fft_thresh 2048 -tune 1 64 1 4

that may sacrifice the 750 a bit, but might help the 950's
ID: 1802219 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : guppie on NVIDIA cards


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.