guppie on NVIDIA cards

Author	Message
Bill G Send message Joined: 1 Jun 01 Posts: 1282 Credit: 187,688,550 RAC: 182	Message 1802123 - Posted: 11 Jul 2016, 18:31:21 UTC With all the talk about rescheduling WUs and not having guppie's run on NVIDIA cards I would just like to share my experience. A guppie seems to take about 30 minutes on my 750TI A non-guppie takes about 20 minutes A guppie on my CPU takes about 3 hours. Why would I even want to run them on my CPU? or am I not understanding something here. I think I have found that running only 1 SoG at a time on the GPU is best but that remains to be seen as it seems to take about a month for the RAC to really stabilize. When I see something that looks like a steady output I will test running 2 at a time again. SETI@home classic workunits 4,019 SETI@home classic CPU time 34,348 hours ID: 1802123 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1802128 - Posted: 11 Jul 2016, 18:48:36 UTC - in response to Message 1802123. I agree with you Bill I rather have the GUPPI on my GPU rather than the CPU. The GPU can do it faster and allow for better throughput than the CPU. I think what is bothering some people is the amount of time it takes on their GPUs. People got used to the rapid crunching of Arecibo work units and now with the vast difference with Breakthrough Listen data, they aren't happy. Unfortunately, CreditNew only made that worse. But that is my opinion. If people want to swap out their work units, so be it. I take whatever the server wants to give me. ID: 1802128 ·

The_Matrix Volunteer tester Send message Joined: 17 Nov 03 Posts: 414 Credit: 5,827,850 RAC: 0	Message 1802131 - Posted: 11 Jul 2016, 18:57:47 UTC At the same time i decided to stop cpu crunshing on seti. No random. ID: 1802131 ·

Bill G Send message Joined: 1 Jun 01 Posts: 1282 Credit: 187,688,550 RAC: 182	Message 1802134 - Posted: 11 Jul 2016, 19:23:22 UTC - in response to Message 1802128. I agree with you Bill I rather have the GUPPI on my GPU rather than the CPU. The GPU can do it faster and allow for better throughput than the CPU. /edit/ I take whatever the server wants to give me. Thanks, that is what I have always done. I do actually miss the credit, but you can not have everything. APs, they are like a dust storm, here for a short time and then gone........ SETI@home classic workunits 4,019 SETI@home classic CPU time 34,348 hours ID: 1802134 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1802144 - Posted: 11 Jul 2016, 21:36:38 UTC Gents, I will respectfully disagree...and here is my perspective below. If I am missing data or jumping to the wrong conclusion, please let me know as I will happily change my view based on a solid perspective with supporting data. Fortunately, with Bill's #s we can compare oranges-to-oranges since my 2 primary rigs have a GTX 750 Ti each. But, before I get into the stats I've compiled to support my perspective, here is my preliminary conclusion specific to: rigs with an 8-core CPU and only-1 GTX 750 Ti: With the current ratio of guppis to nonVLAR sent by S@H server during the last week, processing: - nonVLARs on Cuda50 (with 3 WU at a time); and - VLARs (blc-guppi & 2010-vlar) on 7 CPU cores seems to be the best approach to maximize throughput of WUs (and indirectly, this likely results in the highest RAC possible). But SoG app dev on NV cards is starting to be competitive and since it is doing so with only 1 WU at a time, it is likely to be the stock app of the upcoming future (unless Petri's future apps change the field dramatically). Before I get into my #s, keep in mind that I have written a small batch file that I use to transfer non-VLARs from the CPU-assigned tasks to the GPU queue (hoping soon to test with it Raistmer's latest GPU apps with similar-looking batches of tasks). Lately though, I've been trying to optimizing for WU throughput (not RAC) by mostly focusing on CPU to GPU time ratios with the current server-sent ratios of VLAR & non-VLAR. After a few days of running my batch file a few times/day to move non-vlar from CPU to GPU, I end up processing only VLARs on the CPU and all non-VLARs on the GPU. Under this scenario, my GPU processes about 3x more tasks/day than the CPU. On the 1st rig presented below, the cache is about 100 for the CPU & GPU. But for the 2nd rig, the situation is quite different (100 to 400) and I can't yet explain why...other than the luck-of-the-draw for VLAR to nonVLAR ratios since my SoG rig seems to be almost as productive as my Cuda50 rig. Also: for the GTX 750 Ti, I have been told that 2-3WU on Cuda50 is still currently the "Gold Standard" for max throughput (without reassigning tasks between GPU & CPU), and that is currently the main purpose for the 1st rig presented. So here are my stats...finally! lol On 8010413, I am running with the Lunatics setup with Cuda50 and I just so happen to have changed from 2WU to 3WU (in parallel on GPU) with the following times for non-VLARs: 2WU: ~29-30mins/each = ~14.5 mins/task throughput 3WU: ~39mins /each = ~13 mins/task throughput The CPU times per core (Xeon W3550 with 7 cores running MB_win_x64_SSE3_VS2008_r3330) for the different types of tasks are: guppi-vlar: 2hrs to 2:30, usually ~130mins non-VLAR: usually almost 3hrs (~170mins) On 7996377, I am running with the Lunatics v0.45 beta3 setup but have changed the GPU app to Raistmer's version: MB_win_x86_SSE3_OpenCL_NV_SoG_r3484.exe 1WU: ~14.5 16.5 mins The CPU times per core (Xeon W3550 with 6 cores running MB_win_x64_SSE3_VS2008_r3330) for the different types of tasks are: guppi-vlar: 2hrs to 2:30 Arecibo2010...vlar: 2:25 (145mins) to 3:07 (187mins) non-VLAR: usually almost 3hrs As compared to my SoG rig and Bill's #s with SoG (version: ?), I am starting to conclude that Cuda50 still seems (to me) to be the Gold Standard for a rig running 1 or 2 GTX 750 Ti GPUs (with or w/o task transfer to CPU/GPU). There are a few data points missing above but those are currently unavailable since they were collected ad-hoc prior to installing BoincTasks 2 weeks ago to collect the history. Let me know if I should include specific data points in the future to come to a full-picture conclusion. Cheers, Rob :-) PS: on my SoG rig, I noticed 12hrs ago that many tasks have time estimates of 2.5-3mins that are completing in <10mins. Are these a sub-group of nonVLARs that I should be compiling data for separately? Unfortunately, my Cuda50 rig is at another location so I can't compare until later today. ID: 1802144 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1802146 - Posted: 11 Jul 2016, 21:54:10 UTC Last modified: 11 Jul 2016, 21:55:33 UTC What i can see is that your GPU takes double the time running guppies with cuda than SoG. Considering you didn`t use any comand line values with SoG your conclusion isn`t telling anything. With each crime and every kindness we birth our future. ID: 1802146 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1802152 - Posted: 11 Jul 2016, 22:42:51 UTC - in response to Message 1802146. Hey Mike! What i can see is that your GPU takes double the time running guppies with cuda than SoG. Under Cuda50, it takes 3WU: ~39mins = throughput: ~13 mins/WU Under SoG, 1WU: ~14.5 to 16.5 mins = 3 WU: ~45mins The way I've been looking at those values is that Cuda50 is still ~15% better (= 2mins diff / 13mins) Considering you didn`t use any comand line values with SoG your conclusion isn't telling anything. I haven't used commandline YET! lol ...but I assumed that Raistmer's latest releases included the best generic commandline for most cards. Also, since NV_SoG has mostly a lag issue on some older cards ( < GTX 750 Ti ), I also assumed using a commandline wouldn't make a huge diff. I don't think my perspective is to be dismissed yet since I don't think many Lunatics host setups use commandline (I might be wrong) and Bill's OP didn't mention commandline. Fyi, I am still learning a lot and I am focusing on upcoming optimized generic setups (that's why I was using one of Raistmer's latest apps). Do you have a commandline to suggest? I had tried Brent Norman's before but it was disastrous with guppis. If you don't, I could try his again in a few days since there was an improvement for nonVLARs. Cheers, Rob ID: 1802152 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1802160 - Posted: 12 Jul 2016, 0:01:06 UTC Last modified: 12 Jul 2016, 0:03:19 UTC Thats in particular the comand line i suggest in the read me. For mid range cards like the 750TI`s i would try -use_sleep -sbs 256 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 32 -oclfft_tune_cw 32 Also changing -sbs 256 to -sbs 384 and -spike_fft_thresh 4096 to -spike_fft_thresh 2048 is worth a try for those GPU`s. With each crime and every kindness we birth our future. ID: 1802160 ·

Bill G Send message Joined: 1 Jun 01 Posts: 1282 Credit: 187,688,550 RAC: 182	Message 1802163 - Posted: 12 Jul 2016, 0:51:43 UTC - in response to Message 1802160. Last modified: 12 Jul 2016, 0:52:32 UTC I do not run any command lines as I do not understand them. I am running what I would consider stock SoG download. The only thing I can say is that watching the GPU useage it sometimes goes down to 0 for a short bit on the graph. I always am willing to run any new program to see how it works, even if I do not know a lot about it. SETI@home classic workunits 4,019 SETI@home classic CPU time 34,348 hours ID: 1802163 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1802185 - Posted: 12 Jul 2016, 5:29:00 UTC - in response to Message 1802123. With all the talk about rescheduling WUs and not having guppie's run on NVIDIA cards I would just like to share my experience. A guppie seems to take about 30 minutes on my 750TI A non-guppie takes about 20 minutes A guppie on my CPU takes about 3 hours. Why would I even want to run them on my CPU? or am I not understanding something here. Well, I do think you're missing the point a bit. Certainly a VLAR runs faster on a GPU than on a CPU, as do non-VLARs and APs (unless you've got a really slow GPU paired with a really fast CPU). The issue is really which task type runs most efficiently on which device, assuming that you are trying to maximize the total output of your hardware and the money you're shelling out for your electricity (which is certainly my goal). If you go back and look at the run times for various comparable tasks that I posted in Message 1799300 from my early VLAR rescheduling attempts, you can see that while VLARs run consistently slower than "normal" AR tasks on my GPUs (whether looking at the GT 630 or the GTX 670), they run consistently faster than the "normal" AR tasks on either of the CPUs shown. Therefore, if I can swap a VLAR originally assigned to a GPU with a non-VLAR originally assigned to a CPU, it's a win-win. The VLAR will finish faster than the non-VLAR would have on the CPU, while the non-VLAR will finish faster than the VLAR would have on the GPU. The end result is a significant increase in total throughput. Now, I don't normally put much stock in the whole credit system as a reliable measurement technique but, at least in broad terms, it's probably the simplest way to see the effect of VLAR rescheduling on my boxes. I've been doing VLAR rescheduling for just about 3 weeks now, and taking a weekly RAC snapshot for my account shows: Monday, June 20: RAC = 61,275.91 Monday, June 27: RAC = 65,050.05 Monday, July 4: RAC = 66,102.52 Monday, July 11: RAC = 70,191.34 Even allowing for the vagaries of the credit system, I think that increase in RAC shows that I'm getting a good bit more productivity out of my hardware and my electricity than I was when I was just letting the VLARs fall where they may. ID: 1802185 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1802186 - Posted: 12 Jul 2016, 5:29:57 UTC - in response to Message 1802163. I do not run any command lines as I do not understand them. I am running what I would consider stock SoG download. The only thing I can say is that watching the GPU useage it sometimes goes down to 0 for a short bit on the graph. I always am willing to run any new program to see how it works, even if I do not know a lot about it. Hey Bill, I had a look at your 3 rigs with 2 GTX 750 Ti each. I noticed you have "anonymous platform" setups on those (see bottom of each page for: 6969420 , 7226971, and 7965534) For me, 'stock' means: whatever apps the server sends the host, which in most scenarios means: not "anonymous platform". "SETI@home v8 8.00 windows_intelx86 (cuda50)" is considered stock since it is sent by the server when someone only installs Boinc and adds the project SETI@home, but currently the NV_SoG version installed by Lunatics v0,45 beta3 is not stock. Since those 3 PCs are fairly identical (AMD FX-8xxx with 8-cores and 2 GTX 750 Ti each), you could consider running your own comparisons. I would recommend the following: - PC x: Lunatics v0.44 (or 0.45 beta3) with cuda50 and 2 WU/GPU; - PC y: Lunatics v0.44 (or 0.45 beta3) with cuda50 and 3 WU/GPU; - PC z: Lunatics v0.45 beta3 with MB_win_x64_NV_SoG_r3472 (default: 1 WU/GPU); Because RAC is not a measure of a specific day's output, I am currently just counting the # of tasks processed during an exact 24hr period for both the CPU & GPU(s). I am aware this is far from a great metric but it much better than even the PC's "credit/day" as visible on BoincStats (see yours) The eFMer BoincTasks software allows me to do this count with the History tab. Soon I plan to use the history.csv file it generates in order to compare it to the points I am actually getting for each task...but that will be a much bigger task than counting CPU & GPU completed tasks over a 24-hr period. Let me know if you are interested in doing different setups for your 3 PCs, or if you have any Qs or concerns about doing so. Cheers, Rob :-) ID: 1802186 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1802187 - Posted: 12 Jul 2016, 5:51:34 UTC - in response to Message 1802160. Thats in particular the comand line i suggest in the read me. For mid range cards like the 750TI`s i would try -use_sleep -sbs 256 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 32 -oclfft_tune_cw 32 Also changing -sbs 256 to -sbs 384 and -spike_fft_thresh 4096 to -spike_fft_thresh 2048 is worth a try for those GPU`s. Thanks! I'll give that a try in a few days. Is there a command line to optimize the cuda50 app on the GTX 750 Ti? ...cuz if I am going to use a command line for the latest NV_SoG running on 750 Ti to see if it can outperform cuda50, I should consider the same for the cuda50 app. (maybe my Q does not apply as I do not remember seeing any posts during the last 2 months related to cuda and command line) Looking forward to your reply Mike. Rob :-) ID: 1802187 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1802192 - Posted: 12 Jul 2016, 7:16:52 UTC - in response to Message 1802187. For Cuda there is a file called mbcuda.cfg I was using (with 750Ti) processpriority = abovenormal pfblockspersm = 8 pfperiodsperlaunch = 200 I forget what the defaults are. ID: 1802192 ·

I3APR Send message Joined: 23 Apr 16 Posts: 99 Credit: 70,717,488 RAC: 0	Message 1802201 - Posted: 12 Jul 2016, 10:45:08 UTC My experience, based on a couple of days of statistics : [/img] On my PC guppies seems to be processed slower by the CPU ( I7 4790K @ 4 Ghz) but not by far... Also notice that some class of WU are crunched faster by the 780ti than the 1080... A. P.S. Avg in second for cpu is based on 9 fields since there are no 09my/09dc wu being crunched by the CPU ID: 1802201 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22200 Credit: 416,307,556 RAC: 380	Message 1802203 - Posted: 12 Jul 2016, 11:21:14 UTC Without knowing what the AR of the data is, particularly that from Arecibo your comparison may not be as valid as you may think. For Arecibo data, even within a given data set (ddyy) set there can be some very substantial differences in AR, and thus run-time. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1802203 ·

I3APR Send message Joined: 23 Apr 16 Posts: 99 Credit: 70,717,488 RAC: 0	Message 1802204 - Posted: 12 Jul 2016, 11:38:04 UTC - in response to Message 1802203. Without knowing what the AR of the data is, particularly that from Arecibo your comparison may not be as valid as you may think. For Arecibo data, even within a given data set (ddyy) set there can be some very substantial differences in AR, and thus run-time. Rob, that's why I posted the number of samples the statistic is based upon. For the "blc4", with more than 20/40 samples, the AP should be "equalized" through the GPUs and the CPU as well...unless, some mechanism is distributing them "Low to the CPU" and "High to the GPU" or viceversa... A. ID: 1802204 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1802214 - Posted: 12 Jul 2016, 13:34:17 UTC - in response to Message 1802187. Thats in particular the comand line i suggest in the read me. For mid range cards like the 750TI`s i would try -use_sleep -sbs 256 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 32 -oclfft_tune_cw 32 Also changing -sbs 256 to -sbs 384 and -spike_fft_thresh 4096 to -spike_fft_thresh 2048 is worth a try for those GPU`s. Thanks! I'll give that a try in a few days. Is there a command line to optimize the cuda50 app on the GTX 750 Ti? ...cuz if I am going to use a command line for the latest NV_SoG running on 750 Ti to see if it can outperform cuda50, I should consider the same for the cuda50 app. (maybe my Q does not apply as I do not remember seeing any posts during the last 2 months related to cuda and command line) Looking forward to your reply Mike. Rob :-) You can simulate the OpenCL App with CUDA by adding the -poll commandline option. Using this command will speed up the CUDA App and use a Full CPU core in the process. Add it to the app_info.xml as this; <plan_class>cuda75</plan_class> <avg_ncpus>0.1</avg_ncpus> <max_ncpus>0.1</max_ncpus> <cmdline>-poll</cmdline> <coproc> <type>CUDA</type> Each CUDA instance will use a Full CPU core. ID: 1802214 ·

spitfire_mk_2 Send message Joined: 14 Apr 00 Posts: 563 Credit: 27,306,885 RAC: 0	Message 1802215 - Posted: 12 Jul 2016, 13:40:32 UTC - in response to Message 1802160. For mid range cards like the 750TI`s i would try -use_sleep -sbs 256 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 32 -oclfft_tune_cw 32 I put that into mb_cmdline-8.12_windows_intel__opencl_nvidia_SoG.txt It is magic! ID: 1802215 ·

Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482	Message 1802217 - Posted: 12 Jul 2016, 13:53:43 UTC One of my rigs has 3 950's and one 750Ti in it, do you think that adding this to that system would yield substantial improvements? ID: 1802217 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1802219 - Posted: 12 Jul 2016, 14:01:17 UTC - in response to Message 1802217. the read me I have shows ... Mid range cards x50 x60 x70 -sbs 192 -spike_fft_thresh 2048 -tune 1 64 1 4 that may sacrifice the 750 a bit, but might help the 950's ID: 1802219 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.