Tests of new scheduler features.

Author	Message
Richard Haselgrove Volunteer tester Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0	Message 46072 - Posted: 25 May 2013, 13:35:52 UTC - in response to Message 46071. ... that's one of the first to complete under '4 VLARs at once'. 5,286 seconds - almost an hour and a half - works out at an equivalent APR just below 35, compared to ~150 - 160 for the general 'non-VLAR' mix of work. These VLAR take some 4 hours each on 3.0GHz core2 (1 core). verifying 4 at once in 1.5 hours ? Is that 2 per card on 2 cards ? othwerwise That sounds quite a bit more efficient somehow than my 680. (faster bus & CPU perhaps...) 1 in ~44 mins, multiples not tested, Yes, exactly - 2 cards (identical - NB factory overclock), 2 tasks per card, four tasks in total, each task takes ~90 minutes. So she would spit out one task every 22 minutes, give or take. CPU is overclocked i7-3770K, hyperthreaded - running six threads of BOINC (non-SETI) tasks, the balance of CPU power available to support the GPUs. The tasks had a staggered start as the supply of non-VLAR tasks ran out. I remember the original CUDA apps ran with horrible lag at the beginning of each task, and again for a short segment towards the end (was it around the 75% mark? I've forgotten the details), but better at other stages. So another test would be to send off all four with a synchronised start, but I'm not sure I want to go there... ID: 46072 ·

jason_gee Volunteer tester Send message Joined: 11 Dec 08 Posts: 198 Credit: 658,573 RAC: 0	Message 46073 - Posted: 25 May 2013, 13:47:47 UTC - in response to Message 46072. Yes, exactly - 2 cards (identical - NB factory overclock), 2 tasks per card, four tasks in total, each task takes ~90 minutes. So she would spit out one task every 22 minutes, give or take. CPU is overclocked i7-3770K, hyperthreaded - running six threads of BOINC (non-SETI) tasks, the balance of CPU power available to support the GPUs. The tasks had a staggered start as the supply of non-VLAR tasks ran out. I remember the original CUDA apps ran with horrible lag at the beginning of each task, and again for a short segment towards the end (was it around the 75% mark? I've forgotten the details), but better at other stages. So another test would be to send off all four with a synchronised start, but I'm not sure I want to go there... Yep, alright, matching up with my vague recollections. Yeah taking a while to get a handle on the characteristics here, and the old Cuda apps were too long ago :). I've been running at defaults, but will jack up the settings now in the hopes for a clear Cuda50 performance (still at single instance). Running modded Boinc here I shouldn't run into problems from excessive runtime aborts. I hope the APR swing stays around or preferably under a factor of 5. If that looks OK I'll upgrade Boinc for a multiple test run. ID: 46073 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 46074 - Posted: 25 May 2013, 14:22:27 UTC these lags corresponded PulseFind processing of max length and FFT size of 8. That is, max work per thread, GPU heavely underloaded 9cause too little independent threads for GPU to process). Indeed such PulseFind config will occue at very beginning of task + perhaps somewhere near end. ID: 46074 ·

Richard Haselgrove Volunteer tester Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0	Message 46078 - Posted: 25 May 2013, 16:42:58 UTC - in response to Message 46074. these lags corresponded PulseFind processing of max length and FFT size of 8. That is, max work per thread, GPU heavely underloaded 9cause too little independent threads for GPU to process). Indeed such PulseFind config will occue at very beginning of task + perhaps somewhere near end. Would the relative positioning of those pulsefind kernels have been changed with the addition of Autocorrelations to the workload? Just asking, so I know when and where to look if I ever try that 'synchronised start' test. Even with powerful GPUs, I didn't see much sign of underutilisation: (direct link) The GPU usage traces for the two cards - lines 5 and 6 - show relatively high, though variable, usage throughout the display - left-to-right is about 25 minutes, or over 25% of the WUs. Even with one task just recently started, and the other just over 80% done, I see GPU usage mostly in the high 80%s, only occasionally spiking down into the teen%s - and they really are narrow spikes. But very different from the steady 98% - 99% loading of the single GPUGrid task on the other GPU currently. ID: 46078 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 46079 - Posted: 25 May 2013, 17:00:27 UTC - in response to Message 46078. no change in relative positioning. Longest PulseFind occurs before first autocorr. ID: 46079 ·

jason_gee Volunteer tester Send message Joined: 11 Dec 08 Posts: 198 Credit: 658,573 RAC: 0	Message 46080 - Posted: 25 May 2013, 17:26:05 UTC - in response to Message 46078. Last modified: 25 May 2013, 17:49:17 UTC these lags corresponded PulseFind processing of max length and FFT size of 8. That is, max work per thread, GPU heavely underloaded 9cause too little independent threads for GPU to process). Indeed such PulseFind config will occue at very beginning of task + perhaps somewhere near end. Would the relative positioning of those pulsefind kernels have been changed with the addition of Autocorrelations to the workload? Just asking, so I know when and where to look if I ever try that 'synchronised start' test. Even with powerful GPUs, I didn't see much sign of underutilisation: (direct link) The GPU usage traces for the two cards - lines 5 and 6 - show relatively high, though variable, usage throughout the display - left-to-right is about 25 minutes, or over 25% of the WUs. Even with one task just recently started, and the other just over 80% done, I see GPU usage mostly in the high 80%s, only occasionally spiking down into the teen%s - and they really are narrow spikes. But very different from the steady 98% - 99% loading of the single GPUGrid task on the other GPU currently. For detail, While my autocorrelations are still using the VRAM & bus bound 4NFFT approach, a crude baseline implementation, loading with a single task won't fill larger Units. It can 'look like' it is because of the way utilisation is measured (on the first SMX by duration % in the sample period.] This will [tend to] look like flat troughs where there are ACs in progress & peaks elsewhere. For the remainder of processing, they fill the GPU fairly well, though there are hidden latencies involved. optionally Jacking up the priority & pulsefind settings from defaults smooths these. [Edit:]With agressive settings like mine, you should begin to see hints of long pulsefind related lag at the familiar points: [ & narrow dips disappear... ] [mbcuda] processpriority = abovenormal pfblockspersm = 15 pfperiodsperlaunch = 200 ID: 46080 ·

Josef W. Segur Volunteer tester Send message Joined: 14 Oct 05 Posts: 1137 Credit: 1,848,733 RAC: 0	Message 46081 - Posted: 25 May 2013, 19:44:59 UTC - in response to Message 46070. OK, now we have VLARs - I've got about 80 of them. Let the fun begin. ... The bad news - they still run very, very slowly. Note that apart from the 'two tasks at once' (via app_config.xml), I'm running the host exactly as stock: I've no doubt performance could be increased via application parameter tuning, but that's not what I'm reporting on here. The most recent completed task is WU 5348248 - that's one of the first to complete under '4 VLARs at once'. 5,286 seconds - almost an hour and a half - works out at an equivalent APR just below 35, compared to ~150 - 160 for the general 'non-VLAR' mix of work. It will be interesting to see what APR is recorded for cuda50 at the end of this streak - I'll run all 80 consecutively. The APR might not end up as bad as those figures suggest, because I have over 400 tasks pending and their validation (as wingmates trickle in) will push APR back up towards normal levels. ... There's some related information from host 5619046 at the main project. That dual 560 Ti system did a lot of setiathome_enhanced VLARs on GPU with x41zc, Cuda 4.2. The user posted several times about it in the "Please rise the limits... just a little..." thread March 13, starting with message 1345975. I looked through its task list at the time and judged for that configuration VLARs took about 6 times as long as midrange ARs with similar estimates. The host only has AP v6 in progress now, probably a reaction to the 100 limit. Its last remaing VLAR on GPU was validated earlier today, but here's the timing comparison for that VLAR and 4 midrange AR tasks: Task name Run time CPU time 28my12ab.11019.110604.3.11.149.vlar_0 5,689.26 53.66 03mr13ad.23566.11110.6.11.140_0 869.53 47.56 03mr13ad.23566.11110.6.11.122_0 866.73 44.10 02mr13ad.4401.7850.16.11.39_0 897.47 48.22 02mr13ad.4401.7850.16.11.33_1 899.64 50.01 Joe ID: 46081 ·

jason_gee Volunteer tester Send message Joined: 11 Dec 08 Posts: 198 Credit: 658,573 RAC: 0	Message 46082 - Posted: 25 May 2013, 20:51:22 UTC - in response to Message 46081. Last modified: 25 May 2013, 20:56:28 UTC Hmmm, yeah 560ti (compute cap 2.1) , like on the other PC here, probably sits either square on, or below, performance levels where significant problems with APR etc might occur. It's a tough call. Target market for this card was the newly created 'midrange enthusiast' or 'performance - price' bracket if you like. This suggests to me pushing VLAR to these could go badly, because of the target market expectations, and the geometry being more or less maxxed out for the memory subsystem. Chances are there could be initial backlash with only Kepler Class receiving VLARs, at the more palatable 4x elapsed. That's all prior to adjustable multithreading the long pulsefinds in x42, ala paralleled V13 experiment, after serialisation of the result reductions. I would describe Fermi-class utilisation as solid. That'd make not much room to move before hybridisation & higher level algorithmic changes, while Keplers still do have some 'legs' yet, before major developments. Jason ID: 46082 ·

Josef W. Segur Volunteer tester Send message Joined: 14 Oct 05 Posts: 1137 Credit: 1,848,733 RAC: 0	Message 46083 - Posted: 26 May 2013, 0:17:43 UTC - in response to Message 46082. Hmmm, yeah 560ti (compute cap 2.1) , like on the other PC here, probably sits either square on, or below, performance levels where significant problems with APR etc might occur. It's a tough call. Target market for this card was the newly created 'midrange enthusiast' or 'performance - price' bracket if you like. This suggests to me pushing VLAR to these could go badly, because of the target market expectations, and the geometry being more or less maxxed out for the memory subsystem. Chances are there could be initial backlash with only Kepler Class receiving VLARs, at the more palatable 4x elapsed. That's all prior to adjustable multithreading the long pulsefinds in x42, ala paralleled V13 experiment, after serialisation of the result reductions. I would describe Fermi-class utilisation as solid. That'd make not much room to move before hybridisation & higher level algorithmic changes, while Keplers still do have some 'legs' yet, before major developments. Jason Autocorrelation processing would somewhat reduce the ratio on that dual 560 Ti host, but I agree the mismatch between estimates and actual performance is a problem area. What might be best overall is an additional preference so those crunching with GPUs could opt in or out of doing VLARs. I'm very much in favor of giving users control of their own systems. Joe ID: 46083 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 46084 - Posted: 26 May 2013, 7:54:46 UTC - in response to Message 46083. I agree the mismatch between estimates and actual performance is a problem area. What might be best overall is an additional preference so those crunching with GPUs could opt in or out of doing VLARs. I'm very much in favor of giving users control of their own systems. Joe "Must have" option. Or we would see mass VLAR abortions on main. Even with such option we will see some of them just because some peoples unaware that there are other ways. CPU AstroPulse experience confirms this... ID: 46084 ·

Richard Haselgrove Volunteer tester Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0	Message 46085 - Posted: 26 May 2013, 8:39:56 UTC Last modified: 26 May 2013, 8:52:53 UTC Got my first Kepler-Kepler validation - WU 5374993 The wingmate is interesting - host 62827 - dual Titan. He has a number of EXIT_TIME_LIMIT_EXCEEDED for VLAR, but also a validation with cuda32 in not much longer that his cuda42 tasks. He seems to be getting very few cuda50, I note. Edit - perhaps I should add that the VLARs on the system I'm watching arrived with an estimate of 18 minutes 24 seconds - so the rsc_fpops_bound time limit would be over three hours. None of the tasks has yet exceeded an hour and a half, so I'm well inside the safety zone, and should remain so unless I have a GPU downclock. ID: 46085 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 46086 - Posted: 26 May 2013, 9:02:02 UTC - in response to Message 46069. Two more tasks wasted for nothing... and cause it's Brook AP it costed much more than aborted MB task... 14027050 5343580 23 May 2013, 21:19:57 UTC 25 May 2013, 9:22:36 UTC ÐžÑˆÐ¸Ð±ÐºÐ° Ð¿Ñ€Ð¸ Ñ€Ð°ÑÑ‡Ñ‘Ñ‚Ðµ 24,032.81 22,418.23 --- AstroPulse v6 v6.05 (cal_ati) 14027020 5311626 23 May 2013, 21:19:57 UTC 25 May 2013, 9:16:52 UTC ÐžÑˆÐ¸Ð±ÐºÐ° Ð¿Ñ€Ð¸ Ñ€Ð°ÑÑ‡Ñ‘Ñ‚Ðµ 24,032.78 23,839.50 --- AstroPulse v6 v6.05 (cal_ati) BOINC estimated execution time as 40 min while each task takes 8 hours to complete. So, 2 tasks were aborted. What would prevent BOINC to abort next 2? Estimate remains the same. I will suspend Brook+ AP work on this host until something will be changed to prevent such needless abortions! And back to BOINC force abortions theme. Is there any chance that after update project I will get estimations increase for already downloaded tasks? Or the only way to make it work is to abort manually all already downloaded CAL tasks and hope that newly downloaded ones will have bigger estimations ? ID: 46086 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 46087 - Posted: 26 May 2013, 9:08:23 UTC And another possible issue: SETI@home v7 7.00 windows_intelx86 (cuda23) Number of tasks completed 76 Max tasks per day 109 Number of tasks today 0 Consecutive valid tasks 76 Average processing rate 45.331657046529 Ð¡Ñ€ÐµÐ´Ð½ÐµÐµ Ð¾Ð±Ð¾Ñ€Ð¾Ñ‚Ð½Ð¾Ðµ Ð²Ñ€ÐµÐ¼Ñ 2.91 days SETI@home v7 7.00 windows_intelx86 (cuda32) Number of tasks completed 93 Max tasks per day 119 Number of tasks today 70 Consecutive valid tasks 94 Average processing rate 51.162516650229 Ð¡Ñ€ÐµÐ´Ð½ÐµÐµ Ð¾Ð±Ð¾Ñ€Ð¾Ñ‚Ð½Ð¾Ðµ Ð²Ñ€ÐµÐ¼Ñ 1.79 days It's GSO9600 host with AMD Trinity APU so now it makes both NV and Ati tasks. And looks like APR for NV tasks is screwed. So far all pre-FERMI GPUs I tested favored cuda23, not cuda32 tasks. But, with addition of Ati GPU (it was disabled initially) execution times are changed perhaps. And looks like initial bias in APR will never be healed. Host continues to recive only cuda32 tasks with higher APR that leaves no chance to cuda23. In short, random part of calculations is too small to be usable just as I feared. ID: 46087 ·

Claggy Volunteer tester Send message Joined: 29 May 06 Posts: 1037 Credit: 8,440,339 RAC: 0	Message 46089 - Posted: 26 May 2013, 9:31:35 UTC - in response to Message 46087. Last modified: 26 May 2013, 9:58:38 UTC And another possible issue: SETI@home v7 7.00 windows_intelx86 (cuda23) Number of tasks completed 76 Max tasks per day 109 Number of tasks today 0 Consecutive valid tasks 76 Average processing rate 45.331657046529 Ð¡Ñ€ÐµÐ´Ð½ÐµÐµ Ð¾Ð±Ð¾Ñ€Ð¾Ñ‚Ð½Ð¾Ðµ Ð²Ñ€ÐµÐ¼Ñ 2.91 days SETI@home v7 7.00 windows_intelx86 (cuda32) Number of tasks completed 93 Max tasks per day 119 Number of tasks today 70 Consecutive valid tasks 94 Average processing rate 51.162516650229 Ð¡Ñ€ÐµÐ´Ð½ÐµÐµ Ð¾Ð±Ð¾Ñ€Ð¾Ñ‚Ð½Ð¾Ðµ Ð²Ñ€ÐµÐ¼Ñ 1.79 days It's GSO9600 host with AMD Trinity APU so now it makes both NV and Ati tasks. And looks like APR for NV tasks is screwed. So far all pre-FERMI GPUs I tested favored cuda23, not cuda32 tasks. But, with addition of Ati GPU (it was disabled initially) execution times are changed perhaps. And looks like initial bias in APR will never be healed. Host continues to recive only cuda32 tasks with higher APR that leaves no chance to cuda23. In short, random part of calculations is too small to be usable just as I feared. That is unless that host does a Bunch of VHAR tasks with the Cuda32 app, that'll have the effeect of driving the APR down, and hopefully bring the Cuda23 app into play, On my GTX460 i know the Cuda42 app is fastest, with the Cuda5 and Cuda32 apps being only slightly slower, (from my Bench Testing) as of 15 minutes ago all my Nvidia Wu's were Cuda42 Wu's, (The Cuda42 APR must have been top three days ago when i received that work), now when i unset NNT i get a mixture of Normal and VHAR Cuda5 Wu's, i foresee that which app version is prefered will switch every time we have a Shortie Storm at the Main project, (as long as enough Wu's are done) SETI@home v7 7.00 windows_intelx86 (cuda32) Number of tasks completed 323 Max tasks per day 187 Number of tasks today 0 Consecutive valid tasks 154 Average processing rate 161.44290806292 Average turnaround time 2.48 days SETI@home v7 7.00 windows_intelx86 (cuda42) Number of tasks completed 494 Max tasks per day 535 Number of tasks today 0 Consecutive valid tasks 502 Average processing rate 184.4685013102 Average turnaround time 2.86 days SETI@home v7 7.00 windows_intelx86 (cuda50) Number of tasks completed 428 Max tasks per day 375 Number of tasks today 37 Consecutive valid tasks 342 Average processing rate 190.79465847299 Average turnaround time 2.19 days Claggy ID: 46089 ·

Richard Haselgrove Volunteer tester Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0	Message 46091 - Posted: 26 May 2013, 21:29:46 UTC Cache is beginning to run a little low - the 'in progress' number on the task list includes some non-VLAR I'm holding suspended - so I allowed some work fetch. With the cuda50 APR now down to 133 under the influence of the VLARs, cuda42 becomes the popular choice... ID: 46091 ·

Richard Haselgrove Volunteer tester Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0	Message 46097 - Posted: 27 May 2013, 17:55:19 UTC - in response to Message 46091. OK, VLAR test completed. cuda50 APR driven down to 121 - I think it went transiently even lower than that. Done a couple of cuda42 VLARs - seemed a trifle quicker that the cuda50 norm, but too small a sample for that to be significant. I'll finish off the holdbacks, then return to normal running. ID: 46097 ·

Eric J Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0	Message 46098 - Posted: 28 May 2013, 4:27:19 UTC - in response to Message 46091. Last modified: 28 May 2013, 4:30:37 UTC I expect that once cuda42 picks up some VLARs it's APR will drop accordingly and cuda50 will again be your main choice. Because of the better workunit mix on the main project, I expect this to happen more smoothly there than it does in beta where there are big runs of only VLAR. As far as cal_ati goes, I'll put back the flag on the main project so it won't get sent to opencl capable cards (except it won't get sent in cases of cards reporting opencl capability but not having it. We'll see how common that is in the main project.) I'm also planning to add a "reset app version statistics for this host" button to the app versions detail page, for people who believe that they are getting the wrong app versions consistently. ID: 46098 ·

Richard Haselgrove Volunteer tester Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0	Message 46102 - Posted: 28 May 2013, 9:24:00 UTC - in response to Message 46098. I expect that once cuda42 picks up some VLARs it's APR will drop accordingly and cuda50 will again be your main choice. Because of the better workunit mix on the main project, I expect this to happen more smoothly there than it does in beta where there are big runs of only VLAR. cuda50 APR is still dropping as wingmates trickle in - now 111. Getting closer to the 96 of Cuda32.... I agree that a similar consecutive run of cuda42 VLARs could theoretically drive that APR even lower, but that's just a race to the bottom. I'll be more interested to watch how quickly the random probes bring cuda50 back up to the top: if they probe too often, people will complain that slower, inefficient (for their rig) apps will be too prominent in the mix: if they probe too rarely, a distorted APR will remain trapped under an inversion for too long. It sounds like this might provide an answer for the question being asked at Main: will we still need to use anonymous platform once the stock applications are the same as third-party apps? Anonymous platform will be one way (the only way?) to 'lock-in' a host to use one version consistently. ID: 46102 ·

William Volunteer tester Send message Joined: 14 Feb 13 Posts: 606 Credit: 588,843 RAC: 0	Message 46103 - Posted: 28 May 2013, 11:31:12 UTC - in response to Message 46102. It sounds like this might provide an answer for the question being asked at Main: will we still need to use anonymous platform once the stock applications are the same as third-party apps? Anonymous platform will be one way (the only way?) to 'lock-in' a host to use one version consistently. Another scenario where you need Anonymous Platform is when you want to restrict a type of application to a type of device (e.g. AP on CPU only) A person who won't read has no advantage over one who can't read. (Mark Twain) ID: 46103 ·

Mike Volunteer tester Send message Joined: 16 Jun 05 Posts: 2531 Credit: 1,074,556 RAC: 0	Message 46104 - Posted: 28 May 2013, 12:00:23 UTC - in response to Message 46103. It sounds like this might provide an answer for the question being asked at Main: will we still need to use anonymous platform once the stock applications are the same as third-party apps? Anonymous platform will be one way (the only way?) to 'lock-in' a host to use one version consistently. Another scenario where you need Anonymous Platform is when you want to restrict a type of application to a type of device (e.g. AP on CPU only) I totally agree on that. With each crime and every kindness we birth our future. ID: 46104 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.