Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /disks/centurion/b/carolyn/b/home/boincadm/projects/beta/html/inc/boinc_db.inc on line 147
Tests of new scheduler features.

Tests of new scheduler features.

Message boards : News : Tests of new scheduler features.
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 17 · Next

AuthorMessage
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 46072 - Posted: 25 May 2013, 13:35:52 UTC - in response to Message 46071.  

... that's one of the first to complete under '4 VLARs at once'. 5,286 seconds - almost an hour and a half - works out at an equivalent APR just below 35, compared to ~150 - 160 for the general 'non-VLAR' mix of work.


These VLAR take some 4 hours each on 3.0GHz core2 (1 core). verifying 4 at once in 1.5 hours ? Is that 2 per card on 2 cards ? othwerwise That sounds quite a bit more efficient somehow than my 680. (faster bus & CPU perhaps...) 1 in ~44 mins, multiples not tested,

Yes, exactly - 2 cards (identical - NB factory overclock), 2 tasks per card, four tasks in total, each task takes ~90 minutes. So she would spit out one task every 22 minutes, give or take. CPU is overclocked i7-3770K, hyperthreaded - running six threads of BOINC (non-SETI) tasks, the balance of CPU power available to support the GPUs.

The tasks had a staggered start as the supply of non-VLAR tasks ran out. I remember the original CUDA apps ran with horrible lag at the beginning of each task, and again for a short segment towards the end (was it around the 75% mark? I've forgotten the details), but better at other stages. So another test would be to send off all four with a synchronised start, but I'm not sure I want to go there...
ID: 46072 · Report as offensive
jason_gee
Volunteer tester

Send message
Joined: 11 Dec 08
Posts: 198
Credit: 658,573
RAC: 0
Australia
Message 46073 - Posted: 25 May 2013, 13:47:47 UTC - in response to Message 46072.  

Yes, exactly - 2 cards (identical - NB factory overclock), 2 tasks per card, four tasks in total, each task takes ~90 minutes. So she would spit out one task every 22 minutes, give or take. CPU is overclocked i7-3770K, hyperthreaded - running six threads of BOINC (non-SETI) tasks, the balance of CPU power available to support the GPUs.

The tasks had a staggered start as the supply of non-VLAR tasks ran out. I remember the original CUDA apps ran with horrible lag at the beginning of each task, and again for a short segment towards the end (was it around the 75% mark? I've forgotten the details), but better at other stages. So another test would be to send off all four with a synchronised start, but I'm not sure I want to go there...


Yep, alright, matching up with my vague recollections. Yeah taking a while to get a handle on the characteristics here, and the old Cuda apps were too long ago :).

I've been running at defaults, but will jack up the settings now in the hopes for a clear Cuda50 performance (still at single instance). Running modded Boinc here I shouldn't run into problems from excessive runtime aborts. I hope the APR swing stays around or preferably under a factor of 5. If that looks OK I'll upgrade Boinc for a multiple test run.

ID: 46073 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 46074 - Posted: 25 May 2013, 14:22:27 UTC

these lags corresponded PulseFind processing of max length and FFT size of 8. That is, max work per thread, GPU heavely underloaded 9cause too little independent threads for GPU to process). Indeed such PulseFind config will occue at very beginning of task + perhaps somewhere near end.
ID: 46074 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 46078 - Posted: 25 May 2013, 16:42:58 UTC - in response to Message 46074.  

these lags corresponded PulseFind processing of max length and FFT size of 8. That is, max work per thread, GPU heavely underloaded 9cause too little independent threads for GPU to process). Indeed such PulseFind config will occue at very beginning of task + perhaps somewhere near end.

Would the relative positioning of those pulsefind kernels have been changed with the addition of Autocorrelations to the workload? Just asking, so I know when and where to look if I ever try that 'synchronised start' test.

Even with powerful GPUs, I didn't see much sign of underutilisation:


(direct link)

The GPU usage traces for the two cards - lines 5 and 6 - show relatively high, though variable, usage throughout the display - left-to-right is about 25 minutes, or over 25% of the WUs. Even with one task just recently started, and the other just over 80% done, I see GPU usage mostly in the high 80%s, only occasionally spiking down into the teen%s - and they really are narrow spikes.

But very different from the steady 98% - 99% loading of the single GPUGrid task on the other GPU currently.
ID: 46078 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 46079 - Posted: 25 May 2013, 17:00:27 UTC - in response to Message 46078.  

no change in relative positioning. Longest PulseFind occurs before first autocorr.
ID: 46079 · Report as offensive
jason_gee
Volunteer tester

Send message
Joined: 11 Dec 08
Posts: 198
Credit: 658,573
RAC: 0
Australia
Message 46080 - Posted: 25 May 2013, 17:26:05 UTC - in response to Message 46078.  
Last modified: 25 May 2013, 17:49:17 UTC

these lags corresponded PulseFind processing of max length and FFT size of 8. That is, max work per thread, GPU heavely underloaded 9cause too little independent threads for GPU to process). Indeed such PulseFind config will occue at very beginning of task + perhaps somewhere near end.

Would the relative positioning of those pulsefind kernels have been changed with the addition of Autocorrelations to the workload? Just asking, so I know when and where to look if I ever try that 'synchronised start' test.

Even with powerful GPUs, I didn't see much sign of underutilisation:


(direct link)

The GPU usage traces for the two cards - lines 5 and 6 - show relatively high, though variable, usage throughout the display - left-to-right is about 25 minutes, or over 25% of the WUs. Even with one task just recently started, and the other just over 80% done, I see GPU usage mostly in the high 80%s, only occasionally spiking down into the teen%s - and they really are narrow spikes.

But very different from the steady 98% - 99% loading of the single GPUGrid task on the other GPU currently.


For detail, While my autocorrelations are still using the VRAM & bus bound 4NFFT approach, a crude baseline implementation, loading with a single task won't fill larger Units. It can 'look like' it is because of the way utilisation is measured (on the first SMX by duration % in the sample period.] This will [tend to] look like flat troughs where there are ACs in progress & peaks elsewhere. For the remainder of processing, they fill the GPU fairly well, though there are hidden latencies involved. optionally Jacking up the priority & pulsefind settings from defaults smooths these.

[Edit:]With agressive settings like mine, you should begin to see hints of long pulsefind related lag at the familiar points: [ & narrow dips disappear... ]
[mbcuda]
processpriority = abovenormal
pfblockspersm = 15
pfperiodsperlaunch = 200
ID: 46080 · Report as offensive
Josef W. Segur
Volunteer tester

Send message
Joined: 14 Oct 05
Posts: 1137
Credit: 1,848,733
RAC: 0
United States
Message 46081 - Posted: 25 May 2013, 19:44:59 UTC - in response to Message 46070.  

OK, now we have VLARs - I've got about 80 of them. Let the fun begin.
...
The bad news - they still run very, very slowly. Note that apart from the 'two tasks at once' (via app_config.xml), I'm running the host exactly as stock: I've no doubt performance could be increased via application parameter tuning, but that's not what I'm reporting on here.

The most recent completed task is WU 5348248 - that's one of the first to complete under '4 VLARs at once'. 5,286 seconds - almost an hour and a half - works out at an equivalent APR just below 35, compared to ~150 - 160 for the general 'non-VLAR' mix of work. It will be interesting to see what APR is recorded for cuda50 at the end of this streak - I'll run all 80 consecutively. The APR might not end up as bad as those figures suggest, because I have over 400 tasks pending and their validation (as wingmates trickle in) will push APR back up towards normal levels.
...

There's some related information from host 5619046 at the main project. That dual 560 Ti system did a lot of setiathome_enhanced VLARs on GPU with x41zc, Cuda 4.2. The user posted several times about it in the "Please rise the limits... just a little..." thread March 13, starting with message 1345975. I looked through its task list at the time and judged for that configuration VLARs took about 6 times as long as midrange ARs with similar estimates. The host only has AP v6 in progress now, probably a reaction to the 100 limit. Its last remaing VLAR on GPU was validated earlier today, but here's the timing comparison for that VLAR and 4 midrange AR tasks:

Task name                              Run time   CPU time
28my12ab.11019.110604.3.11.149.vlar_0  5,689.26   53.66
03mr13ad.23566.11110.6.11.140_0          869.53   47.56
03mr13ad.23566.11110.6.11.122_0          866.73   44.10
02mr13ad.4401.7850.16.11.39_0            897.47   48.22
02mr13ad.4401.7850.16.11.33_1            899.64   50.01


                                                                  Joe
ID: 46081 · Report as offensive
jason_gee
Volunteer tester

Send message
Joined: 11 Dec 08
Posts: 198
Credit: 658,573
RAC: 0
Australia
Message 46082 - Posted: 25 May 2013, 20:51:22 UTC - in response to Message 46081.  
Last modified: 25 May 2013, 20:56:28 UTC

Hmmm, yeah 560ti (compute cap 2.1) , like on the other PC here, probably sits either square on, or below, performance levels where significant problems with APR etc might occur. It's a tough call.

Target market for this card was the newly created 'midrange enthusiast' or 'performance - price' bracket if you like. This suggests to me pushing VLAR to these could go badly, because of the target market expectations, and the geometry being more or less maxxed out for the memory subsystem.

Chances are there could be initial backlash with only Kepler Class receiving VLARs, at the more palatable 4x elapsed. That's all prior to adjustable multithreading the long pulsefinds in x42, ala paralleled V13 experiment, after serialisation of the result reductions. I would describe Fermi-class utilisation as solid. That'd make not much room to move before hybridisation & higher level algorithmic changes, while Keplers still do have some 'legs' yet, before major developments.

Jason
ID: 46082 · Report as offensive
Josef W. Segur
Volunteer tester

Send message
Joined: 14 Oct 05
Posts: 1137
Credit: 1,848,733
RAC: 0
United States
Message 46083 - Posted: 26 May 2013, 0:17:43 UTC - in response to Message 46082.  

Hmmm, yeah 560ti (compute cap 2.1) , like on the other PC here, probably sits either square on, or below, performance levels where significant problems with APR etc might occur. It's a tough call.

Target market for this card was the newly created 'midrange enthusiast' or 'performance - price' bracket if you like. This suggests to me pushing VLAR to these could go badly, because of the target market expectations, and the geometry being more or less maxxed out for the memory subsystem.

Chances are there could be initial backlash with only Kepler Class receiving VLARs, at the more palatable 4x elapsed. That's all prior to adjustable multithreading the long pulsefinds in x42, ala paralleled V13 experiment, after serialisation of the result reductions. I would describe Fermi-class utilisation as solid. That'd make not much room to move before hybridisation & higher level algorithmic changes, while Keplers still do have some 'legs' yet, before major developments.

Jason

Autocorrelation processing would somewhat reduce the ratio on that dual 560 Ti host, but I agree the mismatch between estimates and actual performance is a problem area. What might be best overall is an additional preference so those crunching with GPUs could opt in or out of doing VLARs. I'm very much in favor of giving users control of their own systems.
                                                                  Joe
ID: 46083 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 46084 - Posted: 26 May 2013, 7:54:46 UTC - in response to Message 46083.  

I agree the mismatch between estimates and actual performance is a problem area. What might be best overall is an additional preference so those crunching with GPUs could opt in or out of doing VLARs. I'm very much in favor of giving users control of their own systems.
                                                                  Joe


"Must have" option. Or we would see mass VLAR abortions on main. Even with such option we will see some of them just because some peoples unaware that there are other ways. CPU AstroPulse experience confirms this...
ID: 46084 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 46085 - Posted: 26 May 2013, 8:39:56 UTC
Last modified: 26 May 2013, 8:52:53 UTC

Got my first Kepler-Kepler validation - WU 5374993

The wingmate is interesting - host 62827 - dual Titan. He has a number of EXIT_TIME_LIMIT_EXCEEDED for VLAR, but also a validation with cuda32 in not much longer that his cuda42 tasks. He seems to be getting very few cuda50, I note.

Edit - perhaps I should add that the VLARs on the system I'm watching arrived with an estimate of 18 minutes 24 seconds - so the rsc_fpops_bound time limit would be over three hours. None of the tasks has yet exceeded an hour and a half, so I'm well inside the safety zone, and should remain so unless I have a GPU downclock.
ID: 46085 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 46086 - Posted: 26 May 2013, 9:02:02 UTC - in response to Message 46069.  

Two more tasks wasted for nothing... and cause it's Brook AP it costed much more than aborted MB task...

14027050 5343580 23 May 2013, 21:19:57 UTC 25 May 2013, 9:22:36 UTC Ошибка при расчёте 24,032.81 22,418.23 --- AstroPulse v6 v6.05 (cal_ati)
14027020 5311626 23 May 2013, 21:19:57 UTC 25 May 2013, 9:16:52 UTC Ошибка при расчёте 24,032.78 23,839.50 --- AstroPulse v6 v6.05 (cal_ati)

BOINC estimated execution time as 40 min while each task takes 8 hours to complete. So, 2 tasks were aborted.

What would prevent BOINC to abort next 2? Estimate remains the same.
I will suspend Brook+ AP work on this host until something will be changed to prevent such needless abortions!


And back to BOINC force abortions theme. Is there any chance that after update project I will get estimations increase for already downloaded tasks? Or the only way to make it work is to abort manually all already downloaded CAL tasks and hope that newly downloaded ones will have bigger estimations ?
ID: 46086 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 46087 - Posted: 26 May 2013, 9:08:23 UTC

And another possible issue:

SETI@home v7 7.00 windows_intelx86 (cuda23)
Number of tasks completed 76
Max tasks per day 109
Number of tasks today 0
Consecutive valid tasks 76
Average processing rate 45.331657046529
Среднее оборотное время 2.91 days
SETI@home v7 7.00 windows_intelx86 (cuda32)
Number of tasks completed 93
Max tasks per day 119
Number of tasks today 70
Consecutive valid tasks 94
Average processing rate 51.162516650229
Среднее оборотное время 1.79 days


It's GSO9600 host with AMD Trinity APU so now it makes both NV and Ati tasks.
And looks like APR for NV tasks is screwed. So far all pre-FERMI GPUs I tested favored cuda23, not cuda32 tasks. But, with addition of Ati GPU (it was disabled initially) execution times are changed perhaps. And looks like initial bias in APR will never be healed. Host continues to recive only cuda32 tasks with higher APR that leaves no chance to cuda23. In short, random part of calculations is too small to be usable just as I feared.
ID: 46087 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 29 May 06
Posts: 1037
Credit: 8,440,339
RAC: 0
United Kingdom
Message 46089 - Posted: 26 May 2013, 9:31:35 UTC - in response to Message 46087.  
Last modified: 26 May 2013, 9:58:38 UTC

And another possible issue:

SETI@home v7 7.00 windows_intelx86 (cuda23)
Number of tasks completed 76
Max tasks per day 109
Number of tasks today 0
Consecutive valid tasks 76
Average processing rate 45.331657046529
Среднее оборотное время 2.91 days
SETI@home v7 7.00 windows_intelx86 (cuda32)
Number of tasks completed 93
Max tasks per day 119
Number of tasks today 70
Consecutive valid tasks 94
Average processing rate 51.162516650229
Среднее оборотное время 1.79 days


It's GSO9600 host with AMD Trinity APU so now it makes both NV and Ati tasks.
And looks like APR for NV tasks is screwed. So far all pre-FERMI GPUs I tested favored cuda23, not cuda32 tasks. But, with addition of Ati GPU (it was disabled initially) execution times are changed perhaps. And looks like initial bias in APR will never be healed. Host continues to recive only cuda32 tasks with higher APR that leaves no chance to cuda23. In short, random part of calculations is too small to be usable just as I feared.

That is unless that host does a Bunch of VHAR tasks with the Cuda32 app, that'll have the effeect of driving the APR down, and hopefully bring the Cuda23 app into play,

On my GTX460 i know the Cuda42 app is fastest, with the Cuda5 and Cuda32 apps being only slightly slower, (from my Bench Testing)
as of 15 minutes ago all my Nvidia Wu's were Cuda42 Wu's, (The Cuda42 APR must have been top three days ago when i received that work),
now when i unset NNT i get a mixture of Normal and VHAR Cuda5 Wu's, i foresee that which app version is prefered will switch every time we have a Shortie Storm at the Main project, (as long as enough Wu's are done)

SETI@home v7 7.00 windows_intelx86 (cuda32)

Number of tasks completed  323

Max tasks per day          187

Number of tasks today      0

Consecutive valid tasks    154

Average processing rate    161.44290806292

Average turnaround time    2.48 days


SETI@home v7 7.00 windows_intelx86 (cuda42)

Number of tasks completed  494

Max tasks per day          535

Number of tasks today      0

Consecutive valid tasks    502

Average processing rate    184.4685013102

Average turnaround time    2.86 days


SETI@home v7 7.00 windows_intelx86 (cuda50)

Number of tasks completed  428

Max tasks per day          375

Number of tasks today      37

Consecutive valid tasks    342

Average processing rate    190.79465847299

Average turnaround time    2.19 days


Claggy
ID: 46089 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 46091 - Posted: 26 May 2013, 21:29:46 UTC

Cache is beginning to run a little low - the 'in progress' number on the task list includes some non-VLAR I'm holding suspended - so I allowed some work fetch.

With the cuda50 APR now down to 133 under the influence of the VLARs, cuda42 becomes the popular choice...
ID: 46091 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 46097 - Posted: 27 May 2013, 17:55:19 UTC - in response to Message 46091.  

OK, VLAR test completed. cuda50 APR driven down to 121 - I think it went transiently even lower than that. Done a couple of cuda42 VLARs - seemed a trifle quicker that the cuda50 norm, but too small a sample for that to be significant. I'll finish off the holdbacks, then return to normal running.
ID: 46097 · Report as offensive
Profile Eric J Korpela
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 15 Mar 05
Posts: 1547
Credit: 27,183,456
RAC: 0
United States
Message 46098 - Posted: 28 May 2013, 4:27:19 UTC - in response to Message 46091.  
Last modified: 28 May 2013, 4:30:37 UTC

I expect that once cuda42 picks up some VLARs it's APR will drop accordingly and cuda50 will again be your main choice. Because of the better workunit mix on the main project, I expect this to happen more smoothly there than it does in beta where there are big runs of only VLAR.

As far as cal_ati goes, I'll put back the flag on the main project so it won't get sent to opencl capable cards (except it won't get sent in cases of cards reporting opencl capability but not having it. We'll see how common that is in the main project.)

I'm also planning to add a "reset app version statistics for this host" button to the app versions detail page, for people who believe that they are getting the wrong app versions consistently.
ID: 46098 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 46102 - Posted: 28 May 2013, 9:24:00 UTC - in response to Message 46098.  

I expect that once cuda42 picks up some VLARs it's APR will drop accordingly and cuda50 will again be your main choice. Because of the better workunit mix on the main project, I expect this to happen more smoothly there than it does in beta where there are big runs of only VLAR.

cuda50 APR is still dropping as wingmates trickle in - now 111. Getting closer to the 96 of Cuda32....

I agree that a similar consecutive run of cuda42 VLARs could theoretically drive that APR even lower, but that's just a race to the bottom. I'll be more interested to watch how quickly the random probes bring cuda50 back up to the top: if they probe too often, people will complain that slower, inefficient (for their rig) apps will be too prominent in the mix: if they probe too rarely, a distorted APR will remain trapped under an inversion for too long.

It sounds like this might provide an answer for the question being asked at Main: will we still need to use anonymous platform once the stock applications are the same as third-party apps? Anonymous platform will be one way (the only way?) to 'lock-in' a host to use one version consistently.
ID: 46102 · Report as offensive
William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 606
Credit: 588,843
RAC: 0
Message 46103 - Posted: 28 May 2013, 11:31:12 UTC - in response to Message 46102.  

It sounds like this might provide an answer for the question being asked at Main: will we still need to use anonymous platform once the stock applications are the same as third-party apps? Anonymous platform will be one way (the only way?) to 'lock-in' a host to use one version consistently.

Another scenario where you need Anonymous Platform is when you want to restrict a type of application to a type of device (e.g. AP on CPU only)

A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 46103 · Report as offensive
Profile Mike
Volunteer tester
Avatar

Send message
Joined: 16 Jun 05
Posts: 2531
Credit: 1,074,556
RAC: 0
Germany
Message 46104 - Posted: 28 May 2013, 12:00:23 UTC - in response to Message 46103.  

It sounds like this might provide an answer for the question being asked at Main: will we still need to use anonymous platform once the stock applications are the same as third-party apps? Anonymous platform will be one way (the only way?) to 'lock-in' a host to use one version consistently.

Another scenario where you need Anonymous Platform is when you want to restrict a type of application to a type of device (e.g. AP on CPU only)


I totally agree on that.

With each crime and every kindness we birth our future.
ID: 46104 · Report as offensive
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 17 · Next

Message boards : News : Tests of new scheduler features.


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.