Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /disks/centurion/b/carolyn/b/home/boincadm/projects/beta/html/inc/boinc_db.inc on line 147
Tests of new scheduler features.

Tests of new scheduler features.

Message boards : News : Tests of new scheduler features.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 17 · Next

AuthorMessage
TRuEQ & TuVaLu
Volunteer tester
Avatar

Send message
Joined: 28 Jan 11
Posts: 619
Credit: 2,580,051
RAC: 0
Sweden
Message 45709 - Posted: 4 May 2013, 18:57:20 UTC - in response to Message 45708.  

Meanwhile it is every second Workingunit with low credits.
For example: http://setiweb.ssl.berkeley.edu/beta/workunit.php?wuid=5209949
Given Credit of 0.23, validating Ati-OpenCl against Ati-OpenCL.
Gimme the whole credit, please! ;-)

We don't need no steenking credit.


Yes we do.
ID: 45709 · Report as offensive
Josef W. Segur
Volunteer tester

Send message
Joined: 14 Oct 05
Posts: 1137
Credit: 1,848,733
RAC: 0
United States
Message 45710 - Posted: 4 May 2013, 21:07:30 UTC - in response to Message 45709.  

We don't need no steenking credit.

Yes we do.

The projects do need credits to keep some participants interested, and they are a convenient rough yardstick of performance for all. However, I'm glad there aren't more detailed statistics like sports fans keep for their favorite teams or players.

Back to topic -

Assuming the scheduler is using the hav->pfc basis for guessing host speed, for a modest CPU system it works out very close to the old method. With 16 results in the averages, the latest <flops> sent to my host 10490 is 3.881207e09 but would have been 3.880348e09 based on the seconds per FLOP elapsed time average.
                                                                   Joe
ID: 45710 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 45771 - Posted: 10 May 2013, 11:57:24 UTC

I'm enabled beta work distribution for my NV host again (with CUDA 22, 23 and 32 apps) to test if BOINC can figure out what is faster now.

so far all types in downloaded tasks still. 2 of 3 have more than 10 eligible validations.

What estimate can be used for time when BOINC should react? When to expect its reaction? (IMO it should almost stop to send cuda22 tasks to that host and send mostly cuda23 ones).
This host works in unattended mode so it reports and asks for new work constantly. not in large chunks as my ATi host doing.
ID: 45771 · Report as offensive
Profile Eric J Korpela
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 15 Mar 05
Posts: 1547
Credit: 27,183,456
RAC: 0
United States
Message 45775 - Posted: 10 May 2013, 18:06:20 UTC - in response to Message 45771.  

It's a little more complicated than that. When the version selection is done, an app versions processing rate is multiplied by (1+f*r/n) where f is a project specific factor (0.316 for beta), r is normally distributed random number, and n is the number of non-outlier results done. So if two versions are equally fast you'll get typically half and half. If a version is twice as fast and both have 10 results, it's unlikely you'll get any from the slow version. As time goes on you should get fewer and fewer from the slow versions. If the speed difference is only 1% it'll take a longer time to see a difference than if the difference is 10%.

All this is modified by the variable quotas. If a version is really fast, but it errors out on half the results, it will only get one result per day and you'll get additional ones for slower but better versions. That's good. But if you have normal quotas you fill the quota of the fast version, your client will get some of the slow version. That's not do good, but since the quotas increase as you return valid results it should be a temporary situation.



ID: 45775 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 45776 - Posted: 10 May 2013, 18:19:07 UTC - in response to Message 45775.  

quota part not so good indeed but diminishing allocation for slower app done gradually - that's what really needed. For example on my host both apps process few dozens of tasks already... but all of same AR. Quite possible that for different AR their relative speed changed (and this is what we really see in offline tests for HD5 and non-HD5 builds on some of GPUs). So, it's good to continue to recive new tasks for slower app if speed difference is small.

For cuda22 and other cuda speed difference is huge, almost 2x. So I expect not to get cuda22 soon.
Will see what will be in reality.
ID: 45776 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 45784 - Posted: 11 May 2013, 10:45:15 UTC
Last modified: 11 May 2013, 10:55:37 UTC

Not too good result so far...

current host state:

SETI@home v7 7.00 windows_intelx86 (cuda22)
Number of tasks completed 31
Max tasks per day 65
Number of tasks today 60
Consecutive valid tasks 32
Average processing rate 58.021282201959
Average turnaround time 0.75 days
SETI@home v7 7.00 windows_intelx86 (cuda23)
Number of tasks completed 30
Max tasks per day 63
Number of tasks today 40
Consecutive valid tasks 30
Average processing rate 133.1874009876
Average turnaround time 0.87 days
SETI@home v7 7.00 windows_intelx86 (cuda32)
Number of tasks completed 99
Max tasks per day 132
Number of tasks today 100
Consecutive valid tasks 99
Average processing rate 123.59616011718
Average turnaround time 0.88 days


As one can see, cuda22 more than twice slower. But today host got whole pack of cuda22 tasks. Is it not time for BOINC to figure already how bad cuda22 for this particular host and to stop allocating cuda22 work for it ?
Well past 10 results for all types, yesterday were almost no cuda22 allocated, it's very recent allocation...
Eric, could you check this host http://setiweb.ssl.berkeley.edu/beta/host_app_versions.php?hostid=18439 logs, please and decide if it's OK to still have many cuda22 allocations or there is something wrong with BOINC server?

EDIT: I'm afraid with such BOINC work we can't release cuda22 and Brook+ in free competition with other builds. It will be too big slowdown for project, not any gain...

Can it be that "Average turnaround time" in use anywhere in algorithm? It should not be cause it not directly depends from app performance!
(cuda22 has smallest this value so ...)

EDIT2: quite funny - fastest app still has smallest number of completed results... Well done BOINC's "optimization" :D
ID: 45784 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 45785 - Posted: 11 May 2013, 11:05:39 UTC - in response to Message 45775.  
Last modified: 11 May 2013, 11:07:21 UTC

an app versions processing rate is multiplied by (1+f*r/n) where f is a project specific factor (0.316 for beta), r is normally distributed random number, and n is the number of non-outlier results done.

Around what value r distributed? <r> = ?, Dr = ?
ID: 45785 · Report as offensive
Profile Eric J Korpela
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 15 Mar 05
Posts: 1547
Credit: 27,183,456
RAC: 0
United States
Message 45788 - Posted: 11 May 2013, 17:24:26 UTC - in response to Message 45785.  

Normal distribution around r=0 with standard deviation of 1. So it also has the possibility of making an app seem slower.
ID: 45788 · Report as offensive
Profile Eric J Korpela
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 15 Mar 05
Posts: 1547
Credit: 27,183,456
RAC: 0
United States
Message 45789 - Posted: 11 May 2013, 17:46:14 UTC

Yep, there's still a problem. If it had been the random factor that did it, there would have been a message in the log. There isn't. Time to add more debugging output.

app_versions below are 364-367 are cuda22,cuda23,cuda32,cuda42, and cuda50.

2013-05-11 00:53:43.5332 [PID=2515 ]    [send] [HOST#18439] app version 364 is reliable
2013-05-11 00:53:43.5333 [PID=2515 ]    [send] [HOST#18439] app version 365 is reliable
2013-05-11 00:53:43.5333 [PID=2515 ]    [send] [HOST#18439] app version 366 is reliable
2013-05-11 00:53:43.5334 [PID=2515 ]    [quota] effective ncpus 2 ngpus 1
2013-05-11 00:53:43.5334 [PID=2515 ]    [quota] max jobs per RPC: 20
2013-05-11 00:53:43.5335 [PID=2515 ]    [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2013-05-11 00:53:43.5335 [PID=2515 ]    [send] NVIDIA GPU: req 432013.21 sec, 0.00 instances; est delay 0.00
2013-05-11 00:53:43.5781 [PID=2515 ]    [version] looking for version of setiathome_v7
2013-05-11 00:53:43.5782 [PID=2515 ]    [version] [AV#370] Skipping CPU version - user prefs say no CPU
2013-05-11 00:53:43.5782 [PID=2515 ]    [version] Checking plan class 'cuda22'
2013-05-11 00:53:43.5788 [PID=2515 ]    [version] reading plan classes from file '../plan_class_spec.xml'
2013-05-11 00:53:43.5788 [PID=2515 ]    [version] plan_class_spec: host_flops: 2.021320e+09,    scale: 1.00,    projected_flops: 3.808069e+10,  peak_flops: 4.035504e+10
2013-05-11 00:53:43.5788 [PID=2515 ]    [quota] [AV#364] scaled max jobs per day: 61
2013-05-11 00:53:43.5789 [PID=2515 ]    [version] Checking plan class 'cuda23'
2013-05-11 00:53:43.5789 [PID=2515 ]    [version] plan_class_spec: host_flops: 2.021320e+09,    scale: 1.00,    projected_flops: 3.808069e+10,  peak_flops: 4.035504e+10
2013-05-11 00:53:43.5789 [PID=2515 ]    [quota] [AV#365] scaled max jobs per day: 51
2013-05-11 00:53:43.5789 [PID=2515 ]    [version] Checking plan class 'cuda32'
2013-05-11 00:53:43.5789 [PID=2515 ]    [version] plan_class_spec: host_flops: 2.021320e+09,    scale: 1.00,    projected_flops: 3.808069e+10,  peak_flops: 4.035504e+10
2013-05-11 00:53:43.5789 [PID=2515 ]    [quota] [AV#366] scaled max jobs per day: 95
2013-05-11 00:53:43.5789 [PID=2515 ]    [quota] [AV#366] daily quota exceeded: 100 >= 95
2013-05-11 00:53:43.5789 [PID=2515 ]    [version] [AV#366] daily quota exceeded
2013-05-11 00:53:43.5789 [PID=2515 ]    [version] Checking plan class 'cuda42'
2013-05-11 00:53:43.5789 [PID=2515 ]    [version] plan_class_spec: CUDA version required min: 4020, supplied: 3020
2013-05-11 00:53:43.5789 [PID=2515 ]    [version] [AV#367] app_plan() returned false
2013-05-11 00:53:43.5789 [PID=2515 ]    [version] Checking plan class 'cuda50'
2013-05-11 00:53:43.5789 [PID=2515 ]    [version] plan_class_spec: CUDA version required min: 5000, supplied: 3020
2013-05-11 00:53:43.5790 [PID=2515 ]    [version] [AV#368] app_plan() returned false
2013-05-11 00:53:43.5790 [PID=2515 ]    [version] [AV#364] (cuda22) setting projected flops based on host_app_version pfc: 58.01G
2013-05-11 00:53:43.5790 [PID=2515 ]    [version] [AV#364] (cuda22) comparison pfc: 58.01G  et: 58.01G
2013-05-11 00:53:43.5790 [PID=2515 ]    [version] Best version of app setiathome_v7 is [AV#364] (58.01 GFLOPS)



ID: 45789 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 45790 - Posted: 11 May 2013, 17:56:12 UTC - in response to Message 45789.  
Last modified: 11 May 2013, 17:58:06 UTC

366 was ruled out because of quota. Ok, it's understandable.
But why 364 was preferred over 365 - looks strange.

EDIT: CUDA22 (364) has bigger quota available - can this influent on BOINc's choice?
ID: 45790 · Report as offensive
Profile Eric J Korpela
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 15 Mar 05
Posts: 1547
Credit: 27,183,456
RAC: 0
United States
Message 45791 - Posted: 11 May 2013, 17:58:36 UTC - in response to Message 45790.  

I'll check on that.
ID: 45791 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 45794 - Posted: 12 May 2013, 7:12:22 UTC - in response to Message 45791.  
Last modified: 12 May 2013, 8:01:23 UTC

I'll check on that.

Any progress in that? Please keep us informed :)


EDIT: and regarding r param distribution. SD of 1 - not too small for this purpose?
It makes let say r==3 quite unprobably already. and cause N>=10 after 10 eligible validations 3/10*0.3 will give only ~10% of APR change due to random factor. And it's upper bound, usual shift will be even smaller...
ID: 45794 · Report as offensive
Urs Echternacht
Volunteer tester
Avatar

Send message
Joined: 18 Jan 06
Posts: 1038
Credit: 18,734,730
RAC: 0
Germany
Message 45804 - Posted: 12 May 2013, 16:09:03 UTC

Have a host where the estimates have settled since, but both app versions, opencl_ati_cat132 and opencl_ati5_cat132, are still sent to the host. The opencl_ati_cat132 runs the currently distributed workunits ca. 5 minutes faster than the opencl_ati5_cat132.

Should it really take thousands of wus to settle ?
_\|/_
U r s
ID: 45804 · Report as offensive
Josef W. Segur
Volunteer tester

Send message
Joined: 14 Oct 05
Posts: 1137
Credit: 1,848,733
RAC: 0
United States
Message 45805 - Posted: 12 May 2013, 19:52:24 UTC - in response to Message 45804.  

Have a host where the estimates have settled since, but both app versions, opencl_ati_cat132 and opencl_ati5_cat132, are still sent to the host. The opencl_ati_cat132 runs the currently distributed workunits ca. 5 minutes faster than the opencl_ati5_cat132.

Should it really take thousands of wus to settle ?

No, of course it shouldn't, and I'm sure Eric can pin down the reason it does.

Actually, I think it never settles with the current code. Note in the debug_version_select log messages for Raistmer's host that CUDA22, CUDA23, and CUDA32 get the same projected_flops of 3.808069e+10 from the plan_class_spec logic. The complete loop choosing the "best" version is based on that projection as modified by the random factor. The "setting projected flops based on host_app_version pfc:" doesn't happen until after the choice has been made. Under those circumstances, each of the CUDAxx plans has an equal chance to be chosen as "best" for a specific work request.
                                                                 Joe
ID: 45805 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 45806 - Posted: 12 May 2013, 20:34:34 UTC - in response to Message 45805.  

So, different APR just ignored until app chosen already ??? Clear bug then
ID: 45806 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 45810 - Posted: 13 May 2013, 13:41:59 UTC
Last modified: 13 May 2013, 13:42:54 UTC

I stop asking for CUDA work for now cause some server side changes are definitely required. Will allow work fetch when there will be something new to test.
[CUDA app itself quite proved already, months passed with it on beta...]
ID: 45810 · Report as offensive
Profile Eric J Korpela
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 15 Mar 05
Posts: 1547
Credit: 27,183,456
RAC: 0
United States
Message 45816 - Posted: 13 May 2013, 16:41:17 UTC

There has to be some path around the correct logic, a short circuit in the application choice, but I haven't found it yet. I'm hacking at it today.
ID: 45816 · Report as offensive
Profile Eric J Korpela
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 15 Mar 05
Posts: 1547
Credit: 27,183,456
RAC: 0
United States
Message 45822 - Posted: 13 May 2013, 19:26:24 UTC - in response to Message 45816.  

I put a new scheduler that generates about 10x the debugging output. Could you start taking new work so I can see a failure? Or point out a host that got the wrong work after the time this message was posted?


ID: 45822 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 45823 - Posted: 13 May 2013, 19:47:58 UTC - in response to Message 45822.  

I put a new scheduler that generates about 10x the debugging output. Could you start taking new work so I can see a failure? Or point out a host that got the wrong work after the time this message was posted?

Fired up 63280. First fetch on restart was cuda42: cuda50 would have been a (marginally) better choice. Subsequent fetches were cuda32 (bad, but viable, choice), then cuda50, then back to cuda32 again.
ID: 45823 · Report as offensive
Profile Eric J Korpela
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 15 Mar 05
Posts: 1547
Credit: 27,183,456
RAC: 0
United States
Message 45824 - Posted: 13 May 2013, 19:55:41 UTC - in response to Message 45822.  

I found the problem. There are apparently two different methods for computing speed... One is based on the predicted speed of the GPU, and it is what is used to determine which version is faster. When the random factor is added into that you're most likely to get the version that has computed the fewest results so far.

This is contrary to the behaviour David has said the scheduler should have and so I will fix it.



ID: 45824 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 17 · Next

Message boards : News : Tests of new scheduler features.


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.