Message boards :
News :
Tests of new scheduler features.
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 17 · Next
Author | Message |
---|---|
![]() Send message Joined: 28 Jan 11 Posts: 619 Credit: 2,580,051 RAC: 0 ![]() |
Meanwhile it is every second Workingunit with low credits. Yes we do. |
Send message Joined: 14 Oct 05 Posts: 1137 Credit: 1,848,733 RAC: 0 ![]() |
We don't need no steenking credit. The projects do need credits to keep some participants interested, and they are a convenient rough yardstick of performance for all. However, I'm glad there aren't more detailed statistics like sports fans keep for their favorite teams or players. Back to topic - Assuming the scheduler is using the hav->pfc basis for guessing host speed, for a modest CPU system it works out very close to the old method. With 16 results in the averages, the latest <flops> sent to my host 10490 is 3.881207e09 but would have been 3.880348e09 based on the seconds per FLOP elapsed time average. Joe |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
I'm enabled beta work distribution for my NV host again (with CUDA 22, 23 and 32 apps) to test if BOINC can figure out what is faster now. so far all types in downloaded tasks still. 2 of 3 have more than 10 eligible validations. What estimate can be used for time when BOINC should react? When to expect its reaction? (IMO it should almost stop to send cuda22 tasks to that host and send mostly cuda23 ones). This host works in unattended mode so it reports and asks for new work constantly. not in large chunks as my ATi host doing. |
![]() Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0 ![]() |
It's a little more complicated than that. When the version selection is done, an app versions processing rate is multiplied by (1+f*r/n) where f is a project specific factor (0.316 for beta), r is normally distributed random number, and n is the number of non-outlier results done. So if two versions are equally fast you'll get typically half and half. If a version is twice as fast and both have 10 results, it's unlikely you'll get any from the slow version. As time goes on you should get fewer and fewer from the slow versions. If the speed difference is only 1% it'll take a longer time to see a difference than if the difference is 10%. All this is modified by the variable quotas. If a version is really fast, but it errors out on half the results, it will only get one result per day and you'll get additional ones for slower but better versions. That's good. But if you have normal quotas you fill the quota of the fast version, your client will get some of the slow version. That's not do good, but since the quotas increase as you return valid results it should be a temporary situation. ![]() |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
quota part not so good indeed but diminishing allocation for slower app done gradually - that's what really needed. For example on my host both apps process few dozens of tasks already... but all of same AR. Quite possible that for different AR their relative speed changed (and this is what we really see in offline tests for HD5 and non-HD5 builds on some of GPUs). So, it's good to continue to recive new tasks for slower app if speed difference is small. For cuda22 and other cuda speed difference is huge, almost 2x. So I expect not to get cuda22 soon. Will see what will be in reality. |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
Not too good result so far... current host state: SETI@home v7 7.00 windows_intelx86 (cuda22) As one can see, cuda22 more than twice slower. But today host got whole pack of cuda22 tasks. Is it not time for BOINC to figure already how bad cuda22 for this particular host and to stop allocating cuda22 work for it ? Well past 10 results for all types, yesterday were almost no cuda22 allocated, it's very recent allocation... Eric, could you check this host http://setiweb.ssl.berkeley.edu/beta/host_app_versions.php?hostid=18439 logs, please and decide if it's OK to still have many cuda22 allocations or there is something wrong with BOINC server? EDIT: I'm afraid with such BOINC work we can't release cuda22 and Brook+ in free competition with other builds. It will be too big slowdown for project, not any gain... Can it be that "Average turnaround time" in use anywhere in algorithm? It should not be cause it not directly depends from app performance! (cuda22 has smallest this value so ...) EDIT2: quite funny - fastest app still has smallest number of completed results... Well done BOINC's "optimization" :D |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
an app versions processing rate is multiplied by (1+f*r/n) where f is a project specific factor (0.316 for beta), r is normally distributed random number, and n is the number of non-outlier results done. Around what value r distributed? <r> = ?, Dr = ? |
![]() Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0 ![]() |
Normal distribution around r=0 with standard deviation of 1. So it also has the possibility of making an app seem slower. ![]() |
![]() Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0 ![]() |
Yep, there's still a problem. If it had been the random factor that did it, there would have been a message in the log. There isn't. Time to add more debugging output. app_versions below are 364-367 are cuda22,cuda23,cuda32,cuda42, and cuda50. 2013-05-11 00:53:43.5332 [PID=2515 ] [send] [HOST#18439] app version 364 is reliable 2013-05-11 00:53:43.5333 [PID=2515 ] [send] [HOST#18439] app version 365 is reliable 2013-05-11 00:53:43.5333 [PID=2515 ] [send] [HOST#18439] app version 366 is reliable 2013-05-11 00:53:43.5334 [PID=2515 ] [quota] effective ncpus 2 ngpus 1 2013-05-11 00:53:43.5334 [PID=2515 ] [quota] max jobs per RPC: 20 2013-05-11 00:53:43.5335 [PID=2515 ] [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00 2013-05-11 00:53:43.5335 [PID=2515 ] [send] NVIDIA GPU: req 432013.21 sec, 0.00 instances; est delay 0.00 2013-05-11 00:53:43.5781 [PID=2515 ] [version] looking for version of setiathome_v7 2013-05-11 00:53:43.5782 [PID=2515 ] [version] [AV#370] Skipping CPU version - user prefs say no CPU 2013-05-11 00:53:43.5782 [PID=2515 ] [version] Checking plan class 'cuda22' 2013-05-11 00:53:43.5788 [PID=2515 ] [version] reading plan classes from file '../plan_class_spec.xml' 2013-05-11 00:53:43.5788 [PID=2515 ] [version] plan_class_spec: host_flops: 2.021320e+09, scale: 1.00, projected_flops: 3.808069e+10, peak_flops: 4.035504e+10 2013-05-11 00:53:43.5788 [PID=2515 ] [quota] [AV#364] scaled max jobs per day: 61 2013-05-11 00:53:43.5789 [PID=2515 ] [version] Checking plan class 'cuda23' 2013-05-11 00:53:43.5789 [PID=2515 ] [version] plan_class_spec: host_flops: 2.021320e+09, scale: 1.00, projected_flops: 3.808069e+10, peak_flops: 4.035504e+10 2013-05-11 00:53:43.5789 [PID=2515 ] [quota] [AV#365] scaled max jobs per day: 51 2013-05-11 00:53:43.5789 [PID=2515 ] [version] Checking plan class 'cuda32' 2013-05-11 00:53:43.5789 [PID=2515 ] [version] plan_class_spec: host_flops: 2.021320e+09, scale: 1.00, projected_flops: 3.808069e+10, peak_flops: 4.035504e+10 2013-05-11 00:53:43.5789 [PID=2515 ] [quota] [AV#366] scaled max jobs per day: 95 2013-05-11 00:53:43.5789 [PID=2515 ] [quota] [AV#366] daily quota exceeded: 100 >= 95 2013-05-11 00:53:43.5789 [PID=2515 ] [version] [AV#366] daily quota exceeded 2013-05-11 00:53:43.5789 [PID=2515 ] [version] Checking plan class 'cuda42' 2013-05-11 00:53:43.5789 [PID=2515 ] [version] plan_class_spec: CUDA version required min: 4020, supplied: 3020 2013-05-11 00:53:43.5789 [PID=2515 ] [version] [AV#367] app_plan() returned false 2013-05-11 00:53:43.5789 [PID=2515 ] [version] Checking plan class 'cuda50' 2013-05-11 00:53:43.5789 [PID=2515 ] [version] plan_class_spec: CUDA version required min: 5000, supplied: 3020 2013-05-11 00:53:43.5790 [PID=2515 ] [version] [AV#368] app_plan() returned false 2013-05-11 00:53:43.5790 [PID=2515 ] [version] [AV#364] (cuda22) setting projected flops based on host_app_version pfc: 58.01G 2013-05-11 00:53:43.5790 [PID=2515 ] [version] [AV#364] (cuda22) comparison pfc: 58.01G et: 58.01G 2013-05-11 00:53:43.5790 [PID=2515 ] [version] Best version of app setiathome_v7 is [AV#364] (58.01 GFLOPS) ![]() |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
366 was ruled out because of quota. Ok, it's understandable. But why 364 was preferred over 365 - looks strange. EDIT: CUDA22 (364) has bigger quota available - can this influent on BOINc's choice? |
![]() Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0 ![]() |
I'll check on that. ![]() |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
I'll check on that. Any progress in that? Please keep us informed :) EDIT: and regarding r param distribution. SD of 1 - not too small for this purpose? It makes let say r==3 quite unprobably already. and cause N>=10 after 10 eligible validations 3/10*0.3 will give only ~10% of APR change due to random factor. And it's upper bound, usual shift will be even smaller... |
![]() Send message Joined: 18 Jan 06 Posts: 1038 Credit: 18,734,730 RAC: 0 ![]() |
Have a host where the estimates have settled since, but both app versions, opencl_ati_cat132 and opencl_ati5_cat132, are still sent to the host. The opencl_ati_cat132 runs the currently distributed workunits ca. 5 minutes faster than the opencl_ati5_cat132. Should it really take thousands of wus to settle ? _\|/_ U r s |
Send message Joined: 14 Oct 05 Posts: 1137 Credit: 1,848,733 RAC: 0 ![]() |
Have a host where the estimates have settled since, but both app versions, opencl_ati_cat132 and opencl_ati5_cat132, are still sent to the host. The opencl_ati_cat132 runs the currently distributed workunits ca. 5 minutes faster than the opencl_ati5_cat132. No, of course it shouldn't, and I'm sure Eric can pin down the reason it does. Actually, I think it never settles with the current code. Note in the debug_version_select log messages for Raistmer's host that CUDA22, CUDA23, and CUDA32 get the same projected_flops of 3.808069e+10 from the plan_class_spec logic. The complete loop choosing the "best" version is based on that projection as modified by the random factor. The "setting projected flops based on host_app_version pfc:" doesn't happen until after the choice has been made. Under those circumstances, each of the CUDAxx plans has an equal chance to be chosen as "best" for a specific work request. Joe |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
So, different APR just ignored until app chosen already ??? Clear bug then |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
I stop asking for CUDA work for now cause some server side changes are definitely required. Will allow work fetch when there will be something new to test. [CUDA app itself quite proved already, months passed with it on beta...] |
![]() Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0 ![]() |
There has to be some path around the correct logic, a short circuit in the application choice, but I haven't found it yet. I'm hacking at it today. ![]() |
![]() Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0 ![]() |
I put a new scheduler that generates about 10x the debugging output. Could you start taking new work so I can see a failure? Or point out a host that got the wrong work after the time this message was posted? ![]() |
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0 ![]() |
I put a new scheduler that generates about 10x the debugging output. Could you start taking new work so I can see a failure? Or point out a host that got the wrong work after the time this message was posted? Fired up 63280. First fetch on restart was cuda42: cuda50 would have been a (marginally) better choice. Subsequent fetches were cuda32 (bad, but viable, choice), then cuda50, then back to cuda32 again. |
![]() Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0 ![]() |
I found the problem. There are apparently two different methods for computing speed... One is based on the predicted speed of the GPU, and it is what is used to determine which version is faster. When the random factor is added into that you're most likely to get the version that has computed the fewest results so far. This is contrary to the behaviour David has said the scheduler should have and so I will fix it. ![]() |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.