DCF when the GPUs are different speeds

Author	Message
red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1206611 - Posted: 16 Mar 2012, 15:46:33 UTC I have been trying to get a stable DCF on http://setiathome.berkeley.edu/show_host_detail.php?hostid=6379672 and have managed to improve things quite a lot by adding flops to app_info.xml but I expect I have got it as good as it's going to get. The problem is that I have 4 GPUs of different speeds and when one of the slow GPUs finishes I typically get 16/03/2012 15:20:28 \| SETI@home \| [dcf] DCF: 0.373319->1.174715, raw_ratio 1.174715, adj_ratio 3.146681 To get things to work vaguely sensibly I have used flops values such that the CPUs and fast GPUs typically have a DCF of 0.4 which means I can get the current 400/50 WU limits and that I don't get timeouts on the slow GPUs. The actual GPU configuration is 16/03/2012 12:32:59 \| \| NVIDIA GPU 0: GeForce GTX 460 (driver version 28562, CUDA version 4010, compute capability 2.1, 1024MB, 684 GFLOPS peak) 16/03/2012 12:32:59 \| \| NVIDIA GPU 1: GeForce GT 430 (driver version 28562, CUDA version 4010, compute capability 2.1, 512MB, 179 GFLOPS peak) 16/03/2012 12:32:59 \| \| NVIDIA GPU 2: GeForce GTX 460 (driver version 28562, CUDA version 4010, compute capability 2.1, 1024MB, 684 GFLOPS peak) 16/03/2012 12:32:59 \| \| NVIDIA GPU 3: GeForce GT 520 (driver version 28562, CUDA version 4010, compute capability 2.1, 512MB, 104 GFLOPS peak) and given BOINC reports this then it must know the relative speed of the GPUs. To my thinking clearly BOINC should be taking relative speed of the GPUs into account when calculating the DCF for a given WU. Further I suspect it could even work out the speed of each GPU relative to the CPU and thereby totally remove the need for flops entries in app_info.xml. Were a future release of BOINC to do this maybe some of the Luddites running old versions on BOINC would finally update! ID: 1206611 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1207809 - Posted: 19 Mar 2012, 13:13:44 UTC - in response to Message 1206611. Last modified: 19 Mar 2012, 13:22:16 UTC Will this ever be fixed? Currently I get a lot of the following which trigger a load of high priority running. 19/03/2012 13:08:51 \| SETI@home \| [dcf] DCF: 0.691461->2.026834, raw_ratio 2.026834, adj_ratio 2.931233 ID: 1207809 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1208249 - Posted: 20 Mar 2012, 15:41:50 UTC Last modified: 20 Mar 2012, 15:44:36 UTC Far too often it jumps way too high and takes a far too long to recover. 20/03/2012 15:24:03 \| SETI@home \| [dcf] DCF: 0.591590->4.139177, raw_ratio 4.139177, adj_ratio 6.996701 20/03/2012 15:28:30 \| SETI@home \| [dcf] DCF: 4.139177->3.772510, raw_ratio 0.472504, adj_ratio 0.114154 20/03/2012 15:29:55 \| SETI@home \| [dcf] DCF: 3.772510->3.736534, raw_ratio 0.174930, adj_ratio 0.046370 20/03/2012 15:30:18 \| SETI@home \| [dcf] DCF: 3.736534->3.701055, raw_ratio 0.188607, adj_ratio 0.050476 20/03/2012 15:35:21 \| SETI@home \| [dcf] DCF: 3.701055->3.382952, raw_ratio 0.520028, adj_ratio 0.140508 20/03/2012 15:35:38 \| SETI@home \| [dcf] DCF: 3.382952->3.086598, raw_ratio 0.419405, adj_ratio 0.123976 20/03/2012 15:41:55 \| SETI@home \| [dcf] DCF: 3.086598->3.058384, raw_ratio 0.265261, adj_ratio 0.085940 20/03/2012 15:42:01 \| SETI@home \| [dcf] DCF: 3.058384->3.030399, raw_ratio 0.259856, adj_ratio 0.084965 ID: 1208249 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 1208299 - Posted: 21 Mar 2012, 0:20:19 UTC - in response to Message 1208249. Far too often it jumps way too high and takes a far too long to recover. 20/03/2012 15:24:03 \| SETI@home \| [dcf] DCF: 0.591590->4.139177, raw_ratio 4.139177, adj_ratio 6.996701 20/03/2012 15:28:30 \| SETI@home \| [dcf] DCF: 4.139177->3.772510, raw_ratio 0.472504, adj_ratio 0.114154 20/03/2012 15:29:55 \| SETI@home \| [dcf] DCF: 3.772510->3.736534, raw_ratio 0.174930, adj_ratio 0.046370 20/03/2012 15:30:18 \| SETI@home \| [dcf] DCF: 3.736534->3.701055, raw_ratio 0.188607, adj_ratio 0.050476 20/03/2012 15:35:21 \| SETI@home \| [dcf] DCF: 3.701055->3.382952, raw_ratio 0.520028, adj_ratio 0.140508 20/03/2012 15:35:38 \| SETI@home \| [dcf] DCF: 3.382952->3.086598, raw_ratio 0.419405, adj_ratio 0.123976 20/03/2012 15:41:55 \| SETI@home \| [dcf] DCF: 3.086598->3.058384, raw_ratio 0.265261, adj_ratio 0.085940 20/03/2012 15:42:01 \| SETI@home \| [dcf] DCF: 3.058384->3.030399, raw_ratio 0.259856, adj_ratio 0.084965 The DCF is designed to prevent far too much work from being downloaded to a host. It assumes that the estimates for each application from a project will be off in a similar manner. The fix is to have DCF be per application for CPU scheduling. This will not, however, work for work fetch as the work fetch is per project. BOINC WIKI ID: 1208299 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1208414 - Posted: 21 Mar 2012, 10:14:09 UTC - in response to Message 1208299. Last modified: 21 Mar 2012, 10:15:05 UTC The DCF is designed to prevent far too much work from being downloaded to a host. It assumes that the estimates for each application from a project will be off in a similar manner. The fix is to have DCF be per application for CPU scheduling. This will not, however, work for work fetch as the work fetch is per project. I wonder do you understand what I am asking for? How can a DCF per application address GPUs of different speeds? As I said initially there needs to be per device DCF adjustment. Further the current code that allows the DCF to jump from 0.591590->4.139177 is inappropriate and needs fixing. It allows the DCF to instantly jump so high that the (adj_ratio < 0.1) applies when it should not and it then takes forever and a day for the DCF to return to what it should be. void PROJECT::update_duration_correction_factor(ACTIVE_TASK* atp) { RESULT* rp = atp->result; double raw_ratio = atp->elapsed_time/rp->estimated_duration_uncorrected(); double adj_ratio = atp->elapsed_time/rp->estimated_duration(); double old_dcf = duration_correction_factor; // it's OK to overestimate completion time, // but bad to underestimate it. // So make it easy for the factor to increase, // but decrease it with caution // if (adj_ratio > 1.1) { duration_correction_factor = raw_ratio; } else { // in particular, don't give much weight to results // that completed a lot earlier than expected // if (adj_ratio < 0.1) { duration_correction_factor = duration_correction_factor0.99 + 0.01raw_ratio; } else { duration_correction_factor = duration_correction_factor0.9 + 0.1raw_ratio; } } // limit to [.01 .. 100] // if (duration_correction_factor > 100) duration_correction_factor = 100; if (duration_correction_factor < 0.01) duration_correction_factor = 0.01; if (log_flags.dcf_debug) { msg_printf(this, MSG_INFO, "[dcf] DCF: %f->%f, raw_ratio %f, adj_ratio %f", old_dcf, duration_correction_factor, raw_ratio, adj_ratio ); } } [ ID: 1208414 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 1208639 - Posted: 21 Mar 2012, 23:43:33 UTC - in response to Message 1208414. The DCF is designed to prevent far too much work from being downloaded to a host. It assumes that the estimates for each application from a project will be off in a similar manner. The fix is to have DCF be per application for CPU scheduling. This will not, however, work for work fetch as the work fetch is per project. I wonder do you understand what I am asking for? How can a DCF per application address GPUs of different speeds? As I said initially there needs to be per device DCF adjustment. Further the current code that allows the DCF to jump from 0.591590->4.139177 is inappropriate and needs fixing. It allows the DCF to instantly jump so high that the (adj_ratio < 0.1) applies when it should not and it then takes forever and a day for the DCF to return to what it should be. void PROJECT::update_duration_correction_factor(ACTIVE_TASK* atp) { RESULT* rp = atp->result; double raw_ratio = atp->elapsed_time/rp->estimated_duration_uncorrected(); double adj_ratio = atp->elapsed_time/rp->estimated_duration(); double old_dcf = duration_correction_factor; // it's OK to overestimate completion time, // but bad to underestimate it. // So make it easy for the factor to increase, // but decrease it with caution // if (adj_ratio > 1.1) { duration_correction_factor = raw_ratio; } else { // in particular, don't give much weight to results // that completed a lot earlier than expected // if (adj_ratio < 0.1) { duration_correction_factor = duration_correction_factor0.99 + 0.01raw_ratio; } else { duration_correction_factor = duration_correction_factor0.9 + 0.1raw_ratio; } } // limit to [.01 .. 100] // if (duration_correction_factor > 100) duration_correction_factor = 100; if (duration_correction_factor < 0.01) duration_correction_factor = 0.01; if (log_flags.dcf_debug) { msg_printf(this, MSG_INFO, "[dcf] DCF: %f->%f, raw_ratio %f, adj_ratio %f", old_dcf, duration_correction_factor, raw_ratio, adj_ratio ); } } [ Actually a DCF per device is not necessarily a requirement. After all, it is the difference between the actual and expected times, but the servers do not specify the time, they specify the FLoating Point OPerations count. So the speed of the processor is entered into the equation when calculating the original estimated time to compute. Then the actual time is divided by the original time to get a duration correction factor for that task. I believe that BOINC only maintains one speed for all GPUs and one speed for all CPUs. It is this number that needs to be replicated for each GPU type rather than the DCF. BOINC WIKI ID: 1208639 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1208766 - Posted: 22 Mar 2012, 9:22:57 UTC - in response to Message 1208639. Last modified: 22 Mar 2012, 9:53:25 UTC I believe that BOINC only maintains one speed for all GPUs and one speed for all CPUs. It is this number that needs to be replicated for each GPU type rather than the DCF. Yes, that is what I meant by "there needs to be per device DCF adjustment". In my initial post I also said "clearly BOINC should be taking relative speed of the GPUs into account when calculating the DCF". When will the BOINC that does this or resolves my issue using some other regime be released? You have not commented on the current issues I have with the current DCF jumping way too high. When will that code be fixed or expunged? ID: 1208766 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1208768 - Posted: 22 Mar 2012, 9:25:49 UTC - in response to Message 1208639. Last modified: 22 Mar 2012, 9:28:48 UTC Actually a DCF per device is not necessarily a requirement. After all, it is the difference between the actual and expected times, but the servers do not specify the time, they specify the FLoating Point OPerations count. So the speed of the processor is entered into the equation when calculating the original estimated time to compute. Then the actual time is divided by the original time to get a duration correction factor for that task. Actually I have never asked for a "a DCF per device". To me it has always been obvious this would not be a good solution to the issue I have. ID: 1208768 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1208770 - Posted: 22 Mar 2012, 9:27:49 UTC - in response to Message 1208766. When will the BOINC that does this be released? As far as I know, never, since CreditNew will take over the function of TDCF and then it all happens on the server. The server will maintain host_app_version.et, the statistics (mean and variance) of job runtimes (normalized by wu.fpops_est) per host and application version. Source, Job runtime estimates. ID: 1208770 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1208772 - Posted: 22 Mar 2012, 9:40:37 UTC - in response to Message 1208770. Last modified: 22 Mar 2012, 9:51:03 UTC When will the BOINC that does this be released? As far as I know, never, since CreditNew will take over the function of TDCF and then it all happens on the server. The server will maintain host_app_version.et, the statistics (mean and variance) of job runtimes (normalized by wu.fpops_est) per host and application version. Source, Job runtime estimates. Thank you for the link which I have just read. I can't see and explicit referance to how GPUs with different speeds are catered for though. Have I missed it? Which time will get displayed in the Remaining column for GPU tasks that are not running when a system has GPUs of different speeds? Which version of BOINC has this? ID: 1208772 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1208775 - Posted: 22 Mar 2012, 10:04:08 UTC - in response to Message 1208772. I answered before your edit, on the notion of "which BOINC will do a per device DCF adjustment". And that's that no BOINC will do that. Also not for per application. As far as I understand from David, DCF is going away and isn't in use in CreditNew and therefore not in use on projects that use CreditNew. Seti is one of the projects that uses CreditNew. ID: 1208775 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1208783 - Posted: 22 Mar 2012, 10:52:13 UTC - in response to Message 1208775. I answered before your edit, on the notion of "which BOINC will do a per device DCF adjustment". And that's that no BOINC will do that. Also not for per application. As far as I understand from David, DCF is going away and isn't in use in CreditNew and therefore not in use on projects that use CreditNew. Seti is one of the projects that uses CreditNew. Once I gathered DCF was going I made the edit to make the request general. The real issue now is will the new regime address GPUs of different speeds being in the same system? Thus far I can't find any information that says it will. Will the new regime address CPUs with HyperThreading Technology where the CPU speed depends on if there is one or two threads active on each Core? ID: 1208783 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1208789 - Posted: 22 Mar 2012, 11:10:54 UTC - in response to Message 1208783. The real issue now is will the new regime address GPUs of different speeds being in the same system? Thus far I can't find any information that says it will. Will the new regime address CPUs with HyperThreading Technology where the CPU speed depends on if there is one or two threads active on each Core? These are things that you shouldn't ask in the Seti forums, as they're a BOINC thing. So best ask it at the BOINC development email list. This list requires registration. ID: 1208789 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1208804 - Posted: 22 Mar 2012, 12:30:07 UTC - in response to Message 1208789. Last modified: 22 Mar 2012, 12:34:17 UTC The real issue now is will the new regime address GPUs of different speeds being in the same system? Thus far I can't find any information that says it will. Will the new regime address CPUs with HyperThreading Technology where the CPU speed depends on if there is one or two threads active on each Core? These are things that you shouldn't ask in the Seti forums, as they're a BOINC thing. So best ask it at the BOINC development email list. This list requires registration. It would be better to use a PM rather than posting on this thread, but as you have I have to reply here. I do not wish to join the BOINC development email list as I suspect I would get a large number of emails. Given this what other option do I have but to post the issue here? I feel there should be a "Developer" thread that requires approval before you are allowed to post on which these types of concerns could be raised. I feel approval is needed to keep the Signal to Noise ratio high. ID: 1208804 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1208833 - Posted: 22 Mar 2012, 14:13:03 UTC - in response to Message 1208804. Last modified: 22 Mar 2012, 14:13:28 UTC I do not wish to join the BOINC development email list as I suspect I would get a large number of emails. Given this what other option do I have but to post the issue here? You could of course detach from that list after you've had your question(s) answered, but if you really do not feel like joining that list, you can always try to email David personally. Be nice and eloquent, though. Make sure to explain your problem in detail. I feel there should be a "Developer" thread that requires approval before you are allowed to post on which these types of concerns could be raised. I feel approval is needed to keep the Signal to Noise ratio high. But then you'd need such a thread on every of the 50+ projects and someone going by those projects on a daily basis to gather information. It's easier to have forums for that, which we do... but even then, the developers will only check in there when they're pointed out such and so thread and what's in it (by me mostly). They're too busy with all other things BOINC, non-BOINC and personal life to also go read and answer forums 3 times a day. The email lists enter directly into their email box, which is why I point that out first. This is where the other volunteer developers (such as John McLeod) will also be reading what you have to say/ask and answer if they know about the subject. ID: 1208833 ·

dads Send message Joined: 14 Jan 12 Posts: 4 Credit: 158,397 RAC: 0	Message 1209960 - Posted: 25 Mar 2012, 5:06:56 UTC this is what gets my goat they send me 25 603 enhance non cuda and when thats done. Im left with 83 cuda and my quad core does nothing until i get most of the units done and they send me more work . i need more work for my cpu x 4 ID: 1209960 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.