Tests of new scheduler features.

5px solid LightGreen" > berkeley.edu/beta/view_profile.php?userid=2"> Profile

Eric J Korpela
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 15 Mar 05
Posts: 1547
Credit: 27,183,456
RAC: 0

Author	Message
Eric J Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0	Message 45855 - Posted: 14 May 2013, 21:21:51 UTC - in response to Message 45854. Arh, Jeff move old logs offline during the reorg. I'll need to go find them. ID: 45855 ·

Richard Haselgrove Volunteer tester Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0	Message 45856 - Posted: 14 May 2013, 22:35:26 UTC - in response to Message 45854. Sorry for not warning you. I needed to reset stats and cancel previous results. Your versions should be random again until you get back to 10 results per. I was clearing down the run on host 63280, and was left with just two tasks to report when the server closed for maintenance at approx 15:30 UTC. I haven't even fetched any new tasks, let alone reported any except those two, since then. But according to Application details for host 63280, we're practically up to non-random timings on all three apps already. I'm assuming that the 347 invalid tasks in All tasks for computer 63280 - all except five of which were 'Validation pending' seven hours ago - are slowly being semi-validated as 'work in progress' tasks are reported by wingmates. Is that a helpful leg-up for your next stats-gathering run, or might you need to nuke the database even more completely between runs? ID: 45856 ·

Eric J Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0	Message 45857 - Posted: 14 May 2013, 22:45:52 UTC - in response to Message 45856. So far it appears that things will settle out pretty quickly. We're approaching the "100 normal results" for all the GPU setiathomes and the Windows CPU is there already. My major worry is that all the available versions won't go out to some hosts, that some hosts might continue to get the wrong versions even after they've done a couple hundred, and that credit granting might go haywire. I expect to see the usual "15 second deadline" problems, but they should go away quickly. ID: 45857 ·

Richard Haselgrove Volunteer tester Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0	Message 45858 - Posted: 14 May 2013, 22:54:09 UTC - in response to Message 45857. I'll leave that host on autopilot (other projects - not SETI Beta) overnight, and then try another semi-managed run tomorrow when I'm around to keep an eye on it. See what happens then. ID: 45858 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 45859 - Posted: 15 May 2013, 3:17:21 UTC - in response to Message 45857. I expect to see the usual "15 second deadline" problems, but they should go away quickly. And no fix possible? Some hardwired defaults until mean values are formed at least... ID: 45859 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 45860 - Posted: 15 May 2013, 3:23:18 UTC Last modified: 15 May 2013, 3:31:25 UTC My main ATi host recived all possible mix of tasks: CAL + OpenCL for AP and HD5 and non-HD5 for MB. So at least it has chance to form relative speed comparisons. EDIT: and what is not good: initial estimate for CAL AP 4 minutes shorter (!) than for OpenCL AP. But it's good known that CAL version much slower... EDIT2: and currently wrong APR relation between HD5 and non-HD5 for that host. ID: 45860 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 45861 - Posted: 15 May 2013, 3:45:58 UTC - in response to Message 45855. Last modified: 15 May 2013, 3:48:20 UTC Arh, Jeff move old logs offline during the reorg. I'll need to go find them. No need to look old logs, look new ones. That host asked for work already and recived 97 cuda22 task. Again, only cuda22... http://setiweb.ssl.berkeley.edu/beta/host_app_versions.php?hostid=63368 Such number of slowest possible tasks makes me worry about work fetch algorithm sanity, at least on initial phase. ID: 45861 ·

Eric J Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0	Message 45862 - Posted: 15 May 2013, 4:59:15 UTC - in response to Message 45859. I haven't figured out where this one comes from yet, and apparently nobody on the BOINC team cares all that much because it's a temporary problem. I guess I can understand why. On my list of current server problems it's #5 in priority. I expect to see the usual "15 second deadline" problems, but they should go away quickly. And no fix possible? Some hardwired defaults until mean values are formed at least... ID: 45862 ·

Eric J Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0	Message 45863 - Posted: 15 May 2013, 5:00:32 UTC - in response to Message 45861. No need to look old logs, look new ones. That host asked for work already and recived 97 cuda22 task. Again, only cuda22... http://setiweb.ssl.berkeley.edu/beta/host_app_versions.php?hostid=63368 Such number of slowest possible tasks makes me worry about work fetch algorithm sanity, at least on initial phase. I'll check them out first thing in the morning. My priorities right now are to make Angela go to bed, and then go to bed myself. ID: 45863 ·

Eric J Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0	Message 45864 - Posted: 15 May 2013, 5:05:32 UTC - in response to Message 45860. Last modified: 15 May 2013, 5:06:04 UTC Oh, and I should point out that I purposely increased the "cal_ati" computation speed estimates in order to check that things eventually get back to normal. The way things were, nobody with OpenCL was getting the cal_ati application and that was preventing me from testing the server logic. ID: 45864 ·

Richard Haselgrove Volunteer tester Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0	Message 45869 - Posted: 15 May 2013, 10:17:31 UTC - in response to Message 45858. ... try another semi-managed run tomorrow when I'm around to keep an eye on it. See what happens then. Opened up for work fetch - first new work since the project reset Tuesday 14 May. The good news - got cuda50 at the first attempt (correct choice) The bad news - estimates are screwy. Application details for host 63280 showed: Number of tasks completed 29 Average processing rate 158.29961718573 (from wingmates reporting against cancelled pending tasks) But the app_version came out with <flops>474687872384.139160</flops> - three times the APR speed, with a commensurate under-estimate of runtimes. ID: 45869 ·

Richard Haselgrove Volunteer tester Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0	Message 45871 - Posted: 15 May 2013, 13:23:31 UTC - in response to Message 45869. <flops>474687872384.139160</flops> - three times the APR speed, with a commensurate under-estimate of runtimes. The first of those triple-estimated WUs has validated (5313350) ... ... and been awarded triple credit. ID: 45871 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 45872 - Posted: 15 May 2013, 15:48:32 UTC I foresee biased APRs when few tasks per GPU used. For example on my ATi host currently CAL and non-HD5 are paired. Obviously, non-HD5 got nice boost in speed, its elapsed time became lower and its APR will rise. IF cla task will be finished until HD5 task comes in play its APR will be lower than APR for slower app. But currently have no solution how to avoid this. GPU downclock provides another bias in APR and possible wrong app allocations.... All this can be solved only by sending slower app time to time to probe if best app chosen right... but such probes can be expensive performance-wise on properly configured hosts... So, some AI required ;) (for active users there will be nothing better than good old app_info with carefully chosen app and its params :) ) ID: 45872 ·

Eric J Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0	Message 45873 - Posted: 15 May 2013, 17:04:53 UTC - in response to Message 45863. Last modified: 15 May 2013, 17:11:00 UTC I 2013-05-14 00:40:23.2153 [PID=7483 ] 2013-05-14 00:40:23.2153 [PID=7483 ] 2013-05-14 00:40:23.2158 [PID=7483 ] 2013-05-14 00:40:23.2159 [PID=7483 ] 2013-05-14 00:40:23.2159 [PID=7483 ] 2013-05-14 00:40:23.2159 [PID=7483 ] 2013-05-14 00:40:23.2159 [PID=7483 ] 2013-05-14 00:40:23.2159 [PID=7483 ] 2013-05-14 00:40:23.2159 [PID=7483 ] 2013-05-14 00:40:23.2159 [PID=7483 ] 2013-05-14 00:40:23.2159 [PID=7483 ] 2013-05-14 00:40:23.2159 [PID=7483 ] 2013-05-14 00:40:23.2159 [PID=7483 ] 2013-05-14 00:40:23.2160 [PID=7483 ] 2013-05-14 00:40:23.2160 [PID=7483 ] 2013-05-14 00:40:23.2160 [PID=7483 ] 2013-05-14 00:40:23.2160 [PID=7483 ] 2013-05-14 00:40:23.2160 [PID=7483 ] 2013-05-14 00:40:23.2160 [PID=7483 ] It's possible we've overestimated ID: 45873 ·

found out why 63368 is only getting cuda22. It's reporting 360MB of available GPU RAM, but cuda23-cuda50 require 384MB. [version] [AV#370] Skipping CPU version - user prefs say no CPU [version] Checking plan class 'cuda22' [version] reading plan classes from file '../plan_class_spec.xml' [version] plan_class_spec: host_flops: 2.749583e+09, scale: 1.00, projected_flops: 2.391303e+10, peak_flops: 2.455174e+10 [quota] [AV#364] scaled max jobs per day: 33 [version] [AV#364] (cuda22) adjusting projected flops based on PFC avg: 27.72G [version] Best app version is now AV364 (18.34 GFLOP) [version] Checking plan class 'cuda23' [version] plan_class_spec: GPU RAM required min: 402653184.000000, supplied: 377028608.000000 [version] [AV#365] app_plan() returned false [version] Checking plan class 'cuda32' [version] plan_class_spec: GPU RAM required min: 402653184.000000, supplied: 377028608.000000 [version] [AV#366] app_plan() returned false [version] Checking plan class 'cuda42' [version] plan_class_spec: CUDA version required min: 4020, supplied: 3020 [version] [AV#367] app_plan() returned false [version] Checking plan class 'cuda50' [version] plan_class_spec: CUDA version required min: 5000, supplied: 3020 [version] [AV#368] app_plan() returned false the memory requirements of the cuda apps. Anyone have a better estimate of what is required?
Message 45874 - Posted: 15 May 2013, 17:24:23 UTC - in response to Message 45869. The bad news - estimates are screwy. I'm not surprised at this. Hopefully the estimates trend toward better values quickly. The problem is that any time I change the initial estimates, it makes all the measured APR values incorrect by the same amount and screws up the credit calculation for workunits in process and the APR corrections that result from those calculation (hence the reason I need to cancel all workunits in process when I do this.) I think I've learned enough that I can do reasonable initial values when we move to the main project. ID: 45874 ·

Richard Haselgrove Volunteer tester Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0	Message 45875 - Posted: 15 May 2013, 17:40:14 UTC - in response to Message 45874. The bad news - estimates are screwy. I'm not surprised at this. Hopefully the estimates trend toward better values quickly. The problem is that any time I change the initial estimates, it makes all the measured APR values incorrect by the same amount and screws up the credit calculation for workunits in process and the APR corrections that result from those calculation (hence the reason I need to cancel all workunits in process when I do this.) I think I've learned enough that I can do reasonable initial values when we move to the main project. It's beginning to shift. Latest estimates are 3:00, instead of 2:30 at the time I posted - but still short of the actual 7:30 for the current configuration. ID: 45875 ·

William Volunteer tester Send message Joined: 14 Feb 13 Posts: 606 Credit: 588,843 RAC: 0	Message 45876 - Posted: 15 May 2013, 17:56:11 UTC - in response to Message 45873. I found out why 63368 is only getting cuda22. It's reporting 360MB of available GPU RAM, but cuda23-cuda50 require 384MB. It's possible we've overestimated the memory requirements of the cuda apps. Anyone have a better estimate of what is required? 'less' It's a bit difficult. All x41zc varieties from 2.2 to 5.0 will squeeze into a 256MiB card with 237 MiB available and run without problems (e.g. host 62763. I don't know if BOINC correctly reports free VRAM back to the scheduler for use. What does it say for that host? If there is more memory available and you run more than one at a time and you have a Fermi or Kepler class card, it will use more memory per instance and requirements may not be linear with number of instances either. We can try and check the memory footprint of the different versions on available alpha tester cards. A person who won't read has no advantage over one who can't read. (Mark Twain) ID: 45876 ·

Eric J Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0	Message 45877 - Posted: 15 May 2013, 18:20:12 UTC - in response to Message 45876. Last modified: 15 May 2013, 18:34:30 UTC I'll drop it to 237 and will check for failures. Host 62763 reports 256MB on the card, but the logs don't show what it's reporting as available. We'll see if its greater than 237MB next time is contacts the scheduler. ID: 45877 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 45878 - Posted: 15 May 2013, 19:16:25 UTC - in response to Message 45873. Last modified: 15 May 2013, 19:25:41 UTC I found out why 63368 is only getting cuda22. It's reporting 360MB of available GPU RAM, but cuda23-cuda50 require 384MB. 2013-05-14 00:40:23.2153 [PID=7483 ] [version] [AV#370] Skipping CPU version - user prefs say no CPU 2013-05-14 00:40:23.2153 [PID=7483 ] [version] Checking plan class 'cuda22' 2013-05-14 00:40:23.2158 [PID=7483 ] [version] reading plan classes from file '../plan_class_spec.xml' 2013-05-14 00:40:23.2159 [PID=7483 ] [version] plan_class_spec: host_flops: 2.749583e+09, scale: 1.00, projected_flops: 2.391303e+10, peak_flops: 2.455174e+10 2013-05-14 00:40:23.2159 [PID=7483 ] [quota] [AV#364] scaled max jobs per day: 33 2013-05-14 00:40:23.2159 [PID=7483 ] [version] [AV#364] (cuda22) adjusting projected flops based on PFC avg: 27.72G 2013-05-14 00:40:23.2159 [PID=7483 ] [version] Best app version is now AV364 (18.34 GFLOP) 2013-05-14 00:40:23.2159 [PID=7483 ] [version] Checking plan class 'cuda23' 2013-05-14 00:40:23.2159 [PID=7483 ] [version] plan_class_spec: GPU RAM required min: 402653184.000000, supplied: 377028608.000000 2013-05-14 00:40:23.2159 [PID=7483 ] [version] [AV#365] app_plan() returned false 2013-05-14 00:40:23.2159 [PID=7483 ] [version] Checking plan class 'cuda32' 2013-05-14 00:40:23.2159 [PID=7483 ] [version] plan_class_spec: GPU RAM required min: 402653184.000000, supplied: 377028608.000000 2013-05-14 00:40:23.2159 [PID=7483 ] [version] [AV#366] app_plan() returned false 2013-05-14 00:40:23.2160 [PID=7483 ] [version] Checking plan class 'cuda42' 2013-05-14 00:40:23.2160 [PID=7483 ] [version] plan_class_spec: CUDA version required min: 4020, supplied: 3020 2013-05-14 00:40:23.2160 [PID=7483 ] [version] [AV#367] app_plan() returned false 2013-05-14 00:40:23.2160 [PID=7483 ] [version] Checking plan class 'cuda50' 2013-05-14 00:40:23.2160 [PID=7483 ] [version] plan_class_spec: CUDA version required min: 5000, supplied: 3020 2013-05-14 00:40:23.2160 [PID=7483 ] [version] [AV#368] app_plan() returned false It's possible we've overestimated the memory requirements of the cuda apps. Anyone have a better estimate of what is required? FYI card has 384 MB of dedicated GPU RAM (so don't know where BOINC took 360 MB) and fully capable of doing any CUDA tasks we have till now. EDIT: checked what BOINC reports locally: 360MB of GPU RAM and 330MB free GPU RAM. Total reported RAM amount definitely wrong one. ID: 45878 ·

TRuEQ & TuVaLu Volunteer tester Send message Joined: 28 Jan 11 Posts: 619 Credit: 2,580,051 RAC: 0	Message 45879 - Posted: 15 May 2013, 19:23:09 UTC Last modified: 15 May 2013, 19:23:24 UTC I had a 4850 card with 512MB Ram and only 480 of thoose MB's where for apps. Dunno if that is info relevant here though... ID: 45879 ·