Message boards :
News :
Tests of new scheduler features.
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 17 · Next
Author | Message |
---|---|
![]() Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0 ![]() |
I found a typo in the ati_opencl_100 plan class that was causing some ATI machines without OpenCL to get opencl work. I fixed it. ![]() |
![]() Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0 ![]() |
That was part of it. I can see why the reason isn't being transmitted. There's a separate reason for each app version that's considered. You didn't get work some versions because of your driver revision. If BOINC printed out the reason for every app version it would get confusing, unless it were done very carefully. I'll think about it. ![]() |
![]() Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0 ![]() |
And yes, the astropulse results from the last two tapes have been mostly outliers (98.6% to be precise). I've put on a new tape and hope for better luck. ![]() |
![]() Send message Joined: 10 Feb 12 Posts: 107 Credit: 305,151 RAC: 0 ![]() |
However did you manage to get my driver version to show!? ![]() It's been "unknown" for over 2 and a half years!:) Awesome... |
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0 ![]() |
It was: http://setiweb.ssl.berkeley.edu/beta/show_host_detail.php?hostid=18439 and I bet because of full quota. For Beta at least, it would be hugely useful if you could enable click-through from the host listing page to the most recent server log for each host, as Einstein have done. I'm sure it's a big server load, but with the recent new hardware, I think it might be worth a try. Then the - relatively few - active testers here could look at the - version by version - decision-making process followed by each scheduler, and pick out anomalies without pestering you to extract and post each server event that comes under scrutiny. I'm sure Bernd would lend a hand with implementation. |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
It was: http://setiweb.ssl.berkeley.edu/beta/show_host_detail.php?hostid=18439 and I bet because of full quota. Yeah, good idea! This way we also would have a chance to learn more about server side of project :) Sure it can be done ONLY on beta. On main project not needed. And regarding reason of no work given - yes, I had same thought that BOINC's answer was " need driver upgrade". And BOINC right, if I would upgrade driver more app versions would be available and host did not reach quota there... but from user point of view it would be wrong answer still. Cause such answer doesn't make difference between fulfilled and refused request. host gets driver upgrade recommendation on each and every request. We need something like "quota reached for app versions available for this particular host" answer. Smth like BOINC says when there is work for MB and not for AP for example. With hint to user what current problem is (no work for AP) and hint what user can do (enable MB work or, in my case, upgrade drivers to recive CUDA50 for example). |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
Eric, after more considerations I think that temporary effect of limited quota on app version allocation is positive feature, not negative one. And should not be "fixed" anyhow. This binds even very fast hosts to real world time scale. And we can spot and fix issues with screwed hosts only in real world time scale no matter how fast host is. But fastest host leaving in screwed (for example GPU downclock) state for same amount of real world time will complete many tasks with distorted timings. So allocating tasks from all (even slower) apps in case of quota shrinking is good - it gives BOINC a chance to probe host for new conditions and not to stuck in wrong state. |
![]() Send message Joined: 18 Jan 06 Posts: 1038 Credit: 18,734,730 RAC: 0 ![]() |
Two of my hosts are getting too much work assigned now. Local cache settings on both are Min 0.1 days + Max additional 0.5 days. But host id 51991 got already 3 times (450+ wus) that amount and host id 50380 got even more (1500 + wus). Stopped workfetch on both hosts manually. Shouldn't there be a limitation because of cache settings ? _\|/_ U r s |
![]() Send message Joined: 14 Feb 13 Posts: 606 Credit: 588,843 RAC: 0 |
Two of my hosts are getting too much work assigned now. Not if flops and therefore estimates are hopelessly wrong. If flops underestimate at 1/3 e.g. APR is 6 but you're getting flops of 18e9 you'll get three times the amount you actually need. Compare estimates on the host (taking DCF into account for boinc 6) with known actual runtimes. Or look up flops received in scheduler_reply or Client_state.xml. And of course once APR estimates kick in boinc starts to panic :D A person who won't read has no advantage over one who can't read. (Mark Twain) |
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0 ![]() |
I find it helpful to enable the <shed_op_debug> logging flag, and compare the values (number of seconds) between what the client requested, and what the same client estimated the server's response to be. |
![]() Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0 ![]() |
There should have been a limitation because of the app version max results per day, unless you're asking for 7+ days worth of work. ![]() |
![]() Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0 ![]() |
Two of my hosts are getting too much work assigned now. Last time host 50380 got work, it asked for 0.6 days of GPU work. The time before that it asked for 2 days of CPU work. It's not obeying your cache settings for some reason or your computer thinks that the work you have is going to take zero time. 2013-05-16 22:00:23.2456 [PID=25843] Sending reply to [HOST#50380]: 3 results, delay req 7.00 2013-05-16 22:01:29.3086 [PID=28539] Sending reply to [HOST#50380]: 9 results, delay req 7.00 2013-05-16 22:02:35.7364 [PID=29746] Sending reply to [HOST#50380]: 9 results, delay req 7.00 2013-05-16 22:03:38.0991 [PID=29875] Sending reply to [HOST#50380]: 9 results, delay req 7.00 2013-05-16 22:04:57.1270 [PID=30032] Sending reply to [HOST#50380]: 9 results, delay req 7.00 2013-05-16 22:06:33.9883 [PID=30862] Sending reply to [HOST#50380]: 9 results, delay req 7.00 2013-05-16 22:07:52.8436 [PID=31073] Sending reply to [HOST#50380]: 9 results, delay req 7.00 2013-05-16 22:09:09.4220 [PID=31263] Sending reply to [HOST#50380]: 9 results, delay req 7.00 2013-05-16 22:10:26.5875 [PID=31427] Sending reply to [HOST#50380]: 9 results, delay req 7.00 2013-05-16 22:11:58.2080 [PID=426 ] Sending reply to [HOST#50380]: 9 results, delay req 7.00 2013-05-16 22:13:40.9137 [PID=635 ] Sending reply to [HOST#50380]: 9 results, delay req 7.00 2013-05-16 22:14:58.2322 [PID=901 ] Sending reply to [HOST#50380]: 9 results, delay req 7.00 2013-05-16 22:15:54.1178 [PID=1646 ] Sending reply to [HOST#50380]: 5 results, delay req 7.00 2013-05-16 22:34:02.4328 [PID=9818 ] Sending reply to [HOST#50380]: 9 results, delay req 7.00 2013-05-16 22:54:10.1618 [PID=19954] Sending reply to [HOST#50380]: 9 results, delay req 7.00 2013-05-16 23:06:09.5097 [PID=25762] Sending reply to [HOST#50380]: 11 results, delay req 7.00 2013-05-16 23:57:46.2671 [PID=17817] Sending reply to [HOST#50380]: 9 results, delay req 7.00 2013-05-17 01:17:54.1977 [PID=23106] Sending reply to [HOST#50380]: 9 results, delay req 7.00 2013-05-17 01:49:07.7491 [PID=5520 ] Sending reply to [HOST#50380]: 11 results, delay req 7.00 2013-05-17 02:10:24.7012 [PID=17028] Sending reply to [HOST#50380]: 9 results, delay req 7.00 ![]() |
![]() Send message Joined: 18 Jan 06 Posts: 1038 Credit: 18,734,730 RAC: 0 ![]() |
Lol, computers and thinking, lol! Will try to set in cc_config some flags and try again to enable work fetch on 50380 _\|/_ U r s |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
Lol, with such freq and CU numbers your computer will not only think, it will write books and teach us new theories ;D |
![]() Send message Joined: 18 Jan 06 Posts: 1038 Credit: 18,734,730 RAC: 0 ![]() |
The question is : why such numbers ? Some another host on main has even higher numbers and is not getting too much using the same BOINC version 7.0.65 Here is the debug output of work_fetch: Fr 17 Mai 2013 22:00:49 CEST [work_fetch] work fetch start Fr 17 Mai 2013 22:00:49 CEST [work_fetch] choose_project() for ATI: buffer_low: no; sim_excluded_instances 0 Fr 17 Mai 2013 22:00:49 CEST [work_fetch] choose_project() for CPU: buffer_low: no; sim_excluded_instances 0 Fr 17 Mai 2013 22:00:49 CEST [work_fetch] ------- start work fetch state ------- Fr 17 Mai 2013 22:00:49 CEST [work_fetch] target work buffer: 8640.00 + 43200.00 sec Fr 17 Mai 2013 22:00:49 CEST [work_fetch] --- project states --- Fr 17 Mai 2013 22:00:49 CEST SETI@home Beta Test [work_fetch] REC 151466.771 prio -2.914135 can't req work: "no new tasks" requested via Manager Fr 17 Mai 2013 22:00:49 CEST [work_fetch] --- state for CPU --- Fr 17 Mai 2013 22:00:49 CEST [work_fetch] shortfall 119324.77 nidle 0.00 saturated 16471.96 busy 0.00 Fr 17 Mai 2013 22:00:49 CEST SETI@home Beta Test [work_fetch] fetch share 0.000 Fr 17 Mai 2013 22:00:49 CEST [work_fetch] --- state for ATI --- Fr 17 Mai 2013 22:00:49 CEST [work_fetch] shortfall 0.00 nidle 0.00 saturated 1361018.69 busy 0.00 Fr 17 Mai 2013 22:00:49 CEST SETI@home Beta Test [work_fetch] fetch share 0.000 Fr 17 Mai 2013 22:00:49 CEST [work_fetch] ------- end work fetch state ------- Fr 17 Mai 2013 22:00:49 CEST [work_fetch] No project chosen for work fetch _\|/_ U r s |
![]() Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0 ![]() |
I think I see what the issue is. The target work buffer is per core and per GPU. Since it's 4 processors and 2 GPUs it wants a buffer of 207360 seconds of CPU work and 103680 seconds of GPU work. That should still only be about 44 S@H results on your GPU. So it looks like it was estimating that your GPUs could do a result in 20 seconds. I'll back up further in the logs to see what the server side estimates were. ![]() |
![]() Send message Joined: 18 Jan 06 Posts: 1038 Credit: 18,734,730 RAC: 0 ![]() |
I think I see what the issue is. The target work buffer is per core and per GPU. Since it's 4 processors and 2 GPUs it wants a buffer of 207360 seconds of CPU work and 103680 seconds of GPU work. No idea why BOINC detects that a single Opteron processor with 2 dual cores (4 siblings) is 4 processors instead. That's something that one can ask himself when looking at some stats sides, but never would think this could have such a negative sideeffect. 20 seconds per task would be 120 times less than real world shows (with the shorties that are around). Good that fetching was manually stopped. If remember correct, the first estimates from server had be 7:13 minutes per task. _\|/_ U r s |
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0 ![]() |
The question is : why such numbers ? Some another host on main has even higher numbers and is not getting too much using the same BOINC version 7.0.65 Urs, could you try <sched_op_debug>, please? 18/05/2013 00:03:05 | SETI@home Beta Test | [sched_op] NVIDIA work request: 6576.98 seconds; 0.00 devices 18/05/2013 00:03:08 | SETI@home Beta Test | Scheduler request completed: got 17 new tasks 18/05/2013 00:03:08 | SETI@home Beta Test | [sched_op] estimated total NVIDIA task duration: 6662 seconds is actually more helpful than <work_fetch> for this sort of checking. @ Eric, It's quite subtle, and needs to be checked carefully, which values are 'wall time', and which are 'device time' - especially where multiple devices are in play. Urs' "shortfall 119324.77" would have been a request for 1 day 9 hours of device-time work (if work fetch hadn't been disabled) even though the cache setting was for 2.4 + 12 hours of wall-time. I presume that host can crunch at least three CPU tasks in parallel, so there are three (or more) device-hours (CPU-core-hours) in every wall-hour. |
![]() Send message Joined: 18 Jan 06 Posts: 1038 Credit: 18,734,730 RAC: 0 ![]() |
Urs, could you try <sched_op_debug>, please? Was already active, but did not ask for more GPU work anymore. Had to hit the button (see below). Estimates for opencl_ati_sah are now near real duration of wus, but no sign of "high priority" by BOINC. Sa 18 Mai 2013 03:16:04 CEST SETI@home Beta Test update requested by user _\|/_ U r s |
![]() ![]() Send message Joined: 16 Jun 05 Posts: 2531 Credit: 1,074,556 RAC: 0 ![]() |
I`m wondering this host is getting AP 6.04 units. http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=60626 Drivers in use 186.18. If i`m not mistaken first NV OpenCL driver is 195.55. With each crime and every kindness we birth our future. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.