Tests of new scheduler features.

Author	Message
Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 45948 - Posted: 18 May 2013, 13:38:48 UTC Last modified: 18 May 2013, 13:40:08 UTC Is it possible to increase amount of time BOINC wait before issue task abort due to 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED ? I think project loses more power because of this "protective" feature than from stuck tasks themselves. It's especially actual for GPU tasks where total execution time very slow, but computational environment can change very strongly. Example: http://setiweb.ssl.berkeley.edu/beta/result.php?resultid=13913803 was aborted after ~1h of execution. This host has well established running time now. But sometimes tasks take longer. Reason: game was ran that time. No slowdowns in game itself, but GPU computing was slowed down obviously. This task had ALL chances to be completed in time, no deadline pressure at all. Moreover, it could recive usual speed just after end of game... but instead it was aborted by BOINC. It's not "protective", it's abuse! ID: 45948 ·

Claggy Volunteer tester Send message Joined: 29 May 06 Posts: 1037 Credit: 8,440,339 RAC: 0	Message 45949 - Posted: 18 May 2013, 13:53:02 UTC - in response to Message 45946. Last modified: 18 May 2013, 13:57:53 UTC I`m wondering this host is getting AP 6.04 units. http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=60626 Drivers in use 186.18. If i`m not mistaken first NV OpenCL driver is 195.55. And the first driver that the NV OpenCL app produces good work on is 263.xx, 260.99 isn't good enough, i did have a test case host (on the Main project) that was producing inconclusive work, but they've upgraded the driver recently. Edit: and it's still producing inconclusive/invalid work, probably because it did it's compilations under 260.99: All AstroPulse v6 tasks for computer 4819000 Claggy ID: 45949 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 45950 - Posted: 18 May 2013, 13:54:37 UTC So, time to check app_plan driver limits, right ? ID: 45950 ·

Richard Haselgrove Volunteer tester Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0	Message 45952 - Posted: 18 May 2013, 14:45:26 UTC - in response to Message 45950. So, time to check app_plan driver limits, right ? We have a saying in the English political-legal system: "Hard cases make bad law" In other words - don't rush to change the law, just because one special case had difficulty slotting into the current framework. Claggy found one case, of one host, run by a volunteer who has never posted on the message boards. He assumed that the bad results were caused by 'driver version too low': I'd think that the subsequent continuing errors under a newer driver raise doubts about that assumption. Either way, I suggest that changing the driver limits on the basis of that one case, with no corroborating evidence, is contrary to "scientific method". But by all means check, re-check, and propose an alternative based on thorough research (with citations). Similarly, one case of a task which over-ran its time limit because the user forgot to set the 'exclusive app' tag to pause it while gaming? Again, I'm unconvinced that one case is enough to justify a change in project policy. 'EXIT_TIME_LIMIT_EXCEEDED' is triggered when a task exceeds it's <rsc_fpops_bound> limit, which for this project (both Main and Beta) has been set at 10x <rsc_fpops_est> since time immemorial. It would be trivial to re-set the workunit generators (splitters, in local parlance) to use a 20x, or 100x, or any other time limit: but that would impinge on all other over-running tasks, too. What about the case when an application - according to BOINC - is active, but in reality has stalled, using no CPU time and making zero progress? That task's exit will be delayed, too, resulting in longer pendings for wingmates, tasks lingering longer in the BOINC database, data files staying longer in the server fanout before being deleted etc. etc. Again, think through all the consequences throughout the interconnected client-server system before proposing changes. ID: 45952 ·

Claggy Volunteer tester Send message Joined: 29 May 06 Posts: 1037 Credit: 8,440,339 RAC: 0	Message 45953 - Posted: 18 May 2013, 15:38:14 UTC - in response to Message 45952. Last modified: 18 May 2013, 15:42:28 UTC So, time to check app_plan driver limits, right ? We have a saying in the English political-legal system: "Hard cases make bad law" In other words - don't rush to change the law, just because one special case had difficulty slotting into the current framework. Claggy found one case, of one host, run by a volunteer who has never posted on the message boards. He assumed that the bad results were caused by 'driver version too low': I'd think that the subsequent continuing errors under a newer driver raise doubts about that assumption. Either way, I suggest that changing the driver limits on the basis of that one case, with no corroborating evidence, is contrary to "scientific method". But by all means check, re-check, and propose an alternative based on thorough research (with citations). Except I know it never worked on 260.99, it didn't three years ago when I tested the initial Nvidia Alpha build on my 9800GTX+, Raistmer confirmed it worked on 263.06, I confirmed it worked on 266.35 r505 OpenCL AstroPulse for ATI GPUs alpha for NV GPUs Raistmer even posted: @Claggy Thanks! You proven that app works OK, but only with quite recent drivers. Driver requirement is needed. I know it works with 263.06, now we know it doesn't work with 260.99 We never tried to fix the app for 260.99, and admittedly we haven't gone and tested the latest app with that driver, I see no reason why it would work now, Claggy ID: 45953 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 45954 - Posted: 18 May 2013, 15:56:15 UTC - in response to Message 45952. Last modified: 18 May 2013, 16:15:23 UTC What about the case when an application - according to BOINC - is active, but in reality has stalled, using no CPU time and making zero progress? That is the keystone of issue. BOINC should detect when no progress made. In case of slowed processing progress will be, just slow. In case of stuck app progress will be just zero, not little number, but virtually zero. And elapsed time bad measurement for progress. ID: 45954 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 45955 - Posted: 18 May 2013, 16:10:21 UTC Last modified: 18 May 2013, 16:11:57 UTC I would generalize issue in such way: there are 2 time scales: computer time scale, that scales with PC's performance and real world time time scale that never scales (until humans will evolve in something other species). When human interacts with PC (running app on it, for example) he/she always interacts in real world time scale. When he/she notice some slowdowns or other issues and fix them, he/she again acts in real world time scale - no matter how fast or how slow particular PC is. But all flops based time scaling is computer time scale, not real world time scale. Hence when these scales differs too much (PC too fast) we see such issues as downloading thousands of tasks before user can stop this, aborting tasks just because user spend hour of rest playing game and so on. BOINC should take into account real world time scale too. It does this what some timeouts considered (like in network communications). It should tak real world time scale into account also in this part. Also, there were "hearbeats" before. Are they gone completely? Time from last checkpoint check can be used instead, for example.... And would be good if this option will be real world time based and configurable. For example: to abort task if it doesn't checkpoint more than 1 hour (configurable by user) with some adequate default. Or, maybe even better - configurable multiplier. With multiplier we could account for different length of tasks for different projects. ID: 45955 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 45956 - Posted: 18 May 2013, 16:15:53 UTC Last modified: 18 May 2013, 16:16:21 UTC Similarly, one case of a task which over-ran its time limit because the user forgot to set the 'exclusive app' tag to pause it while gaming? Again, I'm unconvinced that one case is enough to justify a change in project policy. 'EXIT_TIME_LIMIT_EXCEEDED' is triggered when a task exceeds it's <rsc_fpops_bound> limit, which for this project (both Main and Beta) has been set at 10x <rsc_fpops_est> since time immemorial. It would be trivial to re-set the workunit generators (splitters, in local parlance) to use a 20x, or 100x, or any other time limit: but that would impinge on all other over-running tasks, too. What about the case when an application - according to BOINC - is active, but in reality has stalled, using no CPU time and making zero progress? That task's exit will be delayed, too, resulting in longer pendings for wingmates, tasks lingering longer in the BOINC database, data files staying longer in the server fanout before being deleted etc. etc. Again, think through all the consequences throughout the interconnected client-server system before proposing changes. In part of EXIT_TIME_LIMIT_EXCEEDED issue I would disagree. exclusive app is "bandage" in this case. There are infinite number of programs being run will slowdown GPU. For example this particular game ran first time on that host. It's not possible to remember to add each and every app that would consume GPU to exclusive applications list. Moreover even in case this particular app it will be workaround! Game was not slowed down at all. App did not fail and did not produce incorrect result. So, for what reason "exclusive app" ??? Just to mask BOINC failure to discriminate "running slower" from "stuck" ? I would strongly disagree that it's right approach. It's some design flaw here. In very approach to issue. I agree, x20 or x100 will not SOLVE issue, it will just delay issue until new even more faster GPUs arrive. For CPU tasks we have real world time and CPU time. And BOINC will not kill task if it takes too much real world time (only if deadline passed) and same CPU time. Here app takes same GPU time but BOINC just doesn't able to determine that app uses fraction of GPU while it's able )vie system calls) to determine that app uses fraction of CPU. So, approach that _maybe_ viable for CPU tasks definitely not viable for GPU tasks where x100 slowdown due to GPU load absolutely normal and can't be used as indication of stucking app. Hence, the usage of wrong method leads to bad results, processing time lost, not saved by this "precaution measure". And just as current workaround, until more sophisticated approach will be devised, I still propose to increase limit. x10 for GPU too low, even for current midrange GPU. With task taking few minutes of real world time even one hour of heavy GPU usage will result in unneeded task abortions. My example just demonstrated that. ID: 45956 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 45958 - Posted: 18 May 2013, 16:40:10 UTC One more proposal: Lets consider, what measures we have now against what possible reason of "stuck" state. 1) deadline: ultimate measure agains host that fails to report results back, for whatever reason including flooded home with PC, nuke strike and so on and so forth. In general, deadline is measure against unknown reason of task stuck. 2) EXIT_TIME_LIMIT_EXCEEDED And here things become more interesting. If we already have measure against unknown reason, why another measure against "unknown" reason? No needed. So, some "known reasons" should be that allows some optimization: to abort stuck (known to be "stuck"!) task BEFORE it reached deadline. So, we should care that this new feature will not become de-optimization. Killing partially processed task w/o real need - example of such de-optimization. Even past deadline partially processed task not killed immediately (or never killed?). So what those reasons when EXIT_TIME_LIMIT_EXCEEDED "optimization" can be successfully used? a) unlimited wait for some resource (like file descriptor and so on) b) GPU driver crash (actually most probably it's case a) too but such waitung in runtime part, not in app own part). In both these cases app will stop to call BOINC API (for example, time to checkpoit request). So, this fact can be used for stuck detection. Note, that slowed app will continue to call BOINC API. What other reasons of stuck one can foresee? ID: 45958 ·

Richard Haselgrove Volunteer tester Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0	Message 45960 - Posted: 18 May 2013, 17:37:25 UTC Right, now we're beginning to think. But before we do, remember the question that started this discussion: Is it possible to increase amount of time BOINC wait before issue task abort due to 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED ? Who was it directed to? Eric, I presume - since you asked it in this thread, on this board. The lever Eric has at his disposal to pull - right now, with minimal programming - is the ratio between <rsc_fpops_est> and <rsc_fpops_bound>. I can't think of another one - can you? Otherwise, any solution is likely to involve substantial changes to at least two out of three from the BOINC server code, BOINC client code, and the BOINC application API code. Are we going to delay the v7 roll-across to the Main project while all that is done? If so, tell David that, rather than Eric. On this message board, shouldn't we concentrate on getting all the new apps tested, documented, and certified for deployment? BTW, on the subject of your 'stall' case (b) "waiting for file descriptor", I presume you've read the patch from Steffen MÃ¶ller this lunchtime? It needs evaluation and testing before rushing into deployment, of course, but surely eliminating the file descriptor leak at source will be the quickest and most effective way of dealing with this case? Otherwise, I'd suggest that we simply allow the fundamental fault-tolerant design of BOINC to cope with the 197 exits - after all, since they only occur when a task takes 10x (or whatever) the estimated time, by definition you don't burn through a large number of tasks that way. Then, once deployment is complete, would be the time to re-schedule your gaming hours into a re-write of BOINC, as you've suggested on another message board :P ;) ID: 45960 ·

Richard Haselgrove Volunteer tester Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0	Message 45962 - Posted: 18 May 2013, 18:01:45 UTC - in response to Message 45953. Last modified: 18 May 2013, 18:11:21 UTC Except I know it never worked on 260.99, it didn't three years ago when I tested the initial Nvidia Alpha build on my 9800GTX+, Raistmer confirmed it worked on 263.06, I confirmed it worked on 266.35 r505 OpenCL AstroPulse for ATI GPUs alpha for NV GPUs Claggy Well, I've looked back through the driver release notes (WHQL only) for your system (9800GTX+ on Vista x64), and found: OpenCL Support Release 195 supports the Open Computing Language (OpenCL) 1.0 for all GeForce 8â€ series and later GPUs. (found in http://uk.download.nvidia.com/Windows/197.13/197.13_Win7_WinVista_Desktop_Release_Notes.pdf) - and there were no bugfixes or OpenCL upgrades recorded in the release notes for either 260.99 or 266.58 Your testing was carried out - well, reported - on 17 Jan 2011. Please remind me whether that was before or after I discovered the "uninitialised VRAM causes inaccurate results" bug? I've forgotten when that was myself - I'll go back and look. Edit - it was 02 May 2012. That thread - AP 6.01 inconclusives/invalids - organising offline testing - started with: each time I provide new AP NV build: NV driver generate silent failure damaging GPU memory buffers and no OpenCL error message accompanies this It might be wise to re-test with a build dating from after the true cause was found? ID: 45962 ·

Claggy Volunteer tester Send message Joined: 29 May 06 Posts: 1037 Credit: 8,440,339 RAC: 0	Message 45963 - Posted: 18 May 2013, 18:30:51 UTC - in response to Message 45962. Last modified: 18 May 2013, 18:36:43 UTC Except I know it never worked on 260.99, it didn't three years ago when I tested the initial Nvidia Alpha build on my 9800GTX+, Raistmer confirmed it worked on 263.06, I confirmed it worked on 266.35 r505 OpenCL AstroPulse for ATI GPUs alpha for NV GPUs Claggy Well, I've looked back through the driver release notes (WHQL only) for your system (9800GTX+ on Vista x64), and found: OpenCL Support Release 195 supports the Open Computing Language (OpenCL) 1.0 for all GeForce 8â€ series and later GPUs. (found in http://uk.download.nvidia.com/Windows/197.13/197.13_Win7_WinVista_Desktop_Release_Notes.pdf) - and there were no bugfixes or OpenCL upgrades recorded in the release notes for either 260.99 or 266.58 I'm not surprised, Nvidia and AMD rarely say anything about Cuda or OpenCL changes, I don't think Nvidia even mentioned the Cuda sleeping monitor Bug being fixed. Your testing was carried out - well, reported - on 17 Jan 2011. Please remind me whether that was before or after I discovered the "uninitialised VRAM causes inaccurate results" bug? I've forgotten when that was myself - I'll go back and look. Edit - it was 02 May 2012. That thread - AP 6.01 inconclusives/invalids - organising offline testing - started with: each time I provide new AP NV build: NV driver generate silent failure damaging GPU memory buffers and no OpenCL error message accompanies this It might be wise to re-test with a build dating from after the true cause was found? I'll put 260.99 on my 9800GTX+ host next week, and do a bench of the Stock NV AP app and the latest version too. Claggy ID: 45963 ·

Eric J Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0	Message 45964 - Posted: 18 May 2013, 19:22:08 UTC - in response to Message 45946. Last modified: 18 May 2013, 20:01:52 UTC In theory the cuda_opencl_100 plan class is checking for NVIDIA driver of 197.13 or later. But I don't see any indications that this check ever fails for any CUDA applications. I guess I have more debugging to do. That one was at least easy to fix. I`m wondering this host is getting AP 6.04 units. http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=60626 Drivers in use 186.18. If i`m not mistaken first NV OpenCL driver is 195.55. ID: 45964 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 45965 - Posted: 18 May 2013, 20:21:46 UTC - in response to Message 45960. Who was it directed to? Eric, I presume - since you asked it in this thread, on this board. Right. If it possible to do such fast fix for single project it would be good move. BTW, on the subject of your 'stall' case (b) "waiting for file descriptor", I presume you've read the patch from Steffen MÃ¶ller this lunchtime? Actually not. Coincedence perhaps (in particular file descriptor example). It could be mutex or anything else, not matter. Basic idea - app will not call BOINC API functions in case of such stuck. In general I agree in that that this issue should not delay v7 release on main. Of course. But it doesn't matter that issue doesn't exit and if it could be at least masked with increased "tolerance time" even that would be good. Each aborted task resets quota. Such reset will in most cases lead to allocation tasks from slower app... and this also means performance drop. It's new, before we had no such performance degradation possibility. Think about it too. HD5 vs non-HD5 no matters much, but cuda23 vs cuda 22 means x2 or even slightly more performance drop.... So, along with v7 release it would be good to think about this too. So it belongs to thread " new shceduler features" at least a little ;) ID: 45965 ·

Josef W. Segur Volunteer tester Send message Joined: 14 Oct 05 Posts: 1137 Credit: 1,848,733 RAC: 0	Message 45967 - Posted: 19 May 2013, 4:54:37 UTC - in response to Message 45965. Who was it directed to? Eric, I presume - since you asked it in this thread, on this board. Right. If it possible to do such fast fix for single project it would be good move. ... It's certainly possible, the question is how to evaluate the protective function against the tendency to kill tasks which would have finished OK. My gut feeling is that there is enough slop in the estimates and known issues like GPUs downclocking to justify some increase. Even for CPUs the trend is for systems to be very protective, and most participants will probably leave such protections in place. At least for the rollout of new applications, I'd be in favor of an increase of the bound to about 20X the estimate. Of course BOINC by default stops all crunching when there's any serious other system usage, so the typical non-enthusiast wouldn't run into extended runtime due to gaming, etc. Still, enthusiasm doesn't necessarily pair with sufficient knowledge to judge the risk/reward ratio of deviating from defaults. Joe ID: 45967 ·

Mike Volunteer tester Send message Joined: 16 Jun 05 Posts: 2531 Credit: 1,074,556 RAC: 0	Message 45968 - Posted: 19 May 2013, 7:21:57 UTC Last modified: 19 May 2013, 7:22:53 UTC I agree Joe. Specially for users with AMD CPU and multiple GPU`s. If no CPU core is freed on such a system and high blanked astropulse running on one card the next WU start on the other card will take ages because of less CPU power. With each crime and every kindness we birth our future. ID: 45968 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 45969 - Posted: 19 May 2013, 10:28:18 UTC Last modified: 19 May 2013, 10:29:17 UTC And now my NV host gets 0 tasks on each request. Quotas not reached, 0 tasks allocated today for each of 3 apps. other 2 not possible for this host due driver constrains. Can this be reason? Maybe BOINC attempts to get statistics on cuda42 and cuda50 and refuse to allocate tasks for already good known cuda22, 23 and 32 ?... 19/05/2013 14:26:56 SETI@home Beta Test [sched_op_debug] Starting scheduler request 19/05/2013 14:26:56 SETI@home Beta Test Sending scheduler request: To fetch work. 19/05/2013 14:26:56 SETI@home Beta Test Requesting new tasks for GPU 19/05/2013 14:26:56 SETI@home Beta Test [sched_op_debug] CPU work request: 0.00 seconds; 0.00 CPUs 19/05/2013 14:26:56 SETI@home Beta Test [sched_op_debug] NVIDIA GPU work request: 266594.79 seconds; 0.00 GPUs 19/05/2013 14:26:58 SETI@home Beta Test Scheduler request completed: got 0 new tasks 19/05/2013 14:26:58 SETI@home Beta Test [sched_op_debug] Server version 701 19/05/2013 14:26:58 SETI@home Beta Test Message from server: NVIDIA GPU: Upgrade to the latest driver to use all of this project's GPU applications 19/05/2013 14:26:58 SETI@home Beta Test Project requested delay of 7 seconds 19/05/2013 14:26:58 SETI@home Beta Test [sched_op_debug] Deferring communication for 7 sec 19/05/2013 14:26:58 SETI@home Beta Test [sched_op_debug] Reason: requested by project ID: 45969 ·

Claggy Volunteer tester Send message Joined: 29 May 06 Posts: 1037 Credit: 8,440,339 RAC: 0	Message 45970 - Posted: 19 May 2013, 10:35:07 UTC - in response to Message 45969. Last modified: 19 May 2013, 10:39:37 UTC And now my NV host gets 0 tasks on each request. Quotas not reached, 0 tasks allocated today for each of 3 apps. other 2 not possible for this host due driver constrains. Can this be reason? Maybe BOINC attempts to get statistics on cuda42 and cuda50 and refuse to allocate tasks for already good known cuda22, 23 and 32 ?... Ask for CPU work, you'll find all work received are VLARs, and VLARs aren't sent to GPUs, hence you get no GPU work: All tasks for computer 45274 Claggy ID: 45970 ·

Richard Haselgrove Volunteer tester Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0	Message 45971 - Posted: 19 May 2013, 10:50:47 UTC - in response to Message 45970. Ask for CPU work, you'll find all work received are VLARs, and VLARs aren't sent to GPUs, hence you get no GPU work: All tasks for computer 45274 Claggy Agreed - All tasks for computer 62652. Being BOINC v6, that host can and will fetch CPU and GPU work independently, but has received no new (i.e. not resent) non-VLAR work for 36 hours now. ID: 45971 ·

Raistmer Volunteer tester Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0	Message 45972 - Posted: 19 May 2013, 11:05:51 UTC thanks. Have no intentions to process beta CPU tasks there so please let know when GPU tasks will be available again. Perhaps testing that belongs to this thread should be suspended until then. ID: 45972 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.