Message boards :
News :
Tests of new scheduler features.
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 17 · Next
Author | Message |
---|---|
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
Is it possible to increase amount of time BOINC wait before issue task abort due to 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED ? I think project loses more power because of this "protective" feature than from stuck tasks themselves. It's especially actual for GPU tasks where total execution time very slow, but computational environment can change very strongly. Example: http://setiweb.ssl.berkeley.edu/beta/result.php?resultid=13913803 was aborted after ~1h of execution. This host has well established running time now. But sometimes tasks take longer. Reason: game was ran that time. No slowdowns in game itself, but GPU computing was slowed down obviously. This task had ALL chances to be completed in time, no deadline pressure at all. Moreover, it could recive usual speed just after end of game... but instead it was aborted by BOINC. It's not "protective", it's abuse! |
Send message Joined: 29 May 06 Posts: 1037 Credit: 8,440,339 RAC: 0 ![]() |
I`m wondering this host is getting AP 6.04 units. And the first driver that the NV OpenCL app produces good work on is 263.xx, 260.99 isn't good enough, i did have a test case host (on the Main project) that was producing inconclusive work, but they've upgraded the driver recently. Edit: and it's still producing inconclusive/invalid work, probably because it did it's compilations under 260.99: All AstroPulse v6 tasks for computer 4819000 Claggy |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
So, time to check app_plan driver limits, right ? |
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0 ![]() |
So, time to check app_plan driver limits, right ? We have a saying in the English political-legal system: "Hard cases make bad law" In other words - don't rush to change the law, just because one special case had difficulty slotting into the current framework. Claggy found one case, of one host, run by a volunteer who has never posted on the message boards. He assumed that the bad results were caused by 'driver version too low': I'd think that the subsequent continuing errors under a newer driver raise doubts about that assumption. Either way, I suggest that changing the driver limits on the basis of that one case, with no corroborating evidence, is contrary to "scientific method". But by all means check, re-check, and propose an alternative based on thorough research (with citations). Similarly, one case of a task which over-ran its time limit because the user forgot to set the 'exclusive app' tag to pause it while gaming? Again, I'm unconvinced that one case is enough to justify a change in project policy. 'EXIT_TIME_LIMIT_EXCEEDED' is triggered when a task exceeds it's <rsc_fpops_bound> limit, which for this project (both Main and Beta) has been set at 10x <rsc_fpops_est> since time immemorial. It would be trivial to re-set the workunit generators (splitters, in local parlance) to use a 20x, or 100x, or any other time limit: but that would impinge on all other over-running tasks, too. What about the case when an application - according to BOINC - is active, but in reality has stalled, using no CPU time and making zero progress? That task's exit will be delayed, too, resulting in longer pendings for wingmates, tasks lingering longer in the BOINC database, data files staying longer in the server fanout before being deleted etc. etc. Again, think through all the consequences throughout the interconnected client-server system before proposing changes. |
Send message Joined: 29 May 06 Posts: 1037 Credit: 8,440,339 RAC: 0 ![]() |
So, time to check app_plan driver limits, right ? Except I know it never worked on 260.99, it didn't three years ago when I tested the initial Nvidia Alpha build on my 9800GTX+, Raistmer confirmed it worked on 263.06, I confirmed it worked on 266.35 r505 OpenCL AstroPulse for ATI GPUs alpha for NV GPUs Raistmer even posted: @Claggy We never tried to fix the app for 260.99, and admittedly we haven't gone and tested the latest app with that driver, I see no reason why it would work now, Claggy |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
What about the case when an application - according to BOINC - is active, but in reality has stalled, using no CPU time and making zero progress? That is the keystone of issue. BOINC should detect when no progress made. In case of slowed processing progress will be, just slow. In case of stuck app progress will be just zero, not little number, but virtually zero. And elapsed time bad measurement for progress. |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
I would generalize issue in such way: there are 2 time scales: computer time scale, that scales with PC's performance and real world time time scale that never scales (until humans will evolve in something other species). When human interacts with PC (running app on it, for example) he/she always interacts in real world time scale. When he/she notice some slowdowns or other issues and fix them, he/she again acts in real world time scale - no matter how fast or how slow particular PC is. But all flops based time scaling is computer time scale, not real world time scale. Hence when these scales differs too much (PC too fast) we see such issues as downloading thousands of tasks before user can stop this, aborting tasks just because user spend hour of rest playing game and so on. BOINC should take into account real world time scale too. It does this what some timeouts considered (like in network communications). It should tak real world time scale into account also in this part. Also, there were "hearbeats" before. Are they gone completely? Time from last checkpoint check can be used instead, for example.... And would be good if this option will be real world time based and configurable. For example: to abort task if it doesn't checkpoint more than 1 hour (configurable by user) with some adequate default. Or, maybe even better - configurable multiplier. With multiplier we could account for different length of tasks for different projects. |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
Similarly, one case of a task which over-ran its time limit because the user forgot to set the 'exclusive app' tag to pause it while gaming? Again, I'm unconvinced that one case is enough to justify a change in project policy. In part of EXIT_TIME_LIMIT_EXCEEDED issue I would disagree. exclusive app is "bandage" in this case. There are infinite number of programs being run will slowdown GPU. For example this particular game ran first time on that host. It's not possible to remember to add each and every app that would consume GPU to exclusive applications list. Moreover even in case this particular app it will be workaround! Game was not slowed down at all. App did not fail and did not produce incorrect result. So, for what reason "exclusive app" ??? Just to mask BOINC failure to discriminate "running slower" from "stuck" ? I would strongly disagree that it's right approach. It's some design flaw here. In very approach to issue. I agree, x20 or x100 will not SOLVE issue, it will just delay issue until new even more faster GPUs arrive. For CPU tasks we have real world time and CPU time. And BOINC will not kill task if it takes too much real world time (only if deadline passed) and same CPU time. Here app takes same GPU time but BOINC just doesn't able to determine that app uses fraction of GPU while it's able )vie system calls) to determine that app uses fraction of CPU. So, approach that _maybe_ viable for CPU tasks definitely not viable for GPU tasks where x100 slowdown due to GPU load absolutely normal and can't be used as indication of stucking app. Hence, the usage of wrong method leads to bad results, processing time lost, not saved by this "precaution measure". And just as current workaround, until more sophisticated approach will be devised, I still propose to increase limit. x10 for GPU too low, even for current midrange GPU. With task taking few minutes of real world time even one hour of heavy GPU usage will result in unneeded task abortions. My example just demonstrated that. |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
One more proposal: Lets consider, what measures we have now against what possible reason of "stuck" state. 1) deadline: ultimate measure agains host that fails to report results back, for whatever reason including flooded home with PC, nuke strike and so on and so forth. In general, deadline is measure against unknown reason of task stuck. 2) EXIT_TIME_LIMIT_EXCEEDED And here things become more interesting. If we already have measure against unknown reason, why another measure against "unknown" reason? No needed. So, some "known reasons" should be that allows some optimization: to abort stuck (known to be "stuck"!) task BEFORE it reached deadline. So, we should care that this new feature will not become de-optimization. Killing partially processed task w/o real need - example of such de-optimization. Even past deadline partially processed task not killed immediately (or never killed?). So what those reasons when EXIT_TIME_LIMIT_EXCEEDED "optimization" can be successfully used? a) unlimited wait for some resource (like file descriptor and so on) b) GPU driver crash (actually most probably it's case a) too but such waitung in runtime part, not in app own part). In both these cases app will stop to call BOINC API (for example, time to checkpoit request). So, this fact can be used for stuck detection. Note, that slowed app will continue to call BOINC API. What other reasons of stuck one can foresee? |
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0 ![]() |
Right, now we're beginning to think. But before we do, remember the question that started this discussion: Is it possible to increase amount of time BOINC wait before issue task abort due to 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED ? Who was it directed to? Eric, I presume - since you asked it in this thread, on this board. The lever Eric has at his disposal to pull - right now, with minimal programming - is the ratio between <rsc_fpops_est> and <rsc_fpops_bound>. I can't think of another one - can you? Otherwise, any solution is likely to involve substantial changes to at least two out of three from the BOINC server code, BOINC client code, and the BOINC application API code. Are we going to delay the v7 roll-across to the Main project while all that is done? If so, tell David that, rather than Eric. On this message board, shouldn't we concentrate on getting all the new apps tested, documented, and certified for deployment? BTW, on the subject of your 'stall' case (b) "waiting for file descriptor", I presume you've read the patch from Steffen Möller this lunchtime? It needs evaluation and testing before rushing into deployment, of course, but surely eliminating the file descriptor leak at source will be the quickest and most effective way of dealing with this case? Otherwise, I'd suggest that we simply allow the fundamental fault-tolerant design of BOINC to cope with the 197 exits - after all, since they only occur when a task takes 10x (or whatever) the estimated time, by definition you don't burn through a large number of tasks that way. Then, once deployment is complete, would be the time to re-schedule your gaming hours into a re-write of BOINC, as you've suggested on another message board :P ;) |
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0 ![]() |
Except I know it never worked on 260.99, it didn't three years ago when I tested the initial Nvidia Alpha build on my 9800GTX+, Raistmer confirmed it worked on 263.06, I confirmed it worked on 266.35 Well, I've looked back through the driver release notes (WHQL only) for your system (9800GTX+ on Vista x64), and found: OpenCL Support (found in http://uk.download.nvidia.com/Windows/197.13/197.13_Win7_WinVista_Desktop_Release_Notes.pdf) - and there were no bugfixes or OpenCL upgrades recorded in the release notes for either 260.99 or 266.58 Your testing was carried out - well, reported - on 17 Jan 2011. Please remind me whether that was before or after I discovered the "uninitialised VRAM causes inaccurate results" bug? I've forgotten when that was myself - I'll go back and look. Edit - it was 02 May 2012. That thread - AP 6.01 inconclusives/invalids - organising offline testing - started with: each time I provide new AP NV build: NV driver generate silent failure damaging GPU memory buffers and no OpenCL error message accompanies this It might be wise to re-test with a build dating from after the true cause was found? |
Send message Joined: 29 May 06 Posts: 1037 Credit: 8,440,339 RAC: 0 ![]() |
Except I know it never worked on 260.99, it didn't three years ago when I tested the initial Nvidia Alpha build on my 9800GTX+, Raistmer confirmed it worked on 263.06, I confirmed it worked on 266.35 I'm not surprised, Nvidia and AMD rarely say anything about Cuda or OpenCL changes, I don't think Nvidia even mentioned the Cuda sleeping monitor Bug being fixed. Your testing was carried out - well, reported - on 17 Jan 2011. Please remind me whether that was before or after I discovered the "uninitialised VRAM causes inaccurate results" bug? I've forgotten when that was myself - I'll go back and look. I'll put 260.99 on my 9800GTX+ host next week, and do a bench of the Stock NV AP app and the latest version too. Claggy |
![]() Send message Joined: 15 Mar 05 Posts: 1547 Credit: 27,183,456 RAC: 0 ![]() |
In theory the cuda_opencl_100 plan class is checking for NVIDIA driver of 197.13 or later. But I don't see any indications that this check ever fails for any CUDA applications. I guess I have more debugging to do. I`m wondering this host is getting AP 6.04 units. ![]() |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
Right. If it possible to do such fast fix for single project it would be good move.
Actually not. Coincedence perhaps (in particular file descriptor example). It could be mutex or anything else, not matter. Basic idea - app will not call BOINC API functions in case of such stuck. In general I agree in that that this issue should not delay v7 release on main. Of course. But it doesn't matter that issue doesn't exit and if it could be at least masked with increased "tolerance time" even that would be good. Each aborted task resets quota. Such reset will in most cases lead to allocation tasks from slower app... and this also means performance drop. It's new, before we had no such performance degradation possibility. Think about it too. HD5 vs non-HD5 no matters much, but cuda23 vs cuda 22 means x2 or even slightly more performance drop.... So, along with v7 release it would be good to think about this too. So it belongs to thread " new shceduler features" at least a little ;) |
Send message Joined: 14 Oct 05 Posts: 1137 Credit: 1,848,733 RAC: 0 ![]() |
It's certainly possible, the question is how to evaluate the protective function against the tendency to kill tasks which would have finished OK. My gut feeling is that there is enough slop in the estimates and known issues like GPUs downclocking to justify some increase. Even for CPUs the trend is for systems to be very protective, and most participants will probably leave such protections in place. At least for the rollout of new applications, I'd be in favor of an increase of the bound to about 20X the estimate. Of course BOINC by default stops all crunching when there's any serious other system usage, so the typical non-enthusiast wouldn't run into extended runtime due to gaming, etc. Still, enthusiasm doesn't necessarily pair with sufficient knowledge to judge the risk/reward ratio of deviating from defaults. Joe |
![]() ![]() Send message Joined: 16 Jun 05 Posts: 2531 Credit: 1,074,556 RAC: 0 ![]() |
I agree Joe. Specially for users with AMD CPU and multiple GPU`s. If no CPU core is freed on such a system and high blanked astropulse running on one card the next WU start on the other card will take ages because of less CPU power. With each crime and every kindness we birth our future. |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
And now my NV host gets 0 tasks on each request. Quotas not reached, 0 tasks allocated today for each of 3 apps. other 2 not possible for this host due driver constrains. Can this be reason? Maybe BOINC attempts to get statistics on cuda42 and cuda50 and refuse to allocate tasks for already good known cuda22, 23 and 32 ?... 19/05/2013 14:26:56 SETI@home Beta Test [sched_op_debug] Starting scheduler request |
Send message Joined: 29 May 06 Posts: 1037 Credit: 8,440,339 RAC: 0 ![]() |
And now my NV host gets 0 tasks on each request. Ask for CPU work, you'll find all work received are VLARs, and VLARs aren't sent to GPUs, hence you get no GPU work: All tasks for computer 45274 Claggy |
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0 ![]() |
Ask for CPU work, you'll find all work received are VLARs, and VLARs aren't sent to GPUs, hence you get no GPU work: Agreed - All tasks for computer 62652. Being BOINC v6, that host can and will fetch CPU and GPU work independently, but has received no new (i.e. not resent) non-VLAR work for 36 hours now. |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
thanks. Have no intentions to process beta CPU tasks there so please let know when GPU tasks will be available again. Perhaps testing that belongs to this thread should be suspended until then. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.