Questions and Answers :
GPU applications :
Blue Screen of Death occuring on unique tasks
Message board moderation
Author | Message |
---|---|
Bill Send message Joined: 30 Nov 05 Posts: 282 Credit: 6,916,194 RAC: 60 |
Forgive me for posting about this from the main Boinc forum, but this may be a unique Seti problem. In an attempt to help someone else with a problem, I coincidentally developed my own problem, starting here. A few weeks ago I noticed that my computer suffered a blue screen of death. The BSOD was a Video_TDR_Failure related to atikmpag.sys. Long story short, I would get the BSOD every time I started up Boinc and allowed the GPU to run, CPU crunching was just fine. I checked over hardware, ran DDU, upgraded/downgraded drivers, nothing worked long term. Technically I got it working again after running DDU, but it only worked again for about a day before suffering the same thing. However, I discovered that the BSOD was occuring only when two unique tasks were attempting to crunch, task 7735441575 and 7735333648. I have suspended those tasks, and I have been crunching on the GPU just fine for a few days now. I could just abort these tasks and move on, but before doing so I was wondering if anyone had any other thoughts. Specifically, is there something anyone can do to see if there is a problem with the task/workunit itself? I'm not sure if there is anything else to attempt to fix on my computer. Considering all other tasks appear to be crunching just fine I don't know that I want to be spinning my wheels over nothing. Seti@home classic: 1,456 results, 1.613 years CPU time |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874 |
I doubt there's anything really unique about the tasks - one BLC26, one Arecibo. Both tasks have a completed run from a wingmate, waiting for yours to return and validate - and both wingmates used GPUs. Of course, in one sense, every single workunit run by SETI is unique (that's the whole point of what we do), but if you worked out exactly what configuration setting or data point was the problem and changed it, it wouldn't be 'that task' any more, and it would probably fail to validate. Better to tackle the problem at source - thank you for including the bugcheck value. Here are two Microsoft articles: Bug Check 0x116: VIDEO_TDR_ERROR (background - for programmers only) Timeout Detection and Recovery (TDR) Registry Keys (settings you can tweak to work round the problem) In a discussion from 2016, the developer's response was "I would recommend just to disable that damned watchdog in Windows registry" - i.e., set TdrLevel to 0. I would do that as a temporary workround to clear those two tasks, and to hold in reserve in case the problem returns. But the problem suggests there is possibly some marginal hardware problem in that machine you could investigate later, if you have access to any diagnostics tools. |
Tom M Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 |
Forgive me for posting about this from the main Boinc forum, but this may be a unique Seti problem. In an attempt to help someone else with a problem, I coincidentally developed my own problem, starting here. I was having the TDR failure on my Ryzen 5 2400G until I stopped using a very aggressive -tt 1500 in the command line. Problem went right away since I dropped the parm. Tom A proud member of the OFA (Old Farts Association). |
Kissagogo27 Send message Joined: 6 Nov 99 Posts: 716 Credit: 8,032,827 RAC: 62 |
imho -tt 1500 is too low with period iteration set to 1 ^^ i can set it to 1800 for BLC but i have to set it higher for arecibo WU ^^ i'm now at -tt 5000 without problems with another hardware , separate GPU and not ryzen :D |
Tom M Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 |
imho -tt 1500 is too low with period iteration set to 1 ^^ i can set it to 1800 for BLC but i have to set it higher for arecibo WU ^^ i'm now at -tt 5000 without problems with another hardware , separate GPU and not ryzen :D I am confused. I thought that the # after the -tt was the length of time that a gpu task could be dispatched before there would be a task switch. Since this meant more time crunching for each "time slice" that was started it was "supposed" to run faster. And usually does. So the -tt # I heard you tell about using is, as far as I know, out of bounds. I thought I read that -tt 1500 was the largest # you could use. "Supervisor Call" would somebody correct me? TY. Tom A proud member of the OFA (Old Farts Association). |
Bill Send message Joined: 30 Nov 05 Posts: 282 Credit: 6,916,194 RAC: 60 |
Okay, I think I have solved the problem, at least partially. I doubt there's anything really unique about the tasks - one BLC26, one Arecibo. Both tasks have a completed run from a wingmate, waiting for yours to return and validate - and both wingmates used GPUs. Of course, in one sense, every single workunit run by SETI is unique (that's the whole point of what we do), but if you worked out exactly what configuration setting or data point was the problem and changed it, it wouldn't be 'that task' any more, and it would probably fail to validate.Thanks, Richard. I did see the first article before. I'm not sure what to do with that one at this point. I know it involves running debug tools for Windows, which I am sure I can do, but I don't think that I have the knowledge yet to figure out what to do with the information. I'll shelf that one for the time being. The second article, and the deep cut on the message board did provide something interesting. I cannot find TdrLevel at HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers, nor anywhere else in my registry. Is that a sign of something else? I would assume that if the value is not present, the default setting is TdrLevelRecover? I tried adding the TDRLevel key to my registry, and it didn't fix BSOD. Perhaps I didn't enter it correctly. No matter, because I think I found the actual problem. I was having the TDR failure on my Ryzen 5 2400G until I stopped using a very aggressive -tt 1500 in the command line. Problem went right away since I dropped the parm.No, I don't have -tt in my command line, but that reminded me that I had entered command line test for the first time a few weeks ago. I haven't even thought about modifying the command lines. I just deleted all command line options from the ati files, and as I've been typing this the one task crunched to completion and the second is chugging along nicely. So, something in that command line I was using wasn't working right. Perhaps I should go back and research what each of the lines means instead of blindly adding them. I think I have it solved for now. Thanks, Tom, Richard, and Jord for all the help! Seti@home classic: 1,456 results, 1.613 years CPU time |
Kissagogo27 Send message Joined: 6 Nov 99 Posts: 716 Credit: 8,032,827 RAC: 62 |
imho -tt 1500 is too low with period iteration set to 1 ^^ i can set it to 1800 for BLC but i have to set it higher for arecibo WU ^^ i'm now at -tt 5000 without problems with another hardware , separate GPU and not ryzen :D it depend of the GPU used, the kernel time depend of the GPU speed and computes units , a slower GPU takesmore times ton complete a FFT sequence ^^ if i set only 1500 , all the FFT aren't done in only 3 steps like here in my results
the longest ones are the firsts lines here FFT length 32 64 and 128 , with my -tt 5000 , i can speed up theses ones ^^ if i set a lower number of -tt , the FFT length 32 64 and perhaps 128 takes more than 3 pass to do .. and u will find other pass in 4 or 5 times after looking at yours results with your vega 11 integrated GPU to your ryzen CPU we can see that
Fftlength=32,pass=3:Tune: sum=90392.8(ms); min=7.523(ms); max=269.3(ms); mean=140.6(ms); s_mean=86.47; sleep=75(ms); delta=451; N=643; usual not all are done in 3 pass some of them are done in 4 pass and another in 5 pass ... not optimized ^^ Fftlength=64,pass=3:Tune: sum=77912.6(ms); min=2.961(ms); max=108.7(ms); mean=62.33(ms); s_mean=55.73; sleep=45(ms); delta=347; N=1250; usual same behavior here ^^ Fftlength=128,pass=3:Tune: sum=84120.4(ms); min=1.53(ms); max=97.76(ms); mean=59.37(ms); s_mean=62.68; sleep=60(ms); delta=343; N=1417; usual here too ^^ Fftlength=2048,pass=3:Tune: sum=235284(ms); min=15.18(ms); max=586.9(ms); mean=43.07(ms); s_mean=50.13; sleep=45(ms); delta=1; N=5463; high_perf but theses ones are optimized with defaut -tt 60 , done in only 3 pass :D that's how it works ;) |
Kissagogo27 Send message Joined: 6 Nov 99 Posts: 716 Credit: 8,032,827 RAC: 62 |
if you want tu stay at -tt 1500 tyou have to modify the -period iteration set more to 1 |
Tom M Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 |
if you want tu stay at -tt 1500 tyou have to modify the -period iteration set more to 1 I am still a bit bewildered. However, if you would take a shot at "the best" combo for my 2400G of the -period_iteration and the -tt I will try them out. As usual I am trying to lower my wall clock time. The cpu time listed is way below the wallclock time so I am interested in minimizing the wallclock time. If you want to be really fancy (and even more helpful) take a shot at the "best" command line for a Ryzen 5 2400G. In any case, thanks for explaining yet something else I didn't (and probably still don't really get) about the reports we see on Seti. Tom A proud member of the OFA (Old Farts Association). |
Kissagogo27 Send message Joined: 6 Nov 99 Posts: 716 Credit: 8,032,827 RAC: 62 |
some mysterious explanations were here from the lunatics forum ;) and after , u have to try diffrents -tt parameter to gain le maximum high_perf résult at the end of the sterr.txt the first step was to understand the explanation with the help of google translate, and then try differents parameters with a period_iteration=1, u will have lot of lags at the beginning of a wu from the first minutes ;) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Good analysis and observations. The link to Raistmer's document at Lunatics explains it all. Will take a few readings to comprehend the interplay between all the parameters. But reducing the passes is the goal with delta=1 and n=highest value is the goal with the means at minimum. Easy to achieve with high powered hardware but a lot harder with lesser hardware like an APU or iGPU. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.