@Pre-FERMI nVidia GPU users: Important warning

Author	Message
Wiggo Send message Joined: 24 Jan 00 Posts: 34896 Credit: 261,360,520 RAC: 489	Message 1579037 - Posted: 28 Sep 2014, 13:02:31 UTC Doctor's prescription, take at least 2 beers and chill out before you give yourself a heart attack or stroke. ;-) Cheers. ID: 1579037 ·

Fred J. Verster Volunteer tester Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0	Message 1579040 - Posted: 28 Sep 2014, 13:14:03 UTC - in response to Message 1579037. Doctor's prescription, take at least 2 beers and chill out before you give yourself a heart attack or stroke. ;-) Cheers. I can only agree to this prescription, eventually an Attivan, would do the trick, but I'm not a fycisian . .;^) ID: 1579040 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1579043 - Posted: 28 Sep 2014, 13:45:41 UTC - in response to Message 1579040. Last modified: 28 Sep 2014, 13:49:56 UTC Doctor's prescription, take at least 2 beers and chill out before you give yourself a heart attack or stroke. ;-) Cheers. I can only agree to this prescription, eventually an Attivan, would do the trick, but I'm not a fycisian . .;^) FWIW numbers sometimes also help too. Current top host Astropulse v6 inconclusive to pending ratio (a holistic indicator of host, app and project health) is currently ~4.9% , which is about twice as good or better than it used to be (well over ~10%). I'd have to guess that this apparently low impact might be partially due to that a lower proportion of Pre-Fermis tend to run AP anyway, amongst others like app improvement and better floating point support in the newer remaining cases. Not forgetting that pre-fermi throughput is a lot lower to start with. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1579043 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1579053 - Posted: 28 Sep 2014, 14:11:03 UTC - in response to Message 1579025. Attention, if you have a nVidia card that's 4 years old or older, and have updated to Driver 340.xx People usually do know what kind of card they have, or can look that up. So then you point out to go to https://developer.nvidia.com/cuda-gpus. Any GPU with compute capability 1.0, 1.1, 1.2 and 1.3 will be affected. ID: 1579053 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1579054 - Posted: 28 Sep 2014, 14:14:06 UTC - in response to Message 1579043. Last modified: 28 Sep 2014, 14:56:39 UTC Doctor's prescription, take at least 2 beers and chill out before you give yourself a heart attack or stroke. ;-) Cheers. I can only agree to this prescription, eventually an Attivan, would do the trick, but I'm not a fycisian . .;^) FWIW numbers sometimes also help too. Current top host Astropulse v6 inconclusive to pending ratio (a holistic indicator of host, app and project health) is currently ~4.9% , which is about twice as good or better than it used to be (well over ~10%). I'd have to guess that this apparently low impact might be partially due to that a lower proportion of Pre-Fermis tend to run AP anyway, amongst others like app improvement and better floating point support in the newer remaining cases. Not forgetting that pre-fermi throughput is a lot lower to start with. I suppose I don't need to remind you of what the Science community thinks about ignoring known faulty data. I just looked at some of my inconclusives, this one stands out; Validation inconclusive (105) That's one host, here's Another. The driver was just released a few weeks ago, the numbers are multiplying by the day... ID: 1579054 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1579059 - Posted: 28 Sep 2014, 14:25:28 UTC - in response to Message 1579053. Attention, if you have a nVidia card that's 4 years old or older, and have updated to Driver 340.xx People usually do know what kind of card they have, or can look that up. So then you point out to go to https://developer.nvidia.com/cuda-gpus. Any GPU with compute capability 1.0, 1.1, 1.2 and 1.3 will be affected. More importantly most users never even look at the message boards. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1579059 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1579071 - Posted: 28 Sep 2014, 14:56:36 UTC - in response to Message 1579059. More importantly most users never even look at the message boards. Maybe that they don't read these boards, and maybe that they won't ask for help here. I have found people asking help in the weirdest of locations. You just need some info out there, Google/Bing having picked it up and others on those weird locations will be able to find this info and help those people out. The info I just posted, I posted on the first of September on the BOINC forums. Goes to show that people here don't look there either, or make use of a search engine to look up information. Of course, all you then need to know is what CUDA version we're talking about, in this case 6.5. So go on, fill in "CUDA 6.5 BOINC" without quotes in your preferenced search engine... Even Goodsearh has the thread as the 3rd possibility. The only thing that may throw you is that it's a thread about Mac OSX and CUDA 6.5, but just consider that Nvidia's drivers are essentially the same for whichever operating system out there. ID: 1579071 ·

Bernie Vine Volunteer moderator Volunteer tester Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328	Message 1579100 - Posted: 28 Sep 2014, 16:13:15 UTC There have been complaints of personal attacks and off topic posts here. To hide all the posts concerned would rob the thread of useful info as the posts have been "quoted" a lot. Please can I ask for a bit more control this is number crunching not politics and this is a serious topic. Thanks. ID: 1579100 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1579101 - Posted: 28 Sep 2014, 16:14:23 UTC - in response to Message 1579054. Last modified: 28 Sep 2014, 16:20:43 UTC I suppose I don't need to remind you of what the Science community thinks about ignoring known faulty data. I just looked at some of my inconclusives, this one stands out; Validation inconclusive (105) That's one host, here's Another. The driver was just released a few weeks ago, the numbers are multiplying by the day... @TBar First of all thanks for attempting to attract attention to this issue. And I can assure you that this issue not ignored by "scientific community", just have in mind that donation of computation resources to project does not automatically imply to have right training and views regarding how that donated time should be used and how results should be verified (so no need to take too close to heart reaction on issue from some other participants that express exclusively own point of view) FYI, issue reported to nVidia bug tracking system, test case supplied and issue confirmed/reproduced by nVidia specialists. Fix in progress I hope. ID: 1579101 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1579135 - Posted: 28 Sep 2014, 18:19:33 UTC - in response to Message 1579011. If the task doesn't have any single pulses it will validate. I've even seen cases where the WingPerson found 1 single pulse and the effected card that didn't find the single pulse still validated. I was just looking at some of the Valid AP tasks for host 7339909, which you referenced in Message 1579054. I do see several where his single pulse count of 0 validated against other non-zero single pulse counts. In one case, the other hosts actually found 9. Fortunately, what I've also seen is that the canonical result always seems to go to one of the hosts with the non-zero count, even if the 7339909 was the _0 task. So, even if the offending host does get credit, its results aren't actually getting into the science database. On the other hand, I would expect that there are cases where 2 of the old card, new driver hosts validate against each other (much like those ATI hosts with the 30 Autocorr overflows that irritate me), their result will end up in the science database, without even an opportunity for another host to crunch the WU and possibly report a non-zero result. ID: 1579135 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1579211 - Posted: 28 Sep 2014, 23:47:00 UTC Last modified: 28 Sep 2014, 23:49:58 UTC A feature added to the AP Validator in late 2003 makes it ignore single pulses which aren't at least 1% above threshold (THRESHOLD_FUDGE is set to 1.01). That was obviously an attempt to avoid the old problem of a signal very close to threshold being reported by one host but not the wingmate even though the calculations produced very nearly equal values. The implementation simply ignores those lower signals completely, however, so in effect moved the critical level problem server-side. In relation to the failure to find/report any single pulses being discussed here, it means that for some of the good results with single pulses reported there wouldn't be any single pulse comparisons anyhow, so no way to eliminate the results with no single pulses. That's an additional way that the faulty results may become canonical. I worked up an improvement to that logic which provides a one way comparison if only one of the signal reports is above the fudge level and reduces that level to 1.001 (just enough to match the allowed tolerance for peak_power). Eric intends to try it out at Beta during the upcoming work week. He also has in mind a statistical method of checking for significant differences between reported signals, that would be a further improvement if/when he actually has time to code it. The shoestring budget tends to delay such. Meanwhile, recent Lunatics builds (including some being used as stock) provide enough detail about the reported signals to usually judge which single pulses are included in validation and which aren't. Here's a little table I made for my own quick reference when looking at the peak_power of single pulses in those stderr sections: Single pulse fudge stderr threshold (thresh*1.01) critical ------------ -------------- -------- scale=0 29.107864 29.39894264 29.4 scale=1 31.568300 31.88398300 31.88 scale=2 37.925224 38.30447624 38.3 scale=3 49.470177 49.96487877 49.96 scale=4 61.368561 61.98224661 61.98 scale=5 86.990051 87.85995151 87.86 scale=6 128.857971 130.14655071 130.1 scale=7 212.633087 214.75941787 214.8 scale=8 361.960083 365.57968383 365.6 scale=9 648.259888 654.74248688 654.7 For now, those peak_power values in stderr are rounded to 4 significant digits which leaves uncertainty when the value is at the level shown in the "stderr critical" column. Raistmer has checked in changes so future Lunatics builds will show those more precisely. Joe ID: 1579211 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34896 Credit: 261,360,520 RAC: 489	Message 1579246 - Posted: 29 Sep 2014, 2:55:31 UTC You also can't say "400 series" cards either as I know of at least 1 card in that series that is still based on the older GTxxx core, the GT405, which is found in a lot of cheap slimline mATX OEM PC's from that era. ;-) Cheers. ID: 1579246 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1579322 - Posted: 29 Sep 2014, 9:12:05 UTC - in response to Message 1579135. If the task doesn't have any single pulses it will validate. I've even seen cases where the WingPerson found 1 single pulse and the effected card that didn't find the single pulse still validated. I was just looking at some of the Valid AP tasks for host 7339909, which you referenced in Message 1579054. I do see several where his single pulse count of 0 validated against other non-zero single pulse counts. In one case, the other hosts actually found 9. Fortunately, what I've also seen is that the canonical result always seems to go to one of the hosts with the non-zero count, even if the 7339909 was the _0 task. So, even if the offending host does get credit, its results aren't actually getting into the science database. On the other hand, I would expect that there are cases where 2 of the old card, new driver hosts validate against each other (much like those ATI hosts with the 30 Autocorr overflows that irritate me), their result will end up in the science database, without even an opportunity for another host to crunch the WU and possibly report a non-zero result. It appears Host 7339909 has dropped back to Driver 337.88. Now his cards are once again finding Single pulses, http://setiathome.berkeley.edu/result.php?resultid=3755432810 One down, how many to go? I'm surprised there still isn't a sticky warning nVidia owners about driver 340.xx. If people are not informed about the problem they won't have any reason not to update their driver. Seems a sticky post would be the least SETI could do... ID: 1579322 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1580975 - Posted: 2 Oct 2014, 15:37:22 UTC Unfortunately, this issue can be completely solved only by total banning of 340.52 driver, for all types of NV GPUs. This should be done cause BOINC can differentiate between FERMI and non-FERMI ones client-side once task was recived. Hence hosts with mixed NV GPUs can recive task under FERMI plan class and then produce invalid result on pre-FERMI GPUs. Sad, but it's BOINC design flaw. ID: 1580975 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1582444 - Posted: 6 Oct 2014, 7:53:14 UTC Last modified: 6 Oct 2014, 8:02:12 UTC FYI nVidia refused to fix this issue. Make your conclusions about future of OpenCL on this platform and pre-FERMI cards future in whole... EDIT: Please, make this thread sticky cause it's the only way to help anonymous platform users with this issue. ID: 1582444 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1582490 - Posted: 6 Oct 2014, 11:42:26 UTC Last modified: 6 Oct 2014, 11:52:09 UTC Not sure this will help in this situation, but the way I've handled Pre-Fermis being unsupported in Cuda 6.5 altogether is to skip those with compute capability < 2.0 (Fermi) in device enumeration, then the project can later restrict device minimum at leisure. Since you use a dedicated plan class for NVOpenCL, you can link in a Cuda driverapi call & compare the compute capability & driver version. 'Non-ideal', as is the only slightly less complicated Cuda situation with such devices, but probably better than relying on users to not update to the transitional broken driver, or wait to figure out a more ideal solution once the full picture is clearer. #if CUDART_VERSION >= 6050 // Check the supported major revision to ensure it's valid and not some pre-Fermi if ((cDevProp[i].major < 2)) { fprintf(stderr, "setiathome_CUDA: device %d is Pre-Fermi CUDA 2.x compute compatibility, only has %d.%d\n", i+1, cDevProp[i].major, cDevProp[i].minor); continue; // Skips initialising this device.... } #else ... Having no usable device at all would fall through multiple temporary exit retries (further in the surrounding logic), and eventually hard error when Boinc decides enough is enough (fingers crossed anyway). It's that complex way for Cuda initialisation because devices may disappear and come back, depending on user switching, so the interaction with temporary exits is complex. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1582490 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1582496 - Posted: 6 Oct 2014, 11:51:36 UTC - in response to Message 1582490. Agree, I will block (can be done via OpenCL runtime itself perhaps) all such devices from next release. But cause BOINC now assigns execution device, just to not enumerate them internally seems not enough. boinc_temporary_exit with user notification about the reason of exit would be more safe perhaps. Or, logic can be more complex: to not enumerate 1.x in case 2.0 present (there are something to run on though overcommitted) and clear exit in case only 1.x available (nothing good to run on at all). ID: 1582496 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1582498 - Posted: 6 Oct 2014, 11:54:47 UTC - in response to Message 1582496. Last modified: 6 Oct 2014, 11:55:43 UTC Agree, I will block (can be done via OpenCL runtime itself perhaps) all such devices from next release. But cause BOINC now assigns execution device, just to not enumerate them internally seems not enough. boinc_temporary_exit with user notification about the reason of exit would be more safe perhaps. Or, logic can be more complex: to not enumerate 1.x in case 2.0 present (there are something to run on though overcommitted) and clear exit in case only 1.x available (nothing good to run on at all). Yep, was editing to add that because of user switching, there is surrounding logic with temp exits. It's been working as desired here so I avoided fiddling with it more than I had to, to add the device skip in its complete enumeration. (didn't want to break what appeared to be working) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1582498 ·

David S Volunteer tester Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12	Message 1582725 - Posted: 6 Oct 2014, 20:57:09 UTC - in response to Message 1579053. Attention, if you have a nVidia card that's 4 years old or older, and have updated to Driver 340.xx People usually do know what kind of card they have, or can look that up. So then you point out to go to https://developer.nvidia.com/cuda-gpus. Any GPU with compute capability 1.0, 1.1, 1.2 and 1.3 will be affected. I do not know, off the top of my head, whether my car is a Fermi or how old it is. I just barely have enough interest to look up my 440 on the given link. I am grateful that I don't need to because T'Bar posted that big, bold statement that 400s are okay. It could easily be amended to note "(except 405)" and would be a lot more useful to people who have less interest than I. Make it a News item so it will appear on the project home page. Maybe with the title "Certain older video cards produce bad science with latest Nvidia driver" and the first line (which will also show up on the home page) can say "300 series and earlier cards, plus others listed here, are affected." Then list cards known to be bad and give the link to look up others to see if they might be. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. ID: 1582725 ·

Borgholio Send message Joined: 2 Aug 99 Posts: 654 Credit: 18,623,738 RAC: 45	Message 1583466 - Posted: 8 Oct 2014, 14:33:05 UTC I am running a 9800GT and the Nvidia 340.52 drivers. I noticed three AP tasks that quickly errored out when using the GPU. My normal S@H tasks seem to be working fine. So it seems I have the problem that is described in this thread. Trouble is, I have never had anything but pain when it comes to downgrading video card drivers. I do not want to downgrade just for this. Is there a way to stop using the GPU for AP tasks while still allowing it to be used for S@H? You will be assimilated...bunghole! ID: 1583466 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.