Update on Linux 64 -Nividia-V8-MB ?????

Author	Message
KLiK Volunteer tester Send message Joined: 31 Mar 14 Posts: 1304 Credit: 22,994,597 RAC: 60	Message 1772359 - Posted: 18 Mar 2016, 6:35:25 UTC - in response to Message 1772094. All of which is fascinating, in a slow-motion-car-crash sort of way, but doesn't really address KLiK's question about not getting Linux 64 -Nividia-V8-MB tasks (because of the app not being ready yet), or how many he will run when it is ready. well, we (those of us running Linux also) have to run BETAs to develop an app which works in v8... ;) but, don't know why the BOINC didn't pick any of v7 AP?! :/ host ID? If there is a suitable app, have you checked your preferences that you allow AP? it's on Default profile, which takes all but CPU apps...here's a Linux: https://setiathome.berkeley.edu/show_host_detail.php?hostid=7784676 non-profit org. Play4Life in Zagreb, Croatia, EU ID: 1772359 ·

OTS Volunteer tester Send message Joined: 6 Jan 08 Posts: 369 Credit: 20,533,537 RAC: 0	Message 1773922 - Posted: 25 Mar 2016, 14:51:25 UTC After running the x41zi app for a period of time after it first came out, I found I was getting more invalids from -9 overflow results than I had seen in the past. At least it seemed that way. Based on this impression, I posted in this thread seeking help and two members were kind enough to suggest I try drivers 352.41 and 358.16. I did try both those drivers in turn but they did not seem to be much of an improvement. As a result, I decided to be a little more scientific in my approach and start with the newest driver and work my way back until I found a driver that would not produce invalid results. The first one I tried was 361.28. I noted the date and time I started using it and for the first week the only invalids I saw were on WUs that had completed before the change to the newest driver. Since that time, or least until yesterday, I was feeling pretty good about having only seen a couple of invalids and I thought I had found a winner with my first choice. Alas, yesterday I saw five invalids listed as result of a -9 overflow. It gets worse. While looking through this thread again yesterday after seeing the invalids, I found where another kind soul had posted the recommendation to try driver 337.25 so last night I tried using that. This morning after a half a day of using it I found that I now have 51 WUS that have â€œerroredâ€ out with an â€œan illegal instruction was encounteredâ€ notation of one sort or another in the results output. This all brings up several questions The first is why would there be five invalids using driver 361.28 in less than 24 hours when previously I was seeing virtually none over several weeks. Is there something different between WUs that would explain this? Another thing I noticed when I went looking at the valid results of my wingmen was that many of them were using Nvidia cards. One I noticed was even using the same card, a 750ti, and the same app, x41zi, but using the 361.43 driver with Windows 10. Seeing that I looked at his tasks page and he is currently showing no invalids for any of his 750ti results. Unfortunately there is no 361.43 driver for Linux so I cannot try to determine if it would help my situation or not. So the question remains, why did he receive a valid result while I received an invalid result on the same WU with the same card, and the same app? Is it the difference in the driver, the OS, or does the app compile differently for each OS? Or is there something else I am not considering As to the 337.25 driver, it is obviously not compatible with the the x41zi app and I have gone back to the 361.28 app while I contemplate my next move â€“ or perhaps I will make the 361.28 my driver permanently and live with a string of invalids every now and then. If I continue with my plan however, I should try 352.79 next. ID: 1773922 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1773931 - Posted: 25 Mar 2016, 15:23:14 UTC - in response to Message 1773922. Last modified: 25 Mar 2016, 15:35:00 UTC One possibility I noticed with my Linux machine many moons ago, running a GTX680, is the drivers seem to have a very non-aggressive fan curve under this environment. Assuming you have the nvidia-settings control panel applet installed, I would recommend to give a trial manually setting the GPU fan to 80%, and see if that makes a difference. In theory if these devices start getting warm or unstable enough, then the GPU Boost algorithm is supposed to compensate by reducing the boost clock, upping the fans/voltages etc. This, however, may not apply to factory overclocked cards in the same way, where clocks/fan-profiles/voltages/ are set by the factory often with some acceptable number of graphical artifacts over time (i.e. glitches) for gaming. Therefore I suspect if the nvidia-settings application is available to you, and you can up the fan, then that may be enough to drop the invalids. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1773931 ·

OTS Volunteer tester Send message Joined: 6 Jan 08 Posts: 369 Credit: 20,533,537 RAC: 0	Message 1773933 - Posted: 25 Mar 2016, 15:33:44 UTC - in response to Message 1773931. Last modified: 25 Mar 2016, 15:41:34 UTC One possibility I noticed with my Linux machine many moons ago, running a GTX680, is the drivers seem to have a very non-aggressive fan curve under this environment. Assuming you have the nvidia-settings control panel applet installed, I would recommend to give a trial manually setting the GPU fan to 80%, and see if that makes a difference. In theory if these devices start getting warm or unstable enough, then the GPU Boost algorithm is supposed to compensate by reducing the boost clock, upping the fans/voltages etc. This, however, may not apply to factory overclocked cards in the same way, where clocks/fan-profiles/voltages/ are set by the factory often with some acceptable number of graphical artifacts over time (i.e. glitches) for gaming. Therefore I suspect if the nvidia-settings application is available to you, and you can up the fan, then that may be enough to drop the invalids. nvidia-smi reports the temperature is currently 53C with 95% utilization using 31W. I do not think I have ever seen 56C. I will however try to figure out how to increase the fan speed from the command line as this is on a headless server without a GUI or X. You never know, it may help. Thanks for offering a suggestion. Edit: Forgot to include fan speed. Currently it is at 31% as reported by nvidia-smi. ID: 1773933 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1773937 - Posted: 25 Mar 2016, 15:36:44 UTC - in response to Message 1773933. Was about to add: [Edit:] as far as the GPU code goes, well they are all compiled using NVCC. If this were a code issue in the build or drivers somehow, we would need to identify other Linux machines running the same build, with the same or different GPUs, in order to isolate more precisely what is going on (compare similarities and differences). Driver or compiler bugs are not out of the realm of possibility, though the nature of this particular event looks more like factory overclock (a la 560ti), and the GPU code being run fairly well time proven. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1773937 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1773944 - Posted: 25 Mar 2016, 15:57:17 UTC - in response to Message 1773933. Last modified: 25 Mar 2016, 16:02:27 UTC Edit: Forgot to include fan speed. Currently it is at 31% as reported by nvidia-smi. In addition to the Fan suggestion, looking through your errors, I'm seeing a pattern of 'illegal instruction' around video-memory/PCIe-bus intensive operations in assorted locations. As it happens these points are inside calls into nvidia's libraries/drivers, and probably point to some deeper system problem (rather than a problem with the libraries/drivers themselves). Yes stress from these kindof operations would vary with task angle-range. If the fan suggestion doesn't help, you may need to diagnose potential PCIe driver, power state, or system memory issues (BIOS, Drivers, many more). The pattern appears to be when the GPU is trying to talk to the Host system. That could be the GPU (OC-Fan-Clocks), PCIe Bus, or Host CPU/Memory, unfortunately for a pretty wide variety of possible causes. [Reseating the GPU card in its slot can sometimes clear problems like this, for example] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1773944 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1773952 - Posted: 25 Mar 2016, 16:28:55 UTC - in response to Message 1773944. ...you may need to diagnose potential PCIe driver, power state, or system memory issues (BIOS, Drivers, many more)... Just another relatively easy check, A quick survey of the processor errata datasheet for that family of CPU, suggests your motherboard manufacturer might have a BIOS update for multiple PCIe & power state issues. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1773952 ·

OTS Volunteer tester Send message Joined: 6 Jan 08 Posts: 369 Credit: 20,533,537 RAC: 0	Message 1774044 - Posted: 25 Mar 2016, 23:57:41 UTC - in response to Message 1773952. The suggestions from your last two posts could certainly be valid but I think it is strange that I have had the same motherboard/750Ti combination in use for 27 months and never had a problem with errors related to "illegal instructions" until I tried the 337.25 driver. Looking in the directory where I store Nvidia drivers, I find I have used 9 different drivers since I installed the card and I have used them on MB7s, MB8s, and AP7s with various apps. The 337.25 is only one that has given "errors" and they have gone away completely since going back to the 361.28 driver. Considering the fact that there were 51 errors in less than 12 hours with the 337.25 driver and none before or after when the 337.25 driver was not being used, I would think that driver is the most likely problem and the simplest solution would be not to use it. If after reading this, you still feel that I could have a hardware problem, I will look into it as you know more about this than I do, but I would really hate to try to fix something that is not really broken on a working server. ID: 1774044 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1774054 - Posted: 26 Mar 2016, 0:58:55 UTC - in response to Message 1774044. Last modified: 26 Mar 2016, 1:06:46 UTC For me, since we have few clues, it's more a matter of eliminating the easy things, low hanging fruit as it were. Checking the motherboard manufacturer's update list is just one easy thing to eliminate as a suspect, although these can be notoriously non-descript about what they are updating (CPU microcode updates are sometimes present, as some weaknesses are discovered years later when the software changes) There are other relatively easy things to eliminate, such as locating the nVidia kernel compute cache, clearing it with no Cuda applications running, On LInux default location: ~/.nv/ComputeCache That cache is built and maintained by your driver, so doing so after installing a 'known good' one would probably be the way to go. It's about the only route apart from hardware or firmware failure I can think of at this time, that would be capable of injecting illegal instructions. Side thought: Are there minimum kernels specified on the Linux Drivers ? [Edit:] Note that if boinc is running the applications as some other user, then compute caches in other home folders might be similarly polluted. This situation would possibly manifest in a functional application in standalone tests under a user account, and failure otherwise. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1774054 ·

OTS Volunteer tester Send message Joined: 6 Jan 08 Posts: 369 Credit: 20,533,537 RAC: 0	Message 1774194 - Posted: 26 Mar 2016, 17:16:55 UTC - in response to Message 1774054. Last modified: 26 Mar 2016, 17:25:43 UTC For me, since we have few clues, it's more a matter of eliminating the easy things, low hanging fruit as it were. Checking the motherboard manufacturer's update list is just one easy thing to eliminate as a suspect, although these can be notoriously non-descript about what they are updating (CPU microcode updates are sometimes present, as some weaknesses are discovered years later when the software changes) There are other relatively easy things to eliminate, such as locating the nVidia kernel compute cache, clearing it with no Cuda applications running, On LInux default location: ~/.nv/ComputeCache That cache is built and maintained by your driver, so doing so after installing a 'known good' one would probably be the way to go. It's about the only route apart from hardware or firmware failure I can think of at this time, that would be capable of injecting illegal instructions. Side thought: Are there minimum kernels specified on the Linux Drivers ? [Edit:] Note that if boinc is running the applications as some other user, then compute caches in other home folders might be similarly polluted. This situation would possibly manifest in a functional application in standalone tests under a user account, and failure otherwise. After reading your last three posts I want to be sure we are on the same page. Based on what I think I have read, all your suggestions so far, with the possible exception of speeding up the fan, are to solve the problem of the â€œerrorsâ€ and not the â€œinvalidsâ€. Is that correct? If it is, I am at a loss as to why I would pursue this when not using one particular early version of the driver seems to cure the problem. The server is handling web and mail services for two domains as well as locally being a time server and a file server for backups and shared areas on mirrored drives. Everything is working fine as near as I can tell and I am reluctant to do anything that has the potential to upset the apple cart so to speak Even if they were all suggestions to solve the â€œinvalidâ€ problem I would have to think long and hard about what to do and not do. SETI work is really secondary to the other functions. A BIOS update falls in that area of something that I am reluctant to do if there is no problem with the non SETI functions of the server. With todayâ€™s new boards it is relatively risk free to update the BIOS, particularly with the roll back many boards have, but there is still the potential to brick it even if it is only a remote chance. Please do not think I am ungrateful because I am grateful. I appreciate all you have done to help me, not only now but in the past. I also know you have put a tremendous amount of time in developing apps for the benefit of everyone. In fact, if you are asking me to do some of these things because it helps you solve some problems on your end with the apps, that would be different. I would certainly consider taking more of a risk if that were the case. I did find a /.nv and a /root/.nv directory and after killing boinc, I deleted both of them. After I restarted boinc only the /.nv directory was recreated, and I have seen no changes so far. That however is not surprising since I am not using the 337.25 driver that created the invalids. I have also tried to increase the fan speed but with no luck so far. "nvidia-settings" apparently only works with X and "nvidia-smi" doesnâ€™t seem to have a way to manipulate the fan speed, only report it. I did find a script on the web for headless machines that I played with a bit last night but without success. The only thing I succeeded in doing is slowing down the 750Ti which I didnâ€™t catch until this morning. Reinstalling the driver seems to have brought that back to normal. To be honest, I think too high a GPU temperature as being a source of the invalids is not likely. A steady state temperature around 50C should not be a problem and even a unreported spike of 20 degrees more would probably not create a problem. I will however continue to look for a way to increase the fan speed. It certainly cannot hurt to bring the fan speed up and it might help. The current kernel is "vmlinuz-huge-3.14.33" with a release date of 11 Feb 2015 so I do not think that would be a problem with the 337.25 driver released on 30 May 2014. I will certainly reseat the card the next time I have the server down. Thanks again for all your help. ID: 1774194 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1774208 - Posted: 26 Mar 2016, 18:11:14 UTC - in response to Message 1774194. No Problem, yes certainly if not a dedicated crunching host I would avoid drastic actions also. For diagnostics of any possible app or Cuda library problem, first step would have been putting the system in the clear (by just looking at the manufacturer's site before deciding whether to take any action). I'll wait to see if any similar symptoms appear. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1774208 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1774296 - Posted: 26 Mar 2016, 23:32:16 UTC - in response to Message 1774194. Hmmm, I just ran the CUDA 60 App again on a different machine using Driver 337.25. Same as the other machine, No problems. So, I have now run it on two different machines with 5 different cards, and three different versions of Ubuntu. The only problem is it runs a bit slower on the Pre-Fermi cards. Just as with the old CUDA 5.0 and 5.5 Mac CUDA Apps on Pre-Fermi cards, it runs slower but doesn't produce any Errors or Inconclusives. Don't know why you would be having that problem, all the CUDA 60 tasks were run with Driver 337.25, then I upgraded to 352 to run the OpenCL tasks; https://setiweb.ssl.berkeley.edu/beta/results.php?hostid=72013&offset=100 I have run many others with 337.25 besides those on Beta... ID: 1774296 ·

OTS Volunteer tester Send message Joined: 6 Jan 08 Posts: 369 Credit: 20,533,537 RAC: 0	Message 1774306 - Posted: 26 Mar 2016, 23:52:17 UTC - in response to Message 1774296. Hmmm, I just ran the CUDA 60 App again on a different machine using Driver 337.25. Same as the other machine, No problems. So, I have now run it on two different machines with 5 different cards, and three different versions of Ubuntu. The only problem is it runs a bit slower on the Pre-Fermi cards. Just as with the old CUDA 5.0 and 5.5 Mac CUDA Apps on Pre-Fermi cards, it runs slower but doesn't produce any Errors or Inconclusives. Don't know why you would be having that problem, all the CUDA 60 tasks were run with Driver 337.25, then I upgraded to 352 to run the OpenCL tasks; https://setiweb.ssl.berkeley.edu/beta/results.php?hostid=72013&offset=100 I have run many others with 337.25 besides those on Beta... 337.25 is not one of the ones that Nvidia lists even under the archived drivers but it may have been just dropped off the end of the list because of how old it is. In any case, I am now running the oldest driver that Nvidia does currently list, 352.21. I have not seen any errors yet but it appears there may be one two invalids. I am going to let it run a while and see what happens, but I am beginning to suspect that I will have to eventually take the server off line in the wee hours of the morning and try some of the more drastic measures that Jason outlined if I want it to continue to do SETI. Thanks for letting me know that it has worked so well for you. ID: 1774306 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1774316 - Posted: 27 Mar 2016, 0:11:53 UTC - in response to Message 1774306. 337.25 is still in the archive, http://www.nvidia.com/object/linux-amd64-display-archive.html That driver is Special, it's the one you use to run a Pre-Fermi card with a GTX 750Ti. Beings that way, I have run it for days on my 750Ti paired with the GTS 250. Never had a problem. It's the driver I was using when I discovered the Pre-Fermi cards run slower with the CUDA 60 App. Since then, I have tested a few different scenarios to make sure it was the CUDA 60 App slowing down the Pre-Fermi cards. In all the tests I never had an Error or Invalid using CUDA 60. All the tests were in various flavors of Ubuntu. ID: 1774316 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1774363 - Posted: 27 Mar 2016, 5:10:19 UTC - in response to Message 1774306. Last modified: 27 Mar 2016, 5:53:36 UTC 337.25 is not one of the ones that Nvidia lists even under the archived drivers but it may have been just dropped off the end of the list because of how old it is. In any case, I am now running the oldest driver that Nvidia does currently list, 352.21. I have not seen any errors yet but it appears there may be one two invalids. I am going to let it run a while and see what happens, but I am beginning to suspect that I will have to eventually take the server off line in the wee hours of the morning and try some of the more drastic measures that Jason outlined if I want it to continue to do SETI. Thanks for letting me know that it has worked so well for you. Let's try clarify what I'm asking again, since we're clearly not on the same page. All I've requested so far is to look at a Website of the motherboard manufacturer. There may well be no updates at all, which would dismiss a whole class of possibilities without anything being taken down or updated. Could I have the model code of the motherboard please ? (On the other hand, there may be a raft of updates listed with words like 'compatibility' and 'PCI Express', which would tell us a lot) Xeon E3-1200 v3 Product Family, Specification update, Jan 2016 (PDF), Errata look scarier than they are (Pretty normal sized list for a couple of year old CPU). 750ti was released about a year after. That may not say anything either, but would yield clues "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1774363 ·

OTS Volunteer tester Send message Joined: 6 Jan 08 Posts: 369 Credit: 20,533,537 RAC: 0	Message 1774419 - Posted: 27 Mar 2016, 12:02:40 UTC - in response to Message 1774363. Last modified: 27 Mar 2016, 12:12:40 UTC 337.25 is not one of the ones that Nvidia lists even under the archived drivers but it may have been just dropped off the end of the list because of how old it is. In any case, I am now running the oldest driver that Nvidia does currently list, 352.21. I have not seen any errors yet but it appears there may be one two invalids. I am going to let it run a while and see what happens, but I am beginning to suspect that I will have to eventually take the server off line in the wee hours of the morning and try some of the more drastic measures that Jason outlined if I want it to continue to do SETI. Thanks for letting me know that it has worked so well for you. Let's try clarify what I'm asking again, since we're clearly not on the same page. All I've requested so far is to look at a Website of the motherboard manufacturer. There may well be no updates at all, which would dismiss a whole class of possibilities without anything being taken down or updated. Could I have the model code of the motherboard please ? (On the other hand, there may be a raft of updates listed with words like 'compatibility' and 'PCI Express', which would tell us a lot) Xeon E3-1200 v3 Product Family, Specification update, Jan 2016 (PDF), Errata look scarier than they are (Pretty normal sized list for a couple of year old CPU). 750ti was released about a year after. That may not say anything either, but would yield clues BIOS Information Vendor: American Megatrends Inc. Version: 1.00 Release Date: 05/03/2013 Address: 0xF0000 Runtime Size: 64 kB ROM Size: 16384 kB Characteristics: PCI is supported BIOS is upgradeable BIOS shadowing is allowed Boot from CD is supported Selectable boot is supported BIOS ROM is socketed EDD is supported 5.25"/1.2 MB floppy services are supported (int 13h) 3.5"/720 kB floppy services are supported (int 13h) 3.5"/2.88 MB floppy services are supported (int 13h) Print screen service is supported (int 5h) 8042 keyboard services are supported (int 9h) Serial services are supported (int 14h) Printer services are supported (int 17h) ACPI is supported USB legacy is supported BIOS boot specification is supported Targeted content distribution is supported UEFI is supported BIOS Revision: 4.6 Handle 0x0001, DMI type 1, 27 bytes System Information Manufacturer: Supermicro Product Name: X10SAE Version: 0123456789 Serial Number: 0123456789 UUID: 00000000-0000-0000-0000-00259086B9EE Wake-up Type: Power Switch SKU Number: To be filled by O.E.M. Family: To be filled by O.E.M. Processor Information Socket Designation: SOCKET 0 Type: Central Processor Family: Xeon Manufacturer: Intel ID: C3 06 03 00 FF FB EB BF Signature: Type 0, Family 6, Model 60, Stepping 3 Flags: FPU (Floating-point unit on-chip) VME (Virtual mode extension) DE (Debugging extension) PSE (Page size extension) TSC (Time stamp counter) MSR (Model specific registers) PAE (Physical address extension) MCE (Machine check exception) CX8 (CMPXCHG8 instruction supported) APIC (On-chip APIC hardware supported) SEP (Fast system call) MTRR (Memory type range registers) PGE (Page global enable) MCA (Machine check architecture) CMOV (Conditional move instruction supported) PAT (Page attribute table) PSE-36 (36-bit page size extension) CLFSH (CLFLUSH instruction supported) DS (Debug store) ACPI (ACPI supported) MMX (MMX technology supported) FXSR (FXSAVE and FXSTOR instructions supported) SSE (Streaming SIMD extensions) SSE2 (Streaming SIMD extensions 2) SS (Self-snoop) HTT (Multi-threading) TM (Thermal monitor supported) PBE (Pending break enabled) Version: Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz Voltage: 1.2 V External Clock: 100 MHz Max Speed: 3800 MHz Current Speed: 3300 MHz Status: Populated, Enabled Upgrade: <OUT OF SPEC> L1 Cache Handle: 0x004C L2 Cache Handle: 0x004B L3 Cache Handle: 0x004D Serial Number: Not Specified Asset Tag: Fill By OEM Part Number: Fill By OEM Core Count: 4 Core Enabled: 4 Thread Count: 8 Characteristics: 64-bit capable ID: 1774419 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1774422 - Posted: 27 Mar 2016, 12:19:10 UTC - in response to Message 1774419. No not upset, just get frustrated when my communications skills fail me, which isn't your fault :) Thanks for the detail. Since the Bios version and manufacturer information is all there, it allows some pretty detailed examination without touching the system. Initial theory was simply that under pressure from the GPU, one of the CPUs number of PCIe related quirks (see prior linked spec update) is unpatched. Seeing if there are any change history on the manufacturer's site or in the zip, to see if there are any notes on what the updates actually address. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1774422 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1774429 - Posted: 27 Mar 2016, 13:02:06 UTC - in response to Message 1774422. Last modified: 27 Mar 2016, 13:39:29 UTC After looking: [update] Bios revision date is 20th May last year, and no change log. As per your concerns, Supermicro stress 'Do not update unless you have to' and don't provide any changelog online or in the download I can find, making how to know if you need to (or not) rather difficult. The way to determine that will be to contact supermicro yourself and ask what is changed (e.g. anything that affects your CPU operation, PCI-E stability in particular) Since your CPU was released Q2, 2013 , the motherboard BIOS v1 is dated 05/03/2013 (presumably American date format), and The Errata document starts at June 2013, Then every single Errata (~ 6 pages on the Summary Listing) should be assumed to be applicable and uncorrected, about a dozen or more of them being directly PCIe related. (Intel guideline in the nomenclature section describes how Errata should be treated by hardware and software developers) So I would definitely recommend contacting SuperMicro support, so as to determine how many updates there have been, What they address, and if the risk and downtime is worth crunching via PCIe3 on that machine (which has a job) My feeling is that they should be able to look up if there were microcode updates and PCIe related workarounds/fixes included. though anything's possible if it's just a call centre and the policy is not to reveal the information. The risks are totally yours to take (or not), and SUperMicro may just say something like "Yeah we fixed a bunch of stuff from revision X onwards, that probably apply to your CPU", Though if it were me I'd still be gathering information and assessing risks for some time, if the machine is mission critical. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1774429 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1774472 - Posted: 27 Mar 2016, 17:05:21 UTC - in response to Message 1774429. Later: a possible workaround to Errata HSW122 "PCIe Link may Incorrectly Train to 8.0GT/s", Symptoms are described as link failure or performance problems. Setting your slot to PCIe v2 mode might avoid this and some of the other errata. (in BIOS, I do think the SuperMicro has such a setting, though only scanned the manual briefly) Could save the effort of checking with supermicro, and poking at updates, while the system can keep doing its job. (if it helps) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1774472 ·

OTS Volunteer tester Send message Joined: 6 Jan 08 Posts: 369 Credit: 20,533,537 RAC: 0	Message 1774572 - Posted: 27 Mar 2016, 23:17:28 UTC - in response to Message 1774472. Later: a possible workaround to Errata HSW122 "PCIe Link may Incorrectly Train to 8.0GT/s", Symptoms are described as link failure or performance problems. Setting your slot to PCIe v2 mode might avoid this and some of the other errata. (in BIOS, I do think the SuperMicro has such a setting, though only scanned the manual briefly) Could save the effort of checking with supermicro, and poking at updates, while the system can keep doing its job. (if it helps) I have not been ignoring you. I have simply been gone most of the day. I will look closely at the manual and if I can change to PCIe V2 mode I will try that when I can take the server down. I will also contact Supermicro via email regardless if that works or not. It can't hurt and any information gained is a benefit. I will probably however wait until Tuesday. I have a feeling a Monday is not the best time to contact them if you want a meaningful response. Thanks for all the effort you have put in on this. ID: 1774572 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.