Seti@home hangs on dual-Opteron w Fedora Core 2

Author	Message
Darren Frith Send message Joined: 14 Jan 02 Posts: 4 Credit: 7,252 RAC: 0	Message 42060 - Posted: 2 Nov 2004, 0:39:15 UTC Hi all, I have a dual Opteron 242 system (Tyan S2885 K8W mobo) running Fedora Core 2 (2.6.8-1.521smp). Everytime I run seti@home the machine hangs after about 5-15 minutes. No exceptions, no kernel panics, just a hard hang and the machine must be reset. First happened with seti@home i686 classic. Then I downloaded the i686 version of boinc, and the same thing happened. Yesterday I downloaded and built boinc-4.11.tar.gz and seti_boinc-client-cvs-2004-10-31.tar.gz. Got it compiled and it ran for a while but then hung again. All but one temperature sensor are within range, and the one that is out of range has run much higher without boinc and not hung than with boinc. I don't know how to proceed from here. I have stress tested the CPU's and not had it hang, so I suspect seti/boinc is to blame. I can't explain why 3 versions cause the same results. Any insight, or debugging tips would be appreciated. Regards, Darren ID: 42060 ·

Darren Frith Send message Joined: 14 Jan 02 Posts: 4 Credit: 7,252 RAC: 0	Message 42084 - Posted: 2 Nov 2004, 2:06:23 UTC I just ran two instances of cpuburn-in (http://users.bigpond.net.au/cpuburn/) for 60 minutes. Highest temperatures (degrees celcius) recorded were CPU0: 54, CPU1: 42, north/south bridge: 50 (40 high). System remained stable and responsive. ID: 42084 ·

Ned Slider Send message Joined: 12 Oct 01 Posts: 668 Credit: 4,375,315 RAC: 0	Message 42088 - Posted: 2 Nov 2004, 2:12:17 UTC Hi, I have fedora core 2 (kernel 2.6.8-1.521) running boinc/seti with no problems on a single processor machine (well, 2 machines actually). I would say that boinc/seti is exposing some weakness in your hardware/software combination. As you know, seti really does stress your hardware unlike many other stress tests. In particular, seti works a lot harder during the first 10 mins of a work unit (just monitor your temps and voltages to see this) so it's not uncommon for seti to hang a machine in the first 15 mins of a work unit. What are your temps like? Also, what are your cpu voltage rails like? I've seen mine drop when running seti (ie stressing the cpu) and they drop even further during the first 15 mins of any work unit - further evidence that seti works the cpu harder at the start of a work unit. This may be just enough to tip the balance from stability to a hard lockup. My overclocked Windows machine would do this all the time until I got a better PSU for it. First thing I'd look at is the PSU and voltage rails especially as it's a dual processor machine. Running 2 instances of seti, one on each processor is really going to suck some juice out of your PSU. Try just running one instance and see if that's stable. Then start up a second instance once the first has ran for a couple of hours. Again, I really don't think it's boinc to blame, just the exceptional load that 2 instances is putting on your hardware, particularly in the first 15 mins of the work unit. Ned * My Guide to Compiling Optimised BOINC and SETI Clients * * Download Optimised BOINC and SETI Clients for Linux Here * ID: 42088 ·

Ned Slider Send message Joined: 12 Oct 01 Posts: 668 Credit: 4,375,315 RAC: 0	Message 42098 - Posted: 2 Nov 2004, 2:28:19 UTC - in response to Message 42084. > I just ran two instances of cpuburn-in (http://users.bigpond.net.au/cpuburn/) > for 60 minutes. Highest temperatures (degrees celcius) recorded were CPU0: 54, > CPU1: 42, north/south bridge: 50 (40 high). > > System remained stable and responsive. > Well, there's cpu stress testing and there's cpu stress testing. You have no idea if the cpu burnin test and seti are comparable. As I've previously shown (below), the start of a work unit does exerts a far heavier load on your processor (1000 Mflops) compared to the rest of the work unit (450 Mflops), which also correlates with an increase is power usage and temperature rise, yet the whole time it's running at 100% cpu usage. Clearly not all 100% cpu usage is equal hence why it doesn't surprise me that your cpu burnin tests are stable but seti isn't. See this thread for a more detailed explanation of what I was trying to say above: http://forums.pcper.com/showthread.php?t=206967 quote from this: "Furthermore, I can correlate the increase in processor temp at the start of a WU with the power output. For example, if our reference point at idle is normalised to 0W, 34C, then for SETI running we have +10W or +3C (to 37C; 450Mflops), and at the very start of a WU we have +18W or +5C (to 39C; 1000Mflops)." * My Guide to Compiling Optimised BOINC and SETI Clients * * Download Optimised BOINC and SETI Clients for Linux Here * ID: 42098 ·

Darren Frith Send message Joined: 14 Jan 02 Posts: 4 Credit: 7,252 RAC: 0	Message 42111 - Posted: 2 Nov 2004, 3:07:43 UTC - in response to Message 42088. > Hi, > > ... > Also, what are your cpu voltage rails like? I've seen mine drop when running > seti (ie stressing the cpu) and they drop even further during the first 15 > mins of any work unit - further evidence that seti works the cpu harder at the > start of a work unit. This may be just enough to tip the balance from > stability to a hard lockup. My overclocked Windows machine would do this all > the time until I got a better PSU for it. > ... > Ned Ned, taken your advice and looked at voltage rails. I just ran the following before starting boinc: while date; do sensors; done > temp.log Results before running boinc: Tue Nov 2 13:12:46 CST 2004 adt7463-i2c-2-2e Adapter: SMBus AMD756 adapter at 50e0 ERROR: Can't get alarm mask data! CPU0 DDR 2.5: +2.526 V (min = +2.37 V, max = +2.63 V) CPU0 DDR VTT: +1.254 V (min = +1.18 V, max = +1.31 V) 3.3VSB: +3.334 V (min = +3.13 V, max = +3.47 V) +5V: +5.130 V (min = +4.74 V, max = +5.26 V) +12V: +12.188 V (min = +11.38 V, max = +12.62 V) CPU1 Temp:+39.00Â°C (low = -127Â°C, high = +127Â°C) System Temp: +35.00Â°C (low = -127Â°C, high = +127Â°C) CPU0 Temp:+52.00Â°C (low = -127Â°C, high = +127Â°C) vid: +1.850 V (VRM Version 0.1) w83627hf-isa-0290 Adapter: ISA adapter CPU0 Vcore: +1.57 V (min = +1.42 V, max = +1.57 V) ALARM CPU1 Vcore: +1.57 V (min = +1.42 V, max = +1.57 V) ALARM +3.3V: +3.36 V (min = +3.14 V, max = +3.47 V) CPU1 DDR VTT: +1.28 V (min = +1.18 V, max = +1.31 V) CPU1 DDR 2.5: +2.54 V (min = +2.37 V, max = +2.62 V) VDD_CPU0: +1.17 V (min = +1.14 V, max = +1.26 V) VBat: +0.00 V (min = +3.14 V, max = +3.47 V) rear temp: +43Â°C (high = +40Â°C, hyst = +37Â°C) sensor = thermistor I/O panel temp: +32.0Â°C (high = +52Â°C, hyst = +47Â°C) sensor = thermistor vid: +2.050 V alarms: beep_enable: Sound alarm disabled And the last recorded output before the hang: Tue Nov 2 13:13:11 CST 2004 adt7463-i2c-2-2e Adapter: SMBus AMD756 adapter at 50e0 ERROR: Can't get alarm mask data! CPU0 DDR 2.5: +2.513 V (min = +2.37 V, max = +2.63 V) CPU0 DDR VTT: +1.254 V (min = +1.18 V, max = +1.31 V) 3.3VSB: +3.334 V (min = +3.13 V, max = +3.47 V) +5V: +5.130 V (min = +4.74 V, max = +5.26 V) +12V: +12.188 V (min = +11.38 V, max = +12.62 V) CPU1 Temp:+39.00Â°C (low = -127Â°C, high = +127Â°C) System Temp: +35.00Â°C (low = -127Â°C, high = +127Â°C) CPU0 Temp:+52.00Â°C (low = -127Â°C, high = +127Â°C) vid: +1.850 V (VRM Version 0.1) w83627hf-isa-0290 Adapter: ISA adapter CPU0 Vcore: +1.55 V (min = +1.42 V, max = +1.57 V) ALARM CPU1 Vcore: +1.57 V (min = +1.42 V, max = +1.57 V) ALARM +3.3V: +3.36 V (min = +3.14 V, max = +3.47 V) CPU1 DDR VTT: +1.28 V (min = +1.18 V, max = +1.31 V) CPU1 DDR 2.5: +2.53 V (min = +2.37 V, max = +2.62 V) VDD_CPU0: +1.18 V (min = +1.14 V, max = +1.26 V) VBat: +0.00 V (min = +3.14 V, max = +3.47 V) rear temp: +43Â°C (high = +40Â°C, hyst = +37Â°C) sensor = thermistor I/O panel temp: +32.0Â°C (high = +52Â°C, hyst = +47Â°C) sensor = thermistor vid: +2.050 V alarms: beep_enable: Sound alarm disabled All voltages appear within range. I have a 460W power supply so I don't think it is a voltage problem. Two things put doubt in my mind, (1) the last sensors message may not be in the file due to HDD sync, (2) lm_sensors may not be correct, although sensors.conf is from the mobo manufacturer website and output corresponds well with values from BIOS. As you can see, the system ran for about 25 seconds before hanging. Usually not that fast. Will try single-CPU kernel next. ID: 42111 ·

Ned Slider Send message Joined: 12 Oct 01 Posts: 668 Credit: 4,375,315 RAC: 0	Message 42116 - Posted: 2 Nov 2004, 3:34:09 UTC - in response to Message 42111. > > All voltages appear within range. I have a 460W power supply so I don't think > it is a voltage problem. Two things put doubt in my mind, (1) the last sensors > message may not be in the file due to HDD sync, (2) lm_sensors may not be > correct, although sensors.conf is from the mobo manufacturer website and > output corresponds well with values from BIOS. > > As you can see, the system ran for about 25 seconds before hanging. Usually > not that fast. Will try single-CPU kernel next. > > I wouldn't worry if the absolute sensor values are correct (motherboard voltage readings are notoriously inaccurate but it doesn't much matter), what you're looking for is a drop in voltage under load. From your data above, it looks OK. The +12V rail supplying the cpu doesn't appear to drop at all between the two sets of data indicating you have sufficient current. Just out of interest, can you post the make/model of your power supply together with the specs for the individual voltage rails. Also, is this with two copies of seti running, one on each processor? If so, do you still get lockups with just one copy running? Ned * My Guide to Compiling Optimised BOINC and SETI Clients * * Download Optimised BOINC and SETI Clients for Linux Here * ID: 42116 ·

Darren Frith Send message Joined: 14 Jan 02 Posts: 4 Credit: 7,252 RAC: 0	Message 42128 - Posted: 2 Nov 2004, 5:20:41 UTC - in response to Message 42116. > I wouldn't worry if the absolute sensor values are correct (motherboard > voltage readings are notoriously inaccurate but it doesn't much matter), what > you're looking for is a drop in voltage under load. From your data above, it > looks OK. The +12V rail supplying the cpu doesn't appear to drop at all > between the two sets of data indicating you have sufficient current. Just out > of interest, can you post the make/model of your power supply together with > the specs for the individual voltage rails. > > Also, is this with two copies of seti running, one on each processor? If so, > do you still get lockups with just one copy running? > > Ned > > Ned, The power supply is a zippy HP2-6460P (http://www.zippy.com.tw/P_product_detail.asp?lv_rfnbr=2&pp_rfnbr=1010&pcp_rfnbr=3&pp_name=&pcp_name=PS2/PS2+%20single&pcpw_rfnbr=4&pp_code=HP2-6460P) Individual rail specs are: Output Voltage Output Current Min. Output Current Max. Regulation Load Regulation Line Output Ripple & Noise Max.[P-P] 5V 2.5 40.00 Â± 5% Â± 1% 60mV 12V 1 32.00 Â± 5% Â± 1% 100mV -5V 0 0.80 Â± 5% Â± 1% 100mV -12V 0 1.00 Â± 5% Â± 1% 100mV 3.3V 1 30 +5, -5% Â± 1% 60mV +5VSB 0.1 2 Â± 5% Â± 1% 60mV The earlier tests were run on a dual processor kernel, but a single instance of boinc. (This is a bit puzzling in itself. I only see boing/slots/0 whereas I have boinc/slots/0 and boinc/slots/1 on my dual-P3 machine). I rebooted into the single processor kernel and ran boinc so it would definately be single instance. It hung just the same as before. It seems that if it is a stress issue then even one instance is too much. ID: 42128 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.