Seti@home hangs on dual-Opteron w Fedora Core 2

Questions and Answers : Unix/Linux : Seti@home hangs on dual-Opteron w Fedora Core 2
Message board moderation

To post messages, you must log in.

AuthorMessage
Darren Frith

Send message
Joined: 14 Jan 02
Posts: 4
Credit: 7,252
RAC: 0
Australia
Message 42060 - Posted: 2 Nov 2004, 0:39:15 UTC

Hi all,

I have a dual Opteron 242 system (Tyan S2885 K8W mobo) running Fedora Core 2 (2.6.8-1.521smp). Everytime I run seti@home the machine hangs after about 5-15 minutes. No exceptions, no kernel panics, just a hard hang and the machine must be reset.

First happened with seti@home i686 classic. Then I downloaded the i686 version of boinc, and the same thing happened. Yesterday I downloaded and built boinc-4.11.tar.gz and seti_boinc-client-cvs-2004-10-31.tar.gz. Got it compiled and it ran for a while but then hung again.

All but one temperature sensor are within range, and the one that is out of range has run much higher without boinc and not hung than with boinc.

I don't know how to proceed from here. I have stress tested the CPU's and not had it hang, so I suspect seti/boinc is to blame. I can't explain why 3 versions cause the same results.

Any insight, or debugging tips would be appreciated.

Regards,

Darren
ID: 42060 · Report as offensive
Darren Frith

Send message
Joined: 14 Jan 02
Posts: 4
Credit: 7,252
RAC: 0
Australia
Message 42084 - Posted: 2 Nov 2004, 2:06:23 UTC

I just ran two instances of cpuburn-in (http://users.bigpond.net.au/cpuburn/) for 60 minutes. Highest temperatures (degrees celcius) recorded were CPU0: 54, CPU1: 42, north/south bridge: 50 (40 high).

System remained stable and responsive.
ID: 42084 · Report as offensive
Ned Slider

Send message
Joined: 12 Oct 01
Posts: 668
Credit: 4,375,315
RAC: 0
United Kingdom
Message 42088 - Posted: 2 Nov 2004, 2:12:17 UTC

Hi,

I have fedora core 2 (kernel 2.6.8-1.521) running boinc/seti with no problems on a single processor machine (well, 2 machines actually).

I would say that boinc/seti is exposing some weakness in your hardware/software combination. As you know, seti really does stress your hardware unlike many other stress tests. In particular, seti works a lot harder during the first 10 mins of a work unit (just monitor your temps and voltages to see this) so it's not uncommon for seti to hang a machine in the first 15 mins of a work unit.

What are your temps like?

Also, what are your cpu voltage rails like? I've seen mine drop when running seti (ie stressing the cpu) and they drop even further during the first 15 mins of any work unit - further evidence that seti works the cpu harder at the start of a work unit. This may be just enough to tip the balance from stability to a hard lockup. My overclocked Windows machine would do this all the time until I got a better PSU for it.

First thing I'd look at is the PSU and voltage rails especially as it's a dual processor machine. Running 2 instances of seti, one on each processor is really going to suck some juice out of your PSU. Try just running one instance and see if that's stable. Then start up a second instance once the first has ran for a couple of hours.

Again, I really don't think it's boinc to blame, just the exceptional load that 2 instances is putting on your hardware, particularly in the first 15 mins of the work unit.

Ned


*** My Guide to Compiling Optimised BOINC and SETI Clients ***
*** Download Optimised BOINC and SETI Clients for Linux Here ***
ID: 42088 · Report as offensive
Ned Slider

Send message
Joined: 12 Oct 01
Posts: 668
Credit: 4,375,315
RAC: 0
United Kingdom
Message 42098 - Posted: 2 Nov 2004, 2:28:19 UTC - in response to Message 42084.  

> I just ran two instances of cpuburn-in (http://users.bigpond.net.au/cpuburn/)
> for 60 minutes. Highest temperatures (degrees celcius) recorded were CPU0: 54,
> CPU1: 42, north/south bridge: 50 (40 high).
>
> System remained stable and responsive.
>

Well, there's cpu stress testing and there's cpu stress testing. You have no idea if the cpu burnin test and seti are comparable. As I've previously shown (below), the start of a work unit does exerts a far heavier load on your processor (1000 Mflops) compared to the rest of the work unit (450 Mflops), which also correlates with an increase is power usage and temperature rise, yet the whole time it's running at 100% cpu usage. Clearly not all 100% cpu usage is equal hence why it doesn't surprise me that your cpu burnin tests are stable but seti isn't.

See this thread for a more detailed explanation of what I was trying to say above:

http://forums.pcper.com/showthread.php?t=206967

quote from this:

"Furthermore, I can correlate the increase in processor temp at the start of a WU with the power output. For example, if our reference point at idle is normalised to 0W, 34C, then for SETI running we have +10W or +3C (to 37C; 450Mflops), and at the very start of a WU we have +18W or +5C (to 39C; 1000Mflops)."
*** My Guide to Compiling Optimised BOINC and SETI Clients ***
*** Download Optimised BOINC and SETI Clients for Linux Here ***
ID: 42098 · Report as offensive
Darren Frith

Send message
Joined: 14 Jan 02
Posts: 4
Credit: 7,252
RAC: 0
Australia
Message 42111 - Posted: 2 Nov 2004, 3:07:43 UTC - in response to Message 42088.  

> Hi,
>
> ...
> Also, what are your cpu voltage rails like? I've seen mine drop when running
> seti (ie stressing the cpu) and they drop even further during the first 15
> mins of any work unit - further evidence that seti works the cpu harder at the
> start of a work unit. This may be just enough to tip the balance from
> stability to a hard lockup. My overclocked Windows machine would do this all
> the time until I got a better PSU for it.
> ...
> Ned

Ned, taken your advice and looked at voltage rails.

I just ran the following before starting boinc:
while date; do sensors; done > temp.log

Results before running boinc:

Tue Nov 2 13:12:46 CST 2004

adt7463-i2c-2-2e
Adapter: SMBus AMD756 adapter at 50e0
ERROR: Can't get alarm mask data!
CPU0 DDR 2.5:
+2.526 V (min = +2.37 V, max = +2.63 V)
CPU0 DDR VTT:
+1.254 V (min = +1.18 V, max = +1.31 V)
3.3VSB: +3.334 V (min = +3.13 V, max = +3.47 V)
+5V: +5.130 V (min = +4.74 V, max = +5.26 V)
+12V: +12.188 V (min = +11.38 V, max = +12.62 V)
CPU1 Temp:+39.00°C (low = -127°C, high = +127°C)
System Temp:
+35.00°C (low = -127°C, high = +127°C)
CPU0 Temp:+52.00°C (low = -127°C, high = +127°C)
vid: +1.850 V (VRM Version 0.1)

w83627hf-isa-0290
Adapter: ISA adapter
CPU0 Vcore:
+1.57 V (min = +1.42 V, max = +1.57 V) ALARM
CPU1 Vcore:
+1.57 V (min = +1.42 V, max = +1.57 V) ALARM
+3.3V: +3.36 V (min = +3.14 V, max = +3.47 V)
CPU1 DDR VTT:
+1.28 V (min = +1.18 V, max = +1.31 V)
CPU1 DDR 2.5:
+2.54 V (min = +2.37 V, max = +2.62 V)
VDD_CPU0: +1.17 V (min = +1.14 V, max = +1.26 V)
VBat: +0.00 V (min = +3.14 V, max = +3.47 V)
rear temp: +43°C (high = +40°C, hyst = +37°C) sensor = thermistor
I/O panel temp:
+32.0°C (high = +52°C, hyst = +47°C) sensor = thermistor
vid: +2.050 V
alarms:
beep_enable:
Sound alarm disabled

And the last recorded output before the hang:

Tue Nov 2 13:13:11 CST 2004
adt7463-i2c-2-2e
Adapter: SMBus AMD756 adapter at 50e0
ERROR: Can't get alarm mask data!
CPU0 DDR 2.5:
+2.513 V (min = +2.37 V, max = +2.63 V)
CPU0 DDR VTT:
+1.254 V (min = +1.18 V, max = +1.31 V)
3.3VSB: +3.334 V (min = +3.13 V, max = +3.47 V)
+5V: +5.130 V (min = +4.74 V, max = +5.26 V)
+12V: +12.188 V (min = +11.38 V, max = +12.62 V)
CPU1 Temp:+39.00°C (low = -127°C, high = +127°C)
System Temp:
+35.00°C (low = -127°C, high = +127°C)
CPU0 Temp:+52.00°C (low = -127°C, high = +127°C)
vid: +1.850 V (VRM Version 0.1)

w83627hf-isa-0290
Adapter: ISA adapter
CPU0 Vcore:
+1.55 V (min = +1.42 V, max = +1.57 V) ALARM
CPU1 Vcore:
+1.57 V (min = +1.42 V, max = +1.57 V) ALARM
+3.3V: +3.36 V (min = +3.14 V, max = +3.47 V)
CPU1 DDR VTT:
+1.28 V (min = +1.18 V, max = +1.31 V)
CPU1 DDR 2.5:
+2.53 V (min = +2.37 V, max = +2.62 V)
VDD_CPU0: +1.18 V (min = +1.14 V, max = +1.26 V)
VBat: +0.00 V (min = +3.14 V, max = +3.47 V)
rear temp: +43°C (high = +40°C, hyst = +37°C) sensor = thermistor
I/O panel temp:
+32.0°C (high = +52°C, hyst = +47°C) sensor = thermistor
vid: +2.050 V
alarms:
beep_enable:
Sound alarm disabled

All voltages appear within range. I have a 460W power supply so I don't think it is a voltage problem. Two things put doubt in my mind, (1) the last sensors message may not be in the file due to HDD sync, (2) lm_sensors may not be correct, although sensors.conf is from the mobo manufacturer website and output corresponds well with values from BIOS.

As you can see, the system ran for about 25 seconds before hanging. Usually not that fast. Will try single-CPU kernel next.

ID: 42111 · Report as offensive
Ned Slider

Send message
Joined: 12 Oct 01
Posts: 668
Credit: 4,375,315
RAC: 0
United Kingdom
Message 42116 - Posted: 2 Nov 2004, 3:34:09 UTC - in response to Message 42111.  

>
> All voltages appear within range. I have a 460W power supply so I don't think
> it is a voltage problem. Two things put doubt in my mind, (1) the last sensors
> message may not be in the file due to HDD sync, (2) lm_sensors may not be
> correct, although sensors.conf is from the mobo manufacturer website and
> output corresponds well with values from BIOS.
>
> As you can see, the system ran for about 25 seconds before hanging. Usually
> not that fast. Will try single-CPU kernel next.
>
>

I wouldn't worry if the absolute sensor values are correct (motherboard voltage readings are notoriously inaccurate but it doesn't much matter), what you're looking for is a drop in voltage under load. From your data above, it looks OK. The +12V rail supplying the cpu doesn't appear to drop at all between the two sets of data indicating you have sufficient current. Just out of interest, can you post the make/model of your power supply together with the specs for the individual voltage rails.

Also, is this with two copies of seti running, one on each processor? If so, do you still get lockups with just one copy running?

Ned

*** My Guide to Compiling Optimised BOINC and SETI Clients ***
*** Download Optimised BOINC and SETI Clients for Linux Here ***
ID: 42116 · Report as offensive
Darren Frith

Send message
Joined: 14 Jan 02
Posts: 4
Credit: 7,252
RAC: 0
Australia
Message 42128 - Posted: 2 Nov 2004, 5:20:41 UTC - in response to Message 42116.  

> I wouldn't worry if the absolute sensor values are correct (motherboard
> voltage readings are notoriously inaccurate but it doesn't much matter), what
> you're looking for is a drop in voltage under load. From your data above, it
> looks OK. The +12V rail supplying the cpu doesn't appear to drop at all
> between the two sets of data indicating you have sufficient current. Just out
> of interest, can you post the make/model of your power supply together with
> the specs for the individual voltage rails.
>
> Also, is this with two copies of seti running, one on each processor? If so,
> do you still get lockups with just one copy running?
>
> Ned
>
>

Ned,

The power supply is a zippy HP2-6460P
(http://www.zippy.com.tw/P_product_detail.asp?lv_rfnbr=2&pp_rfnbr=1010&pcp_rfnbr=3&pp_name=&pcp_name=PS2/PS2+%20single&pcpw_rfnbr=4&pp_code=HP2-6460P)

Individual rail specs are:

Output Voltage Output Current Min. Output Current Max. Regulation Load Regulation Line Output Ripple & Noise Max.[P-P]
5V 2.5 40.00 ± 5% ± 1% 60mV

12V 1 32.00 ± 5% ± 1% 100mV

-5V 0 0.80 ± 5% ± 1% 100mV

-12V 0 1.00 ± 5% ± 1% 100mV

3.3V 1 30 +5, -5% ± 1% 60mV

+5VSB 0.1 2 ± 5% ± 1% 60mV

The earlier tests were run on a dual processor kernel, but a single instance of boinc. (This is a bit puzzling in itself. I only see boing/slots/0 whereas I have boinc/slots/0 and boinc/slots/1 on my dual-P3 machine).

I rebooted into the single processor kernel and ran boinc so it would definately be single instance. It hung just the same as before. It seems that if it is a stress issue then even one instance is too much.

ID: 42128 · Report as offensive

Questions and Answers : Unix/Linux : Seti@home hangs on dual-Opteron w Fedora Core 2


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.