Two Fried Solaris Servers in less then a week.

Message boards : Number crunching : Two Fried Solaris Servers in less then a week.
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Celtic Wolf
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 3278
Credit: 595,676
RAC: 0
United States
Message 112645 - Posted: 19 May 2005, 0:31:52 UTC

OK I am hoping this is just a coincidence, but having two Solaris Server crash in less then a week at different locations has me wondering.

Both were running the BOINC 4.25 client. Once was running Solaris 8 (E-220-R, 2G RAM, Dual Processors) the other Solaris 9 (Ultra 10, 512 RAM, Single Processor).

It appears in the case of the Ultra 10 to be a power supply issue. The reason for the E-220-R dying is still up in the air.

Has anyone run into a similar issue with a Solaris System just dying? Both of these systems are heavily monitored and until they died there were no indications of failure.

I have shut down BOINC on my other Solaris systems until I can rule out BOINC being the issue.

Before anyone asks both systems are or were well cooled!!

ID: 112645 · Report as offensive
Ned Slider

Send message
Joined: 12 Oct 01
Posts: 668
Credit: 4,375,315
RAC: 0
United Kingdom
Message 112779 - Posted: 19 May 2005, 10:48:36 UTC - in response to Message 112645.  


It appears in the case of the Ultra 10 to be a power supply issue. The reason for the E-220-R dying is still up in the air.

Before anyone asks both systems are or were well cooled!!



Well, if they were well cooled, it's not that.

Running seti does put extra load on the CPU which in turn draws extra current from the supply. On PC based systems i've observed this first hand, both in my UPS logs and by simply sticking a volt meter in the supply lines and watching the voltages drop slightly. A quality supply will handle this without problem, but a flaky supply could theoretically be pushed past it's limits and cause failure.

Ned


*** My Guide to Compiling Optimised BOINC and SETI Clients ***
*** Download Optimised BOINC and SETI Clients for Linux Here ***
ID: 112779 · Report as offensive
Profile MikeSW17
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 1603
Credit: 2,700,523
RAC: 0
United Kingdom
Message 112792 - Posted: 19 May 2005, 11:31:26 UTC - in response to Message 112779.  


It appears in the case of the Ultra 10 to be a power supply issue. The reason for the E-220-R dying is still up in the air.

Before anyone asks both systems are or were well cooled!!



Well, if they were well cooled, it's not that.

Running seti does put extra load on the CPU which in turn draws extra current from the supply. On PC based systems i've observed this first hand, both in my UPS logs and by simply sticking a volt meter in the supply lines and watching the voltages drop slightly. A quality supply will handle this without problem, but a flaky supply could theoretically be pushed past it's limits and cause failure.

Ned



Bear in mind though that BOINC apps are not the only software that can fully use a CPU and memory to its designed limits. BOINC does not overclock or push a system beyond that it was designed or built for.
If the fault was caused by full load conditions, BOINC just happened to be there at the time, and shouldn't get the blame.

ID: 112792 · Report as offensive
Ned Slider

Send message
Joined: 12 Oct 01
Posts: 668
Credit: 4,375,315
RAC: 0
United Kingdom
Message 112798 - Posted: 19 May 2005, 11:35:45 UTC - in response to Message 112792.  


If the fault was caused by full load conditions, BOINC just happened to be there at the time, and shouldn't get the blame.


Quite right Mike - was just trying to put forward a possible explanation as to why the machines may have failed. Presumably if the original poster is able to run boinc on them, then they weren't originally being run with a constantly full CPU load.

Ned
*** My Guide to Compiling Optimised BOINC and SETI Clients ***
*** Download Optimised BOINC and SETI Clients for Linux Here ***
ID: 112798 · Report as offensive
Profile Celtic Wolf
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 3278
Credit: 595,676
RAC: 0
United States
Message 112803 - Posted: 19 May 2005, 11:44:12 UTC - in response to Message 112798.  


If the fault was caused by full load conditions, BOINC just happened to be there at the time, and shouldn't get the blame.


Quite right Mike - was just trying to put forward a possible explanation as to why the machines may have failed. Presumably if the original poster is able to run boinc on them, then they weren't originally being run with a constantly full CPU load.

Ned


Neither system exceeded %3 use. Just strange that two systems whose only commonality was BOINC failed in the same manner.

It is not proven that the power supplies are definately at fault yet. IF the CPU fried the symptoms would be similar to a power supply output failure.


ID: 112803 · Report as offensive
Profile Celtic Wolf
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 3278
Credit: 595,676
RAC: 0
United States
Message 112965 - Posted: 19 May 2005, 22:41:02 UTC

Well for those that care. I have determined the cause of failure and it was indeed coicidental that they both occurred at the same time.

The primary drive on my E-220-R is failing and the power supply failed on the Ultra 10. Unrelated events..

I am once again BOINC'in on the other servers..
ID: 112965 · Report as offensive
Kathy
Avatar

Send message
Joined: 5 Jan 03
Posts: 338
Credit: 27,877,436
RAC: 0
United States
Message 112988 - Posted: 20 May 2005, 0:13:53 UTC

Sorry that happened to ya, CW. Glad to read that you located the problems and are back to crunching!

Kathy


ID: 112988 · Report as offensive

Message boards : Number crunching : Two Fried Solaris Servers in less then a week.


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.