Power - the Reunion Tour (Jun 11 2012)

Message boards : Technical News : Power - the Reunion Tour (Jun 11 2012)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1244754 - Posted: 11 Jun 2012, 22:17:30 UTC

Kind of a bumpy weekend. So we moved that database (which handles the seti.berkeley.edu website) from Dan's new but oddly crashy desktop on my new desktop. Then over the weekend MY new desktop started crashing at random. You'd think this is now clearly related to the database, but Dan's desktop continued to crash after moving the mysql database off of it. And upon further inspection both systems sometimes crash before the OS is even loaded.

So this looks like a hardware problem after all. Funny how both of these new systems are failing in the same manner. We think it has to do with the power outages from a couple weeks ago sending some jolts into these perhaps more sensitive systems.

But speaking of outages, completely separate from those previous power issues which have since been fixed, there was a brand new problem affecting just this building (and all the projects within it, including SETI@home/BOINC). This one was worse, starting in the middle of the night, and by the time anybody could do anything power was up and down several times, and some outlets delivering half power, etc.

The repairs were much faster, and we were stable again around noon, but upon turning everything back on we found we completely lost thinman, the main web server. Totally dead. However, quite luckily, we happened to have a spare old frankenstein machine kicking around, and I was able to do a "brain transplant" i.e. swap the drives from thinman to this other machine. Now this other machine thinks it is thinman and is working quite well as a web server. Dodged a major bullet there.

I also happened to have my old desktop nearby, so I'm using that as I diagnose the new crashy one. Not sure who is responsible for all these damages and lost time, but it definitely shouldn't be us.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1244754 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1244764 - Posted: 11 Jun 2012, 22:28:48 UTC - in response to Message 1244754.  

Thanks for the update Matt,

Claggy
ID: 1244764 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30608
Credit: 53,134,872
RAC: 32
United States
Message 1244769 - Posted: 11 Jun 2012, 22:41:08 UTC

Thanks for the update, and let us know if you need a petition drive to make the powers that be held responsible for the damage.

Actually wouldn't surprise me if the first outage stressed something the the building and it went.


ID: 1244769 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 1244805 - Posted: 12 Jun 2012, 0:01:31 UTC - in response to Message 1244754.  

Now would be a great time to get the funds for those whole-closet UPS devices. How much could that possibly cost the school? ;)
ID: 1244805 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1244824 - Posted: 12 Jun 2012, 1:11:20 UTC - in response to Message 1244805.  
Last modified: 12 Jun 2012, 1:12:51 UTC

Now would be a great time to get the funds for those whole-closet UPS devices. How much could that possibly cost the school? ;)

Or at the very least, some line conditioners, which are usually built-in to UPS units. Line conditioners will clean up noisy power, and also most of the time handles very strong surges just fine. May help with keeping weird power scenarios from taking out machines.. or dirty/noisy power may be what is causing those strange and random crashes.

One of my long-since retired crunchers continues to do other things for me around the house and it was acting weird and would randomly crash. Sometimes it would be weeks before it did it, other times it would be repeatedly for an hour or so. I ran memtest on it and discovered the RAM needed more voltage. Instead of the 2.6 that it wanted, I already had the board set for 2.8, so I had to crank it to 2.9, and that fixed it.

Might just be a power issue, either internal or external.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1244824 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1244871 - Posted: 12 Jun 2012, 3:08:23 UTC - in response to Message 1244754.  
Last modified: 12 Jun 2012, 3:10:24 UTC

... Dan's new but oddly crashy desktop on my new desktop. Then over the weekend MY new desktop started crashing at random. You'd think this is now clearly related to the database, but Dan's desktop continued to crash after moving the mysql database off of it. And upon further inspection both systems sometimes crash before the OS is even loaded.

So this looks like a hardware problem after all. Funny how both of these new systems are failing in the same manner. ...


One relatively newer possibility, in addition to the usual checks, that's quick & easy to eliminate. There's been a general trend evolving lately, to supply XMP profile (or other high frequency with tight latency) memory defaulting to 'normal undervolts'.

After a typical 14 hour or so burnin period the crashy symptoms appear, & gradually worsen over time. Heavy RAM usage patterns in particular then throw either controller or RAM modules over the edge, while memtests often show clear.

The quick check is to make sure the DIMM voltage matches the XMP profile spec, and that VID (memory controller in the CPU) is set to about 70% of that (which is for impedance matching purposes, maximising signal integrity & stopping the memory controller sinking excessive current).

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1244871 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1244955 - Posted: 12 Jun 2012, 7:05:42 UTC

I have wondered this out loud before, but doesn't the campus have some kind of comprehensive insurance coverage that might cover the loss of equipment in cases like this?
I find it hard to believe that lab and computer equipment might not be covered.
Even most basic homeowner's insurance covers this kind of thing for example, in the case of a lightning strike.

It might be worthwhile to ask some serious questions of the proper authorities.....

Just sayin'.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1244955 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1244988 - Posted: 12 Jun 2012, 10:02:56 UTC - in response to Message 1244978.  

No UPS will last for a 5 or 6 hour outage,

They can, but it takes big batteries.
The main use for UPSs is protection from surges, brownouts & power falures. If the failure is long enough, then it allows the hardware to be shut down normally.
Larger UPS units are designed to keep systems up till such time as a backup generator can come online, and then keep things up when that shuts down & the system switches back to mains power.
Grant
Darwin NT
ID: 1244988 · Report as offensive
Cheopis

Send message
Joined: 17 Sep 00
Posts: 156
Credit: 18,451,329
RAC: 0
United States
Message 1244997 - Posted: 12 Jun 2012, 10:28:33 UTC
Last modified: 12 Jun 2012, 10:30:31 UTC

I do not think it is reasonable to try to get a UPS system that will do more than protect the machines, and allow them enough time to gracefully power off after a short timeframe running with no power. Maybe 10 minutes.

Power conditioning and voltage regulation, if they are not already a part of the lab's UPS system, should be considered. Every time you have an outage like this one (especially in an older building), some other part of the electrical system gets stressed. You might have cascading problems every few weeks for the next year before everything is all ironed out.
ID: 1244997 · Report as offensive
Profile Slavac
Volunteer tester
Avatar

Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1245056 - Posted: 12 Jun 2012, 14:17:38 UTC - in response to Message 1244997.  

We've floated the idea of power stabilizing hardware to the lab, I'll let anyone know if they decide they'd like some of the same.

It's heartbreaking that our two new workstations got crippled but given the past few weeks it's understanding. We'll replace the damaged components ASAP once Matt et al figure out the issues.


Executive Director GPU Users Group Inc. -
brad@gpuug.org
ID: 1245056 · Report as offensive
Profile edjcox
Avatar

Send message
Joined: 20 May 99
Posts: 96
Credit: 5,878,353
RAC: 0
United States
Message 1245745 - Posted: 14 Jun 2012, 5:53:17 UTC

Even some small UPS equipment for the PC's would help keep the power gremlins from disturbing circuitry and such and shortening lifespan. I have all my gear at home on UPS for graceful shutdown and power conditioning at all times...

Find out who your campuis engineer is and raise hell ... Let people know they are destroying equipment with their shenanigans. This should bye upchanneled as mush as possible to let management know this is costing them money, time, equipment...
Never engage stupid people at their level, they then have the home court advantage.....
ID: 1245745 · Report as offensive

Message boards : Technical News : Power - the Reunion Tour (Jun 11 2012)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.