Outage notice

Message boards : Number crunching : Outage notice
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13752
Credit: 208,696,464
RAC: 304
Australia
Message 63025 - Posted: 11 Jan 2005, 11:10:56 UTC - in response to Message 62935.  


> It is easy to forgive the Seti@Home project for not having diesel-powered
> generators behind their UPS'es. As long as every mains outage is announced or
> lasts shorter than their UPS endurance, or there is staff at call well within
> the same time limit, everything can be shut down in a controlled manner (like
> this time).

Even cheap UPSs have software so that when the battery reaches it's critical level it can then shut the connected computer(s) down.
Even better UPSs have software that allows for some outputs to be shutdown just after the inital power failure, others to be powered down after 30min, 45min (or whatever is appropriate) & the critcal ones to be shut down only when the batteries have reached their end.


> But a disk failure causing trouble upon booting afterwards? Does this mean
> that the almost impossible happened, that two (or more) disks in the same
> array failed at the same time?

Almost impossible to happen?
More like almost impossbile not to happen.
The more drives you have, and the more you have of the same age & same number of operating hours, the more likely multiple failures are.


> Hard disks are relatively inexpensive and known
> to fail from time to time, shurely they must have fault-tolerance throughout
> all of their storage. Don't they?

Pretty sure that up till recently they've been relying on software RAID- hence some very severe bottle necks.
I beleive they now have hardware RAID, but i don't know if it's been implemented yet; and i don't know what type of RAID is being used.
Grant
Darwin NT
ID: 63025 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13752
Credit: 208,696,464
RAC: 304
Australia
Message 63027 - Posted: 11 Jan 2005, 11:13:13 UTC - in response to Message 62939.  

> All it would take is one carbon trail, falling wire,
> inadvertantly thrown switch to feed the generator voltage back out to where
> the men are working.

Hence the use of physical isolation & correct saftey labeling & following of safety proceedures.
Grant
Darwin NT
ID: 63027 · Report as offensive
Profile kinnison
Avatar

Send message
Joined: 23 Oct 02
Posts: 107
Credit: 7,406,815
RAC: 7
United Kingdom
Message 63029 - Posted: 11 Jan 2005, 11:16:19 UTC

I used to work for a major bank, looking after their mainframe servers. We avoiding powering down wherever possible, but occasionally we had to for OS updates and the like. I've never once seen both halves of a mirrored disk go down! I suppose that's the equivalent of raid.
Of course our development servers didn't have mirrored disks, we ran backups once a day on most files to tape.
I'm guessing borkeley doesn't mirror their disks.
<img border="0" src="http://boinc.mundayweb.com/one/stats.php?userID=268&amp;prj=1&amp;trans=off" /><img border="0" src="http://boinc.mundayweb.com/one/stats.php?userID=268&amp;prj=4&amp;trans=off" />
ID: 63029 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13752
Credit: 208,696,464
RAC: 304
Australia
Message 63032 - Posted: 11 Jan 2005, 11:20:16 UTC - in response to Message 63029.  


> I've never once seen both halves of a mirrored disk go down! I suppose that's
> the equivalent of raid.

RAID 1= mirroring.

> I'm guessing borkeley doesn't mirror their disks.

Pretty sure they're using SCSI drives & there's nothing cheap about even entry level SCSI drives (compared to IDE ones) & so full mirroring would be horrendously expensive.
Grant
Darwin NT
ID: 63032 · Report as offensive
John Hunt
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 514
Credit: 501,438
RAC: 0
United Kingdom
Message 63042 - Posted: 11 Jan 2005, 11:50:47 UTC - in response to Message 62979.  

> In reply to John Hunt's comment:
>
> " I've learnt my lesson the hard way - what seemed like a good machine when I
> first started has turned out to be a turkey! "
>
> Here is Jim's rule for buying a computer.
>
> On a scale of 1 - 10 with 10 being the best. You decide that you are going to
> buy/build one at around level 4. You can afford it with only minor pain. After
> looking around you decide that one at level 5 would suit your present and
> future needs better. You don't want to spend that much but you rationalize it
> out that 5 would be worth it, pain be darned, and besides you can probably
> slip a 5 past the wife if you work it right. When you have finally talked your
> self into a 5, buy a 6 and you will be happy with it. It will be worth the
> pain and the pain will fade,,,in time. :)
>
LOL at that good advice,Jim!
I'm going for a 6 this time....good news is that I'm still single (and sane!).
ID: 63042 · Report as offensive
John Hunt
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 514
Credit: 501,438
RAC: 0
United Kingdom
Message 63043 - Posted: 11 Jan 2005, 11:51:24 UTC - in response to Message 62979.  

> In reply to John Hunt's comment:
>
> " I've learnt my lesson the hard way - what seemed like a good machine when I
> first started has turned out to be a turkey! "
>
> Here is Jim's rule for buying a computer.
>
> On a scale of 1 - 10 with 10 being the best. You decide that you are going to
> buy/build one at around level 4. You can afford it with only minor pain. After
> looking around you decide that one at level 5 would suit your present and
> future needs better. You don't want to spend that much but you rationalize it
> out that 5 would be worth it, pain be darned, and besides you can probably
> slip a 5 past the wife if you work it right. When you have finally talked your
> self into a 5, buy a 6 and you will be happy with it. It will be worth the
> pain and the pain will fade,,,in time. :)
>
LOL at that good advice,Jim!
I'm going for a 6 this time....good news is that I'm still single (and sane!).
ID: 63043 · Report as offensive
Brickhead
Avatar

Send message
Joined: 4 Oct 03
Posts: 26
Credit: 2,156,744
RAC: 0
Norway
Message 63089 - Posted: 11 Jan 2005, 16:18:02 UTC

@mmciastro
> I hope I can help on the Generator Issue. I an electrical background, and I
> can say that the building in berkeley probably has a step down transformer
> which converts (transforms)a higher voltage 2400, 4800, 8000, or 12KV. a
> transformer works in both directions. I usually converts the higher voltage
> into (say) 480VAC for building heating, cooling, motor loads, lighting, etc.
>
> If you attach a 480 three phase generator then that same 480 is converted back
> up to the exceedingly deadly voltage. As a general rule AC voltage can arc
> (jump) through open air roughly 1 inch for each 1000 volts. Now the
> contractor attaching the new service to the new building isn't going to take
> any chances. he'll insist that any possible source of a back fed voltage is
> nonexistent. All it would take is one carbon trail, falling wire,
> inadvertantly thrown switch to feed the generator voltage back out to where
> the men are working.

Good point!

That's one of the reasons why, where I come from, generator and intake lines are isolated by break-before-make switches. Voltage fed back from the generator is thus impossible unless you physically short those two circuits with wires, and even if you did, that would trigger the generator control to immediately kill the generator output. And then there are the intake breakers (on the supply side of all this) that the el. supplier controls.

But as UC Berkeley won't go bancrupt if the SSL shuts down their machinery for a day, they can afford to provide an extra layer of safety.

@Grant (SSSF)
> Even cheap UPSs have software so that when the battery reaches it's critical
> level it can then shut the connected computer(s) down.
> Even better UPSs have software that allows for some outputs to be shutdown
> just after the inital power failure, others to be powered down after 30min,
> 45min (or whatever is appropriate) & the critcal ones to be shut down only
> when the batteries have reached their end.

Sorry, I didn't take those UPS types into account. The type I've grown accustomed to, is one where two of them handle the load of a couple of hundred servers (normal operation), and one of them could do the job alone (UPS failure or maintenance). In this scenario, the software approach isn't all that feasible.

> Almost impossible to happen?
> More like almost impossbile not to happen.
> The more drives you have, and the more you have of the same age & same
> number of operating hours, the more likely multiple failures are.

Are we still talking about double drive failure *in one single array*?

> Pretty sure they're using SCSI drives & there's nothing cheap about even
> entry level SCSI drives (compared to IDE ones) & so full mirroring would
> be horrendously expensive.

I actually hadn't considered the possibility of IDE or S-ATA drives being used. IMO, even top-of-the-line 15krpm SCSI320 or FC-AL disks are cheap compared to the rest of a (serious) server. As for high-capacity storage, I agree that mirroring wouldn't be the most economical form of fault tolerance. But with most databases (and often file storage too), disks are read from far more often than they are written to, so the write performance penalty of a RAID5 array would rarely warrant the added cost of mirroring anyway.

In this case, I don't think service unavailability means disaster (actually, you made me realize that). But even so, how many extra disks could one buy for the time and cost of repairing a disk, rebuilding data lost and recovering what was not?

Cheers (and thanks for an interesting and educating discussion - both of you).
ID: 63089 · Report as offensive
Profile AthlonRob
Volunteer developer
Avatar

Send message
Joined: 18 May 99
Posts: 378
Credit: 7,041
RAC: 0
United States
Message 63093 - Posted: 11 Jan 2005, 16:47:47 UTC - in response to Message 62935.  

> It is easy to forgive the Seti@Home project for not having diesel-powered
> generators behind their UPS'es. As long as every mains outage is announced or
> lasts shorter than their UPS endurance, or there is staff at call well within
> the same time limit, everything can be shut down in a controlled manner (like
> this time). So diesels would rarely be needed and the expense probably
> unjustifiable.

You need to remember, none of these servers are running Windows. They're all Solaris or Linux, as far as I know.

As such, if you know the power is out (or is going to be out), you don't need to be on-site to cleanly shut them down. UNIX was designed to be administered remotely. Just SSH in to them. :-)

In this case, I was told the systems were cleanly shut down before the power went out *and unplugged* just to stay on the safe side of things. The addition of UPSs to the system is relatively new, by the way. I wouldn't be surprised if not all the servers were on UPSs... remember the outages last year due to quicky power flickers and the resulting database corruption (during Beta)?

I would imagine the disk(s) went out just because when you power something like a disk down and power it back up... it's more likely to die than when it's sitting there humming along.
Rob
ID: 63093 · Report as offensive
Profile kinnison
Avatar

Send message
Joined: 23 Oct 02
Posts: 107
Credit: 7,406,815
RAC: 7
United Kingdom
Message 63098 - Posted: 11 Jan 2005, 17:48:06 UTC

I was only talking from my own experience in a banking production environment. There, we have customer critical information, so it's hardly surprising we mirror our disks to make sure the data is intact.
Actually we had a disk go down every month or two - but an engineer can replace it within 12 hrs and it's "revived" within another 12 or so.
The chances of both halves of a mirrored disk going down within 24 hrs of each other is miniscule - I know the london stock exchange use the same mainframe system that we do and they live with the possiblity it also.

I suppose Borkeley don't exactly consider the SETI servers production equipment :-)

<img border="0" src="http://boinc.mundayweb.com/one/stats.php?userID=268&amp;prj=1&amp;trans=off" /><img border="0" src="http://boinc.mundayweb.com/one/stats.php?userID=268&amp;prj=4&amp;trans=off" />
ID: 63098 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 63103 - Posted: 11 Jan 2005, 17:59:26 UTC

Most electronics fails when turned on ... the "inrush" causes the problems. The drive was probably running with a "latent" defect that would fail at the next power cycle.

The military went for testing to line qualification for this very reason. The testing of the parts to make sure that they we MIL-SPEC introduced latent defects that would, in fact, make the part fail early. With Line qualification you test parts to destruction and use that to validate that the production process (proeuction line) would produce parts to the quality standard desired.

Personal experence demonstrated I was better off using Radio Shack parts instead of mil-spec in many cases ...
ID: 63103 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13752
Credit: 208,696,464
RAC: 304
Australia
Message 63177 - Posted: 12 Jan 2005, 2:07:37 UTC - in response to Message 63089.  


> > Almost impossible to happen?
> > More like almost impossbile not to happen.
> > The more drives you have, and the more you have of the same age &
> same
> > number of operating hours, the more likely multiple failures are.
>
> Are we still talking about double drive failure *in one single array*?

No, just the storage subsystem as a whole.
But the more of an inconvenience something's likely to be, the more likely it is to occur in my experience...


> > Pretty sure they're using SCSI drives & there's nothing cheap about
> even
> > entry level SCSI drives (compared to IDE ones) & so full mirroring
> would
> > be horrendously expensive.
>
> I actually hadn't considered the possibility of IDE or S-ATA drives being
> used. IMO, even top-of-the-line 15krpm SCSI320 or FC-AL disks are cheap
> compared to the rest of a (serious) server. As for high-capacity storage, I
> agree that mirroring wouldn't be the most economical form of fault tolerance.
> But with most databases (and often file storage too), disks are read from far
> more often than they are written to, so the write performance penalty of a
> RAID5 array would rarely warrant the added cost of mirroring anyway.

Pity that space is so limited & someone wouldn't just donate a SAN for BOINC...


From here.
Grant
Darwin NT
ID: 63177 · Report as offensive
Previous · 1 · 2 · 3 · 4

Message boards : Number crunching : Outage notice


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.