Strangely Normal (Sep 10 2007)

Message boards : Technical News : Strangely Normal (Sep 10 2007)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 638320 - Posted: 10 Sep 2007, 20:11:28 UTC
Last modified: 10 Sep 2007, 20:11:46 UTC

So it was a busy weekend, with our focus mostly on thumper (the science database server). There were actually two separate problems. Three drives within four days failed somewhat spuriously. We are fairly convinced at this point that they didn't actually fail - I actually took them out of RAID control this morning and am heavily exercising them without any errors. Why they seemed to fail is still a mystery. We are running an older version of Fedora Core on this system and therefore an older version of mdadm. Or is it drive controller issues? Or just error-level threshholds that need tweaking to be less hypersensitive to transient I/O issues? Meanwhile, perhaps due to all the above, an index in the database got corrupted and needed to be dropped/rebuilt which took all of Thursday night to Friday afternoon to complete. Add all this up and we weren't able to create/assimilate new work for most of the weekend. I did get the assimilators going on Friday night, and when the smoke cleared Jeff got the splitters running on Saturday. So far so good.

We were expecting more spurious disk failures, but so far nothing. In fact today has been strangely normal. Tomorrow we may try implementing a method of distributing workunits around our local network so we aren't so choked on that one NAS server which can only do so much. We need to get more headroom before we can try to win participants back. As it stands now given our current level of redundancy we can barely keep up with demand.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 638320 · Report as offensive
Wasabi Peanut
Avatar

Send message
Joined: 14 Jul 99
Posts: 62
Credit: 32,646,911
RAC: 0
Switzerland
Message 638341 - Posted: 10 Sep 2007, 20:28:54 UTC

Thanx for the news, Matt! Things have moved towards smooth sailing a great deal on my boxes over the last 24 hours - excellent!!

But as you say, keeping up with demand appears to be a challenge for the current architecture. I'm curious to see the effects of distributed WU-storage in action. Speaking of it: ever thought about a SAN? Seems to me like it would give you a great deal of welcome flexibility beyond the much needed performance boost. Just a thought...

Kudos to everyone @ Berkeley for their hard work!

Cheers,

Ron
ID: 638341 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 638384 - Posted: 10 Sep 2007, 21:47:44 UTC - in response to Message 638341.  

Thanx for the news, Matt! Things have moved towards smooth sailing a great deal on my boxes over the last 24 hours - excellent!!

But as you say, keeping up with demand appears to be a challenge for the current architecture. I'm curious to see the effects of distributed WU-storage in action. Speaking of it: ever thought about a SAN? Seems to me like it would give you a great deal of welcome flexibility beyond the much needed performance boost. Just a thought...

Kudos to everyone @ Berkeley for their hard work!

Cheers,

Ron



Correct me if I'm wrong but isn't a SAN failure what Rosetta just recovered from?



PROUD MEMBER OF Team Starfire World BOINC
ID: 638384 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 638466 - Posted: 10 Sep 2007, 23:11:18 UTC

Thank You Matt - and as Well to the others @ Berkeley

> note: RAC is finally startin' to count right ;)
ID: 638466 · Report as offensive
Profile Agnostic Pope

Send message
Joined: 25 May 99
Posts: 20
Credit: 118,354
RAC: 0
United States
Message 638480 - Posted: 10 Sep 2007, 23:24:17 UTC - in response to Message 638384.  

Thanx for the news, Matt! Things have moved towards smooth sailing a great deal on my boxes over the last 24 hours - excellent!!

But as you say, keeping up with demand appears to be a challenge for the current architecture. I'm curious to see the effects of distributed WU-storage in action. Speaking of it: ever thought about a SAN? Seems to me like it would give you a great deal of welcome flexibility beyond the much needed performance boost. Just a thought...

Kudos to everyone @ Berkeley for their hard work!

Cheers,

Ron



Correct me if I'm wrong but isn't a SAN failure what Rosetta just recovered from?

Yup.

Be careful what you wish for ... you might get it!
ID: 638480 · Report as offensive
Howard

Send message
Joined: 20 Dec 01
Posts: 1
Credit: 171,270
RAC: 0
United Kingdom
Message 638493 - Posted: 10 Sep 2007, 23:33:08 UTC

Well talking of getting "participants back" ,i'm one. Joined the project along long time ago. I stopped some time around when bonic came along. I have over the past 6 months so or been using Linux alot more and have decided to help out again. Came to look at the forums when my client stopped working, nice to see the community still here. Sad to see that im still reasonably high on the stats of my join date even though i havnt been part of the project in over a year.
ID: 638493 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20359
Credit: 7,508,002
RAC: 20
United Kingdom
Message 638773 - Posted: 11 Sep 2007, 10:30:17 UTC - in response to Message 638493.  

Well talking of getting "participants back" ,i'm one. Joined the project along long time ago. I stopped some time around when bonic came along. ...

Welcome back to the great crunch!

We're now on very new data... Until Arecibo gets closed that is...

Happy crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 638773 · Report as offensive
[DPC]TeamGrazzie~MoMurdaSquad~Oet

Send message
Joined: 19 Mar 02
Posts: 3
Credit: 3,619,841
RAC: 0
Netherlands
Message 638995 - Posted: 11 Sep 2007, 22:06:20 UTC - in response to Message 638320.  


We were expecting more spurious disk failures, but so far nothing. In fact today has been strangely normal. Tomorrow we may try implementing a method of distributing workunits around our local network so we aren't so choked on that one NAS server which can only do so much. We need to get more headroom before we can try to win participants back. As it stands now given our current level of redundancy we can barely keep up with demand.

- Matt


Hi Matt,

Appreciate all the time and effort you guys put in this project, it's great :)

Meanwhile, it may help if you could make some kind of graphical reprentation of the current infrastructure (meaning the SETI servers, networking and data flow). There might be a hand full of participants with enough knowledge that are willing to help you designing a improved design?
It's just a thought though..
ID: 638995 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20359
Credit: 7,508,002
RAC: 20
United Kingdom
Message 639374 - Posted: 12 Sep 2007, 11:24:58 UTC - in response to Message 638995.  

Meanwhile, it may help if you could make some kind of graphical reprentation of the current infrastructure (meaning the SETI servers, networking and data flow)...

I think that'd need a webcam on their lab whiteboard!

Also, beware of too many 'chefs' causing confusion and spoiling the 'broth'...


I'm sure Matt etal will contact off-forum/offlist whenever there is a need.

Enjoy the ride!

Happy crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 639374 · Report as offensive
Profile speedimic
Volunteer tester
Avatar

Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 639409 - Posted: 12 Sep 2007, 12:19:19 UTC


Meanwhile, it may help if you could make some kind of graphical reprentation of the current infrastructure (meaning the SETI servers, networking and data flow). There might be a hand full of participants with enough knowledge that are willing to help you designing a improved design?
It's just a thought though..


I did that once for my network - took me a whole day...

The problem at the lab is not the lack of knowledge or good ideas, but the lack of TIME and MONEY!

If Matt is stuck somewhere, I'm sure he will ask and we can all try to help.


mic.


ID: 639409 · Report as offensive
haddock29

Send message
Joined: 18 Sep 99
Posts: 36
Credit: 26,012,417
RAC: 0
France
Message 639583 - Posted: 12 Sep 2007, 19:03:47 UTC - in response to Message 638320.  

So it was a busy weekend, with our focus mostly on thumper (the science database server). There were actually two separate problems. Three drives within four days failed somewhat spuriously. We are fairly convinced at this point that they didn't actually fail - I actually took them out of RAID control this morning and am heavily exercising them without any errors. Why they seemed to fail is still a mystery. We are running an older version of Fedora Core on this system and therefore an older version of mdadm. Or is it drive controller issues? Or just error-level threshholds that need tweaking to be less hypersensitive to transient I/O issues? Meanwhile, perhaps due to all ...
- Matt


I got some years ago such a mystery with a 3ware ATA raid card and 120 GB ST (may be WD) disks. The vendor was unable to fix it. In afct the probleme was documented, and we had to change the firmware of the disks, something related to vibrations. The symptom was random disk failures (after 2 years of continuous operation), and the disks tested alone wera aprrarently good, as was the cards, the cables and so ON. Everything is OK since the firmware upgrade (there e was a tool to do that on the 3ware Web), that means 2 more years of continuous operation.
Will be happy if that can help you.

Alain
ID: 639583 · Report as offensive
Profile Hawksfollow
Avatar

Send message
Joined: 23 Feb 03
Posts: 9
Credit: 19,348
RAC: 0
United States
Message 657310 - Posted: 10 Oct 2007, 14:05:56 UTC - in response to Message 638493.  

Well talking of getting "participants back" ,i'm one. Joined the project along long time ago. I stopped some time around when bonic came along. I have over the past 6 months so or been using Linux alot more and have decided to help out again. Came to look at the forums when my client stopped working, nice to see the community still here. Sad to see that im still reasonably high on the stats of my join date even though i havnt been part of the project in over a year.


I'm also one who came back and with a new pc, the other one caught fire, smoke sparks and flame. Have a dumb question, hope this is the right place to ask. I linked my account using my first email that had only 14 WU. Can't figure out how to add or change to the email with over 500 WU. Any way to do this? Thanks in advance. Glad to be back running Seti again!
ID: 657310 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 657403 - Posted: 10 Oct 2007, 17:21:18 UTC - in response to Message 657310.  

Well talking of getting "participants back" ,i'm one. Joined the project along long time ago. I stopped some time around when bonic came along. I have over the past 6 months so or been using Linux alot more and have decided to help out again. Came to look at the forums when my client stopped working, nice to see the community still here. Sad to see that im still reasonably high on the stats of my join date even though i havnt been part of the project in over a year.


I'm also one who came back and with a new pc, the other one caught fire, smoke sparks and flame. Have a dumb question, hope this is the right place to ask. I linked my account using my first email that had only 14 WU. Can't figure out how to add or change to the email with over 500 WU. Any way to do this? Thanks in advance. Glad to be back running Seti again!


You should ask this question again in the number crunching forums — you're more likely to get a reply...
.

Hello, from Albany, CA!...
ID: 657403 · Report as offensive

Message boards : Technical News : Strangely Normal (Sep 10 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.