Technical News |
![]() |
| log in |
|
The news items below address various issues requiring more technical detail than
would fit in the regular news section on our front page.
These news items are all posted first in the
Technical News discussion forum,
with additional comments/questions from our participants.
(available as an RSS feed.) |
|
12 Jan 2012 | 21:01:42 UTC Hello people. I'm actually about to head out again shortly (once again for a whole month) so let me get y'all caught up before I disappear. Let's see. There's been a lot of the usual hiccups over the past couple of weeks. Overloaded servers locking up and requiring a hard restart, drives failing and being replaced, bringing machines down on purpose to upgrade the OS, etc. No singular event was tragic or noteworthy, but the quantity of such events has been slightly higher than normal. Meanwhile various projects have been pushing along. After enough analysis, database tweaking, and data dumping/reloading, we finally created some test "small signal tables" containing the top 1% signals on which to do our final analysis. Turns out doing the same on the 100% full (and constantly growing) tables was a performance disaster. Basically we're now determining what our i/o needs and parameters are with much smaller cases, and then going from there. Right now the signal tables entirely fit in memory, but part of this equation is adding more spindles to the science database array to improve disk i/o as well. This is where the GPU User's Group-donated JBOD comes in. More on that below. Another project I've been working on is to get the splitters (the programs that make workunits out of raw data) to become sensitive to VGC (voltage gain control) values available in the raw data headers so that we can avoid splitting areas with low VGC values (and therefore loud noise). In layman's terms: we're trying to set everything up to automatically reject noisy workunits before sending them out. We know one or two beams (out of fourteen) are sometimes flaky, and keeping those workunits out of the pipeline will help reduce network competition for downloads. This should have been fairly straightforward, however during the course of testing we're finding more than one or two beams with various problems. More like 5 or 6. This may be for several different reasons, including bogus or misreported VGC values. This is on a front burner, with several parties involved here and at Arecibo. Speaking of network competition - yes, we're away that we are dropping all kinds of connections during uploads/downloads. This isn't because of our router (which was definitely the problem over the summer before we added RAM to it), but somewhere else further up the pipeline. Still figuring this out, but it's certainly load related. Hardware wise, we took an archive server out of the closet to make way for the JBOD mentioned above. The archive server will move into our secondary lab down the hall (where other servers currently reside). We were going to install the JBOD on Tuesday but the hole in the rack made for it isn't big enough to let us mess with internal cabling. Given that we hope to hook this up to at least two separate servers, we'll likely need to mess with internal cabling. So we're going to try to do our best with that while the JBOD is still on the table in our lab. Oh yeah.. our web server crashed due to overloading last Friday, likely due to an article Andrew published about recent Kepler analysis results. He didn't clearly enough state that these plots were radio frequency interference, and thus we got clobbered due to confused news reports that we found ET. The usual drill, basically. The text of the article was cleaned up. Eric suggested we put disclaimers on the top of every web page on our site that says, "everything we find is Radio Frequency Interference unless we specifically tell you otherwise." Okay. I should wrap this up. As a parting gift here are a couple random recent photos: Here's the new JBOD, as seen from behind, sitting on our lab table. There's only 21 (currently empty) drive bays back here, but on the front there are 24 full ones in front. Here's the current state of our server closet across the hall. Note the hole in the middle rack - that's where the JBOD is going. And for fun, last week I shot this photo, which is the entire Bay Area consumed in fog, which we at the lab (over 1000 above sea level) are enjoying lovely weather over said fog. Wow those pictures are blurry. Well, it's from my iPhone 3GS. Not exactly state of the art. So! I'm now official on the road. I'll be playing with my band MoeTar in Whittier, California on Saturday (opening up for the Allan Holdsworth Band), then I drive up to Seattle to meet some of the guys in Secret Chiefs 3, and then we all drive in the tour van to Denver, where we meet to remaining guys (flying in from NYC and Sydney, Australia). We'll rehearse two days, then tour for a few weeks all over the western US (with one stop in Vancouver), co-headlining with the awesome band Dengue Fever. Should be fun! Cheers, - Matt 20 Dec 2011 | 23:35:51 UTC Hello, world. Here's another random, non-comprehensive status update regarding our servers, quite possibly the last one before the end of the year. So what are we dealing with lately. Well, carolyn (the mysql server) seems to have some funky memory. Or maybe it's an overactive watchdog in the kernel. Hard to tell, but the warning messages we're getting aren't given us the warm fuzzies. Operations are more or less normal, so we're just keeping an eye on it for the moment (don't really want to do any surgery before the holidays). Meanwhile, it did have a standard issues CPU lock last week, requiring a hard reset and database recovery. However annoying (and it seems the modern day linux kernels are getting more and more prone to this sort of misbehavior) it's so far easy enough to recover from after hard power cycling the machine. We have to do this a lot on bruno (the upload/compute server) quite often, and oscar (the informix database server) the other day as well. Every time we eventually recover just fine. Also mysql-wise, we seem to be having performance issues that defy easy understanding and explanation. Maybe this is memory related (hope not), but probably just due to some black-box mysql internal bookkeeping. During some testing/tweaking I turned off the daily stats dump scripts, and (oops) forgot to turn them back on. So there was a period of 5-6 days without stats dumps. Sorry about that. Another thing we have to keep an eye on is server closet temperature. Seems like (without clear notification) we are already in "holiday energy curtailment" mode. With less people around, lab-wide environmental controls (which assist our server closet cooling) are ramped down to save energy. Makes sense, but that still means temperatures rise in our closet, which isn't happy-making. So far they only went up a degree or two on average. Just one more thing to worry about. Onto brighter news. The gang over at the GPU Users Group has been incredibly helpful to us thus far. They recently donated a 45 JBOD drive array, and as of today 28 2TB and 6TB drives (for this array and/or data transport to/from Arecibo). We'll use this, along with more drives to come and another whole server, to upgrade various parts of our current server backend in the near future: science database server storage, upload server, download server, and main BOINC admin/compute server... They are still collecting donations over at their site (see our donation page for a link to their paypal-based donation site) going towards this new hardware. Happy holidays, safe travels, and all that. See you in the new year... - Matt |
| Copyright © 2012 University of California |