Technical News |
![]() |
|
The news items below address various issues requiring more technical detail than
would fit in the regular news section on our front page.
These news items are all posted first in the
Technical News discussion forum,
with additional comments/questions from our participants.
(available as an RSS feed.) |
|
2 Jul 2009 18:24:11 UTC Looks like we're back in another noisy period, or at least the bandwidth is maxed out enough that it's constraining both downloads and uploads. Let's just try to ride this storm out - it should hopefully clear up on its own. Regarding the videos I linked to yesterday, there were plans to get the powerpoint images linked into the actual camera footage, but I guess that never panned out. That's fine. Or maybe that only happened on the live feed... Anyway, you get the basic gist of what we're trying to say from this footage. I was kind of rushing through my talk - how do you condense 10 years of effort into 20 minutes? We were hoping to get the NTPCkr pages up this week but I'm finding that I really need to update the FAQ and other informational pages before making this live, lest we get flooded with common questions. Plus we have a little bit of feature creep, which is okay - better to rush and do these things now or they'll probably never get done. - Matt 1 Jul 2009 19:38:40 UTC Sorry about the forums (and other web site features) being shut off for over a day. These Tuesday outages are really taking forever. I guess we've been really busy, which means our tables get ridiculously fragmented throughout the week. Plus I noticed our database is easily 50% larger than it has been about 2 months ago. And the replica lost a couple of its CPUs recently (it's a used/donated system and the CPUs were known to be flakey from the start). Anyway, since the normal recovery procedure was so painful last week I opted to keep all web page database lookups offline until the replica was caught up. Once again, I'm sorry for the inconvenience. To make up for that, how about some videos from the SETI@home 10 year anniversary? I'll link these to the home page soon enough. Consider this a sneak preview for those who read these threads. Let me know if there are problems downloading/viewing these mpegs. Data recorder-wise... After all the effort to work with what we got, we're finally throwing in the towel on the current set of data drive enclosures. We have a plan B and plan C already in place - just a matter of deciding which one to enact. Meanwhile, I'm pulling old data off the archives at a pretty good clip - hopefully fast enough to keep up with demand. Otherwise, I'm still working on NTPCkr and radar stuff. And I adjusted the stats scripts that generate the numbers on the server status page. The Astropulse numbers up until this morning reflected version 5 workunits/results, now they reflect version 5.05. - Matt 29 Jun 2009 22:16:48 UTC Another wacky weekend. Sounds like we were sending out a bunch of short workunits, which strains our bandwidth resources. Plus uploads were clogged for a while. The server was too busy and dropping connections, so the uploads weren't even reaching the server. On Saturday morning I did some TCP tweaking and seemed to clear up that log jam for the time being. This morning it came to my attention that we've been sending out workunits with the "application/x-troff-man" mime type. This was because files with numerical suffices (like workunits) are assumed to be man pages. This may have been causing some blockages at firewalls. I changed the mime type to "text /plain." The SERENDIP web page was updated for the first time in many years. There's a link on the front page about that. We plan to get the public NTPCkr candidate lists on line this week, ready or not. Trying to squeeze a couple more features in at the last minute, but I'm sure there will be bugs to work out and more features to add later on. - Matt 25 Jun 2009 20:59:16 UTC Fallout continues from the outage on Tuesday. Turns out the minor corruption in various MyISAM tables is messing up replication. Every so often a duplicate entry appears on the replica queue which is easy to remove but requires human intervention. This is causing the replica to fall further and futher behind. I'm loathe to give up on it, though, as that means being forced to point all queries, including non-essential ones, at the master. And that'll break everything. We also had to fall back to using two download servers, but we did so using simple DNS round-robin load balancing. Obviously this wasn't working out so well. DNS rollout/caching is never balanced (we saw this several times before, especially during the feeder mod polarity issues a year or two ago). So this morning we fell all the way back to using "pound" - which forces exactly 50% of all incoming connections to go to the first server, and the rest to the second one. This immediate broke the current download log jam, though of course we're still maxed out bandwidth-wise as I write this paragraph. Seems like there are a lot of frustrated people on these threads. There's no right or wrong way to feel about these outages. We're kind of a special case. At the core we're an academic project with no deadlines - normally nobody gets hurt if science is delayed a day or a month or a decade. On the other hand, we're forced to be "professional" since we're asking for various forms of support from many thousands of people, and you can't have that large a number of people involved without some sort of professional grade management and public relations. It's a daily puzzle marrying the two completely separate worlds. - Matt 24 Jun 2009 19:56:44 UTC Despite efforts to reduce the outage time yesterday, the database was bloated enough (for various reasons) to take all day compressing/backing up. The replica wasn't even close to being ready to done by the time I left the lab, and still wasn't done before I went to bed last night. That meant all queries had to be aimed at the master, including all the read-only stuff that usually hits the replica - stats collection scripts, result state count scripts, the daily credit multiplier calculation (which is rather expensive), and lots of annoying web scraping queries. All those excess things pretty much killed us throughout the evening. The replica was finally available in the morning, albeit fairly far behind the master. Nevertheless I was able to start cleaning up the mess. However, two other problems were revealed. First, going to one download server wasn't a good thing. It seems impossible to me that apache can't handle all the downloads on one system - especially given the abundance of free resources. It drops connections regardless of how much network/httpd.conf tweaking I do. So we fell back to using two download servers, and that immediately solved everything. Of course, we've been offline for 24 hours, so there's gonna be lots of traffic for a while making it hard to upload/download anything. Second, there was minor corruption in the MyISAM tables in the mysql database. Not sure what caused that but given the database was clogged all night all bets are off. The most notable effect of this was some weird behavior in the forums. Some simple "repair table" commands found the problems and claims to have fixed them. Anyway.. it's clear we still have much work to do cleaning up our current mysql situation. Sigh. In better news, looks like me and Jeff are going to the OSCON 2009 in San Jose in July - the O'Reilly open source convention. Maybe we'll get some hot tips about improving the linux/apache/mysql/php performance around here. Tim O'Reilly himself helped hook us up with free passes (he's been nice to us over the years). - Matt 23 Jun 2009 23:09:29 UTC Usual outage today (which happens every Tuesday for mysql database compression/backup). It went really long - I guess we've been busy inserting/deleting all last week. We went back to an older policy of doing simultaneous compression on both the master and replica, which should vastly speed up post-outage recovery. Until today we've been letting the compression commands (i.e. "alter table user type = innodb") to pass from the master to replica via the usual channels, but they wouldn't happen in parallel (as the loooong queries had to complete successfully on the master before the replica would start processing them). This caused the replica to be as many as four hours behind when the project started up again in the afternoon. The benefit of doing it that way was less work/management and accidental updates/inserts during the outage wouldn't get lost. Going back to doing it in parallel, we have to stop the replica before we start and reset the master after we're done, thus increasing the chance of these lost queries, but so far we've had 0 such incidents during these weekly outages since we started using mysql years ago. A weekly planned outage is usually a good time to take care of some offline chores. Today I cleaned up lots of unnecessary mounts in a effort to reduce our automounter maps as much as possible (so we don't have such a tangled web which can be quite painful when one server disappears). I also made vader the sole download server, thus freeing bane to be whatever we want - which will be useful to handle certain services temporarily as we go around upgrading the out-of-date operating systems on lots of these machines. I think vader can handle the load alone. I hear the presentations from the 10th anniversary celebration have all been converted to mpegs. It's a few gigs worth of stuff on a computer down on campus. A flash drive containing all that will appear up here at our lab sometime in the near future. Or it may be hosted on an interim server. We shall see. - Matt 22 Jun 2009 20:53:56 UTC It's fairly clear that the recent updates we made to the general mysql/state counts/splitter fold has vastly improved our recent weekend woes. There were still a couple dips here and there, but no wild swings like before. Except this morning one particular query - from the scheduler - was clogging the works. We figured we'll just let it push through, i.e. let nature take its course. We assumed it was an expensive lookup, but after a couple hours of waiting I ran the same query on the replica and found there was only one (!) row in question. So what the heck is mysql doing? We killed the query and eventually the logjam cleared. I'm finally scraping up enough space to pull a lot more work up from our archives, so Astropulse will be kicking in again, at least at some low level. This should also help reduce the deman on our limited resources since those workunits take longer to process, which means a lighter load on our database/download/upload/scheduling servers. - Matt 18 Jun 2009 22:36:40 UTC Some things got lost in the server reboot chaos/mayhem yesterday. One being that results were not being correctly stored on disk, despite all diagnostics showing otherwise (the incoming traffic looked normal, the upload apache servers were responding with "200" status, all the BOINC backend queues were nice and low). However, after rebooting the upload server yesterday the result RAID partition failed to mount. Actually this is a known quantity - there's something odd about this particular RAID partition that requires human intervention after every reboot to get going. Well, that human intervention didn't happen. Oops. Anyway, this was ultimately discovered thanks to various complaints from various parties, and fixed. Hopefully not too much headache/annoyance out there as the backlog of failed results clears out and corrects itself. The new splitter method is now in production - where we're getting counts from a regularly updated table rather than each splitter process making the same redundant query over and over again. This would seem like a job for triggers, and we may go that route, but we already had the programming/plumbing in place to make this table (i.e. the process that collects numbers for the server status page, which already displays those same counts) - so this was easier to implement. We'll see if we get less network dips over the next few days... - Matt 17 Jun 2009 20:16:11 UTC I've been busy. Almost too much to write about, none of it all that interesting in the grand scheme of things, so I'll just stick to recent stuff. Our main problem lately has been the mysql database. Given the increased number of users, and the lack of astropulse work (which tends to "slow things down"), the result table in the mysql database is under constant heavy attack. Over the course of a week this table gets severely fragmented, thus resulting in more disk i/o to do the same selects/updates. This has always been a problem, which is why we "compress" the database every tuesday. However, the increased use means a larger, more fragmented table, and it doesn't fit so easily into memory. This is really a problem when the splitter comes along every ten minutes and checks to see if there's enough work available to send out (thus asking the question: should I bother generating more work). This is a simple count on the result table, but if we're in a "bad state" this count which normally takes a second could take literally hours, and stall all other queries, like the feeder, and therefore nobody can get any work. There are up to six splitters running at any given time, so multiple this problem by six. We came up with several obvious solutions to this problem, all of which had non-obvious opposite results. Finally we had another thing to try, which was to make a tiny database table which contains these counts, and have a separate program that runs every so often do these counts and populate the proper table. This way instead of six splitters doing a count query every ten minutes, one program does a single count query every hour (and against the replica database). We made the necessary changes and fired it off yesterday after the outage. Of course it took forever to recover from the outage. When I checked in again at midnight last night I found the splitters finally got the call to generate more work.. and were failing on science database inserts. I figured this was some kind of compile problem, so I fell back to the previous splitter version... but that one was failing reading the raw data files! Then I realized we were in a spate of raw data files that were deemed "questionable" so this wasn't a surprise. I let it go as it was late. As expected, nature took its course and a couple hours later the splitter finally found files it could read and got to work. That is, until our main /home account server crashed! When that happens, it kills *everything*. Jeff got in early and was already recovering that system before I noticed. He pretty much had it booted up just when I arrived. However, all the systems were hanging on various other systems due to our web of cross-automounts. I had to reboot 5 or 6 of them and do all the cleanup following that. In one lucky case I was able to clean up the automounter maps without having to reboot. So we're recovering from all that now. Hopefully we can figure out the new splitter problems and get that working as well or else we'll start hitting those bad mysql periods really soon. - Matt 11 Jun 2009 21:52:12 UTC Spent the morning clearing out my mail spool - something that could easily eat up a full day if I let it. It's amazing how these "this will only take 5 minutes, tops" tasks add up, especially when there are about 100 of them. Bob found the mysql replica has been falling behind a bit more than he though it should, and after some poking around I found iptables getting in the way. So I did some reconfiguration on that system, rebooted it, and now let's see if it is operating any faster... This wasn't the crux of our mysql woes, but it may help a little bit (less chance the stats queries will rely on the master if the replica is always caught up). Actually as I write this I see we're in another difficult period. Eric was actually just up here and suggested a workaround for one of the queries that has been given us the most headaches lately. We might implement that in the near future. We also should try throwing some of this new hardware at the problem (if we could ever get it working). The dust is settling after the anniversary a bit - still haven't gotten any video from the students putting it all together. Dan, having spent some time in Arecibo recently has new insight about the radar problems we've been having - so I may get yet another code rewrite on my plate in the near future. Hopefully this will be the final revision that will actually get completely and be used to clean up a huge backlog of dirty data (waiting to be processed). Jeff and I hope to also get some NTPCkr far enough along to present something to the public. I know I've been saying that a while. - Matt 10 Jun 2009 22:12:33 UTC Playing around installing the new Fedora Core on my desktop today. So far so good. It seems any time anybody in any context mentions a specific flavor of linux this inspires discussion, usually in an incredulous tone, about why in god's name would you even consider using version x instead of version y, etc. I understand the pros and cons, and we're not going to change anytime soon, if ever. Personally I'm waiting for the day when operating systems disappear and we can all get back to work. Still haven't gotten any of the Intel systems up and running for various reasons. I'm abandoning all of them for now. Very frustrating - every time I solve one problem another takes its place. And the inability to collect data at Arecibo continues - the problem has been narrowed down to the (very old) EDT card working on a newer OS. The good folks atEDT are working on it (even though they don't even sell this card anymore, I don't think...). - Matt 9 Jun 2009 22:27:28 UTC Well I employed my database code adjustments yesterday afternoon... and they seemed to have had a decidedly opposite effect than what we expected. So I reverted them back last night. Back to the drawing board on that front - I'll think we'll basically move from understanding the problem to eventually adding more hardware so it isn't a problem. The key to that is getting hardware to work. Eric figured out the issues we were having on one of the newer Intel servers (the RAID controller card had to have a jumper moved around, even though I checked the jumpers already and they matched a similar card in a similar system that is working just fine). Of course, it's a hardware RAID controller, and it won't let me do JBOD, so I was forced to make 8 individual RAID groups, each containing one drive. This is annoying enough, but they RAID bios contains primitive enough mouse drivers that each step of pointing the mouse and clicking on the appropriate button took anywhere from 5 to 60 seconds. So it took me about 90 minutes to configure the RAID. Of course, we could have just stuck with using hardware RAID but for benchmark purposes we're comparing this system to one with similar software RAID. So there ya go. Our BOINC server - one system that handles the boinc.berkeley.edu website, all the alpha project stuff, etc. - has been having more and more problems as of late, all resulting in the CPU load spiralling out of control. We're in the process of getting another one of these new Intel servers up and running to replace this older server. Of course, we're hitting all kinds of other problems trying to boot the thing. More on that tomorrow if it's still offline. Downloading FC11 today. All the mirrors are jammed. And of course we had our weekly outage. No big developments there - Bob took care of all that. He did notice during our weekly science database backup that we had some corrupt database pages. This may be because of something else he discovered - the disk space made available for Astropulse had filled up sooner than expected. So he added more disk space to those tables. - Matt 8 Jun 2009 21:31:03 UTC Dan and company are wrapping up their work at Arecibo and heading home today (I think). It was a painful weekend trying to get our data recorder working again (and installing the new SERENDIP V data recorder) but all is well, more or less. We even did some observations of the crab nebula (and its known pulse) which Josh then found in the data using Astropulse, providing a good end-to-end test. We'll send workunits using that data once we get that raw data up here. We ultimately found our SATA drive enclosures were a major part of the headache, and we're planning to replace those with USB enclosures... probably. It was a painful weekend network-wise - the increased active user load (mixed with the lack of long Astropulse workunits to send out) means a lot more activity on the result table in the database, which means periods of mysql choking. We're adjusting some code to do "dirty reads" which may help conserve resources. For example, the count of the result table to determine the current size of the ready-to-send queue doesn't have to be 100% accurate, so locking the table to do such a query is overkill. We'll see if that works, or helps. We hope to replace these database servers, or at least the mysql replica, with one of these new Intel servers. They have tons of CPUs and gobs of memory, but the disk controller doesn't work. Actually, that's unclear - we replace the card with one we know works, and that wasn't behaving either. Until we can figure that out we're stuck with what we got. - Matt 4 Jun 2009 22:27:11 UTC A day full of troubleshooting. Still trying to get one of these Intel servers up - everything in the system works except the hardware RAID. We got the new drives in the mail today, but still can't get into the RAID bios. We do have a card we know works in another Intel server which we'll swap in sometime but we're tabling this project in general for now... That's because Dan and a bunch of the CASPAR students are down at Arecibo to install their new SERENDIP V data recorder. They'd like to test it while they are there, of course, which means comparing its functionality with our recorder, as well as do some observations of the Crab Nebula to run through Astropulse, etc. What does that mean for us? That means we really need to get our SETI@home data recorder SATA drives/enclosures working. They have been off line for well over a month now, but now that we have our own people with immediate access to the machine it's speeding up the debugging process. Still, there are plenty of mysteries that seem impossible to figure out. Jeff's frustration with SATA/USB/drivers/linux is palpable coming all the way from the other side of the room. In fact I just heard him tell the gang down there to install a new OS on the system (the current OS is ancient, and quite possibly the source of our woes). Meanwhile Jeff and I are continue to tinker with NTPCkr stuff. I've been trying to optimize the NTPCkr page, finding that it spends most of its time parsing the XML of the zillion multiplets (groups of similar signals) within each candidate. So at this late hour we may change how we divide the multiplets up into "barycentric" (tight in frequency space) and "non-barycentric" and just score them according to frequency tightness. This may not only yield far less multiplets, but they may be ranked better as far as how interesting they are. There's gonna be more tweaking/testing on that front. - Matt 3 Jun 2009 22:00:49 UTC Today started messing with one of the new Intel servers. We're still waiting on drives to ship before doing much with it, but at least it boots off of DVD. There are some other kinks to work out as well. I think we're going to call it "mork." We hope to at least replace sidious with this machine, and if we get the other servers working, than replace others. In general we always wish to reduce the hardware we need to maintain - i.e. have less machines doing more stuff. However, we'd like to do so without increasing our single points of failure (redundancy is nice). And given we never buy anything we have to generally stick to a "work with what you got" philosophy. A small note about the front page "weekly outage" status - that's a line at the top of our project_news data file which is commented out. Every Tuesday morning I uncomment it (if I remember to) so people can see it, and hopefully later that day (if I remember to) I comment it back into oblivion. Sometimes I forget, or recovery is slow enough that I keep that warning there so people can get some idea why they're having trouble connecting. In any case, it's human controlled and therefore prone to error. - Matt 2 Jun 2009 23:29:04 UTC Had the weekly outage today - the normal database/compression/cleanup stuff was by the book, however we took the time to address some other hardware issues. First and foremost, we replaced the failed drive on thumper. I was griping about this yesterday and how this means we'll have to reboot, which means we're forced to resync the root RAID devices. Well, that's happening now. I also upgraded the kernel on worf. That sort of went well - except upon coming back on line one of the spare drives was marked as failed. We're dealing with that now. Coming out of these weekly outages has gotten painful given our increased rate of traffic lately, and these web queries that continue to clobber us. I try to aim these at the replica, which helps, but right after outages the replica is effectively offline for many hours as it is still busy recreating the giant tables. So I have to temporarily aim those web queries at the master, which makes recovery even slower. We gotta figure this all out, come up with a better weekly backup/reorg policy, or get that new replica server up and running sooner than later. We did order drives for it - should be here later in the week. - Matt 1 Jun 2009 22:27:24 UTC Lots to talk about today. Let's start with the weekend: we had the usual drill of running out of raw data files for the Astropulse splitters to chew on. Due to file transfer speeds up from our off-site archival storage (NERSC) we can only put a few files up a day, which Astropulse goes through in no time. This isn't a big deal, but in order to regulate this a little better we adjusted the weights of the two applications so that the feeder gives 97% of its slots to multibeam, and 3% to Astropulse. This shouldn't change the current regular behavior, but will help smooth out the peak periods I think. There's still some BOINC logic changes that have to happen to keep Astropulse from taking over too many systems. Some good news: Intel once again came through with a slew of donations - five servers to be exact. These are mostly test/used systems so three require some TLC to bring on line (a couple of those may be used as parts to boost up one of our current compute servers). However one of the remaining two will get our attention right away and became the new mysql replica server. I haven't confirmed the specs, but I've been told they each contain four 6-core CPUs and 64GB RAM. Intel would like us to do some benchmark tests right away, so expect a new server (or two) in the fold in the coming weeks. I guess I need to update the hardware donation page... Of course, the release of Fedora Core 11 has slipped a couple times, but I hope to start a major wave of OS upgrades (or installs) next week as well. The other big project is dealing with thumper - our science database server. We're replacing a bad drive tomorrow, which means rebooting it, which in turn means it will go through some painful RAID resync upon coming back up (due to its drive naming issues). We know we can fix this resync problem by reinstalling the OS, which we'll do when FC11 is out and we tested a similar install on bambi (the secondary science database server) first. Once that's working, we'd like to re-RAID the data drives (from RAID5 to RAID10) to vastly speed up throughput (necessary for NTPCkr performance). But to do that we need to get all the raw data off first. And to do that we need to first install a kernel update on worf (the NAS from Overland Storage which we are beta testing) so we can safely move all our raw data there. Oy. So many ducks to get in a row. Anyway.. one step at a time... - Matt 28 May 2009 20:37:47 UTC Question: so what's up with the near time persistency checker (NTPCkr)? If the live web streaming were working last Thursday you would have seen the tail end of my and Jeff's talk where Jeff went into a little details about the current status of things. Basically, we have some screws to tighten here and there, but the general thing is working. We're up against some database throughput issues which we hope to fix sooner than later, plus we are still tweaking the scoring algorithms. We hope to have a public page available soon where you can peer into the progress of things. Until then, here's version 0.0.1 of the NTPCkr FAQ. It's becoming clearer that we need to adjust the weight of our applications so that we send out more SETI@home/multibeam workunits. We have things effectively set such that Astropulse work gets sent out as soon as it becomes available. This was partly to expedite getting as many Astropulse results back as possible (in the interest of getting that science done) but this is getting less and less possible given our resources and current participant demands. Things on this front may shift in the near future. We've been near our bandwidth limit for the past day since unclogging the mysql database, providing more data for Astropulse to split, and our active user base going up about 15% over the past couple of weeks. This may account for recent upload/download difficulty. It looks like it's getting better, as least for the moment. - Matt 27 May 2009 22:07:00 UTC Had a few more bandwidth woes early in the morning. Turns out this was due to the replica recovery yesterday - a lot of long queries were still being aimed at the master. I turned the replica on, which immediately helped (though it is about 10-15 hours behind and slowly catching up so some stats may seem a little screwy). Before we figured that out Jeff and I were a bit stumped as we thought this had to do with Astropulse work availability. In the process of looking for clues we discovered that for a long time Astropulse had an extra defunct project sitting in our applications table. This meant the feeder was saving a third of its slots for a project that will never have any work. I fixed that. I don't think that was causing any major problems lately, but it sure wouldn't help them, either. This morning I dusted off some code - a program that would fix our doubly-precessed signals. I was hoping some changes Eric had since made to the (incredibly arcane) database code would have fixed some long standing problems, but they didn't. This isn't Eric's fault - it's some garbage in the esql libraries that won't let me do updates to rows with user-defined types. This normally isn't a problem as we can insert signals just fine. Updating them, however, is the problem, at least using esql. So I'll shelve this project once again - in the meantime we have a patch of signals that we cannot use to find candidates as their coordinates are slightly wrong. Oh yeah - people were asking: I'm not sure when video of our anniversary talks will become available. The students involved in the filming/editing are also working on SERENDIP V, and they're in a mad scramble to get that ready for deployment down at Arecibo next week. - Matt 26 May 2009 22:32:21 UTC We're back after the long holiday/anniversary weekend. Phew! That was fun, and now we can get back to work on some outstanding projects. First off it should be noted the weekend had some issues. For some reason the "forum preferences" table broke again, which wouldn't be that big a deal, except this messes up replication. I kicked it every few hours over the past couple of days which didn't help very much. So we're reloading the replica from scratch yet again. This'll take some time, so the recovery from today's regular outage may be particularly painful. Meanwhile a random drive on thumper failed. No surprise - there are 48 drives in that thing. It's RAIDed, we're getting a spare from Sun, no big deal. Still, this will exercise our problems with rebooting thumper at this point - so this bumps up in priority our need to reinstall the OS on the thing. I'm still trying to move data from our archives up here for Astropulse as fast as I can. We have over 100 files yet to transfer. I hope we get the data recorder back in working order before we use up all these files. - Matt 20 May 2009 21:47:21 UTC Another short note just to check in. Good news is that I finally was able to get more than just 1 or 2 files up from HPSS for Astropulse to chew on. In fact, I got 4 files! Well, that's still not very much, but more are on the way. We'll really have to get crackin' on the data recorder issues once this week is through. It also seems that we have continuing problem with these difficult web queries clobbering us from time to time. I put a "hack" in place yesterday that I thought was helping, but Dave noted our problem may be from persistent mysql connections. Since php is embedded in apache, whenever it starts up it opens a database connection and keeps it open through multiple page requests. While we put explicit code to use the replica on the result pages, apparently php won't flip from master to replica (or vice versa) during these persistent connections, so we need better logic to handle all that. In the meantime it seems like we're in another ugly long query phase clogging the pipes. Still very annoying. This is my last tech news item until next week, probably. Will be busy tomorrow with the big event and all. - Matt 19 May 2009 23:29:17 UTC It's Tuesday, that means outage time (for database backup/compression/etc.). Today's outage was by the book, and we're recovering from that now. We're still sloooowly getting more data back up here from our archives at NERSC, though the Astropulse splitters are tearing through those pretty fast. We were also having continuing issues with loooong queries on the mysql master database. We thought we fixed that yesterday. Looks like we didn't. Dave and I poking around with that for a while. Other than that, chipping away on NTPCkr stuff for Jeff, getting things in order for the big event on Thursday. Wow - I got exactly 48 hours from now to get my little talk straight. - Matt 18 May 2009 23:13:38 UTC Happy Anniversary! Though we're officially celebrating later this week it was actually ten years ago yesterday that we launched this thing. We didn't know what to expect, and our ftp server was immediately clobbered from thousands of people simultaneously attempting to download the client. I remember a blur of chaos as we procured other ftp servers (and a remote mirror) that day. I still joke that we've been trying to catch up ever since. The general workunit/result flow was a little weird lately. First, we ran out of data for Astropulse to process. The splitters kinda burned through a lot of these files - I'm wondering if there's something else going on - or maybe just data quality issues. We also updated some web code which broke our (temporary) master/replica code when looking up results via the web, so the database got clobbered again for a while. This morning Dave re-enacted these changes to use the replica and checked the code in. And once again we had a couple weird mounting issues - bruno was hung on bambi, lando was hung on thumper. This sudden rash of mounting problems is getting annoying if not worrisome. We had to reboot both bruno and lando, which I did this morning. I'm also pulling up some data from Arecibo to get Astropulse rolling again at least from time to time. - Matt 14 May 2009 20:40:07 UTC We are quite preoccupied with anniversary stuff so we've been doing the bare minimum amount of systems administration to get by until after the event. Still, it should be mentioned we continue to have SATA/driver issues on our data recorder at Arecibo, and haven't collected new data for about a month now. While we have a pile of data yet to crunch readily available on disk, I started pulling up unanalyzed data from our offsite archives. Before doing so I went through the whole data inventory rigamarole this morning. We have 1787 raw multi-beam data files (mostly all 50GB in size) archived, of which 338 haven't been split at all. However, a portion of these files were recorded before 2008, i.e. before we had a hardware radar blanking signal embedded in the data. So until we get my software radar blanker working (a project postponed until post-anniversary) we can't chew on these files without dealing with major radio frequency interference. This isn't a major problem: 1225 of the 1787 archived data files are from 2008 or later, and of these 249 have yet to be split. So we got plenty of numbers to crunch until we get the data recorder working again. - Matt 13 May 2009 19:24:37 UTC No real server news today, but I'll respond to a couple things mentioned in the previous thread. I said we have about 150 CPUs in our server fold. Of course, looking at the list of machines on the server status page you see about 40. First, this isn't a complete list - it only contains public facing or critical servers. We have a lot of other systems that are doing tangential tasks or behind-the-scenes stuff. We also have several appliances (like the NAS's) which contain multiple CPUs as well. Still, this number may be inflated a bit due to hyperthreading on some servers. I think the actual number of physical CPUs is still above 100 though. Plus, as I was calculating this just now I found that two of the CPUs on sidious have apparently died. This is no surprise - it's a used/experimental machine and had CPU issues since day one, which is why it is the replica mysql server and not the master. The talk (which happens next week) should be viewable over the net after it happens. I don't think we're going to do live streaming or anything like that. We're going to meet and discuss early next week what our options are. - Matt 12 May 2009 21:32:39 UTC Today's Tuesday, which means regular outage day for us. The project is already coming back to life as I write this sentence, though Bob still has some work to do to sync the beta replica database up again (a process which failed last week due to one of the tables unexpectedly needing repair). I got a funny call out of the blue yesterday from a person who works at a music production facility in LA. They do a lot of CPU intensive work there, and were surprised to find a bunch of BOINC clients running on their systems slowing things down. I'm guessing a former employee (or current employee afraid to speak up) planted them on as many CPUs as possible. Anyway, I'm not sure how he got my number, and even less why he chose to call me of all people, especially since the clients were all apparently running Einstein@home. Nevertheless, I gave him some uninstall tips, and that was that. Still working on the talk, which is slowly coming into shape. I'm trying to squeeze in 10 years' worth of digressions about work creation/distribution, databases, web sites, and networks, as well as back-end server war stories into about 20 minutes. It's been a trip down memory lane, and we're kind of kicking ourselves for not taking as many pictures back in the day of our puny little setup. I can't believe we got this thing off the ground with 3 Sun Ultra 10's (all doubling as desktops for me, Jeff, and Dan) and 2 IPC's. Our current server closet contains about 150 CPUs, 100 TB of disk, and 150 GB of RAM. - Matt 11 May 2009 21:08:02 UTC Over the weekend we hit a bit of a traffic "depression" - in other words we were sending out far less work than we should and so our outgoing bandwidth dropped. Why? Well, due to a single garbled astropulse file the astropulse assimilator was bailing, and so the queue was growing, and so workunits were staying on disk longer, and so we ran out of workunit storage, and so the splitters revved down. Eric kicked the assimilator in question yesterday, and we caught up more or less. This morning I found bruno (the upload/BOINC general admin server) was having similar mounting problems that thumper was having the end of last week - it was hanging on a mount to anakin (the scheduling server) of all things. This didn't affect anything major, but the server status page was stuck since yesterday. Anyway this time I cut to the chase and reboot the system, which helped, but the drive arrays are configured in such a way that requires human intervention on boot to get fully working again. No big deal, but some result uploads were failing for a minute or two there. Jeff and I practiced the first rev of our anniversary talk this morning. We need to trim it down by 15 minutes. I guess there's a lot to talk about (nothing regular readers of these threads don't already know). - Matt 7 May 2009 22:03:43 UTC I came in this morning and went about my normal chores, including checking the raw data pipeline. We have automated scripts to do most of the work, including one called "splitter_janitor" which finds files ready for deletion, takes some action, and mails me/Jeff the results. Well, I didn't get any mail. So I looked at the system in question, thumper, and found the script was hung. Some poking around led me to discover that thumper was having trouble mounting directories on server ewen (Eric's hydrogen study server, which actually crashed yesterday but came up again just fine). Well, other machines were mounting ewen just fine. So what gives? Sometimes the automounter needs a kick, so I restarted that. No dice. I restarted nfs/nfslock to no avail either. Hunh. Around this time I noticed the primary master science database, also on thumper, had gotten wedged. Great. Eric/Jeff were brought into the fold but nobody had any great ideas as to what was wrong and therefore how to fix it. We started killing processes one by one, including the database engine itself, which could only be stopped with a kill -9 (which isn't optimal, but informix has always been perfect recovering from such ugly shutdowns). With an empty process queue we still had mounting problems. Normally one of the first things to try is a reboot as this is easy and usually works, but we were loathe to reboot thumper since (as you might remember if you are an avid reader of these threads) that its root RAID has some funkiness where, even if it's healthy, will show up as degraded (and require a long resync) upon reboot. But we had no choice at this point, so we rebooted it, and sure enough the system booted just fine (and we could mount everything again). That's the good news, the bad news is that our fears were realized, and we're in the middle of another long painful root drive resync. The system is functional in the meantime, so really it's not that big a deal - it's just annoying, and perhaps a bit scary. Well, that ate up my whole morning. Then moved onto my Powerpoint/PHP tasks until Bob noticed the science database load was strangely low. This led to more snooping around, finally finding that our system vader (where the assimilators run) was having trouble mounting bruno's disks (where the result files are). So we weren't inserting results, which explains the bored science database. I rebooted vader, which is much easier than thumper, and that broke another dam. - Matt 6 May 2009 20:39:57 UTC We recovered fairly well after the outage, despite all the minor annoyances as of late. We still have to resync the beta database on the replica - turns out there was corruption in those tables that didn't get noticed until after we brought everything up again. Well, not so much corruption as a bit somewhere that told mysql to not bother dumping the beta database because it thinks there's corruption. So when I tried to rebuild the replica with the dump (when the beta project was back on line) and found the dump was zero length, I issued the proper repair statement and mysql responded "0 errors" but then was able to dump everything. Whatever. It's fine for now - and it is just the beta database, so we'll clean that up next week. As for fears of running out of data while we're waiting for the data recorder to get fixed: we still have plenty on line, and a few drives on the shelf full of data sent up from Arecibo as part of the last shipment they made before the SATA card went kaput. Plus we have a bunch (how much? not sure, but a lot) of data in our archives at HPSS which we haven't processed yet. So we're good for now, and maybe even a month or two. As for those network graphs talked about in the previous thread: that particular graph is for a router down on campus which handles the tunneled traffic to/from our lab and destined for our router at the PAIX (where we hook up with our ISP bandwidth). So yeah, green shows "incoming" from the lab, which is what we see as "outgoing" i.e. downloads. And vice versa for the uploads. Of course, there's a tiny tiny bit of noise due to scheduler traffic which also goes over that link. - Matt 5 May 2009 21:42:36 UTC There were indeed some weird lingering problems with the mysql database from this weekend. Some tables had bungled indexes. We think we cleaned that up during the usual weekly maintenance outage today. We also needed to regenerate the replica mysql database from scratch, so that'll be behind until later this evening (or tomorrow). The result pages may be out of whack until then. In fact, I just turned them off for now as they were eating too many resources. By the way, we're still unable to collect data at Arecibo due to problems with the data recorder being unable to see the drives. Turns out the card we bought, which was an exact replacement of the previous card, is having driver issues. Why? Well, unbeknownst to us we weren't actually using the previous card - we were using a totally different card (i.e. one we didn't buy) this whole time. It's a mystery why the original card was swapped out and replaced with this third one, but we're kinda back at square one again. Sigh. Due to time zone/scheduling conflicts each iteration on this front takes about 24 hours (the staff at Arecibo is providing support for free, after all). - Matt 4 May 2009 22:27:44 UTC The weekend was a little bumpy. The mysql database was showing signs of trouble Saturday. Eric was the only one paying attention at the time, so he restarted the database. Everything seemed fine, except he made some posts of the forum and then they all disappeared. This is still a mystery (the cause, the exact effects, and if it still a problem). Eric is trying to recreate and diagnose. But we were still getting web scraped to death. I played a gig Saturday night, getting home around 1:30am. I noticed the lingering problems at that point and blocked a couple more IP addresses and kicked off the long queries. Things more or less recovered on their own after that (except for the validators, which I fixed in the morning). So this is getting to be a regular problem, which I partially addressed this morning. I dug through the php code and quickly figured out how to get a couple of the offensive long queries to point at the replica database. This seemed to be quite helpful, but the replica is still behind due to the other problems mentioned above. So people are seeing about a day in the past when checking out their current results on our web site. It's confusing, but not the worst tragedy in the world, and it's a problem that will correct itself shortly. It'll all be caught up after the outage tomorrow. To keep things interesting, we seem to be in a middle of a spate of weird workunits - ones where the data isn't kosher and therefore returning quickly. Eric is also on top of that one. In the meantime, our outgoing traffic is a bit pegged. Less than three weeks until the anniversary. I'm getting my powerpoint together now. And I couldn't think of a worthy thread title theme this month, so how about apt titles for a change? - Matt 30 Apr 2009 21:21:40 UTC We're officially three weeks away from the 10th anniversary celebration - I think Dave just put the official announcement of such on the front page. Jeff and I are bashing out all the details we can beforehand. I guess I will finally learn how to use powerpoint (at least the openoffice version). So there were some splitters stuck after the outage so we ran out of work to send Tuesday night, but that got kicked back in line Wednesday morning. I wasn't involved with the outage and didn't notice until everything was better - I was taking the day off entertaining visiting family (which also explains the spotty nature of these current tech news items - sorry). There are still lingering problems trying to record data at Arecibo. We sent them a new SATA card, which worked, but even though the part # was the same of the old card the connectors were different (I instead of L). Jeez. So we sent them the right cables. Now the drivers won't load - the system recognizes the card, but not the drive. What a headache. Oh yeah. This is the last tech news item for the month, so after much anticipation (not) the thread title theme this month is revealed: names of cats I lived with throughout my life, some adored, some not so much. By far the best kitty ever was Normal (he and his littermates had Geek Love references as names). Our current cats (i.e. still alive and/or hanging around our house) are Olga (Alexei's sister) and Fner (Fnerina's feral half brother). Too bad our dog Laszlo - a purebred Doberman we recently rescued as an adult from the pound - still requires much effort in the ways of socialization, including reducing his desire to hunt down and eat smaller animals. We're working on it. 28 Apr 2009 22:35:46 UTC Busy busy busy, though not many fun adventures to report in the server realm. The weekend was fairly smooth, as was the regular database backup outage today. Bob went to the MySQL conference last week, so yesterday we discussed some plans for mysql upgrades, tweaks, etc. which we won't implement until the end of next month (i.e. after the anniversary). Of course, there was discussion about the Oracle buyout of Sun, and how that will affect the future of mysql. Apparently panic is unwarranted and we were reminded that the innodb engine, which is mostly what we use within mysql, was already partly an Oracle project. Anyway we shall see. Jeff and I are continuing to spend our time doing what we can to get the NTPCkr rolling before the anniversary, as well as scraping a talk to present together about the general data pipeline (which we hope to end with the "unveiling" of the NTPCkr). Jeff's been hitting some execution efficiency hurdles (mostly involving many long database queries), but we discovered some more significant optimizations (mostly involving getting around having to query the database in the first place). These speed-ups require some logic changes, which then means fresh code walkthroughs. Extreme programming time. - Matt 23 Apr 2009 23:07:53 UTC Today included more messing around with gnuplot and various web programming tasks. I also helped Dan format a pdflatex document. I'm kind of cursed with being really fast at working with these formatting markup languages, so such tasks get thrown onto the end of my work queue a lot. I noticed we were having a network dip in the afternoon and found once again our web site was being DOS'ed. Somebody (or some robot) was scraping our site, completely ignoring our robots.txt file, etc. Quite infuriating. I wonder if it is officially unethical to make public IP addresses which exhibit this kind of foul behavior. The worrisome part is this kind of activity clobbers mysql (and thus the whole project), and last time this happened everything seemed to recover, and then the database crashed twice over the weekend. We shall see, I guess. It's recovering now. - Matt 22 Apr 2009 22:33:18 UTC Looks like there were some beta project problems after the outage yesterday caused by a missing executable. That got replaced, and I think that everything should be okay now on that front. I heard rumors that regular users were seeing beta errors, but I'm hoping that was just confusion. I haven't heard anything since. Other than that today was more or less a day of system/web plumbing. The web stuff I'm working on is becoming a major kludge due to time constraints. It's actually a conglomeration of C code and perl, php, and C-shell scripts. You know, whatever works. I'm a big fan of getting things working as soon as possible, then making it pretty later. - Matt 21 Apr 2009 22:16:04 UTC Tuesday means weekly outage day for mysql database backup/compression. Since the replica got messed up during the duet of crashes over the weekend, we are using this backup today to recover the entire replica database from scratch right now. Should be ready to go in a few hours or so. I think the regular boinc stats xml dumps also broke over the weekend but those should be generating normally again now. The secondary science database is also suffering some kind of malaise. Not sure what the deal is, but it's slowing down my NTPCkr web site development. I thought it was excess disk activity on the system (caused by writing a primary database backup image to one of its spare drives) wreaking havoc, so I waited for that to end, but still no dice. Had to stop/restart the engine and even then it went through some phase of vague recovery before I could access it again. Finally got that replacement sata card for the datarecorder down at Arecibo. Jeff and I tested it in a system up here (mostly to make sure we didn't need to update its firmware) and I just put it in a box heading to Puerto Rico (along with a set of blank data drives). Hopefully it'll be a quick swap and we'll be back to recording data again. Jeff and I are really getting into the mode of programming/development. I think we found a way to speed up the NTPCkr a little bit more this morning, which is always a good thing. I'm still mostly working on internal visualization tools (with some simultaneous thought to what the first rev of the publicly available pages may look like). Don't get too excited yet - it's mostly just a table of numbers. - Matt 20 Apr 2009 23:04:44 UTC The mysql database crashed on Friday, then again on Saturday. The reasons are mysterious, though we've had similar crashes in the past - just not two in immediate succession like that. Most of the large, important tables (user, host, workunit, result) are using the innodb engine, while the many others (including team, forum preferences, posts, etc.) are using mysql's standard myisam engine. There's worry we may have lost a few rows in some of the myisam tables, though they seem to check out okay. The replica database, though, is in a confused state so we just shut it off for the time being. We're going to save any remaining cleanup for tomorrow's usual outage. As stated elsewhere, Jeff and I have adopted a policy of no-system-changes (except for emergencies) until after the anniversary. So as long as mysql continues to run well, we're not going to worry about this so much. I know I write all these missives and therefore I get the brunt of the accolades (or otherwise) but Jeff/Bob pretty much took care of the entire mess above. I did log in on Sunday and cleaned up the server status page and the validators (which for some reason *have* to start on the command line, as opposed to the usual cron job which restarts stopped processes), but that's the usual drill (we're always logging in on nights/weekends to kick one process or another). - Matt 16 Apr 2009 21:39:09 UTC Slow steady progress since the last tech news item. The science database continues to be massaged into shape from the past month of nastiness. It's working, but some indexes are still missing, and some queries are taking longer than we'd like. Sometime, probably next week, I'll turn the science status page updates back on - until then the numbers are old and/or flat out wrong. We're narrowing down the cause of our data recorder woes to either the SATA card or the system itself. We're trying the former first. A new one is on order and we'll have to get it configured remotely (which is a lot easier than configuring a whole new system remotely). We're also finding that we don't have the processing power we'd like. It seems like we lost a lot of active users over the past few months. I blame the recession. You could also blame Astropulse, I guess. In any case, we need more people. We're hoping the 10th anniversary buzz will help. And speaking of that, Jeff and I are putting all focus on the NTPCkr, just so we have something fun/new/interesting to present in time for any p.r. blitz. That means very little effort in systems/upgrades/etc. for the next 5-6 weeks. Simply don't have the time/manpower. Sorry about the lull in tech news items. I was on vacation visiting 23 relatives. Many are under 5 years old, which meant a lot of them have colds, which meant I got sick immediately upon my return, earlier in the week. - Matt 8 Apr 2009 20:00:28 UTC The science database choked last night. Nothing terrible - it was just unable to deal with the pulse index rebuilds as well as the usual outage recovery. So the assimilators got a little hung up for a while until the current index build was finished. It's still a mystery why this was as big an issue as it was - we've built indexes before on live, fully functional databases. Hmm. Apparently we have to be a little less cavalier about it. Turning off a server for good always has unintended consequences. Shutting down milkyway yesterday caused mail from the web server to fail. A couple red herrings later I found the problem - the milkyway mail server replacement (clarke) wasn't configured to allow relaying from the web server machine. Easy squeezy problem to fix. Now reset password requests, forum moderation notices, private message alerts, etc. are being sent. Spent way too much time hunting down the cause of a seg fault in my NTPCker web page code. It's kinda hard when it's a C program that's being executed within a c-shell script, which in turn is being called by a php script, and which is all running under apache. It's frustrating when everything works on a command line, but not within apache. Anyway I finally figured it out, or at least got it working. The irony is this code was to produce a tiny close-up waterfall plot around any given signal (to immediately spot symptoms of RFI), and once it was running Jeff and I realized our database query logic was slightly wrong, and the correct logic would take too long to be of any use in a dynamically generated plot on the web anyway. Sigh. Looks like we'll have to batch job it or something like that. - Matt 7 Apr 2009 23:15:25 UTC Outage day today. No big news there on the mysql backup/compression front. We're busy building indexes that were lost during the pulse table rebuild, so that's adding some load to the science database. That may slow splitters/assimilators down at points over the next few days. We shall see. I did shut down server milkyway for good today, which was our last solaris system still running. This makes me sad. In general, I still prefer solaris over linux, for what that's worth. And I definitely have had much better luck with Sun hardware than with anything else. Lost in radar/ntpckr coding, hence the short note today. Now I have to catch a bus... - Matt 6 Apr 2009 22:32:20 UTC Much progress over the weekend on the science database front. The pulse table has successfully been rebuilt, we started up the assimilators, and the queue drained to zero. With the influx of resources the splitters revved up and more workunits went out. All was well until the logical log on thumper filled up. This is a log of transactions which is necessary for database replication, and given all the pulse table activity it's no surprise it did get clogged up with extra transactions. When the log fills, the database engines have no choice but to hold still until there's log space again. Jeff noticed the dip in the traffic graph and got that all sorted out. Just now there was another dip in the traffic caused by some DOS'ing on our web site causing some mysql database overload. Damn robots skimming stats off our sites... I made a quick route rule to block the offending IP. This damaging effect was probably unintentional but still very annoying. - Matt 2 Apr 2009 22:44:31 UTC The science database issues slowly get better. The root drives are now all sync'ed up, but as I mentioned before this is only a temporary condition. This will fail again upon next bootup. That's fine because this forces the issue of reformatting the data RAIDs on the system which is something I've been wanting to do for a year now - might as well reformat the whole system, root, data, and all. The pulse table continues to get populated and assimilators remain off - at least for another day. We're about to run out of workunit disk storage (again) so expect another workunit shortage period in the very near future. My new rough estimate for the pulse load to finish is sometime tomorrow, and then we can turn the assimilators on, and we will be as back to normal (whatever that means). One of the download servers (bane) has been having mounting issues the past few days, hence the locking-up of the server status page. I just rebooted the thing. Let's see if that holds. Once again today was mostly a coding day. I've been annoyed by the radar blanking stuff, being as how the design has changed underneath me thus rendering a week (or two) of my effort moot. The old understanding was that we should only being seeing one type of radar at a time, but my output was showing this to be far from the case. Nevertheless once I got a quick handle on the fftw routines I made quick work of the correlation code and am already spotting radar quicker and more effectively. However a lot of graphing/threshold tweaking is in order before I can really start locking on and blanking. - Matt 1 Apr 2009 22:01:27 UTC Let's see.. we're *still* waiting for the RAID resync's to finish and likewise the pulse table rebuild. Another day or two? Meanwhile, I cleared off enough space on the workunit machine such that we can keep producing/sending out work. We still can't assimilate very much until the pulse table rebuild is over, but at least the people can do science and get credit. I'm worried about mysql bloat with the large result table (over 2 million waiting for assimilation), but we've been here many times before and lived. Lost in the chaos of outage recovery yesterday was a bunch of "make science status page" processes piling up on top of each other, causing extra stress on the science database, and eventually making the splitters jam up. Oops. I killed all those this morning and that particular dam broke. Now that we're catching up on satisfying workunit demand I think we'll be maxed out traffic-wise for a while, which isn't the worst of problems (that means work *is* flowing as fast as we can send it). Lots of code walkthroughs with Jeff today regarding the NTPCker. It's getting to be a mature piece of code. Scoring mechanisms are almost all in place (though they still may need major tuning once we sift through enough real data). We're still concerned about our ability to actually keep it running "near time," i.e. will the database be able to handle the load? We shall see. A lot of database improvements to help this have unfortunately been blocked on the last couple of weeks' worth of problems with thumper. Happy April Fool's Day! Don't believe anything anybody says! Actually that's good advice regardless of the day of the year. 31 Mar 2009 22:48:04 UTC Another Tuesday, another planned outage. We did the usual database compression and backup but it still took a long time as we're bloated with 2 million extra results waiting to be assimilated. No big deal there, but of course we're still mired in the thumper projects. It's becoming a two-weeker (since the original crash the Friday before last). Remember we're fighting on two fronts: rebuilding the root drive RAID and rebuilding the pulse table. Starting with the former, all we (thought we) had left to do was install grub on one of the two bootable drives (even though the weird drive numbering causes grub to read the actual kernel image off a third, non-bootable drive). Before launching into that I rebooted the system just to make sure everything was working. This system has very large ext3 file systems, and so I used tune2fs a while back to prevent a long (6-8 hour) forced file system check every 180 days (the default). Unbeknownst to us, it would *also* force a check every N mounts. So I was very displeased to find the system going through a round of forced checks when all I wanted to do was quickly reboot the thing. I was just going to let it go, but after a half hour I got sufficiently annoyed to just halt the check (gracefully) and re-tune2fs'ed to prevent this from happening again. And upon coming up I was further displeased to find the only root drive (of the three) that appeared in the RAID was the one in the non-bootable slot. We're stumped as to why. Well, even though this RAID was seriously degraded, we powered down, did the planned drive swapping and brought the system up. Even though drives were swapped the only root drive this time in the RAID was the (new) one in the non-bootable slot. Fine. I'm pretty much of the opinion we need to reinstall the OS on this point to clean everything up, but until that happens we have some (oddly long) drive resyncs to un-degrade the RAID. Of course, this will all fail again upon next boot as far as I can tell. Meanwhile, the pulse table reload that started yesterday failed last night. Since we have redundant database servers now, the informix engine is sensitive to anything that may bring the primary/secondary systems out of whack. This includes really long queries, like the one we started yesterday to copy 500 million pulses from one table to another. Back to square one. Jeff wrote a script that breaks this one query up into many smaller ones, thus hopefully circumventing any "long query" issues. We estimate this will be done Thursday sometime. I did start up one assimilator - the trickery I mentioned yesterday (to let assimilation run alongside pulse table insertions) does work, however as the pulse table gets populated it eats up a lot of database locks, and the assimilator can barely get an insert in edgewise. In any case, I found a rich source of stuff to move off the workunit storage server, so at least that bottleneck will be temporarily alleviated. Oh, yeah - end of the month, so that's the end of the current thread title theme. I think the only person who came close to describing the theme was QuietDad yesterday (apologies if others got it earlier). Anyway, the official theme was: Apple II hackers/game programmers who, as a budding young programmer myself in the 70's/80's, I thought were super heroes such that I fondly honor their names (real or otherwise). It takes a real game programmer to do *everything* - not just the game logic but also the design, the graphics, the animation, the sound, the music... and do it all in machine language (and 6 colors, including black and white, in 280x192 "hi-res" graphics). - Matt 30 Mar 2009 21:58:54 UTC Monday, Monday. There was little done on the science database/pulse table problem over the weekend - we hit a couple snags so we tabled it until we were all here in the lab today. It looks like we're doing the big move successfully now (taking the 500 million pulses from the old table and inserting them into a new, better formatted table with more extent space). I was hoping that we'd be able to do some trickery to get assimilation flowing again simultaneously, but it looks like that isn't in the cards. With the assimilator queue clogged we can't delete anything, which means we ran out of room to create new workunits, or at least enough to keep up with demand. Hang in there, folks. Work is on the way. - Matt 26 Mar 2009 20:25:02 UTC So the focus is still on thumper, the science database/raw data server. Last night we finished resyncing all the root drives (a three drive mirror). We still have to do some swapping to install grub on the third and final drive - we'll do this during the outage next week. Until then we're officially resuming normal operations, at least at the server level. Phew. I started up several raw data transfer jobs since that's been backed up for a week. Now we can turn our attention to the database. We're dumping the entire pulse table to a file so we can recreate the table in a larger set of db spaces. This is basically all you can do when you run out of extents - unload the table, then reload into new db spaces. I roughly estimate the unload will take at least 24 more hours. Since we couldn't insert pulses until we got more extents, the assimilator queue grew fairly large. So why stop now? There's really no reason not to split/create new multibeam workunits - we can still insert workunits into the science database. So I started a single multibeam splitter if only to satisfy some workunit demand until we can assimilate again. Of course, if we can't assimilate, we can't delete - and we've been running low on space to store workunits. But being that we've been running only astropulse for a day that actually helped push a lot of ap workunits/results through the validation/assimilation/deletion queues, which in turn cleared up a fair amount of storage. So we're good for the moment, at least storage-wise (seems like even the one splitter is sensitive to the current heavy load on thumper). Tomorrow is actually an official university holiday (the staff gets its one day of spring break). However, like always, Jeff, Eric, Bob and I will be poking and prodding at the servers remotely over the weekend. - Matt 25 Mar 2009 21:07:21 UTC Mmm-kay. So where are we at with the science database...? The morning today was much like yesterday: me, Eric, and Jeff shouting over the deafening noise of the server closet, taking turns hunched over a monitor attached directly to thumper (the kvm monitor was having separate issues). Lots of reboots and unexpected (and unpleasant) results. Lots of thinking we found the problem only to reboot and (five minutes later) finding we were wrong, then having to reboot again off of DVD (taking another five minutes). Basically our discussions were along the lines of: Why does the boot metadevice disappear when booting off of DVD? And why does the root metadevice disappear when coming up via grub? Didn't we resync these two drives yesterday? Oh look - the grub device map is referring to /dev/sdm, which was how the root drive was ennumerated when there were only 24 drives in the system - it should be referring to /dev/sdy now that we have 48 - so this must be at least one of our problems! Nope. Changing that did nothing. Etc. etc. etc. etc. Well, whatever. It's been a two-day-long game like a demented version Towers-of-Hanoi - swapping drives, installing/reinstalling grub, resyncing devices, reconfiguring mdadm, then going back to step one and trying a different permutation. On hindsight it probably would have been easier to just install a new OS from scratch (though we would have had to recreate a web of informix configuration which also exists on the root drives). Right now the system is actually up (finally) and resyncing one mirror (again) and will have to sync another once that's finished. So we're offline for another day, and we haven't even gotten to the pulse table problems yet. I will stil try to get Astropulse running in some form later on today/tonight. Funny thing: Oliver and Bernd of Einstein@home have been visiting from Germany, collaborating with Dave on some general BOINC stuff. They left just a couple hours ago, but we did discuss how when SETI@home is having issues such as this, Einstein@home certainly gets a huge "bump" from the suddenly influx of free CPU time. We joked how the these thumper issues strangely coincided with their arrival last week. Meanwhile, I'm back on radar blanking detail. We're now trying cross-correlations to match radar patterns using fftw. - Matt 24 Mar 2009 20:27:33 UTC The good news is that our regular Tuesday maintenance outage today chugged along quickly, and without incident. The not-so-great news is that we are still fighting with thumper to get it running properly again. Jeff, Eric, and I whipped up a cookbook yesterday of the 7 or 8 steps to get thumper's root drive mirrored. As of this morning we had only one working drive with root/boot on it, but it's the spare drive sitting in the /dev/sda slot. According to the BIOS, the root/boot drives have to be in slots #0 and #1, but thanks to non-linear disk controller labels on the backplane these drives show up in linux-land as /dev/sdy and /dev/sdac. Of course, you can only install grub on /dev/sd[a-d] which means lots of disk swapping and rebooting and resyncing. However, we're still on step #2 right now, and it won't finish until later tonight. The three of us were huddled over thumper for almost three hours - a frustrating period of time starting with us rebooting thumper "just to make sure everything is working" and then it wouldn't mount the root drive because of underlying issues with the metadevice. This was all mysterious, and after poking this and that it got worse - we could only boot in recovery mode off of DVD, and we had to hack partition tables and change disk identifiers before we could see root again. That's where it's at now: we're syncing the one working drive with a new spare, a process that we thought would take less than an hour but will take five, apparently. To add insult to injury our pulse table in the science database on thumper ran out of extents last night, which basically means the tables are full even though we have disk space available. So as if the above ordeal wasn't enough, we'll need an additional day or two to recreate (or at least hack at) the pulse table to add more extents. Long story short, don't expect SETI@home to be generating any new work or assimilating anything for a week (unless we're lucky). We'll at least try to keep Astropulse working during this time, so computers that can run Astropulse will be kept busy. When it rains it pours, but we'll be back to normal again soon enough. - Matt 23 Mar 2009 19:30:51 UTC We had a crazy weekend in database-land. First and foremost, we had issues with one of the root drives on thumper (the primary science database server, among other things). We didn't completely lose the drive, but smartd has been issuing complaints recently about bad sectors, and then the whole system crashed Thursday sometime in the early evening. While I was able to get the machine back up and RAID resyncing from home that night, the timing was such that poor Jeff and Eric had to deal with the fallout the next day without me (I was in Carmel playing spy music at a corporate party - things like the theme from "Get Smart"). The drive arrangement on thumper is a little bizarre. There are 48 drives that sit in a 12x4 grid, with drive #0 in the lower left corner. However, due to the ordering of the six disk controllers on the system, the root drives (a mirrored pair) show up as /dev/sdy and /dev/sdac. This gave us a bit of a headache when installing linux on this the first time a while ago. The root mirror has a dedicated spare, which by some coincidence happens to appear as /dev/sda. Since we never really exercised an actual root drive failure on thumper, Eric and Jeff spent Friday lost in a maze of conundrums. For example, given that grub only recognizes the first four drives in a system (/dev/sd[a-d]) how were things working all along? After some head scratching and drive swapping they got thumper back on line. We still need to replace a drive or two, and those just arrived this morning. Another confusing game plan awaits us as we take what we learned and actually try to apply it. Short story: we need to make a three way mirror of the root drives, after installing grub on the spare by booting from DVD, etc. Honestly I still don't quite get it as I write this up but I'm hoping I will after we go through the whole procedure. And then yesterday jocelyn (the primary mysql server) had some issues. Eric restarted it, and things seemed to clear up without much ado in due time. To be safe we'll do some sweeping data integrity checks on all our databases, probably during the regular outage tomorrow. - Matt 19 Mar 2009 20:44:53 UTC Another work week is drawing to a close for me (I don't come in to the lab on Fridays - sometimes I work from home - sometimes not). The servers continue to hold on as long as we have the hardware/network resources available (when will they become unavailable? Hours? Days? Weeks? Months?). Yesterday I mostly worked on NTPCKr web programming - stuff for mostly internal use, but a "lite" version will be made public eventually. Why the "lite" version? It's not because we have something to hide - we just don't have the web server/database resources to handle the traffic. The hope is that the public version will at least have a regularly updated list (every hour?) of the current most interesting pixels on the sky, and you can click on them and see where they are in the sky, and get some sense of why they scored well (numbers of signals, they line up with stars/extrasolar planets, etc.). The internal version will have, among other things, additional clicks so we could pull a window of signals out of the database, plot them, and we can scan for RFI - you can see why this would add a big load on our servers. Nevertheless, we'll see what we can manage, and try to much as much information as possible available to everybody. Today I spent way too long dealing with confusing subversion/trac configuration. Annoying. I guess I should be getting back to radar blanking (sigh). - Matt 17 Mar 2009 21:37:38 UTC Hello again. Sorry about missing a couple days there. The end of last week I did write a tech news item that I neglected to post as I got suddenly very busy at the end of the day with random programming tasks, and yesterday I was lost in many meetings and other post-weekend catchup. So be it. Here I am now. The end of last week I was a stand-still with various projects, so I chipped away at neglected chores and other nagging annoyances. Like our new mail server's log filling up with cryptic automounter messages regarding a machine we haven't had on line in five years - I finally tracked this down to Eric's home-grown spam challenge script which made reference to this machine in its LD_LIBRARY_PATH. I also tried and failed to figure out why one of our systems, configured exactly like the others, refuses to acknowledge the lab-wide legato backup server. And I cleaned my keyboard for the first time ever (which was gross after years of eating at my desk, and this was probably not helping the lingering ant problem). Then I got lost in NTPCker stub web page design. Yesterday there was much discussion about radar. Dan, recently back from Arecibo, confirmed some things and had news about others. The radar blanking code I took over and improved upon had faulty logic, caused by some early misunderstandings (not mine) about how the radar behaved. Most of the radar we see is from the airport, and that's all the hardware blanker thwarts. However, there are 5 other patterns we detect, including the aerostat balloon radar. So one problem is that at times we're seeing a jumble of various radars, making it very difficult to "lock on" and blank them. I'm working on that now. One other point is that the radar frequencies are all pretty much out of our band (typically around 1.3GHz - we're looking around 1.42GHz), but nevertheless are so loud they jam our receivers. However, sometimes if certain projects call for it the Arecibo operators turn on a high pass filter so that the radar frequencies under ~1.4GHz are completely silent. When this happens (about 20-25% of the time) our data are incredibly clean, even without hardware blanking. Of course, since we're piggybacking we can't control when the filter is on, but we do keep track of it in our data headers. We might prioritize this cleaner data for astropulse, which is far more sensitive to radar than SETI@home. Today had the usual outage for mysql database backup/compression. I took extra time while everything was quiet to move a lot of big files around the raw data storage server - that's mostly why we were slow to get out of the outage this time around, but at least now I can start emptying the latest shipment of drives from Arecibo. Speaking of drives, there was some discussion about that, too. We may start trying to partially send data over the net, if not completely. We thought this was impossible due to bandwidth constraints, but operators at Arecibo told us to give it a shot. This is low priority since, however annoying, the drives, their enclosures, and the shipping rigamarole works well enough right now. In general the public-facing servers continue to behave themselves. It's been a good couple of weeks. I don't believe in jinxes so I don't mind saying as much. I will say that the workunit storage server is filling up again - a factor of astropulse actually performing well, and workunits sitting around a long time waiting to validate. If it does fill up we'll have to deal with it. - Matt 11 Mar 2009 20:43:03 UTC Lots of machine rebooting today as Eric is getting his new hydrogen server online, and I'm finishing work on moving mail servers around. This shouldn't have affected the outside world. During all this Eric gave Jeff and I a quick tutorial on merged file systems. Wacky stuff. Radar wise, I got some lengthy notes from Phil down at Arecibo. Turns out by far most of the radar we see is from the airport, which was news to me, and that's the only thing the hardware blanker checks for. Discussions will continue. Dan, while at Arecibo earlier this week, replaced our non-working raw data drive enclosure with one we've been using up here. It's unclear whether this helped or not. We're learning that SATA drives (and enclosures/backplanes) aren't necessarily meant for excessive hot-swapping, and will fail after N "mating cycles." This may be what we're coming up against. - Matt 10 Mar 2009 22:45:02 UTC Tuesday means weekly outage day. Nothing really interesting or scary today. The only sysadmin thing I did during the quiet time was moving mail service off one machine (which we plan to retire soon) onto another. Still have a couple steps to go on that front. I should mention that we upgraded our network connection from our auxiliary lab to the server closet from 100Mbit to 1Gbit. In practice this meant simply replacing an old cheap switch which a new cheap one. This was mostly for the benefit of Eric and his new compute server, but on the side helps vader (which handles half the downloads and all the assimilators) and our other compute servers maul and marvin (all of which still sit in the other lab, awaiting room in the closet). Finally stopped being sidetracked enough to work on radar blanking again today. I'm finding some data is very clean and would like to not enforce blanking if it seems unnecessary. E-mails were sent to the experts for advice. - Matt 9 Mar 2009 22:38:16 UTC Happy Monday, everybody. It was a pretty smooth weekend, so not much to report there. Today I mostly took care of chores and the less glamorous/interesting side of systems administration. Eric bought a new server for his hydrogren projects. We needed to put it somewhere, so we decided to put it in our current auxiliary rack, which is currently sitting in our other lab waiting to replace one of the smaller (and less useful) racks in the closet. One of the download servers (vader) is actually in this auxiliary rack already. Anyway, we discovered that yet again the rails for this server are ever-so-slightly too big given the current rail configuration. Annoyed but determined Eric and I put forth the effort of taking vader out of this rack (which is why it was offline for an hour there) and adjust the stupid rails. Now everything fits. Good. To answer PhonAcq's question ("Now what is on the agenda to improve things to the next level of performance??"), there is always some looking ahead to what we'll need soon. First up is more memory in our mysql server (jocelyn). When all is well it can easily handle a mixed bag of 2000 queries/sec, but during peaks or other crises it may start to page and cause massive disk i/o. Given the current memory configuration it'll be quite easy to add 4GB ram to the system, which will help. Of course we're simultaneously scanning different avenues of download/upload bandwidth increase. We still have yet to do the whole project of converting thumper's RAIDs from 5 to 10, which will boost science database (and likewise splitter/assimilator) performance. There's more, but that's a good start. - Matt 5 Mar 2009 22:01:59 UTC Once again not much hardware/server stuff to report. I guess the ap_validator "2" is failing due to seg faults. A fact that is obscured on the server status page (due to automatic parsing of configuration files) is that the ap_validator "2" does strictly astropulse_v5 workunits, while ap_validator "1" validates older astropulse workunits. In any case, I warned Josh, he's looking into it, etc. Probably a broken result file/database entry is causing it to seg fault and quit before doing very much. Today was mostly conceptualizing/programming again for me, though focused back on radar blanking stuff as I should really get this done. I'm getting bogged down with "ragged files" - where the chunks of data aren't nearly ordered, thus causing confusion about where the software/hardware blanking bits are. This usually isn't a problem, except when a particular raw data file is ragged at the top or bottom, and the chunk containing blanking information needed by adjacent chunks is actually at the end of the previous file, or at the beginning of the next, or nowhere to be found at all. - Matt 5 Mar 2009 0:26:41 UTC Don't really have much to report today, tech-wise. The replica problems I mentioned yesterday ended up not being problems at all. There was some network security stuff I got bogged down with yesterday afternoon and again this morning - campus is ultra paranoid, so when they see what they think is nefarious activity (false positive or otherwise), or even potential security holes that haven't yet been exploited, you have to pretty much drop everything and act on it, which is fair enough. I spent pretty much my entire day getting the ball rolling on the "visualization" of the NTPCkr output. Jeff has some code working which dumps out giant blobs of xml detailing the "current best" points on the sky. So I spent the day writing up some php which digests this xml and makes nice tables, plots, etc. It's all very basic so far, but it's a start. We're getting large bursts of network activity at midnight every day now. Not sure what's up with that. Somebody's got a cronjob somewhere doing something. - Matt 3 Mar 2009 23:11:39 UTC Usual outage day today (database backup/maintenance, etc.). Actually it would have been "usual" except that certain finagling by us in the background may have messed the replica up. That remains to be seen - if it needs intervention the fix would be trivial. Oh look. Somebody updated web code. Pretty colors. I think I overheard Dave talking to Rom about new forum features. I have no idea what they are. Helped Jeff walk through NTPCkr code this morning, tracking down bugs, etc. In essence the goal of this program is simple - to find groups of signals in our data that fall within a certain window of frequency/space but have been seen over multiple observations, and preferably near stars/planets. But it's actually quite complicated - there's a lot of set analysis/manipulation requiring chunks of dense code where bugs can hide if you're not careful. Plus there are always new "special cases" we find (or dream up before we find them) that we need to consider. In any case, we're pressing to get this thing rolling and producing non-zero results before the 10 year anniversary of the SETI@home launch in May. - Matt 2 Mar 2009 23:01:18 UTC Not much going on (SETI@home-wise) over the weekend. The fallout from those traffic woes over a week ago are pretty much all behind us (I think completely after we do the database compression tomorrow). The average temperature in the server closet has risen slightly, but we think this is mostly a function of current weather (it seems that during rainy/foggy periods the air conditioner is less efficient). I did get another server online - something donated by Intel a while ago but only now found the time to set it up, add more memory, etc. It's going to mostly used as a compute server for Eric's hydrogen study project, which is good for SETI as that means his IDL processes won't be competing with our NTPCkr/radar blanking tests. We continue to have raw data drive enclosure problems. This time the set down at Arecibo is getting funky. Very hard to debug remotely. - Matt 26 Feb 2009 19:46:29 UTC Random day today for me. Catching up on various documentation/sysadmin/data pipeline tasks. Not very glamorous. The question was raised: Why don't we compress workunits to save bandwidth? I forget the exact arguments, but I think it's a combination of small things that, when added together, make this a very low priority. First, the programming overhead to the splitters, clients, etc. - however minor it may be it's still labor and (even worse) testing. Second, the concern that binary data will freak out some incredibly protective proxies or ISPs (the traffic is all going over port 80). Third, the amount of bandwidth we'd gain by compressing workunits is relatively minor considering the possible effort of making it so. Fourth, this is really only a problem (so far) during client download phases - workunits alone don't really clobber the network except for short, infrequent events (like right after the weekly outage). We might be actually implementing better download logic to prevent coral cache from being a redirect, so that may solve this latter issue. Anyway.. this idea comes up from time to time within our group and we usually determine we have bigger fish to fry. Or lower hanging fruit. Oh - I guess that's the end of this month's thread title theme: names of lakes in or around the Sierras that I've been to. - Matt 25 Feb 2009 22:48:42 UTC It looked like we got beyond the current deluge without too much intervention. Good. Then our bandwidth spiked again. Bad. But then it recovered once more. Good. Oh well, whatever. We're still just in "wait and see if it gets better on its own" mode around here - if we hit our bandwidth limits (and we understand why) there's not much else we can do. Spent a chunk of the day tracking down current donation processing issues. What a pain. I really need to document the whole crazy donation system so other people around here can fix these problems when they arise. Maybe I'll do that later today. Other than that, just some data pipeline/sysadmin type stuff. A note about the server status page: Every 10 minutes a BOINC script runs which does several things including: 1. start/restart servers that aren't running but should be, and 2. run a bunch of "task" scripts, like the one that generates the server status page. Since this status page script runs once every ten minutes, it is only a snapshot in time - not a continuum. It also could take several minutes to run its course, as it is scanning many heavily loaded servers. So the data towards the top of the page is representative of a minute or two earlier than the data towards the bottom. And server processes, like ap_validator, hiccup from time to time and get restarted every 10 minutes, then maybe process a few hundred workunits, but fail again a second before the status page checks its status. So even though it was running the past couple of minutes it shows up as "Not Running." In short, don't trust anything on that page at first glance. - Matt 25 Feb 2009 0:16:11 UTC Had our weekly maintenance outage today, including the usual chores. I took the opportunity to replace a failed drive on one of our administrative file servers. I also issued the long-overdue final "shutdown" command on another administrative server, kang, which we no longer use. Many years ago, during the early days of SETI@home, several Sun representatives came by one day to discuss our progress. We thought it was just an informal touching-base kind of meeting, but they told us at the end they were going to donate a whole rack full of 6 state-of-the-art Sun servers and 2 disk arrays. Sun has always been nice to us, but this was completely unexpected. We eventually dubbed this the "k-rack" as we named every server after a sci-fi character starting with "k" (kang, kodos, kosh, klaatu, kryten, koloth). Well, kang, was the last one to go - the end of an era. We're still using the rack itself, though - very useful. Network bandwidth woes continue, moreso now that we're coming out of the weekly outage. Lots of discussion about this in the previous thread - let me see if I can wrap up all the major points quickly. There are three potential solutions to our bandwidth limitations that we are actively entertaining/researching with the related parties. They are: 1. get a full 1Gbit link up to our server closet (pros: zero migration, cons: time/cost - about $80K in parts/labor), 2. collocation on campus (pros: minimal cost/migration, cons: almost impossible nuisance having to administer from a distance), 3. have a third-party entity host/administer everything (pros: we can ditch sysadmin for once and get back to work, cons: major cost, major migration). Each of these solutions requires a major amount of "getting ducks in row" (due to equipment policies, contract terms, general scheduling issues, etc.) - it's hardly just a money issue. Of course there are other options, too, like putting all efforts into final data analysis and shutting down SETI@home. One major issue is that our server closet (roughly 100 CPUs, 100 TB disk, 200 GB RAM) operates atomically - it's all or nothing. We can't just move one piece somewhere else. It's long and complicated - please don't make me explain why unless there's a free pitcher of beer involved. - Matt 23 Feb 2009 21:06:51 UTC Our outbound traffic has been pegged since Friday. This may seem like only a download problem, but it even affects uploads, as the basic syn/ack handshaking packets on the upload server get dropped along with the rest of the download packets that can't make it through the dam. After discussions with Eric and Jeff, here's what we gather is happening. We use coral cache to reduce our bandwidth needs. Coral cache is an easy-to-use, free, third-party system which does some nice distributed caching just by redirecting the right apache requests to their servers. For example, somebody wants to download the latest astropulse client, they go to our download server, and then they redirected automatically to the coral cache server. The redirect is of the form such that, if the coral cache server hasn't done so already, it downloads the latest astropulse client from us, caches it, and then sends it to the requester. Once cached, it doesn't need to contact our servers again. So, in essence, all but one of the client download requests hit originate from sources outside our lab, thus saving us lots of bandwidth. That brings us to problem 1. Many ISPs don't like redirects to third-party IPs. This is understandable. What happens in this case is a client downloads a new application, but instead of getting the actual executable they get a blob of HTML saying "this ISP doesn't like third party redirects," etc. Obviously the checksum of this HTML blob won't match the executable checksum, resulting in an application download checksum error. This has been a known problem. So we've been only using coral cache during the first couple of weeks after a new application is made available to reduce the pain of the download rush. A small fraction of our users will be inconvenienced by those redirect errors, but they'll get their clients in due time when coral cache is turned off after the initial "wave." But then there's problem 2. An application download checksum error (a) doesn't cause exponential backoff and (b) causes all workunits also requested by this particular client to be errored out and resent. This is at least the behavior is older, yet still commonly used, boinc clients. Dave said most of that has been addressed, but if they're still bugs they'll be fixed. In any case, what we saw this weekend was a confluence of these two problems. This may not have been an issue before due to lighter traffic patterns, but we sure fell off the deep end this time. Maybe there was a small set of heavily active clients this time around causing most of the pain. And once the network gets pegged, all hell breaks loose, and it takes a while to heal itself. Eric actually had most of this figured out before we arrived today, and already turned off coral cache. At least the broken redirects spiraling out of control would stop happening. He also adjusted the tcp settings on the upload server to help get those partially working again (instead of only 2% uploads getting through, now it's about 50%). The plan is to let this current state of indigestion pass on its own, and if needed change some BOINC settings (if not also BOINC code) so that future coral cache attempts will be direct links as opposed to apache redirects. - Matt 19 Feb 2009 20:41:57 UTC As we move toward the weekend we're sticking with the current raw data storage workarounds, which means servers are loaded heavier than we'd like, but at least data is still flowing. I wouldn't be surprised if there are network hiccups or if the assimilator queue swells during the weekend. So far this morning lots of chores. Bob and I got a shipment of empty data drives bundled up to be sent to Arecibo. I finished getting the new CPU server configured (now me, Eric, Josh, and Jeff are in less competition for cycles). I made more strides towards retiring the last two Solaris machines. Honestly, depending on the development/production environment I'd still probably prefer Solaris over linux. So I'm sad to see these systems go, but they are both very old Sparc machines that we simply don't need anymore. Late last week Eric, Jeff and I had a quick meeting to discuss current candidate scoring algorithms - we're pretty sure we'll have to tweak them as we go, but we're in enough agreement to get started implementing this part of the NTPCker. Jeff's been all over that this week. I'm just now turning my focus back to actual development, too. My software radar blanker now agrees with the hardware blanker 90% of the time, which is a very good start. I can add an additional 5% just by adjusting thresholds, but the real test is to run software blanked data through the pipeline and see which workunits generate more RFI (the ones using hardware blanking or the ones using software blanking). - Matt 18 Feb 2009 23:39:35 UTC Still having ups and downs with the raw data storage. Possibly a second disk failure. We'll get to the bottom of it soon enough. Traffic may be a bit rocky at times, but hopefully not so much. We also just noticed a drive failed on our upload backup storage. That RAID pulled in a spare without anybody realizing what happened until Jeff and I saw the little orange light in the closet today. We really need better monitoring tools. Actually, we have the tools - we just need time to implement them. Still, it's not a super-critical logical drive (it contains backup data from a separate RAID device) so we're not panicked trying to procure a new spare... yet. I wish I had more positive things to report today. This details I'm failing to mention out aren't all that fun either. Not my day today, I guess. - Matt 17 Feb 2009 23:42:50 UTC Over the long (President's Day) weekend one of our storage servers had a headache. Not a big deal, and we got to the bottom of it today (pretty much just a RAID drive failure). We were able to get a workaround in place so we could start generating/saving workunits again, and will slowly transition back to normal over the next day or two. It has been a bit rocky the last few days because the workaround involves a different RAID with far less I/O throughput. There's always a bright side during work transmission failures: we get to catch up on backlogged queues. So by the time we had our usual database compression/backup outage today the result table was relatively small, and therefore got packed down nice and tight. That's always helpful. Spent most of the day with the fallout of the above, while also getting a couple systems configured for new duty - mostly administrative/CPU servers that will replace a some older clunkers. - Matt 12 Feb 2009 20:20:58 UTC Looks like "Astropulse V5" was finally released yesterday night. As far as I know so far, so good - work is going out, results are being validated. However, it seems like jocelyn (the master mysql database server) had a long period of mysterious pain over night, and recovered on its own this morning. This happens from time to time on our mysql servers, perhaps due to its own nebulous data scrubbing, or perhaps due to lack of memory which is becoming more a problem as the database continues to grow and less of it fits in RAM. Unless anybody out there has a couple Sun-qualified 2GB DIMMs that work in Sun v40z's kicking around, we're going to purchase a few. Currently the system has 28GB of RAM - 12 slots with 2GB DIMMs, the remaining 4 with 1GB. We hope to at least upgrade those four to 2GB. It is unclear whether or not our version of the v40z can take 4GB DIMMs (and go over 32GB total). As for radar blanking, let me clear up the general picture. Now that we are using the ALFA receiver (since 2006) we are susceptible to military radar, which causes many overflows in our SETI@home/astropulse analysis. The transmitter is aimed right at us approximately every 12 seconds, and then echoes bounce all over the mountains surrounding the telescope the rest of the time. Even the echoes cause us to overflow. The radar is fairly unpredictable - the military isn't very forthcoming about their transmission patterns, and when they are going to change to another pattern. Nevertheless, it is predictable enough: there are about 6 known "patterns" us civilians can lock on to. Luckily, Arecibo solved this problem for us. They have a hardware device that broadcasts a bit letting all projects at the observatory know when it thinks the radar is on (1 for on, 0 for off). This we call the "hardware blanker" - and we inject this bit into an unused channel in our raw data. This has been quite helpful: when the bit is "1" we'd randomize the data, thus squashing the overflows. At least in theory - there were still three problems. Problem 1: We only got the hardware blanker working sometime in 2007, so there is no such blanking information in the previous years' worth of data, thus rendering it fairly useless. Problem 2: The hardware blanker sometimes isn't on like it should be, or even worse is mis-locked onto a wrong pattern and going out of phase with the actual radar, which also renders data quite useless. Here's where my code comes in: The "software radar blanker." Actually, this is code/logic written by a summer student, Luke, and then I cleaned it up and (apparently so far) got it working. In short, the software radar blanker does a statistical analysis of the raw data - basically looking to see when we're blasted by radar and then trying to lock on to known patterns, and extrapolate from there. Luckily there's another free bit available in the raw data, so the ultimate plan is for raw data to come up here, go through the software radar blanker, and then process. The splitter will use the software and hardware radar blanker bits (exactly how is still up for discussion) to randomize the data. This brings us to... Problem 3: The randomization shouldn't be totally random. Initially we were injecting white noise into the data when we were blanking. Turns out this causes edge effects and other artifacts during the client analysis. This noise was eventually shaped to fall in line with noise we'd expect to see from a quiet Arecibo. The exact mathematical details of this are left to others who aren't me. I was out of this loop. All the above was taking too long, so Josh actually implemented code in the astropulse client to reduce some of these radar problems until they are completely solved. He isn't radar "blanking" (which happens during workunit creation) as much as having the clients find stuff that is probably radar and treating it accordingly. For what it's worth, one of the CASPER guys, Andrew, has been having the same exact military radar problems with the pulsar data they've been collecting at the ATA, so he's been simultaneously working on his own radar mitigation techniques. Man, the earth is noisy. In any case, I figure it'll be about a month of testing/tweaking before we're actually using the software blanker. - Matt 11 Feb 2009 23:01:41 UTC Before releasing the astropulse application Eric had to add a couple fields to the result tables in the science database that are now necessary. These are large fields, and it's taking informix forever to update the table. The job was started 24 hours ago and is still chugging along. I guess it doesn't help that the assimilator queue is still rather large (though it is draining). So the release is delayed until this job finishes. The radar blanking stuff I was whining about the other day has nothing to do with the astropulse release, in case there was some confusion about that. Josh and I are working on two completely separate and different forms of radar mitigation. Mine is to better clean up data before any splitting/analysis, Josh's is to deal with radar that squeaked through the first pass and made it all the way to the client. The good news is that I made significant progress on mine today. - Matt 10 Feb 2009 23:03:00 UTC Today's Tuesday - that means weekly outage. Outside of the normal database backup/compression drill I went through the rigamarole of changing the user id of mysql on the master database server (and updated the ownership of all its files), if only for administrative ease now that it matches the same user id as all other instances of mysql here in our group. I also decided to yum up several servers that were lagging behind since we have been getting ugly yet harmless kernel warning messages for a while now. Unfortunately, this general update included a buggy nfs package (which I knew was buggy months ago but assumed they must have fixed this by now) which then locked up one of our main file servers, thus grinding everything to a halt. It was an annoying hour or so trying to figure this all out, and ultimately the only solution was to fall back to an old version of nfs. Not sure why this nfs-utils package is *still* in the repositories. Josh is working on getting another astropulse client out into the world today, and is fighting with the code signing machine as I type this sentence. Here's another problem we've been having over the past couple weeks, and it doesn't seem to be getting better: ants. I typically don't take a lunch break, and just nosh all day by my computer during small cracks of time. Dave and Jeff are the same way, and have the next two desks adjacent to me. Even though we're on the third floor the ants finally found the motherload of crumbs and unwashed utensils left on our desktops. There's not enough of them to find their exact point of entry nor plot their general plan of action. So throughout the day I've been mashing the little buggers as I spot them. Hopefully they'll just give up and disappear - meanwhile my work space is smelling more and more like formic acid. - Matt 9 Feb 2009 22:48:18 UTC My mondays are generally spent (a) figuring out what went wrong over the weekend (if anything), (b) cleaning up the data pipeline which has been running on its own for three days, and (c) preparing for or sitting in meetings. Today wasn't so different. Between my radar blanking tests, Jeff's NTPCkr tests, Josh's astropulse development, and Eric's hydrogen studies we're suddenly finding ourselves woefully low on CPU/memory power. Sure, we have 100 CPUs in our closet, but I'm kind of a fuddy-duddy when it comes to running non-critical processes on our high-availability public facing machines. This is frustrating to others as these machines are the ones best suited for the testing/development we're doing. Luckily, we have one server, maul, which can never be a critical system as it has a test motherboard which would be fine except it intermittently loses contact with the keyboard. So this is our one CPU server which is now usually overloaded to the point of unusability. We do have two machines coming to the rescue: One from Intel, actually donated around the same time as maul. We haven't gotten around to installing an OS on it until today. Why? Well, that means also needing an IP address for it. The university charges us monthly per IP address we use, so to conserve funds we've been keen on only bringing systems online we actually intend to use, preferably to replace a current system. The second machine is a similarly powerful one that we received from a private donor last week.. but the motherboard was DOA. At least that's our theory. We'll get that replaced soon. Both systems will go a long way towards reducing our current development/testing constraints - something we haven't been worried about too much over the past decade because we've been mostly in a mode of data collection/reduction instead of final data analysis... in case you haven't noticed. I'm happy this is changing (or at least portending to change). - Matt 5 Feb 2009 23:57:23 UTC Spent a large chunk of the day actually programming, which is nice. It seems like the network bandwidth bottleneck part of our malaise over the past couple of weeks has finally gone away - we're back down to a floor of 60 Mbits/sec. However, the mysql database is still quite clogged up. Looks like as I type this sentence we're still having fits as the splitters/feeders/etc. can't get their queries through fast enough. I'm hoping the bandwidth drop means the excess results were all finally downloaded, which means in the next few days they'll return, and we can finally get them validated/assimilated/deleted and out of our hair. There was a sweeping change in web code brought on line this afternoon. This broke web account authentication, making it impossible for people to log in. Oops. Not my bad - don't kill the messenger. Anyway it was fixed quickly enough. - Matt 4 Feb 2009 22:01:23 UTC Moving on... We seem to have eventually recovered just fine from the replica resync, as well as the outage in general. Traffic is still very high, but at least just below the point of impossibility. The assimilator queue is indeed dropping, which is a good thing, as that means we're inching closer to removing all the excess workunits and results from the disk, as well as the database. We still seem to be dealing with the result indigestion I described two days ago, but this too is sloooowly getting better over time. We've been having some load issues on the web server (thinman). There were no obvious signs of being DOS'ed or over-spidered, if anything it seemed like apache developed a memory leak. I yum'ed in the latest kernel, rebooted the machine (in case anybody noticed a 5 minute outage earlier today), and it looks okay at this point. Maybe just a simple case of reboot-itis. Just found another potential problem with the radar blanking code. Sigh... (Don't worry - it's not a C++ issue). - Matt 4 Feb 2009 0:35:10 UTC So then. We had our weekly outage today. We knew it would be a long one - the result table is bloated for various reasons so it took forever to compress. This may help get past this period of "indigestion" I mentioned in the previous thread, but there's no sign of it getting much better any time soon. Expect continuing network pain. Plus Bob is resync'ing the mysql replica, so that'll be behind a bit in the near term. Quite often we recompile all the back-end servers with code thoroughly tested in beta and switch in these new versions in the public project during the outage. We did so today, and the splitters and assimilators all freaked out upon starting up this afternoon with library linking errors. What a hassle. It seems like our servers are slowly getting more and more out of sync, given some are 32-bit, some are 64-bit, some are running this rev of the OS, some are running that rev, some have this package installed, some don't, etc. and this is apparently becoming a problem. Like we have time to clean this all up. <obnoxious rant> I was having an offline discussion with a friend who insists that C++ is a vast improvement on C, and that C programmers who complain about C++'s major failings are living in the past or "just don't understand." I wouldn't mind the debate except C++ afficianados usually adopt a smug, condescending tone regarding C programmers that reminds me of republicans describing democrats. In any case there was a programming mystery today that ate up a man-hour of my and Jeff's time. If the object in question was just a struct it would have been painfully obvious. Instead the problem was obscured in vague assignment operator behavior. Does anybody have an actual, simple example of C++ code that is (a) easier to debug than analogous C code, (b) required less manpower to generate, and (c) will be forever useful and understood? I'm willing to be convinced, but it hasn't happened yet. Maybe it's just a different (and not necessarily better) kind of brain that loves C++, but I tend to think it stemmed from the evil part of our monkey mind that turns a blind eye toward unnecessary complication for everybody in the hope that things may be easier for ourself later on. Or the other evil part of our monkey mind that foists contorted methodology on others as some sort of sick competition (which may be fun but is hardly productive). K&R = 200 pages. Stroustrup = 1000 pages. Is C++ really 500% better that it requires 500% the pages to describe? Nope. Case closed. </obnoxious rant> - Matt 2 Feb 2009 21:54:21 UTC Happy Monday everybody. I guess I should move on from the January thread title theme (odd little towns/places/features in southern Utah which I've been to during many nearly-annual backpacking/hiking adventures in the area - easily one of the best parts of the U.S.). We did almost run out of data files to split (to generate workunits) over the weekend. This was due to (a) awaiting data drives to be shipped up from Arecibo and (b) HPSS (the offsite archival storage) was down for several days last week for an upgrade - so we couldn't download any unanalysed data from there until the weekend. Jeff got that transfer started once HPSS was back up. We also got the data drives, and I'm reading in some now. The Astropulse splitters have been deliberately off for several reasons, including to allow SETI@home to catch up. We also may increase the dispersion measure analysis range which will vastly increase the scientific output of Astropulse while having the beneficial side effect of taking longer to process (and thus helping to reduce our bandwidth constraint woes). However, word on the street is that some optimizations have been uncovered which may speed Astropulse back up again. We shall see how this all plays out. I'm all for optimized code, even if that means bandwidth headaches. Speaking of bandwidth, we seem to be either maxed out or at zero lately. This is mostly due to massive indigestion - a couple weeks ago a bug in the scheduler sent out a ton of excess work, largely to CUDA clients. It took forever for these clients to download the workunits but they eventually did, and now the results are coming back en masse. This means the queries/sec rate on mysql went up about 50% on average for the past several days, which in turn caused the database to start paging to the point where queries backed up for hours, hence the traffic dips (and some web site slowness). We all agreed this morning that this would pass eventually and it'll just be slightly painful until it does. Maybe the worst is behind us. - Matt 29 Jan 2009 23:25:26 UTC The replica mysql database on sidious recovered more or less just fine. It may be ever so slightly out of sync with the master database. This means we'll probably rebuild it during the next weekly outage just to be sure. The scheduling server was up and down yesterday afternoon and this morning. The scheduler CGIs have been segfaulting and adding core dumps caused the system to grind to a halt, needing a reboot. Turns out the problem wasn't in the CGI, but in apache itself (or the fastcgi module). This has been a problem in the past. We seem to have to tweak various apache parameters at random times, based on a chaotic, unpredictable equation involving current resources/demands, mysql health, network health, system health, various queue sizes, etc. Simply reducing the MaxClients to a much lower number caused the segfaults to disappear while still servicing all incoming requests. We're running low on data to send out, and we're in a murky period where the weekend is rapidly approaching and we are still awaiting the latest shipment of raw data drives from Arecibo. We could pull up as-yet-unanalysed data from our archives, but the offsite storage archive (HPSS) is undergoing several upgrades and have been offline for days. We'll see how this all pans out... - Matt 28 Jan 2009 23:24:18 UTC Last night sidious (mysql replica database server) rebooted itself. Yeah, we did just move this into the closet, so there's non-zero worry that something may have gotten injured in transit, or it's unhappy in its new home. On the flip side, our servers are rebooting themselves from time to time for no apparent reason except maybe high stress. I love all operating systems (this is sarcasm). Anyway, that meant mysql crashed ungracefully and has been recovering all day - however succesful this recovery is remains to be seen. It is just the replica, so no big shakes, really. And this afternoon we ran out of work to send out. This was due to our science database getting "brain freeze" which is what I'm calling it these days. If you run the wrongly formatted query the whole engine silently grinds to a halt, effectively blocking all splitter and assimilator access. I found and killed the errant queries and the dam burst. So yet again we're recovering from an unexpected semi-outage this week. Regarding the setisvn server (from last thread)... I'm fully aware of the poor configuration of that virtual domain. Low on my priority list. - Matt 27 Jan 2009 22:40:49 UTC Last night, due to the high traffic I was grousing about yesterday, the workunit storage filled and therefore no new work could be generated, so we ran out of stuff to send to clients. This cleared up on its own this morning, but then we started the regular weekly database maintenance outage, so we'll be in a bit of connectivity pain for a while. During the outage I tested the stability of our secondary science database server (bambi). In other words: will it survive reboot without missing drives? It did. So that project is more or less done, and we'll start focusing on the primary science database server (thumper) next. Even more exciting is that Jeff and I added a couple more servers to the closet today: sidious and casper. The latter is a multi-purpose machine used by the tangentially related CASPER project. The former is the replica mysql database. We were happy to finally get it out of our "test lab" and into the closet because it's big, noisy, and there's a chance its particular network hangups will be solved by moving it physically closer to its friends (all talking over one switch, as opposed to traversing at least three). We have only one major server left to move into the closet: vader. This is all good news but we're kind of maxed out on power usage in the closet, and need to do some breaker tests before adding anything else. - Matt 26 Jan 2009 23:17:39 UTC Due to various bugs on the scheduler/client side of things some users have been getting far too much work to do. This results in excess workunit downloads which eats up our bandwidth and makes it generally difficult for anything to happen, then queues start backing up, etc. The scheduler fix has already been employed late last week, a client bug-fix is in the works. I have little to do with the above, and the problems should clear up on their own once traffic settles down. Today has been a catch-up-on-mundane-sys-admin tasks kind of day for me, which is fine once in a while. - Matt 22 Jan 2009 23:34:01 UTC We continue to have problems mounting our raw data drives (which we fill down at Arecibo and drain up here). The symptoms are random, the error messages are random, and where these messages actually appear is random. Jeff and I are pretty much giving up trying to figure it out. We'll most likely remove as many moving parts from the whole system and deal with continuing issues as they arise. Not sure who/what to blame. Linux? SATA? USB? The enclosures? The cables? The drives themselves? I actually got the software radar blanker working. Whether or not the output it generates is worth anything remains to be seen, but at first glance it looks pretty good. The proof is when I run this on a whole file and make some workunits, and then see if these workunits explode. - Matt 21 Jan 2009 22:18:50 UTC The secondary science database finally recovered. As we poke and prod at this new configuration we're still finding things we might have done differently, but we're planning to just seal it up and call this project done. Actual gains in speed/performance are to be tested. As many of you regular/avid readers know the last release of the cuda client got a little messed up - people were getting checksum errors meaning the files were corrupted. Bob did the code signing procedure this last time around from his desktop machine which has recently had problems with its memory DIMMs. This is our best, albeit vague and unsatisfying, theory as to why a small subset of files got corrupted when simply copying from one directory to another. Continuing progress on radar blanking and the NTPCkr. Jeff and I are anxious to get these projects done already. - Matt 20 Jan 2009 22:58:04 UTC Welcome back from another long weekend - we had MLK Day off yesterday, and the whole country has been running a little late this morning. Things went mostly well in server land. The astropulse validator was (still) choking on various results so the backlog grew and thus the workunit storage filled up again for a minute there. That means the splitters halted, and we ran low on work to send out for half a day. Other than that, no major events. Today we began the final stages of the secondary science database shuffle. We were a bit disappointed by the results at first, and did some more reconfiguration/testing before learning to not trust the output of iostat so much as the other evidence that shows we may have improved our peak science db throughput by 10x. Well maybe not so much - we'll see - if it's 2x I'll be psyched. More work tomorrow on that (the secondary is still catching up from being offline for 5+ days). A followup on a recent story about our Overland Storage servers. I recently mentioned we hit an unexpected 4 TB file system limit on our workunit storage server (gowron). Turns out we actually hit a physical extent limit, and this will be fixed in the latest OS release. This is really just an academic point - we could only grow to 4.25 TB max anyway, given the number of drives. Thanks again to Overland for continued support. - Matt 15 Jan 2009 21:45:17 UTC This morning moved on to the next phase of the bambi RAID shuffle - destroying all current volumes and building a series of RAID1 mirrors in their wake. The initial sync will take until tomorrow. Sigh. We'll continue then. Eric's server ewen (mostly used for studying interstellar hydrogen) crashed this morning. This should be a non-issue except due to various dependencies it hung some of our other servers. Upon restart it was having networking issues thanks to NetworkManager - something we try to uninstall on every system but apparently didn't on ewen. This is a piece of software that comes with linux distibutions which, as far as I can tell, exists strictly to create random network problems to keep your workday interesting. In better news, Bob's desktop is working again. The problem was actually a bad internal SATA cable. Or at least things are working since removing it. The ap_validator is still offline, mostly. It restarts every 10 minutes, maybe gets a few results done, then segfaults. The astropulse people (not me) are working on it. I know nothing beyond that. - Matt 15 Jan 2009 0:09:47 UTC Today started the process of reconfiguring the underlying RAID devices on the secondary science database server (bambi). I was able to scrape together enough spare drives within the system to make temporary space so I could shuffle things around. Given the amount of data each shuffle takes a long long time. In fact, we're kinda stuck on this project until tomorrow. Anyway.. the database is sitting on three concatenated 6-drive RAID5's. Actually, given the way LVM is handling things it's mostly all on one 6-drive RAID5. Don't ask me why we set it up this way. The plan is to convert these 18 drives into a giant RAID10. More spindles, better striping, etc. and we can take the hit in usable storage. Other than that, and messing around with Bob's desktop (which seems to have gotten a weird case of OS rot), I'm still elbow deep in programming. I hate C++ so very much but I admit the standard template library is helpful once you wrap your brain around it all. - Matt 13 Jan 2009 22:58:50 UTC Typical weekly outage (for database cleanup/backup). During so Jeff and I did some more server closet reconfiguration - we consolidated all the Overload Storage stuff (servers gowron and worf, and their combined 16 TB of raw storage) into one rack, along with our router (that connects us to our private ISP separate from campus). This gave us enough room to (finally) add another UPS to the fold - which is good as older ones have been complaining/dying. Our UPS situation is far from optimal, but we're working with what we got. We also (finally) got server clarke into the closet, which will act as a much-needed build/compute server, among other things. Steady progress is being made on both NTPCkr and radar blanking fronts - in fact I should working on the latter. Tomorrow I may tackle the RAID re-configuration project on our secondary science server, which may vastly reduce i/o and therefore increase NTPCkr throughput. - Matt 12 Jan 2009 23:58:40 UTC A rather quiet weekend, though the astropulse validator seems to have gotten locked up on something. Josh and Eric and looking into that. This morning was a little weird. An old UPS we were using as a glorified power strip just up and stopped working, thus removing power to various sundry items in our secondary lab which wouldn't have been a big deal but one of those items was a switch, so sidious and vader (and casper for that matter) disappeared from the network for a short while there. Nobody seemed to notice. In the afternoon Jeff and I plotted some physical server moves for tomorrow's outage. We'll see how much we get done - and as always we take small steps with these big projects. Various cuda-related items were discussed in our server meeting today. A bug that was causing the triplet overflows was found, and the blue screen of death issue with slower nVidia boards is getting a workaround. New client and application releases in the near future should clear some of this up. Back to work - which means plotting lots of radar data for me. - Matt 8 Jan 2009 22:26:13 UTC I actually should be programming all day, but when I dive head first into such activity I have to take frequent breaks to let the CPU in my head cool off as I draw odd diagrams on the dry-erase board to solidify the logic and pseudo-code tumbling around my brain. During these moments of respite I may tend to more enjoyable things, like messing around with the raw data pipeline, or figuring out why, all of a sudden, we're not sending out any work. The last thing was due to a problem we're seeing more and more around here. As we ramp up doing actual science where hitting the science database with one-off queries that somewhere contain the phrase "order by." This seems to give informix fits when it's busy. Apparently we need to free up, or create, more resources so the db engine has more scratch space to do sorting. Otherwise it jams up in a slow, quiet manner, and nobody notices until we observe side effects - like the traffic graph dropping to zero. So we're looking into that general problem now. - Matt 7 Jan 2009 23:56:34 UTC Now it's Wednesday, which usually means my focus should shift towards programming tasks. This actually hasn't happened in a while due to holiday schedules and other crises, but the radar blanking code really needs to be hammered into shape already. See the plans page for more info on that. Lots of mental paging-in of C++ programming trickery. But this morning I was still busy with a bunch of things on my systems task list. Our informix replica server bambi was having fits with exporting/mounting so I had to go through the rigamarole of rebooting the system - which always seems to be the fastest way to fix things when things go awry. I also plugged away moving tons of data around our internal network for eventual filesystem rebuilding, tending to the raw data pipeline, etc. - the stuff I've been talking about for a while. I've been using an old "Solaris 8" software box (coupled with the shell of a long-defunct external SCSI hard drive enclosure) as a stand for my desktop monitor, unaware how over the years the box has been slowly morphing out of square and sinking towards the left thus slanting the screen more and more. That might explain the crick in my neck I've had the past six months. This unergonomic situation was finally pointed out today by fellow SSL sysadmin Robert. Anywho, I now have the monitor sitting onto my shuttle enclosure, and even though it's perfectly level it seems it's slanting to the right. Talk about accommodation - my brain really got used to the old lean. - Matt 7 Jan 2009 0:06:20 UTC It's Tuesday, so that means database maintenance outage - the usual drill. We are recovering from that now. During the downtime I added more space to the workunit storage - actually reaching an unexpected 4 terabyte logical limit on that volume. This isn't a big deal, and we converted the two drives we can't use on this volume into extra spares which are always welcome. I also rolled up my sleeves and drew up a brand new power map of the closet which was until now sorely outof date. After we get Dan to measure the current draw directly at the breakers we can start safely adding machines to the closet. Over the holiday break, at least since I last posted anything, there was only one real incident. Our scheduling server went kaput and required reboot. Dan and Eric actually took care of that as I was happily making a chunk of change playing a New Year's Eve gig at the time. The surprise outage had the benefit of reducing demand on our resources so we could finally drain our back-end queues, and we recovered nicely once everything was back up and running. Jeff found the bug in the validator today that's been causing some confusion when comparing cuda vs. non-cuda processed workunits. He's working on the fallout/cleanup from all that while we're still trying to figure out why some cuda clients are overflowing on certain workunits. By the way, welcome to 2009. I'm only now just getting back into the lab (was out of town between new year's day and yesterday). I have hopes of progress regarding UC Berkeley's SETI project in general. - Matt 30 Dec 2008 23:16:29 UTC Yep, we had our usual Tuesday outage. Nothing special, except that the result table is vastly bloated due to the back-end queues being clogged for one reason or another. So the "compression" part of our outage took an extra hour (roughly). So be it. Hopefully the wheels were greased enough to continue letting these drain without much intervention on my or Jeff's part. In any case except a slightly painful recovery as we continue to catch up. We're also pulling up a bunch more unanalyzed raw data to keep the splitters happy during the long weekend. Other than that today.. a lot of planning and preparing for various bigger projects to tackle once the holidays are over and we're all back in the lab - adding yet more workunit storage, reconfiguring database/raw data storage, adding more stuff to the closet, upgrading OSes, retiring older machines, bringing newer ones on line already. That's all well and good, except that Eric, Jeff, and I have three separate higher-priority tasks to tackle before anything else if possible. Those are (a) wrapping up all radar blanking efforts (we still get too many result overflows due to missed and therefore unblanked radar), (b) noise shaping (the noise we're injecting to reduce the effect of the radar is causing predictable and removable but nevertheless messy analysis artifacts), and (c) the NTPCker (the real-time candidate finder/reporter - so we might have something positive to mention come our 10th year anniversary in May). That's it - the last tech news update (from me at least) for 2008. I'm already looking forward to 2009. Maybe we'll get some or all of the above done. - Matt 29 Dec 2008 23:56:24 UTC One short holiday week is behind us, now here comes another one. We did fairly well over the weekend, considering we were pretty much maxed out the whole time. The assimilator queue finally drained, thanks to splitters starting to chew on raw data files physically located on the new raw data storage server (as opposed to located on the same server as the science database), but also thanks to the validator queue falling behind. In times of low resources we do have some knobs to turn to help squeeze more juice out of our embattled servers. Sometimes you have to roll up your sleeves (or, in this case, pull out a calculator) and determine what processes needs what resource, and which are claiming too much. After some investigation it was clear this time around we were giving httpd too much - and this is a tunable we have to adjust every so often, depending on how many people are connecting at any given time, and for how long - otherwise you have too many httpd listeners hanging out doing nothing eating up valuable memory/cpu. Anyway, long story short I reduced the number of validators from 6 to 4, moved the validator logs to a different filesystem (reduce i/o contention), and vastly reduced the number of httpd listeners. So far so good - that queue is draining (and therefore the assimilator queue is inflating again). We will have the usual outage drill tomorrow, followed by another set of "days off." - Matt 24 Dec 2008 21:06:54 UTC We seem to have gotten beyond the current period of high demand and back into a realm of working within our limited resources. Queues are filling or draining in a positive direction, albeit slowly. I did finally write a script to compute how many results passing through our validation queue are CUDA processed - currently roughly 3%. And speaking of that, I am now aware of the CUDA validation problems mentioned in other threads and I passed them along with screenshots, info, etc. to the proper authorities (i.e. Eric and Jeff). At this time of year I do a lot of prep for upcoming server projects without enacting anythying too crazy, lest I break anything that's currently working just fine. For example, I'm building more RAID mirror pairs on the workunit storage server, but won't actually add them until the new year. We added enough space yesterday to hold us over until then. I'm also cleaning up the lab, labelling spare parts, placing things in boxes, organizing dozens of O'Reilly books currently stored inefficiently in stacks, etc. We also tend to "store up for the winter" - at some point soon we'll pull up a bunch of data from HPSS to keep splitters happy until the new year. Thanks for all the holiday wishes/greetings, and please accept my likewise sentiments. For those thinking I'm going above and beyond the call of duty by working during vacation, don't give me too much credit. My vacation comes later. - Matt 23 Dec 2008 23:00:32 UTC Today had our weekly outage for mysql database backup, maintenance, etc. This week we are recreating the replica database from scratch using the dump from the master. This is to ensure that the crash last week didn't leave any secret lingering corruption. That's all happening now as I type this and the project is revving back up to speed. Had a conference call with our Overland Storage connections to clean up a couple cosmetic issues with their new beta server. That's been working well and is already half full of raw data. Once the splitters start acting on those files the other raw data storage server will breathe a major sigh of relief. I was also set to (finally) bump up the workunit storage space yesterday using their new expansion unit - but waited until their procedure confirmation today lest I did anything silly and blew away millions of workunit files by accident. The good news is that I increased this storage by almost a terabyte today, with more to come. We have officially broken that dam. I also noticed this morning the high load on bruno (the upload server) may be partially due to an old, old cronjob that checks "last upload" time and alert us accordingly. This process was mounting the upload directories over NFS and doing long directory listings, etc. which might have been slowing down that filesystem in general from time to time. I cleaned all that up - we'll see if it has any positive effect. Jeff's been hard at work on the NTPCker. It's actually chewing on the beta database now in test mode. We did find that an "order by" clause in the code was causing the informix database engine to lock out all other queries. This may have been the problem we've been experiencing at random over the past months. Maybe informix needs more scratch space to do these sorts, and it locks the database in some kind of internal management panic if it can't find enough. Something to add to the list of "things to address in the new year." - Matt 22 Dec 2008 23:32:27 UTC Okay, well, it's not like we didn't see difficulties coming with the release of a client that could potentially improve our processing by 10x. But it hasn't been all that bad, either. Due to various reasons, mostly excessive i/o, the assimilator queue swelled, which caused the workunit storage to reach maximum capacity, which in turn constrained the splitters. This is still the case, more or less - though I am working to increase the workunit storage which will help break one of our dams. I already employed some of the Overland Storage for raw data images, which will eventually break another dam or two. There's still our network bandwidth limits, though... We're just crossing bridges as we get there. In any case, I did add a new photo album of our server closet for the nerds in our audience. Schedules will be erratic for the holidays, as you can imagine. - Matt 18 Dec 2008 22:41:17 UTC Moving onward and upward. More and more people are switching over to the GPU version of SETI@home and Dave (and others) are tackling bugs/issues as they arise. As predicted we're hitting various bottlenecks. For starters, increased workunit creation (and current general pipeline management since we have full raw data drives that need to be emptied ASAP) has consumed various i/o resources, filled up the workunit storage, etc. On this front I'm getting around to employing some of the new drives donated by Overland Storage. The first RAID1 mirror is syncing up - may take a while before that's done and we can concatenate it to the current array. Might not be usable until next week. Also, as many are complaining about on the forums, the upload server is blocked up pretty bad. This is strictly due to our 100Mbit limit, and there's really not much we can do about it at the moment. We're simply going to let this percolate and see if things clear up on their own (they may as I'm about to post this). Given the current state of wildly changing parameters it's not worth our time to fully understand specific issues until we get a better feel for what's going on. Nevertheless, I am working on using server "clarke" to configure/exercise bigger/faster result storage to put on bruno (the struggling upload server) perhaps next week. As for the mysql replica, it did finally finish its garbage cleanup around midnight last night, but then couldn't start the engine because the pid file location was unreachable (?!). Bob restarted the server again, which initiated another round of garbage cleanup. Sigh. That finished this morning, and with the pid file business corrected in the meantime it started up without much ado - it still has 1.5 days of backlogged queries to chew on, though. - Matt 17 Dec 2008 23:50:51 UTC So it's official: you can now run SETI@home on your NVIDIA GPU. Of course they're still working out the kinks, and it has yet to be seen what effects (immediate and long term) this will have on our servers and known bottlenecks. Such things are quite unpredictable, given the dizzying long list of variables. In order to keep our bandwidth from going bonkers due to all the new client downloads, we employ the use of Coral Cache. This is all well and good, except that some ISPs out there firewall http redirects, which means a tiny subset of users cannot download these new clients. This is unfortunate, as we have no choice because we can't handle the new client downloads ourselves. So these few users will suffer a bit until we can remove such caching. Our replica server never did recover from the outage yesterday, causing stats of various kinds to be jammed for the past day or so. This morning we found scary log messages and we couldn't even shut mysql down gracefully, so we had to kill the process and reboot the machine. It's been in really slow recovery mode all day. When finished there's a good chance it'll be out of sync from the master and will have to be rebuilt from scratch anyway. Sigh. In the meantime, I'm pointing all queries at the master, which is loading it down a bit and causing us some minor grief (running out of work to send, for example). - Matt 16 Dec 2008 23:43:25 UTC First and foremost, it's snowing outside. This doesn't happen very often around here. So today was an outage day - with one unexpected surprise: a visit from Court, systems administrator extraordinaire here in our lab a couple years back. Nice to see him again and catch up. The standard outage stuff was, well, standard. Allow me to remind our new readers: Weekly we "compress" the mysql databases (which bloat from continual inserts/deletes all week, much like disk fragmentation) and back them up. These databases contain all the user/host/team info, and who is working on which workunits - basically all the generic volunteer computing stuff. The science is all kept in a separate database (using an Informix engine) on a different server altogether. The latter doesn't suffer from the same bloat, so we can do simple no-frills backups to disk while the database is live, without much ado. In theory we could do the mysql dumps live as well, but we choose to take things down to ensure the master/replica databases are in sync, and allow us some regular downtime to take care of pending server tasks. For example... Today we finally turned off the old Network Appliance - a NAS server which worked fast and wonderfully, but (a) was only 3 Terabytes raw storage, (b) took up one third of our server closet, and (c) the individual disks have been failing at an increasing rate. We moved all of its functionality elsewhere already, so it was time to say goodbye. Jeff and I tore it apart shelf by shelf. Any sadness was lost in the joy of now having a completely empty rack full of completely useful shelves (we've had ridiculous problems finding racks/shelves that matched in the past). It's kind of funny the most useful part of that system at this point was its racks/shelves. We put all the recently donated Overland Storage servers into this now-empty rack (containing 10 Terabytes worth of storage), as well as anakin (the scheduling server), and there's still room for a lot more stuff. We still have to configure/employ all this new storage, but it's all plugged in and on line at least. Recovery from the outage is usually painful. Today seems a little worse. Part of that is our work-to-send queue is at zero and the splitters are waiting for some space to free up before creating new work. I also think server "bruno" is having result storage issues slowing things down (people are connecting okay, which they can push through the usual traffic jam). We might need to reconfigure/rebuild that RAID array sooner than later. I brought the mini video camera to make a quick video tour of our server closet, but the noise of all the fans is so loud it's basically worthless. I did take some low-quality still photos though - I'll get those up on the web someday. - Matt 16 Dec 2008 0:10:30 UTC Happy Monday, one and all. So let's see... things are progressing in a general positive direction. Our conversion from multi- to single-dimensional indexes on the result table in the BOINC/mysql database seems to have been a success, though I'm still not sure if it's helping all that much just yet. In any case, we may continue doing the same on other tables. We might get the whole database, indexes and all, fitting entirely in memory. We don't need to (we're doing just fine with whatever level of paging is currently happening), but it'd still be nice. In any case, at least we proved that we don't need to create extra unwieldy multi-dimensional indexes to do specific merges - mysql 5.x and up will figure out how to the merges on its own. Jeff and I plan to do some big steps towards moving things in and out of the server closet tomorrow. I'll try real hard to remember to bring a camera. If all goes well we'll at least have (a) more free rack space, (b) more available power, and (c) more workunit storage on-line (one less bottleneck to worry about!). Thanks to those who've been beta tested the cuda version of the SETI@home client. Sorry if I confused people by vaguely mentioning this in my last missive. Once this is formally released I'm sure we're going to exercise new and old bottlenecks, but it will be a huge step in the world of volunteer computing. We may run out of work more often. Depending on your perspective this may be seen as a "good problem." And we did finally get the donation mass e-mail rolling out late last week. I really appreciate the generosity of the SETI@home community, especially in these dark economic times. - Matt 11 Dec 2008 0:21:20 UTC During the wee hours this morning our upload server (bruno) froze. We are still unsure why, but recovery was a comedy of errors. Jeff was already about to power cycle it (having little other choice given the unresponsive console) when I got in around 8am. After rebooting bruno failed to mount its result storage drives due to some kind of mdadm mismanagement. This forced us into a read-only please-fsck-your-drives mode. The drives, outside of pointless resyncing due to hard power cycle, were fine - they didn't need to be fsck'ed. Still, being root (/) was read-only we couldn't edit /etc/fstab to prevent this from happening again upon every reboot. So I tried to get it into a real single user mode to make such an edit - all I wanted to do was comment out that one mount line. However, thus started a series of about 8 consecutive reboots, each taking about five minutes, and all wastes of time due to a typo or an unresponsive kvm. I ultimately gave up and booted from DVD in "rescue mode" where I could finally make the fstab edit. Finally all was well with the mount (which I did on the command line), but then I had all kinds of network errors with the system. More tweaks, more reboots... Long story short this server is being held together with figurative duct tape at the moment. We'll get it all sorted out later. Jeff and I also worked together to get the remaining pieces of the "donation drive" in place, such as it is. I'm sending out test e-mails out now, and will probably start sending in earnest on Friday. Please send all questions/comments about our fundraising efforts to the principal investigators (Dan, Dave, Eric). I am simply implementing the technical aspects of this endeavor, though I would like to point out we finally updated the text on the plans page. By the way.. did anybody notice this? - Matt 10 Dec 2008 0:31:47 UTC Tuesday outage day (mysql database backup/maintenance). Today Bob took care of the final step of the "single vs. multi-dimensional indexes" exercise. That is, he dropped all the multi-dimensional indexes on the result table in the main project on the master database and we crossed our fingers. Looks like mysql is neatly, or smartly, parsing queries and merging single indexes as needed just fine. This whole point was to remove the number of indexes we need, and thus keep a slightly smaller footprint in memory, which in turn helps performance. The raw data pipeline has been a major headache, if only because our hot-swap enclosures have been giving us grief. Jeff and I determined one of them is flat out broken, so that reduces our current maximum throughput by half until we get it replaced. This isn't a disaster, as we pretty much never reach half of our maximum throughput anyway, but still a slight inconvenience as we have to more rigorously schedule drive swaps. Gearing up for the donation drive, I discovered our mass mail server lost its DNS entry for some reason. The lab DNS master replaced it, but not after I turned sendmail on an hour earlier and started my tests, thus causing all kinds of circular bounces that clogged the entire lab's mail queue with literally thousands of e-mails (maybe tens of thousands). It's still draining as I type this. Don't blame me - I didn't remove that DNS entry. We're another step closer to removing that NetApp box. In fact, it's out of the automounter maps, everything on it is sym-linked elsewhere or chmod'ed to 0, and I scoured all the other servers to remove sym-links to it. Part of this project meant resurrecting server "clarke" (donated many months ago) to be a CPU server (or otherwise internal use) as it will soon have room in the closet. It had a stale configuration at this point which needed refreshing. No news on the Overland boxes - though one question was: why not combine them into one big box? Well, we have two separate needs: workunit storage, and raw data storage. The former we already have, and it works great - we just need more room - so we'll plug in one of the new expansions and get that room. The latter we don't really have and would like to keep on separate volumes (as you read the raw data and write out workunits, so you don't want the I/O to compete as it would on shared drives). Also.. part of the deal is we're going to continue helping them beta test their latest OS, which they have on the second head unit they gave us. So in a sense we're obliged to have two separate entities - the raw data on the beta test head/expansion and the workunits on the known-reliable head and additional expanion. Other question: form factor - the heads are about 2U and the expansions are about 3U. We have 2 of the former and 3 of the latter now. We'll have room for them eventually. I will update closet photos when we do the next major move (next week, I hope?). - Matt 9 Dec 2008 0:45:19 UTC Happy Monday, folks. Things were sort of okay over the weekend. The replica mysql database got stuck on Sunday - the usual drill - I logged in and quickly restarted it. The science database, however, also choked. This happened on Friday. Jeff's been doing some NTPCkr testing that would have gone all through the weekend except the excess I/O ate up all the informix threads, thus causing the splitters/assimilators to slow down and run out of work to send. Luckily I caught this before bedtime that night and broke that dam. Jeff's looking into why that happened. In good news, Overland Storage (formally Snap Appliance, or Adaptec), donated 10 Terabytes of NAS storage in the form of a new "head" and two expansion units. One of the expansion units we'll try to get on our current workunit storage server ASAP (so we stop running out of room to split new work), and the other stuff we'll make a new temporary (possibly permanent) raw data reserve so we can do the big shell game and convert all the science database devices from RAID5 to RAID10. Thanks, Overland! - Matt 5 Dec 2008 23:12:26 UTC Happy Friday! I don't really have much to add to the proceedings.. today was a lot like Wednesday when last I was here at the lab. Time spent on more filesystem shell games, compiling/running code, and working with Josh to figure out some weird discrepancies between beta/public Astropulse results. I should point out I added a couple more stats to the server status page, those being mysql queries/second, along with the amount of seconds behind the replica is from the master. Maybe this will help clarify when things go awry, though I know sometimes more information obscures the pertinent stuff. I forsee a couple dams breaking in the very near future, resulting in massive server closet updates/upgrades including, but not limited to: shutting down the incredibly solid (but physical large and logically small) NetApp rack to be replaced by a 3U system with twice the storage, thus making room to (finally) put vader and sidious in the closet, along with several UPSes, and another CPU server, clarke, which has been waiting for too long to be employed. Sometimes these things have to happen serially. Ducks in a row and all that. - Matt 3 Dec 2008 23:24:42 UTC Ah, Wednesday. It usually today when Jeff and I swap our "focus." Early in the week I'm aimed at hardware/sysadmin and he's deep in software development, and then later in the week we switch. This is an attempt to make sure we both get some programming time as the other person is taking the helm. He's mostly working on the NTPCker, and me on radar blanking stuff. Both projects are slow going. There are a lot of chores we both manage. Maintaining the raw data pipeline eats up an astonishing amount of time so we swap those duties as well. Simply "walking the beat," chasing down alerts, fixing hung processes and broken services, could easily end up a whole day every day if we're not careful. Today a huge chunk of time was spent by me moving home accounts off the old server onto the new one (and cleaning up a bunch of old garbage in the process). Also lost an hour with Jeff trying to figure out why his subversion repository was out of sync in such a manner he couldn't check changes in. I did get a moment to get the latest version of the software radar blanking signal generator to compile - and I just started a test run. - Matt 2 Dec 2008 23:27:39 UTC Typical Tuesday outage day today (for database maintenance), and currently we're in the midst of smooth recovery from that, more or less. Things sometimes seem weirder on the server status page than they actually are, as the replica database (where we collect the stats) is too far behind the master. Sometime soon I'll add some stats to show this, hopefully thus refusing confusion (and fix the broken XML stuff while I'm at it). Major improvements during the outage: Jeff put in some freshly compiled servers that went into beta last week, Bob rebuilt an index that has been missing on result for some time (used for occasional statistics Eric checks by hand), and I changed data selection priority to match between both Astropulse and Multibeam splitters (so they chew on the same files at the same time - and make it easier to determine who's splitting faster). I also been busy with other sysadmin-y tasks. Moving accounts around (still), kicking one of our internal diagnostic cronjobs that has been hanging on stale lock files in /var/lib/rpm, data pipeline management (including shipping empty drives to Arecibo), and messing around with FC10. - Matt 1 Dec 2008 21:29:48 UTC Welcome back from the holiday weekend, those who actually had a holiday weekend. Things were more or less calm around here. However thanks to our predictable nemesis autofs some things got a little murky yesterday. The mysql replica lost contact with the master - a regular occurrence - but we didn't get the warnings as mail was hung on a dead mount. Now that the replica has fallen behind (though it is catching up) the stats/server pages are a bit behind as well. This will clean itself up in due time. A few hours perhaps. Otherwise work/data seems to be flowing normally, or normal enough. Dave incorporated some new scheduler logic (not sure what offhand) that is being tested in beta, probably rolled out to the public tomorrow. I'm bouncing around between data management, radar blanking code, and OS upgrade projects today. - Matt 26 Nov 2008 21:30:53 UTC Oops. My web configuration changes yesterday afternoon seemed to work at first (I checked the logs, tested it myself, etc.) but something bad got exercised, probably at the next web log rotation (which quickly stops/starts the web server) which then made it impossible for people to see the home page for a couple hours. Instead they got a broken link to our subversion page (an interface to our freely available source code). My bad. I fixed this as soon as I noticed it later in the evening. Later on we had some weird behavior on the scheduling server (anakin) where it ran out of memory due to too many httpd/cgi processes running. It actually recovered on its own around midnight, then got choked up again. Nothing really changed, as far as our configuration nor our executables so we restarted it again this morning with the "ceiling" process limit values lower than before. However I noticed the fastcgi's were growing as they stuck around. A memory leak perhaps? Dave pointed out we have been doing client logging the past couple of weeks (which we usually don't do). Maybe that part of the code contains a leak - he's checking. Maybe that combined with the short period of mysql query logging slowing everything down caused the scheduler fastcgi processes to bloat. Not sure exactly, but we turned client logging off, and I added another flag to the fastcgis to force them to exit from time to time regardless of error just to make sure they don't bloat for too long and eat up RAM. I also finally bit the bullet and figured out our broken/wonky web log rotation system given all the above and fixed all that (I think). Obviously I didn't get dinged with jury duty this time around, though last night the automated reporting instructions hotline told me to call again today at 11am for further instructions. So I did, but then the service kept saying it was "unavailable at this time." You know, I tried. Anyway.. Happy day of turkey. Actually I think we're having goose this year. Jeff and I will both be around and checking in from time to time (as usual). - Matt 25 Nov 2008 23:35:36 UTC Happy Tuesday. We had the usual outage rigamarole today and should be recovering from that in due time. Right after the backup was finished we restarted mysql with full query logging turned on. We knew this would choke the server a bit, and would just be on temporarily. After about a half hour we had over a million queries in the log, so we brought everything back down and turned logging off. We'll parse this log file, and perhaps others we generate over the next 24 hours, in order to find pesky unoptimized queries, anything that would die if we remove all multidimensional indexes, or queries running far more often than we expected. Also during the outage I moved some big directories around - more NAS shell games. Other than that I've been reconfiguring some more web server stuff (internal use pages) and trying to maximize the raw data pipeline plumbing to get as much work online as possible. It doesn't help that a lot of our raw data drives are showing weird signs of corruption. Don't worry - we do checksums at every important transfer to ensure the data are sound, and the splitters cannot operate on garbage (there are keyword strings occurring regularly throughout the files). Nevertheless, we're having to throw away some files, which is sad. My spider sense tells me this has to do with our general SATA enclosure mounting/unmounting woes. For example, we're finding drives that are 500GB thinking they are 750GB when mounted. Was this because a drive previously on that mount point was 750GB and some bookkeeping bits haven't been cleared? I dunno, but I'm sure this isn't good. In a couple hours I get to call a number where an automated voice will tell me if I have to attend jury duty tomorrow or not. I get dragged in for potential jury duty an astonishing amount (pretty much the legal maximum) considering I never actually get selected for trial, and never will. - Matt 25 Nov 2008 0:04:53 UTC Welcome back from the weekend, which was actually relatively painless except for the usual set of automounter issues. We're close to giving up on all that. Today was a day filled with lots of chores - including trying to maximize how much raw data we have on line for splitting over the long weekend. We did have a server hiccup today due to an administrative script corrupting an /etc/passwd file (thanks to aforementioned automounter problems). It's hard to maintain a server if the "root" user disappears from the passwd file. So I had to boot from DVD to file the corrupt file. Just so happens this was the server I was having BIOS issues last week, and they happened again! Without my consent it reset the boot drive sequence, causing a little bit of annoyance and grief. Eric and I are thinking there's a dead battery involved. Reminder: this is a "short week" for us, thanks to the turkey day. - Matt 21 Nov 2008 22:29:21 UTC Let's see. Do I actually have any news to report...? Among other things today I've been working on some web site configuration cleanup, the continual chore of raw data pipeline management, and discussing the general Astropulse game plan with Josh. I think when Jeff and Eric are in the lab we'll all figure out what our exact plans are, and what we need to do to enact these plans. Generally I keep myself out of as many loops as possible for my own sanity, but I have to ramp up on Astropulse sooner or later. It's no longer a "proof of concept" kind of project handled completely by Eric/Josh. Anyway this is the kind of day where I take care a small subset of the little things that need fixing - it's been a long week and my brain is unable to deal with big projects, nor do I want to mess around with project critical stuff (especially as I am the only person on the "systems team" physically here at the lab right now and we're heading into the weekend). Oh yeah.. keep in mind we are entering holiday hijinx time. Next week will be "short" (even shorter if I get called into jury duty the day before Thanksgiving). - Matt 20 Nov 2008 0:38:26 UTC Today was a day mostly spent tracking down little problems involving BOINC, Astropulse beta, the Astropulse 5.00 release, the beta SETI@home splitter, raw data pipeline flow, data drives reporting wrong capacities to the OS... Lots of bizarre problem solving. As for Astropulse 5.00 and an "official" statement (which was requested in the last thread) I just have to step back a moment and tell everybody that these threads are for entertainment purposes only and nothing I say should be considered official. I just work here and happen to suffer from hypergraphia. I do understand this is the most dynamic form of news on this site and so I nag the others to add content here and elsewhere. They never do, and I end up looking like the de facto spokesperson. In practice, due to the incredible web of resource dependencies behind the scenes here, I have to keep tabs on pretty much every aspect of the whole BOINC/SETI@home/Astropulse family of projects since each program, server, budget constraint, etc. affects everything else. Jeff has to do the same. Nevertheless we can only keep track of so much, and what I believe is going to happen doesn't always necessarily happen. That said.. from what I know and understand Astropulse queues did drain last week and the new client was released on Friday or Saturday. The vader choke hampered this a little bit, but shouldn't have affected progress on this front that much. Josh is a little puzzled about current results, or lack thereof. That was part of the problem solving today. I still have no real handle on the current Astropulse plans - just temporarily offering my mysql/BOINC expertise to the "Astropulse committee" (Josh and Eric) and then getting back to work on something else. - Matt 19 Nov 2008 1:59:09 UTC Had the usual outage today (weekly mysql database reorg/backup). I also took this opportunity to do what I mentioned yesterday: the remaining last bits of NAS-box shuffling. This included breaking a (currently unused) RAID5 array, putting in bigger drives, and rebuilding it all as a RAID10. However, I quickly came to find the command line utility doesn't allow me to delete single logical drives - it's all or nothing. Not wanting to destroy the root logical drive, I was forced to go into the RAID BIOS, which meant the server (and the web site) had to be brought down temporarily. Temporarily became a couple hours - after doing the reconfigure the regular BIOS surreptitiously changed the boot drive sequence. This meant the system wasn't booting after that, leading to much confusion and panic (and many long, slow reboots) until I discovered this tricky, pointless switcheroo. Anyway, everything was fine after that and I brought up the new partition and started moving things back to where they are supposed to be. This included the beta download directory, which uncovered a "bookkeeping" error on our part which meant beta downloads of the new client were broken for the past few days. Oops. That should be fixed now. We turned on query logging bringing the project back up in order to do an inventory and determine any need for more/different indexes. I had to bounce the project again later in the afternoon to turn that logging back off (it eats up too much i/o to just leave on indefinitely). I also spent a lot of time helping the CASPER gang reconfigure their main web server. I'm also supposed to be working on donation drive stuff. Oh well. I'll get to it tomorrow. - Matt 18 Nov 2008 0:12:43 UTC So vader went down again over the weekend. Actually just its ethernet connection went down. We're blaming Network Manager. Anyway, we remotely moved enough services around to get beyond vader missing from the server fold, and got everything working again this morning once back in the lab. I don't have much time to report on all the other mundane details that occupiedthe rest of my day. Tomorrow is a standard outage day, during which we hope to get a bulk of the remaining NAS-box shuffling done - one more step towards major server closet overhaul. - Matt 14 Nov 2008 21:49:19 UTC Happy Friday. After the Wednesday outage we had some splitter issues - Jeff incorporated new raw data reading logic that changed in our standard internal data handling libraries. This didn't break in tests, but broke in reality. Actually I'm not sure if it actually broke as much as misbehaved. In any case, the splitters tore through all our raw data and called the files "done" so we ran out of work to send for a moment there. I "un-did" these files and Jeff fell back to the old splitter. The project of debugging this is still open as far as I know. In the meantime, the Astropulse splitters are disabled for a reason - Eric and Josh want to fully drain all those workunits before releasing another client. Meanwhile since we still haven't gotten our shipment of the latest data drives we had to pull data up from our archives to ensure we have enough work to send over the weekend. These are raw files that were surplus at the time and therefore unsplit (and "saved for a rainy day," like today). We had some more automounter issues, though they are happening far less frequently than before. I added some alerts so Jeff and I will get more warning when such things go awry on any particular system. I also cleaned up the server status page some more. Other than that most of my time has been spent on shuffling big bunches of data around like some shell game in preparation for optimizing file systems (probably early next week). This is mostly internal stuff and has little to do with public server performance. - Matt 12 Nov 2008 23:39:34 UTC Let's see.. we had our weekly Tuesday outage today, since yesterday was a holiday. This meant the database had an extra day to get more bloated, which is perhaps why several queues started falling behind. Actually that probably has to do with our workunit storage server filling up again causes general backend malaise. So we were low on downloads for a while there before the outage. Good news is that I found one reason why our apaches were randomly failing - kind of stupid, actually - there were two httpd log rotation scripts in occasional competition with each other. I think I cleared that up, but automounter/nfs problems are still creeping in there and wreaking havoc. I also finally employed new file_deleter logic to split the deletes between results and workunits, so they can run specific jobs on more appropriate machines. Hopefully that will help speed up the queue drainage. I also added a tiny bit more logic to the server status page to help make clear which data files are being acted upon by which application. On that front, we were expecting raw data from Arecibo to show up today. It didn't. However, I've been pulling up old raw data files from HPSS for Josh's pulsar testing, and found these haven't been chewed on by Astropulse yet, so I added those to the data queue. So you'll be gettin' Astropulse work soon enough. As for the project getting all of our stuff off the Network Appliance rack (to free up major amounts of space/power in our closet) we continue to make sloooow progress. Today we moved the boincadm account to this new machine, and so far so good - response times are still pretty snappy. Or snappy enough. The web page server does a lot of random access reading/writing in this directory, so it would be obvious if there were an i/o problem. For what it's worth my schedule is changing a bit over the next month or so, and I'll be here on Fridays instead of Thursdays. This has nothing to do with anybody except those who expect my tech reports on specific days. - Matt 10 Nov 2008 23:24:13 UTC Reminder: Tuesday (tomorrow) is a national holiday. We'll be having our weekly outage on Wednesday. It's been a bit of a rocky weekend. A rocky week, actually - since the last regular Tuesday outage the CPU load on anakin (the scheduling server) has been kinda high. Like around 100. The obvious answer - turning on client logging for debugging on Tuesday - wasn't so obvious at first as we thought we vindicated that and moved on to finding other possible problems which were all red herrings. Eventually we were brought back to client logging and Dave made an optimization fix which we tested in beta this morning, and Jeff applied to the public project around noon. This brought the load back down to 1. I guess the fix worked. Our raw data pipeline management still needs work. Lots of bottlenecks make it impossible to keep a steady, automatic flow of work. In a perfect world it would be simple and serial - data drives sent up here, files brought online and acted upon by both multibeam and astropulse splitters, copied down to archival off-site storage, and then deleted. However each step takes a rather long time (hours per file if not days), and storage is limited, so we have to parallelize as much as possible. One possible effect of this, and one we're seeing now, is that we currently don't have raw data available for astropulse to split. We're loathe to copy data back up from the archives unless we really have to. We still might do so, but we are expecting a new shipment of drives directly from Arecibo today or Wednesday so astropulse at least be will be fed then. The bright side is this is now very clear on the server status page now that I made some updates to finally split out database counts/rates and splitter activity per application. There's still more updates to be done, but now it's much easier to tell what's going on between the two. - Matt 5 Nov 2008 21:32:17 UTC At 7:30pm last night the scheduler apache server got hung up - probably from all the election night excitement. These apache servers need a kick fairly often. Unfortunately they die various way due to various things, so automating the checking of certain pulses doesn't always help - in fact such things usually make systems more complicated and unpredictable. In the case last night it failed during log rotation which issues an "httpd restart" - this time the head-in process didn't die, so port 80 got locked up. I had to log in and kill the zombie httpds by hand before restarting apache. Not a big deal, though it got missed for a couple hours as it was timed perfectly with the entire country busy watching the news. - Matt 4 Nov 2008 23:26:50 UTC I don't know if you heard but today is Election Day in the U.S.. Luckily I only had to wait in line an hour to cast my vote so the usual weekly maintenance outage wasn't delayed. However, I wanted to reboot jocelyn to pick up a new kernel, and had issues upon shutting down and coming up not unlike those I had with bruno a week or two ago. Namely - the server couldn't find its large storage partition and/or thought it was corrupt. Not sure why but the data storage partition, which is under LVM control, wasn't being activated. I had to go through the rigamarole of booting from CD, commenting out the mount point in /etc/fstab, rebooting, then typing "vgchange -a y" myself to finally see the partition. Then everything was kosher. So far the projects are coming up just fine, though slowed as the restarted database has to flood its memory caches before reaching maximum efficiency - this usually takes around 30-45 minutes, I think. Next Tuesday is Veteran's Day, so we'll probably have the weekly outage on Wednesday. - Matt 3 Nov 2008 23:46:19 UTC Yeesh - another rocky weekend, but nothing out the ordinary. One download server got a headache, the schedule process felt sick for a while, the workunit storage filled up again thus blocking the splitters... At least we don't have those Astropulse download spikes anymore, but we're still at a loss to exactly explain why bruno is so overloaded - and therefore why the queues can't seem to drain as fast as they used to. Anecdotal evidence shows the mysql database may seem fine on the surface but is about to collapse any second, and all those extra milliseconds it takes to respond is causing bruno's processes to get all gummed up. In any case I put some effort into moving as many of these processes elsewhere. I also asked Dave for a BOINC feature request - a file_deleter command line option where you can state "only delete results" or "only delete workunits" so you can have file_deleters running on more appropriate systems. It's raining here in the Bay Area - and this wet weather is very much welcome given a ridiculously long summer of drought and fire, but rain also means our air conditioner isn't working as efficiently. So we got the server closet temperature to worry about on top of everything else. - Matt 30 Oct 2008 23:00:44 UTC Okay. So the assimilator memory leak wasn't a problem so much as an effect. It's consumption of resources still needs to be addressed, but it was only affecting itself, and being aggravated by the other problems around it. Poring through logs I confirmed that the network bursts were indeed due to Astropulse downloads - during the "baseline" 2 out of 100 workunit downloads are Astropulse, but during the "burst" 40 out of 100 are Astropulse. The Astropulse workunits are much larger in size than SETI@home workunits, hence the bandwidth consumption. I also confirmed it wasn't a single (or few) clients hitting us at once - connections were randomly distributed over many IP addresses. It finally dawned on me, and now like most things is painfully obvious on hindsight. The SETI@home and Astropulse splitters have separate high water marks. For SETI@home, if we get above 50000 results ready to send, we temporarily halt splitting. For Astropulse, it is still set pretty low at 2500. Every so often a splitter process checks to see the size of the queue and if it should stop. Since there are many SETI@home splitters running at a time, and there is always a delay in transitioning state, thousands of workunits may be generated before the splitters actually realize they are above the high water mark. And then they go to sleep for a while - like an hour or so - until the queue drains enough and they wake up again and get back to work. The thing is, during SETI@home's "sleep until we're needed again" phase the Astropulse splitters continue to run since they haven't reached their high water mark even though it's much lower - those splitters are fewer in number and run slower. Now remember when workunits are created, the transitioners also create respecitve results to "send." New results are id'ed serially - i.e. they are tagged with a number in the database which increases automatically. So during these periods you'll get an area in database id space rich in Astropulse results. Moving on to the feeder. Since it's stupid regarding application types, it fills its own send queue with the oldest results ready to send regardless of application, and the way mysql works this tends to mean in database id order. Of course with the ready-to-send queue at 50000 or so, we have to send out 50000 results before we finally see the effects of what happened above - many hours, usually. Then suddenly - bam! - 20 times more Astropulse workunits than normal. That arbitrary time delay really confused matters. Anyway, one easy solution is to make the feeder smarter. It does have an "-allapps" flag to send to all applications equally. We were hesitant to use this before due to fear this will give too many shared memory slots to Astropulse - and it may very well cause periods of low work during peak periods as the feeder has half the memory for SETI@home workunits than it did. Nevertheless we turned this on today and it had an immediate, positive smoothing effect. Sweet. Other than that today... some data pipeline scripting, and continuing discussions amongst the gang regarding changing redundancy to zero - trying to wrap our brains around all the current bottlenecks and what will suffer depending on what we do. As it stands now, our servers most likely will not be able to support reducing redundancy all the way to zero *and* keeping up with current workunit demand. So we have to either improve our server i/o or figure out what other knobs to turn. - Matt 29 Oct 2008 22:55:54 UTC Well we haven't really gotten completely around the general problems with our raw data drives being unreadable via our tangled web of SATA enclosures and USB converters, etc. However I did find one thing this morning which helped. Turns out one enclosure just simply stopped working. Long story short, upon very careful inspection I found one of the drive bays had a tiny tiny piece of pink fluff wedged in the SATA power plug. The fluff was from our shipping containers to/from Arecibo. Bits of it get torn off from regular use, and it looks like some got stuck on a drive, which then got wedged into the power plug upon insertion into the enclosure. I dug it out, replaced the drives, and they were visible again. At least for now. I do appreciate the "modprobe" suggestion in the last thread, which may help other similar issues. Jeff and I were discussing a lot of stuff today, focused mainly on future planning and needs, i.e. what are our current bottlenecks, how do we fix them, and then what will our new bottlenecks be? We're resurrecting conversation with campus, possibly to have them research the current cost/feasibility of increasing our bandwidth. We're also internally discussing needs regarding a potential move towards less redundancy - which will pretty much double our load if we decide to keep up with demand, and can keep up. As well we were scratching our heads about these semi-regular bandwidth spikes that max out our current bandwidth and wreak general havoc for an hour or so at a time. As far as the last thing I found an important clue today. The assimilator code has a memory leak - it's had the leak for years now, but it's usually not a problem. It eventually reaches a limit, fails, then restarts within a few minutes. Today I found the assimilators have been dying quite often recently, and their failures are perfectly in tandem with upward bumps we see in upload traffic. No surprise, as the assimilators and uploads happen on the same machine (bruno) - so if bloated, resource-consuming assimilators suddenly disappear from the process queue, more resources are suddenly given to uploads. The story goes on from there, but I have to get back to work and will leave the conclusion until tomorrow. You see, I put in a "assimilator killer" cronjob today in every two hours to restart the assimilators regularly and prevent them from bloating too much. I think observing the effects of that over the next 24 hours will inform what I think about other network problems we've been having... - Matt 28 Oct 2008 22:59:27 UTC Today's outage took a little longer than usual. This had mostly to do with the replica mysql database needing to be reloaded from scratch (since it fell behind over the weekend and would take days to catch up otherwise). Plus there was some more index manipulation, en route to a (slightly) more streamlined mysql database. I also replaced the drive that failed on bambi a week ago. So you can stop worrying about that. Jeff and I spent way too much time fighting with our current raw data pipeline. We get SATA drives up from Arecibo full of data. What happens to this data is a matter of priorities. Do we need to send empty drives back down to Arecibo as soon as possible? Is the splitter data queue low? Is the raw data storage full? Etc. etc. So at any given time we're been either (a) sending data to our offsite archival storage or (b) moving data over to the raw data storage, or (c) both of the above. We're not here 24/7 so to ensure continual data flow we have external SATA drive enclosure on a couple systems. However, due to various annoying mechanical/form factor reasons, very few of our systems can host these enclosures. Also the drives should be swappable (otherwise what's the point?) but we're finding that very frequently a drive is pulled, another is put it to be read, and the OS can't see the new drive until we reboot the system. This has been a problem with the enclosure directly connected to a SATA card, or via a SATA to USB converter. We're trying to automate this whole process, but with the drives/enclosures constantly disappearing for no good reason we're up against a wall on this. - Matt 27 Oct 2008 21:52:50 UTC Bit of a weird weekend. Towards the end of last week we had some science database issues - apparently informix "runs out of threads" and needs to be restarted every so often. Around this time there were continuing mount problems on various servers. The usual drill. Then I headed to San Diego for a gig (only gone 28 hours) and Jeff went on a backpacking trip. Things were more or less working in our absence, but - as it happens sometimes - sendmail stopped working on bruno. This wouldn't be a tragedy except for the fact that bruno wasn't able to send us the usual complement of alerts. For example: "the mysql replica isn't running!" So we didn't realize the replica was clogged all weekend. The obvious effect of this is our stats pages have flatlined. It's catching up now, but we'll probably just reload it from scratch during the outage tomorrow. We also had more air conditioning problems last night. At least the repairguy returned today with replacement parts in tow. So that's being addressed, but not after Jeff got the alarm at midnight last night and Dan trudged up to the lab to open the closet doors and let things cool off. And the httpd process on bruno, once again, crapped out at random - meaning uploads weren't happening for a short while there. Jeff gave that a swift kick, too. On the bright side, we're discovering ways to tweak NFS which have been vastly improving efficiency/reliability here in the backend. This may help most of the chronic problems like the ones depicted above. - Matt 23 Oct 2008 20:55:56 UTC There's been some problems with the web server lately which are hard to track down. However, this morning I found we were being crawled fairly severely by a googlebot. I thought I took care of that ages ago with a proper robots.txt file! I then realized the bot was scanning all the beta result pages, and I only had "disallow" lines for the main project - so I had to add extra lines for beta-specific web pages. This may help. So we've been getting these frequent, scary, but ultimately harmless kernel warnings on bruno, our upload server. Research by Jeff showed a quick kernel upgrade would fix that. We brought the kernel in yesterday and rebooted this morning to pick it up. The new kernel was fine but exercised a set of other mysterious problems, mostly centered on our upload storage partition (which is software RAIDed). Lots of confusing/misleading fsck freakouts, mounting failures, disk label conflicts, etc. but eventually we were able to convince the system everything was okay, but not after a series of long, boring reboots. Speaking of RAID, I still haven't put in the new spare on bambi. It's late enough in the week to not mess around with any hardware, especially after dealing with the above. Plus the particular RAID array in question is now 1 drive away from degradation (no big deal), and 2 drives away from failure. Plus it's a replica of the science database - and the primary is in good shape, and is backed up weekly. So no need to panic - we'll get the drive in there early next week. Speaking of science database, I'm finding our signal tables (pulse, triplet, spike, gaussian) are sufficiently large that informix is automatically guessing that with certain "expensive" queries indexes aren't worth using, and is reverting to sequential scans which take forever. This has to be addressed sooner than later. - Matt 22 Oct 2008 21:00:31 UTC Really busy day for me, but not much on the public facing side of things. Jeff and I are revamping our current backend data pipeline in light of continual hardware and I/O headaches. I'm pulling a bunch of stuff out of the database for Josh so he can do some more "find the known pulsar and see if it looks like RFI" game in Astropulse. I enacted the "no redundancy" policy on beta - we're curious to see how well it works in practice, mostly for the sake of general BOINC testing. I had some updating/programming to do regarding our donations database - stuff that campus requested. Still no signs of the air conditioner being fixed (though it is running cooler in the closet than earlier in the week). And we haven't yet replaced the bad drive on bambi (though we have a spare sitting on the shelf). - Matt 21 Oct 2008 22:25:00 UTC Today had the weekly outage for the mysql backup/compression/etc. Bob did some index manipulation on the beta project while we were down - to see if we can perform as well with less indexes (now that mysql merges indexes if possible on its own). During the outage one of bambi's 24 drives failed, or at least seemed to. A spare has been pulled in and is rebuilding the array now. The forums were pretty slow yesterday - actually everything was. Queues were filling, storage was maxed out, servers and databases was slowed by all the above, causing all kinds of headaches. However overnight the dams finally broke through and everything more or less cleared up on its own. I like when that happens. About our bandwidth.. We do have *two* 100Mbit connections to the world. First is Hurricane Electric (HE), which is the what SETI pays for, and the other is the link supplied by campus which is shared by the entire lab. The HE traffic is strictly result uploads and workunit downloads, with occasional archival transfers to offsite storage. Everything else - most of the archival transfers, the public web sites, etc. go over the very underutilized campus link. So if there are web site connectivity problems, it has nothing to do with a maxed out link - it's probably due to the database server being overloaded, or something else. - Matt 20 Oct 2008 23:09:08 UTC Hello. So the weekend was a bit "noisy" on the network backend. This was mostly due to these network bursts we've been getting. It's still confusing to me why these bursts are happening - every few hours we get a bunch of Astropulse workunit downloads in quick succession that max out our bandwidth and wreak general havoc. And over time our workunit storage server filled up again, so queues are filling up, the splitters can't create work fast enough. Also the load on our upload server is unbearably high. I'm hoping during the usual weekly outage tomorrow we can give certain servers a rest to help clear the pipes. Until then, practice patience. We also have conditioning air conditioning problems in our server closet - apparently the temporary fix from last week is unfixing itself. It's not a disaster, but the temperatures are rising about a degree per day. I hope the facilities people will be checking it out tomorrow. - Matt 16 Oct 2008 16:46:38 UTC Early note as I'm leaving after lunch today. Looks like the translation code on the web broke sometime during yesterday evening. How embarrassing. Code was updated on this site (not by me!) which messed things up. The problem with the translation stuff is that it takes a while to "percolate" - you update the proper .inc files, you look at the web site, it looks fine, so you move onto something else - and don't notice when it breaks 10-15 minutes later. Normally I check in regularly on "off hours" to catch such problems but I was busy last night. Anyway, I don't want to apologize for this, especially as it wasn't my fault, and in fact I fixed it when I got in by falling back to older code. I believe everything is else is more or less recovering from various mounting/network/reboot issues yesterday. Hope y'all are getting your workunits for the weekend! - Matt 15 Oct 2008 21:51:36 UTC This morning the other building in the Space Lab "complex" started having network issues on one of their subnets. For various reasons I shan't go into here, we have some servers on that subnet. Since some of these "foreign" servers were recently mounted, this reverberated into all kinds of NFS malaise on most of our local servers, some of which needed rebooting to break various network logjams (and then in one case fsck'ing after rebooting...). It's been that kind of day. The good news is the mysql master/replica seemed to have survived the OS upgrade yesterday, though not after some confusion about unexpected behavior. - Matt 14 Oct 2008 23:21:23 UTC Well here we are. I just had a long day mostly occupied with upgrading the last of server that required a long-overdue OS upgrade. This was our master mysql server. We started the outage early so we could compress/back up the database like we usually do, then allowing enough time in the afternoon for me to install the new OS and configure everything. It seems everytime we install a new OS on a server, a completely random set of unexpected hurdles eats up a couple hours. Today was no different. Hurdle 1: This system has two hardware RAID devices, which the OS saw as /dev/sda and /dev/sdb - the former being the root drive, the latter be ing the data drive. The installer recognized both devices but swapped names - the root drive was /dev/sdb and the data drive was /dev/sda. Fair enough, but I had to be extra careful not to blow the data drive away. This would have been okay, except upon reboot the names were swapped yet again, and grub's device map was pointing to the wrong drive (it doesn't use partition UUIDs). This led to some confusion and having to edit grub config files in rescue mode, etc. Hurdle 2: Actually this happens every time I install an OS, but each time it is slightly different. That is, despite entering the proper network info during the install process, things just don't work right out of the box. This time it took 45 minutes of hair pulling before I gave up, swapped the ethernet jack from eth1 to eth0 (it was working just fine in eth1 before the upgrade) and then, inexplicably, I could see the world using the exact same "broken" configuration on eth0 that I used on eth1. Very annoying. Hurdle 3: I was able to get mysql to start up and see the data, but it's master/replica configuration was messed up. I fixed it, but then the replica itself barfed for other reasons. Problem is it was still lodged on trying to replicate "alter table" commands which we do each week to compress the data. So every time I try to reset values an errant "alter table" seems to run, thus locking the database for 60-70 minutes. Makes debugging/progress very slow. In fact, the replica is still off - I just started the project running entirely on the master. I might get the replica working today. Maybe not. - Matt 13 Oct 2008 21:30:21 UTC Busy day today. Jeff came in and found the server closet air conditioner went dead around 5am. So the entire closet was running pretty hot. Turns out there was another coolant leak (a problem we seem to deal with a lot). At any rate, this was fixed pretty quickly and everything cooled up to 2 degree colders than before this weekend. Problems over the weekend. The mysql replica lost its connection - a known, common problem (hopefully will be fixed once the replica is on the same switch as the master db). I discovered that and gave it a kick. Hours later the upload server needed a kick as well. Eric discovered that in the morning and got it working again. We're also fairly pegged at our network limit again, I think thanks to the workunit turnaround time being pretty low (i.e. fast). Plus I have to send extra raw data to our archive over the same link. Oh well. Expect data transfer headaches for the next qwhile. I also am planning for our last OS upgrade tomorrow on jocelyn, the master mysql database server. This means, like when we upgraded bruno, an extra long outage tomorrow. - Matt 9 Oct 2008 20:26:27 UTC Let's see.. We had one of our download servers choked on NFS again, which caused its httpd server to die. I gave both autofs and httpd a swift kick on that machine (vader) and it's back up server workunits again. Of course that means there's a backlog of clients trying to connect to it, and we'll be dropping various other connections while that queue clears out. Our mysql research led us to discover we needn't upgrade our current mysql version after all to make use of automatic index merges. We haven't been seeing this logic being employed due to (a) low ordinality of certain indexes and (b) mysql refusing to use multi-dimensional indexes in their merges. Fair enough. We'll just have to change around our current constraints.sql (dropping some 2-dimensional indexes and making new single dimensional ones) and see what sticks. Other than that.. today I've been working on LVM/xfs snapshots and making slow but steady progress on radar blanking testing. - Matt 8 Oct 2008 22:01:36 UTC Some nagging network issues, mostly due to the known liabilities/usual suspects. Very often we are maxing out our 100Mbit private connection to the world, due to peak periods (catchup after an outage, a spate of "fast" workunits, new client downloads added to the usual set of workunit downloads) or sending our raw data to offsite archival storage. This is why download/upload rates are abysmally slow at times - if you can get through at all. One solution would be to increase our bandwidth - we do pay for a 1Gbit connection, but due to campus infrastructure can only use 100Mbit of that. Getting campus to improve this infrastructure is currently prohibitive due to cost (which includes new routers and dragging new cables up a mountainside to our lab), bureaucratic red tape, and the backlog of higher priority networking tasks campus wishes to tackle first. In other words as far as I can tell it ain't never gonna happen. Another solution would be to reduce our result redundancy, as already discussed in recent threads. We also had our science db/raw data storage server choke a bit today - perhaps because of the recent swarm of fast workunits and therefore increased demand on the splitters. We do try to randomize the data to prevent such swarms but you can't win all of the time. And our web log rotating script sometimes barfs for one reason or another and fails to restart one server or another. For a moment there both the scheduler and upload server were off - I caught it fairly quickly though and restarted them. To clarify, result uploads and workunit downloads go over the private SETI net, along with scheduler traffic. Web pages and other stuff goes over the campus net (it's not that much - only a couple Mbits/sec at peak times). The archival storage (where we copy all our raw data offsite) sometimes goes over the campus net, sometimes over the private SETI net, sometimes both if we need to empty the disks as fast as possible to return to Arecibo. Other than all that.. I fixed the fonts of the status pages, and Jeff elsewhere posted a quick note about NTPCker progress. - Matt 7 Oct 2008 22:30:02 UTC Had our weekly outage for mysql database backup/compression. Reminder: by "compression" I mean that the rather large tables in the database (notably "workunit" and "result" tables) stay stagnant in size if you go by number of rows. That is, workunits and results are created/deleted at about the same rate. However, when you delete a result you can't reclaim that space in the database again until either (a) a whole page of results is deleted (due to random nature of the project this rarely happens) or (b) we actively do this "compression." Why is this a problem? Well, imagine a city where, once you leave a parking space, nobody can ever park in that spot ever again unless all spaces in that neighborhood are vacated. This would make hunting for parking quite a chore. As time goes on, we see a similar effect on the database I/O. Seems silly that the database has this issue, but consider how many endeavors around the world, commercial or otherwise, require a database as large as ours in which a million rows get deleted and added every day? It's not a common problem, to say the least. At least at our scope. People seem to be experiencing slowness uploading/downloading work. I know why: I've been pumping raw data over our network to our offsite archive (HPSS) over the same network link as the uploads/downloads. Usually we don't, and in fact after the current batch is done (later tonight) I'll archive over the campus network (which is what we usually do). - Matt 6 Oct 2008 23:15:18 UTC Let's see. No real major crises at the moment. We do have these network bursts which are entirely due to Astropulse workunits. Here's what happens, I think: an Astropulse splitter takes a long time to generate a set of workunits, and then dumps them on the "ready to send" pile. These get shipped to the next 256 clients looking for something to do, which in turn causes a sudden demand on our download servers as the average workunit size being requested goes from 375K to 8000K. We'll smooth this out at some point. Lots of systems projects, mostly focused on improving mysql performance (Bob is researching better index usage in newer versions) and improving disk I/O performance (I'm aiming to convert all our RAID5 systems to some form of RAID1). Also lots of software projects, mostly focused on radar blanking (the sooner we clean up the data the better). Unfortunately needs of the software radar blanker required us to break open working I/O code - Jeff implemented some new logic and we walked through the code together today. Hopefully soon we can get back to the NTPCker. Thanks for your input about the "zero redundancy" plan. Frankly I'm a bit surprised how many are against it, though the arguments are all sound. As I said we have no immediate need to enact this feature. I still personally think it's worth doing if only for the reduction in power consumption - though I'd feel a lot better if we could buff up the validation methods to ensure we're not getting garbage from wrongly trusted clients. - Matt 2 Oct 2008 21:22:11 UTC Not much to report, really. We had a couple blips or brownouts which were minor and easily corrected. Mostly spending my day working on R&D type stuff (mysql replication, radar blanking, etc.) and data pipeline management - this included boxing up freshly reformatted drives to ship to Arecibo. One thing in the works, maybe, is changing the workunit redundancy to effectively zero. There is already the mechanism in BOINC to "trust" hosts that continually return validated work. These hosts are then sent workunits that only they will have to process (not a redundant "wingman"). No validation is required (or actually possible) upon returning the result, and no waiting on others for credit, either. Of course, even trusted hosts will get occasional tests to prove they are still trustworthy. Plus there are quick tests we can do on the backend in lieu of "comparison validation." Other pros for doing this include using half the resources for the same amount of science (hooray!) and potentially getting through our backlog of data twice as fast. The cons are mostly concerns. If we try to keep up with current demand for work we'd have to run twice as many splitters, which is impossible given our current resources (we'd at least need more cpus, more disks, and better disk i/o). Or we could split at today's rate and regularly run out of work, which might upset some people. If we do increase our splitter production rate and burn through our data, we will even more likely run out of work on a regular basis (since we can't pad fresh data with old data if we used up the old data). Just some thoughts for now. We haven't really decided on anything yet. - Matt 1 Oct 2008 21:01:21 UTC Random day. Fixed more stuff on bruno (which got upgraded yesterday), most notably the update_stats process which needed to be recompiled to find newer libraries. Also dealt with lots of internal data pipeline management. And some subversion repository cleanup (in preparation to possibly improve web page translations). The big thing is that I finally got some time to reconfigure that one RAID5 system into RAID10 (effectively), and the write rates increased by over 16x. Now we're talking. As we get more disk space to work with, we'll pretty much convert all our RAID5's to something else to help get beyond several backend IO bottlenecks. I know this sounds like we only now just discovered the joys of non-parity-based RAID systems, but - like most things around here - we are always firmly aware of better solutions but lack to resources to enact them. Pretty much all our RAID5 systems were built grudgingly but we needed the extra storage at the time. - Matt 30 Sep 2008 23:28:47 UTC We had an extended outage today (more than the regular 3-4 hour database maintenance outage) to finally upgrade one of our core servers, bruno. Usually the OS upgrades are trivial, however this particular machine required a little extra TLC, due to its functional importance, as well as its unique (but admittedly not that unusual) hardware configuration. In regards to the latter, we basically put off upgrading this system until a modern day OS would automatically support its fibre channel card (as opposed to us having to compile drivers into the kernel, etc... blech...). Anywho... there were no major failures during the long procedure (which included backing everything up, reconfiguring root RAID devices (while trying not to destroy others), then resetting all the network/RAID/apache/etc. services). It still took longer than it should due to a steady stream of minor annoyances (installer crash on first attempt, missing sym links that had to be discovered/recreated, missing packages to be installed, having to recompile every BOINC service due to standard library changes). Doesn't matter - it's done. Or at least done enough - there are still some screws to tighten which I'll tackle later. So, we'll be catching up for a while. If at first you don't connect, let your client try again later. - Matt 29 Sep 2008 22:17:27 UTC Quick news for the beginning of the week. We chugged along nicely all weekend, though for server load reasons we were running less Astropulse splitters (and thus creating less Astropulse workunits) and so they've been "falling behind" SETI@home in the competition for processing power. I changed that this morning. Also we're going to attempt the bruno upgrade again tomorrow. We realized last week we'll need a lot of time to do everything we'd like, so the regular outage will start a bit early and possibly end later. - Matt 24 Sep 2008 21:12:42 UTC Something we've been lagging on is separating the database count totals on the server status page. Currently we're showing "totals" - for example, the "results ready to send" is a sum of both SETI@home and Astropulse results ready to send. For diagnostic purposes, it would be much better to split these into two separate columns. However, this isn't so easy, as such queries become suddenly very expensive if we're adding an additional "where appid = N" conditional (AstroPulse and SETI@home are considered two different "applications" in the BOINC realm). I'm talking the difference being a 3 second query versus a 3600 second query. Yup. We've made joint indexes in the past for servers that needed them, but this hasn't been a priority for diagnostic stuff. We also don't really have the memory/resources to keep such extra indexes around. In any case, Bob pointed out that newer versions of mysql are smarter - doing the index joins automatically - so we may push on upgrading mysql sooner than later. Today I'm actually lost in mundane bureaucracy land. I also should be working on the new software radar blanking embedder code. Sigh. - Matt 23 Sep 2008 23:17:01 UTC We had the regular database maintenance outage today - no news there and we're recovering from that now. We have several backlogged data pipeline jobs adding much noise to our backend network, so progress is slower than normal. We also planned to do some OS upgrading today but were blocked waiting for some backup jobs to finish. The influx of free time led me to do some extensive testing regarding our general bottlenecks as of late. I'll cut to the chase. We can blame RAID5 for pretty much everything. No real shocker there, but I was surprised by the extent of RAID5's lousy performance. In one example, a large file copied from temp space to a directly attached RAID5 partition took two minutes, and the same file copied over NFS to a remote RAID10 device took 6 seconds (file caching had nothing to do with it, in case you're wondering). While some systems handle RAID5 (or RAID4) much better than others, we simply can't afford the performance hit on the writes no matter how fast the parity bits are computed. So why choose RAID5? Well, you get far more raw storage that way. But that's pretty much it as far as I care. Unfortunately in some cases (like our raw data storage buffer on thumper) we need every terabyte we can get. Seems kinda silly what with single terabyte drives readily available to the world, but spindle count is also quite important to us. In any case we have some convertin' to RAID10 ahead of us on several systems and the usual round of careful/paranoid testing. I don't think we have much of a choice in making some of thumper's partitions RAID10 as well, and that'll mean sometime in the future a planned outage of indeterminate length. - Matt 22 Sep 2008 21:19:03 UTC No big disasters over the weekend. However, turns out one of the download servers had its root partitions fill up yesterday due to faulty log rotation behaviour. I'm figuring that's why outbound traffic was spotty for a while. I had to clean that mess up this morning - I think we're out of the woods on that front but the traffic graphs still seem kinda weird to me. I plan to upgrade the OS on server bruno tomorrow, and with that being the "hub" computer for BOINC in a lot of ways, the outage may be longer than usual. Hopefully not too long. It is coming clear that our hopes for the new NAS box we assembled aren't being realized - it's pretty slow. It is also clear that using thumper as both a raw data storage buffer and science database server isn't going to work out for much longer. The I/O on the machine is usually maxed out, and we need a better solution. Not sure exactly what that solution is yet. I'm going to be prioritizing helping to implement the new radar blanking code, as Astropulse is kinda blocked until it's ready. Jeff's been working pretty hard on that, as the program required some changes to core data management routines without breaking currently working software. Once we're over the hump on that he (or we) can turn our attention back to the NTPCkr. - Matt 18 Sep 2008 23:30:22 UTC Just checking in before the weekend. Not much super urgent to report. The mysql replica fell behind again as our alert scripts didn't exactly work as expected. When the replica lost connection to the master the "seconds behind master" diagnostic variable gets set to NULL, which my scripts interpreted as "zero" as in "zero seconds behind master" - which is usually optimal. Ha ha ha. Anyway, it didn't fall that far behind and is catching up now. Otherwise I've been doing some data pipeline scripting updates - for example you may have noticed that the server status page no longer gets cluttered with files that finished "in error" - as mentioned in a previous post these files are finishing fine except for some "raggedness" at the very end. Also some fighting with sendmail, and moving servers around. I moved a rather heavy desktop server downstairs into a new office - while carrying it the weight was enough to keep me distracted from the fact the sharp corner was digging two bleeding holes into my wrist. No big deal - but I showed my wife the wound later and she said it looked like a snake bite, which was amusing as the offending server's name is "snake." We also walked through Luke's radar blanking code today - he's back to school so he was wrapping it up best he can this week and all our free resources were aimed at making this possible. His program is pretty much doing its job - in fact it's detecting the radar in our data better than the embedded hardware radar blanking signal we currently use! Well, we'll confirm this we more analysis. Thanks for the concerns/tips/suggestions regarding my previous post about the mysterious RAID controller card behaviour. Maybe I'll check jumpers/etc. next week. - Matt 16 Sep 2008 22:25:07 UTC Another week, another database maintenance outage. This one was short but busy. We actually had major upgrade plans for one server but feared this would take all day and lock out the servers so we postponed it until less week which may be less stressful. Eric cleared a bunch of space of the workunit storage so that bottleneck has been alleviated for now, i.e we have elbow room to create enough workunits to keep up with demand. However this leads us to the first of two mysteries today. You see, he's moving all the beta workunits to our new homemade NAS box (ptolemy). While this move has been already been helpful, it's taking forever to complete. Why are the disks pegged at 100% utilization? Lack of spindles? PCI bus traffic? Old/slow controller cards? RAID5 biting us again? We'll either sort that out or eventually give up on this machine as anything more than archival storage. The other mystery has been a known issue for some time, but with the down time we revisited the problem: our secondary science database server, bambi, works great except for the fact that upon reboot there's a random chance one or two (or three) drives simply don't show up on the 3ware controller, causing all kinds of RAID panics/rebuilds. It's never clear why this happens, or when it will happen, and when it does it's not always the same drives that disappear. However, a full power cycle always works. The only difference really is that the drives have to spin up on power cycle, but not on reboot. So we've been assuming there's some spin-up settings that need to be tweaked. There's been talk of making bambi the primary database server, so today we looked for those settings. Couldn't find them - nothing in the regular motherboard BIOS, and nothing useful in the 3ware BIOS - and the latter was moot because the drives would have already disappeared according to the 3ware BIOS, so all the spin-up problems are happening before the 3ware is aware. I find nothing about this in any documentation or on the web. It's not a showstopper, we can still use bambi as the backup that it is, but this pretty much means we'll never be able to fully trust bambi as a "main" server. Oh yeah.. other stuff. The mysql replica croaked this morning just before we arrived - a partition on the server filled up. Apparently when upgrading the OS we missed a sym link somewhere. So the replica is resync'ing yet again. Also messing around getting the CUDA development/testing server up and running. - Matt 15 Sep 2008 23:14:16 UTC Happy Monday, everybody. We've been in a holding pattern all weekend, more or less, dealing with the usual constraints (not enough space for workunits, mostly). This morning was weird - something tripped the "stop all daemons" trigger on our back end, so we were weren't sending out work for a couple hours until I noticed. Even then restarting everything was blocked by the lack of space again. On the bright side, we've been getting this homemade NAS box up (for use as general backup of stuff we don't want to waste time/money backing up to tape, as well as administrative stuff, home accounts, etc.). So far so good, and there's a lot of extra space on it to move the less-active beta downloads there thus freeing up space to make SETI@home/Astropulse workunits to keep up with demand. Woo-hoo! That'll break the dam, at least temporarily. We're still looking for a cleaner long term solution - several things are in the works on that front. Other than that, spent a lot of today in meetings, installing high-end graphics cards (for CUDA development/testing), and writing scripts to kick the replica mysql database when it lags behind for no good reason. - Matt 11 Sep 2008 22:08:04 UTC So we hit that brick wall again with the science database - that is, when we try to create a new index it works fine on the primary server but then clogs up sending the new index pages to the secondary. This clog locks up the database, the splitters grind to a halt, the assimilators grind to a halt, i.e. fun for everybody! We thought we were out of the woods yesterday afternoon but checking in at 1am last night (this morning?) I saw this all happening again, so I gave things a swift kick and went to bed. This morning, once we were all here at the lab, we decided to just bite the bullet this time and shut down all the splitters/assimilators and let the clog work through naturally on its own, which it did. We also took the down time to do an "update statistics" on one signal table (this helps re-sort current indexes for speedier lookups) and add disk space for said indexes. I just turned things back on, we'll be catching up for a while, etc. I did do some qlogic card testing today which got us over my "information gathering and training" hurdle so we can upgrade the remaining two servers with old OS's in the coming weeks. We also got our homemade NAS configured so that we may get the old NetApp rack out of the closet maybe next week. It's still working quite reliably, but it's taking up a third of our closet space, a seventh of our power, but delivering only 2 TB of raw disk space. Not really efficient, and we have a *lot* of servers waiting to get into the closet already. - Matt 9 Sep 2008 22:36:13 UTC Tuesday means down time. Same drill that happens every week: projects go down for a few hours, mysql databases are washed, dried, and neatly folded, and then we're back on line sometime in the afternoon (Pacific Time). Some people don't like the scheduling of these outages, but as it happens NERSC (where we archive all our raw data off site) has their weekly maintenance outage at the exact same time. Something about Tuesday morning that makes it particularly good for maintenance downtime: it's not Monday, when we're catching up on weekend issues, but it's still early enough in the week to recover from potential problems should any arise. We tackled several other projects during the outage, as we always try to do. We upgraded the OS on sidious (mysql replica db server), which was long overdue. There's lots of configuration involved, but with extra care the software RAID partitions containing the database survived the ordeal. We also tested some 750GB drives in one storage server - we're still trying to figure out what we have and what we can use given our current storage needs (for workunits, results, or less interesting but equally important things kept on the NAS box which will soon disappear). I also finished getting a new desktop installed - replacing the old clunker which had been our "mass mail" server (for reminder e-mails and such). I'll wait before the current smoke has cleared before telling people to "please come back." There are always other work items too confusing to mention here. In fact I avoid a lot of happenings/details in these glib tech news posts as it will only raise more questions which I don't have the time to answer. Sometimes I'm cagey with my responses for political reasons - occasionally we have commercial vendors/anonymous donators/grant administrators involved in our decision making processes, occasionally I don't want to perpetuate the false impression I call the shots around here (I just work here - and post a lot because I happen to suffer from hypergraphia). I understand this vagueness is to the detriment of those who have a generally good understanding of the big picture and are keen to guess what our motivations and needs are, but without key bits of information people sometimes end up being a tad off base. Nevertheless it is amazing to me how much people glean from the scant amount of public relations material we barely manage to squeak out. - Matt 8 Sep 2008 23:02:11 UTC The triplet table in the science database has been a headache for over a week now. We've been trying to add some indexes to it, but this has been mysteriously filling up some kind of logical space (not physical space) such that new triplets couldn't be inserted. This has also been adversely affecting the science database replica. For now we're giving up on the indexes and letting triplet insertions continue, and allowing the replica to recover. Internal discussions continued today regarding what to do next as far as general storage. As mentioned often recently, we're low on workunit storage - the crux of most of our recent public server problems. We just got some disks in the mail today which were slated for our new home-made NAS box, but we might instead aim these at workunit storage somehow. Testing will commence tomorrow during the outage, as will several other server-related tests/upgrades. To clear up some confusion: a lot of raw data files depicted on the server status page are showing errors. This is somewhat misleading as these errors all happen at the very end of the particular file/channel. So it's not like we're losing half our data. Only about one tenth of a percent. What are the errors? At the very very end of the raw data files, some channels are missing the radar blanking signal, so it's impossible to remove the RFI. These channels exit in error, though there's nothing we can do about it. We have taken steps to try to reduce the number of files that exit this way. - Matt 4 Sep 2008 19:48:30 UTC The good news is that recent woes due to lack of workunit disk space have seemingly passed for now. We're still on the very edge of our capacity, but now that we're prioritizing the smaller regular workunits (as opposed to the big Astropulse workunits) we were able to build up a ready-to-send queue and network traffic stabilized overnight. The less-good news is that we still need to build some indexes on the science database. We're building one now, and it usually takes 12-24 hours. This adds a lot of CPU and disk I/O to the science database server, meaning the splitters can add rows as fast, nor can the assimilators. So the ready-to-send queue drops, and the assimilator queue rises. As an added bonus, when the assimilator queue rises, that means the deleters slow down, which means the available workunit disk space reduces, and we're back to square one again. No big deal as long as people are patient. All the backend services are doing the best they can until the index build finishes, and then we should catch up again. - Matt 2 Sep 2008 22:16:36 UTC Currently as I write this we're recovering from the weekly outage (during which we take care of database backups and other sundry server details). It may take a while... This past Friday we overloaded our science database trying to create a new index. A database engine restart solved the problem, but not after choking the whole local network. As mentioned in many posts past, we're strangely sensitive to heavy network bandwidth (I think due to linux's imperfect handling of NFS dropouts), and such periods cause random unexpected events. This time, for example, the bottleneck from the primary science database server ultimately caused the BOINC/mysql replica server to disconnect from the master. So the replica fell behind all weekend. Sigh. Instead of actually letting it catch up we're just re-mirroring it from the master as we just backed it up this morning. Meanwhile, we're out of space again on the workunit server, and with no fast/easy way to add space. Eric's playing with the splitter mix to reduce the number of Astropulse workunits being generated (they are much larger than SETI@home workunits). Maybe that will help, but not immediately. This is what's mostly causing our headaches today as we can't create enough work to keep up with demand. - Matt 28 Aug 2008 22:51:58 UTC We have a lot of servers in play around here, and once in a while an operating system on one particular server falls far enough behind in spec that the best move is to do a clean reinstall of the latest OS version from DVD (as opposed to trying to do 3 or 4 separate upgrades over the net, one revision at a time). Such was the case with vader, and I bit the bullet yesterday and tackled that project. It mostly acts as a compute server and a redundant download server, so it wasn't really missed for the 24 hours it was offline. Only one annoying snag: we have a lot of systems already running this OS, but this was the first 64-bit clean install from DVD, and turns out there's a package dependency bug that caused the install to crash until I figured out the offending package and left it off the list. This morning I wrapped up work and it's back online. That's good, but I still have a few more servers needing similar upgrades. The summer we have a volunteer undergrad, Luke, working on radar blanking code. Background: our multibeam data is inundated with military radar noise of semi-predictable rate and frequency. Such data collected since early 2008 has a "blanking signal" embedded by Arecibo within the raw data, so we can easily tell when the radar is on or off and we can ignore the loud noise. What Luke's working on is a program that analyzes pre-2008 data to retroactively find the radar noise and recreate a similar "blanking signal" so we can clean it up. We (me, Jeff, Eric, and Luke) had a code walkthrough yesterday. So far, so good. In the process of making this program Luke also found phase issues, even with the Arecibo blanking signal, which is probably why we still get overflow workunits from time to time. So there's still a little work to be done. When we have an observatory on the dark side of the moon, this won't be a problem. Don't see that happening anytime soon, though... Still messing around with this new/old NAS system. It's becoming a real time sink. Lots of waiting through long reboots, then trying to figure out why X or Y isn't working as expected. I don't come into the lab on Fridays, and Monday is a national holiday. So signing off for a few days... - Matt 26 Aug 2008 22:53:45 UTC Ah, yes - here we go again - the regular Tuesday outage for mysql database backup/compression and other tasks better suited to happen during "quiescent" time. For example, this week we replaced the failed drive in the workunit storage server with a new drive. That was painless. We also spent a bunch of time experimenting with the new-ish RAID server. I say "new-ish" as it's new to us, but it is an old system. For example, it can't handle logical volumes greater than 2TB. We however today confirmed (a) it can handle physical single drives at least 750GB in size, and (b) physical volumes greater than 2TB (i.e. put three 750GB drive together to make a 1.5TB RAID5). We also tested that this system is keeping up pretty well doing a continual backup of our upload directory. That is, we're doing a constant rsync with the upload directory to keep a "hot backup" around on a separate system. We didn't have the bandwidth/storage capacity to do this ourselves before (and daily backups to tape were too expensive). Anyway.. the extended length of the outage today was mostly due to revamping the way we're doing the backups. We're working to include better query blocking (to ensure the database is totally update-free) and figure out the best way to maximize our time, thus ultimately shortening these outages. - Matt 25 Aug 2008 22:56:00 UTC I've been out for a couple weeks. I really need to get the others around here to chime in while I'm away, but it's hard to convince people who aren't as hypergraphic as I. Anyway, it seems like whatever happened most everybody survived. Another problem: what I end up blathering on about in these posts is hardly comprehensive, and given arbitrary priority based on whatever is on my mind at the given time. This can be confusing, I imagine. I might also just go ahead and start only posting here when I really need to (during *real* server issues) and post less important day-to-day type things in the blog. We'll see how that goes. It might help keeping specific issues contained to one meaningful thread. In any case, a brief rundown of the past two weeks: A drive failed on the workunit storage server. Usual drill there except it hung after the failure, however once rebooted it recovered just fine using a spare drive. Outside of that were more minor issues (another server hung requiring reboot, the mysql replica stopped for no apparent reason and took a few days to catch up, etc...) causing various queues to drain or fill too fast, bottlenecks were exercised, and we had a couple temporary complete/partial public server outages... all told nothing out of the ordinary. We are still running a bit "hot" due to the Astropulse release - by "hot" I mean we're using far more storage/network resources than we'd like, but we're otherwise okay. Going back to catching up from the absence... - Matt 7 Aug 2008 22:11:38 UTC Towards the end of the afternoon yesterday we put in a new scheduler to fix a bug with "anonymous platforms" and the way they handle Astropulse workunits. This is working fine as far as I know, but at first there were some brief issues with uploads in general (human error when installing new scheduler). Today got our new NAS machine into the closet. We're close to removing the old NetApp filer, which still works great after so many years, but the drives are too small and we can't afford support on this system, and buying new replacement drives is prohibilitively expensive. Plus the thing is just physically huge - a whole rack taking up a third of our closet for only 3 TB raw space. We're replacing it with a 3U system that will ultimately have about 7 TB raw space. Getting that into the closet meant I was able to fire up another server-to-be today in our prep lab and get that configured. Traffic-wise we're still trying to get a feel for our demand and our bottlenecks. Eric wrote a script that is busy deleting antique workunits/results that exist on disk but not in the database (not sure why the antique deleter built into BOINC isn't working...). This will clear up additional much needed room but this is pretty much all we can do short of getting a whole new workunit storage server. Looks like web code was updated just now, breaking a thing or two. I think Dave's addressing that stuff. I've been mostly catching up on several behind-the-scenes programming projects today. - Matt 6 Aug 2008 21:11:48 UTC Generally speaking, the wealth of issues we've been experiencing were simply due to Astropulse adding about 10-20 more Mbits/sec to our general average. This was a little higher than we expected, hence the initial air of mystery, but still quite within our abilities given current infrastructure. This traffic might go down a bit once everybody requesting their first Astropulse workunit gets their single copy of the Astropulse client. So this explains the big rush once we released the first workunits and the longer "catching up" period, especially given the fact we were constrained all weekend due to lack of workunit storage space. Today I've been mostly working on build scripts and testing recent database code fixes. Getting back on the "development" train for a bit... We are also close to getting that new home-grown NAS into production. - Matt 5 Aug 2008 23:15:08 UTC Today was another one of them "outage days" where we shut everything down to do basic weekly maintenance (database backup and whatnot). We had a particularly large task list this time around. A lot of it was fairly mundane - like moving/compressing files to make more room on various storage systems. The sidious crash the other day did in fact break the mysql replica again. No big deal, but that meant recreating the database from the master - a seemingly weekly occurrence. It's easy to do, just adds extra time to the whole operation. Also, we tried to fix that broken index on the science database. We found the corruption was actually not on the RAID system we thought (the one that required a drive replacement). Huh. Anyway.. the index repair on the whole table was taking too long. We might just go ahead and drop/rebuild the specific index later now that we are more sure what's what. We brought all our backend services (feeder, transitioner, validator, etc.) up to spec on current BOINC code for the first time in a long time, so we carefully turned these on one at a time to observe the logs/results and make sure nothing got all screwy with the updated code. So we're back up, more or less. The current mystery is why we are using so much bandwidth. Too many factors at play to make a clear determination - lots of known network bottlenecks, lots of database bottlenecks, unknown Astropulse behavior, etc. We'll give this a closer look tomorrow after (hopefully) some of the traffic jams disappear. - Matt 4 Aug 2008 21:37:18 UTC Another wacky weekend for us. Astropulse is still ramping up - we're creating work, sending it out, receiving results back and assimilating them. However the validator stopped granting credit for these workunits - something we'll fix and we can also retroactively give people their credit. The workunit storage server ran low on room again, the bottleneck that's been giving everybody headaches over the weekend as the splitters could only create work as fast as workunits got deleted off disk. Right now things are generally running slow as I'm moving stuff off the workunit server to make room causing lots of excess internal i/o. As an added bonus the mysql database replica server crashed this morning - it ran out of memory. No harm done, but it looks like it'll take a while to catch up again (it's been lagging behind all weekend). I would like to try to split the numbers on the status page between the two different applications (SETI@home/Astropulse) but those extra "where" clauses make the queries run forever. In better news, looks like we got our new home-grown NAS/RAID box working as we'd like it, so we may start employing that sooner than later (thus freeing up lots of room/power in our server closet). Also all drive issues on our science database server over the past couple of weeks have been completely dealt with at this point. Well.. there's one lingering corrupted index which we'll try to rebuild tomorrow during the outage. I was actually out of the loop since Thursday as I went up to Seattle to play a gig on the main stage at the Microsoft Techready conference at Bell Harbor. Anybody around here attend that thing? Fun show/event, but the stage tent was completely inadequate and the entire band got soaked by rain and sea mist. I'm amazed none of us were electrocuted. - Matt 30 Jul 2008 20:10:28 UTC Looks like we're pretty much out of the woods regarding recent issues. Plus the stats dumps are working again (for the first time in days) so there was an artificially inflated bump in BOINC world-wide productivity for a moment there. Following on with the science database server stuff. I continue to play the RAID "shell game" to get the root filesystems back on the actual root drives (just for our own sanity, mostly). I also still have to drop/rebuild that one index which gave us trouble a couple weeks ago (apparently "checking" the index didn't fix it) - all very minor issues. Regarding our experience with drive failures... We see the obvious stuff - drives fail either (a) immediately, (b) after 2-4 years, or (c) never ever. I remind people that our original SETI@home data recorder contained drives that were already heavily used for about 5-6 years when we installed them down at Arecibo in 1998, and then they were reading/writing successfully until a couple years ago. They would still probably be working but we have since switched to the newer multibeam data recorder system. Anyway, we don't have enough data to prove that high temps or heavy loads kill drives faster. My gut feeling is they don't as much as you think. My gut feeling is also that more than half our "failures" are bogus - for example, we had a lot of fibre channel errors, or RAID card bugs, or smartd being oversensitive making it seem like perfectly good drives were unhappy. Many times we just remove and re-add the "broken" drive and it works just fine. In the current case we believe the drive replacement was necessary. Regarding linux OS re-installs... We've been using Fedora for a while now. Each OS rev has about 18 months of support, and we like to keep up to date for various compatibility/security/bug-fix reasons. It's easy to "yum upgrade" to the next OS rev, but after doing this a couple times you find configuration files get out of whack, and your system is littered with "rpmnew" files. Package conflicts arise. Plus every few years you learn enough that you might want to rethink your file systems/adjust partition sizes, etc. So a fresh install is more just "spring cleaning" than anything else. - Matt 29 Jul 2008 23:13:57 UTC Today we had our usual Tuesday outage which was a bit longer than usual as we had extra things to take care of (outside of the usual BOINC database table compression and backup to disk). I failed to mention yesterday (though many have noticed) that db_dump hasn't been working for days, which means our stats have flatlined all weekend. This was because our mysql replica failed (we run these expensive stats lookups on the replica so they don't affect the more important updates running on the master). So part of the outage today was to rebuild this replica from scratch via the dump from the master. It was easy - we do this regularly anyway - just takes a long time. Also, Jeff and I replaced a failed drive on thumper (the science database server). There are 48 drives on the thing so disk failures are common, and we get Sun support on this important system. We ask for a drive, they send one, we put it in and ship the old one back. Easy as pie. Unfortunately the software RAID on this system made some bogus complaints upon restart (unrelated to the device that required the new drive). I'm not sure why mdadm gets confused - for example I converted a couple spare drives to a new RAID device, which works fine, but upon reboot (many months later) mdadm freaks out that those spares are missing. Anyway, this was mostly harmless, and another warning we really need a fresh OS install on this system sooner than later (that'll be scary). We're running full bore now. It'll take a while to catch up, and we may temporarily run out of work again (still not a comfortable amount of free disk space on the workunit storage). But it'll all clear up eventually. - Matt 28 Jul 2008 21:27:00 UTC Wow. What a weird weekend. A lot of little minor things went wrong causing a bunch of "perfect storms" in succession. I have a technical term for this which I can't say in public. Anyway, I'll spell some of it out in no particular order and in varying amounts of detail. Our workunit storage server filled up again. We got the warnings too late, as mounting problems were keeping the server status scripts from running, which obscured a rather large assimilator queue backlog. When results stay on disk waiting to be assimilated, so does their respective workunit. Plus with Astropulse ramping up those giant workunits were filling up the storage faster than usual. Eric did already put in code for the splitter (which generates the workunits) to check for a full disk before attempting to write anything. Of course, this fix was only deployed in beta so far. The result, there are about 20000 workunits of zero length, which will cause annoying errors for all clients trying to download them, but they should pass through like kidney stones before too long. For a while I stopped the splitters to reduce the disk usage. Today we put the updated splitter in the main project. We've been having general scheduler problems over the last week as BOINC code updates were made in preparation for Astropulse. We haven't built a new scheduler process in a while which brought to light several problems, mostly due to our database schema being outdated and therefore out of sync with what the code expected. This didn't cause any data corruption, but caused random hosts to be unable to connect. For no real good reason a lot of hosts reporting problems were Macs which added to the difficulty of diagnosis - we thought it was an architecture dependent issue at first. In any case, we got beyond understand those problems late last week and planned to clean it all up early this week. There was some miscommunication and the new "broken" scheduler was turned on again last Friday for about a day. On Sunday our bandwidth dropped to zero. At this point we threw up our hands and figured we'll figure this out when we're all in the lab together on Monday (today). Remember we do have a policy that it is perfectly okay for our project to be down for a day or two as this is BOINC and people can crunch on other projects in the meantime. Nevertheless, we don't want to be too cavalier about that as we know a lot of people just crunch SETI data. But still, given our meager resources our average uptime is quite good, so a day or two of occasional downtime is acceptable. But I digress... Turns out apache was the problem on this server (once again a problem obscured by alerts not running due to mounting issues) and we had to kick it a couple times (including a full system reboot due to messed up shared memory segments) to get it going again. Once going, both download servers choked. So I had to kick both of them as well. Then we ran out of work. Remember how I said we put a fix in the splitter to keep from writing if the workunit storage server was full? Well, it was being extra cautious and not writing if it said storage server was over 90% full. So as I write this paragraph we're low on work to send out, but Eric gave me permission to turn file deletion on in beta so that'll clear up space soon enough and we'll generate fresh work. And oh yeah.. we were slashdotted again on Sunday. That's enough for today. We'll have the usual outage tomorrow (may be slightly longer than normal) and maybe start splitting some more Astropulse workunits to send out! - Matt 24 Jul 2008 21:35:24 UTC Astropulse release progress has been slowed by various things. Some necessary updates were made to the generic BOINC scheduler which we then employed on Monday. After that we found several weird problems including computers being refused work because their hardware was wrongly deemed inappropriate. At first this seemed like a "Mac only" problem but as far as I could tell some Macs were still able to get work. In any case, we ultimately fell back to the "old" scheduler this morning. This improved things according to some rough, immediate analysis. It is still unclear the complete set of scheduler problems, their causes, and their solutions. We'll chip away at that as Dave works his way through a large e-mail backlog. Yesterday Dave, Jeff, and I had a "work stoppage" and went for a hardcore hike in the Desolation Wilderness (near Lake Tahoe) - something we've been talking about doing for way too long, as we are all avid hikers. We were joined by my wife and Daniel, a visiting BOINC developer from Spain. Since this is technical news, the technical details are thus: We took the Twin Bridges trailhead (at 6200') up to and beyond Horsetail Falls. This included some surprisingly dangerous boulder scrambling which sapped more energy than originally expected. Our plan to bag Ralston Peak (9200') was reduced to basic exploration up to (and ultimately into) Lake of the Woods (over 8000'). The boulder scrambling downward was even worse, but all knees/ankles survived intact. All told, about 7-8 miles of hiking/scrambling, almost 2000 vertical feet gained and lost, taking about 8 hours including lengthy breaks. I felt poorly acclimated, even though I easily conquered a similar hike in Yosemite (up to the top of Nevada Falls and back) six days earlier. Dave was acclimated but started the hike a bit exhausted as he did about 800 feet of rock climbing in upper Yosemite the previous day. - Matt 22 Jul 2008 21:16:02 UTC Yesterday afternoon we installed in a new scheduler which included some updates necessary for the upcoming Astropulse rollout. However, our network performance took an immediate hit. After about 10 minutes trying to figure out what was causing this Jeff and I realized our scheduler switch perfectly coincided with several expensive credit-analysis queries Eric was running, also in regards to the Astropulse rollout. So it wasn't the scheduler - just the database getting overloaded. That got cleared up quickly. Last night I noticed people complaining about Mac computers being denied work. This is still an issue, probably with the new scheduler implementation, and we'll address it shortly. We had the regular weekly outage today during which I tackled some extra things. First off, due to continuing mysql database performance issues we completely dropped the credited_job table (before we just dropped the indexes). Reminder: this is the table that connects user ids in the mysql database to result ids in the science database, so we know who did what. This is also the only table in the mysql database that grows without bounds, and therefore has been the cause of much headaches as of late. Don't worry - we have all this data archives in three formats in three different locations, and will continue to collect this data in flat file format. I also checked the integrity of the database filesystem now that it was cleaner. No problems there. I started up the projects and mysql is currently handling well over 2000 queries/sec without breaking a sweat. - Matt 21 Jul 2008 18:49:42 UTC I was out of the lab since last Wednesday hence the dearth of tech news reports. Though not all that much to report. We had a couple of the usual/typical blips that required minor maintenance, most notably the db_purge process (the thing that keeps the result/workunit tables trim by actually deleting database rows from the BOINC database once the scientific data has been inserted into the science database) - this process hung for some unknown reason and the BOINC db grew great in size. A simple restart fixed that. As for that index corruption in the science database I mentioned last week, that index was rebuilt just fine, but only after we took one drive in the particular RAID holding these indexes off line - smartd was reporting a lot of errors so we think that drive was the culprit of the corruption. We'll try to replace it sooner or later (the system is now down to only 47 out of 48 500GB drives). I haven't fully caught up yet from being gone but I imagine there will be some AstroPulse ramping up to report sooner or later. I see scheduler updates have been made (and I think put into beta). I'll meet with Jeff/Eric later and discuss. Looks like there will be a campus network outage that affects us this upcoming Wednesday morning - it will last about a half hour, starting at 6:30am (Pacific Time). A couple router upgrades from what I can tell. - Matt 15 Jul 2008 22:42:09 UTC Had the typical weekly outage today - the results of which were much happier than last week. We were also hoping to fsck the mysql data drive that gave us grief last week to make sure it's okay, but the outage was taking too long so we'll do that later. We did fire off our weekly science database backup which quickly failed due to finding a corrupt page or two. This happens from time to time - and turns out this particular corruption is within a index that we can easily drop and recreate if the usual data-cleanup utility doesn't work. Also science database replication broke at some recent point, probably due to the primary database catching up on backlogged inserts caused some kind of handshake timeout. No big deal - replication is catching up now. The campus network graphs are all out, which is how we confirm what our current bandwidth usage is. I hope this will get fixed soon. I feel like a doctor without a stethoscope. - Matt 14 Jul 2008 23:07:47 UTC So the second half of last week was spent trying to figure out why our database server was so painfully slow. Bob, Jeff, Eric, and I were scratching our heads, trying this and that to diagnose and fix this mysterious problem. Everything was fine before the Tuesday outage, nothing changed during the outage, but upon restarting the project we couldn't handle very much load. We were quick to blame mysql, as it has had random episodes in the past of secretive bookkeeping causing us grief. We ruled this out. We started blaming the "credited job" table which is growing infinitely. This is the table keeping track of which user did which workunit. We do nothing but insert into this table (no random access selects), so why would that be a problem? Nevertheless we turned off inserts (back to writing similar info to flat files for later parsing) to no avail. Maybe it was hardware? Did a disk fail? Is a disk about to fail? We ruled all that out as well, which brought the focus back on mysql with dozens of server tuneables that we tweaked for various reasons over the years. Did we go too far with some of those variables? We convinced ourselves that wasn't it. Of course on hindsight the ultimate solution seems obvious: the filesystem where all the data is kept. Just because the hardware seems okay, and I/O rates are normal, doesn't mean the filesystem is happy. And the focus was back on "credited job" as this table is constantly growing and therefore a big ol' file - much bigger than anything else. A file that is constantly growing during all other inserts and updates that happen as the project is running will likely become interleaved and fragmented to the nth degree. Without fearing data loss we dropped the credited job indexes and that alone broke the dam. Well, jeez. We're still catching up from the backlog, but mysql is performing incredibly well at this point. This is good, as we're hoping to release Astropulse before the end of the week. More on that later. Happy Bastille Day, by the way. - Matt 8 Jul 2008 23:19:59 UTC Weekly outage day (to compress/backup BOINC database). It lasted a little longer than usual due to some confusion - unbeknownst to me a recent web code update was made that broke the "stop_web" mechanism which keeps the database quiescent during the outage. It's also taking a long time to recover. Not sure why but we'll see if the clog pushes through. I took advantage of the outage to move server anakin into the closet. We also upgraded the RAID card BIOS to see if that fixes our minor issues with ptolemy's current hardware RAID setup. Well, it's logical volume initialization is still way too slow, but maybe we'll live with that if all future resync's are fast. Just wrapped up the scoring meeting I mentioned yesterday. The bottom line being our current scoring algorithms for individual signals (spike, guassians, pulses, triplets) are sound, the multiplet scores (interesting groups of signals of a single type) are 99.9% sound, and metacandidate scores (of single sky pixels containing "candidates" like indiviual signals, multiplets, or stuff observed from previous SETI project, as well as interesting celestial objects) are still way up for debate as this is where individual philosophies differ, but we'll probably just go with the easiest solution (multiply all the candidate probabilities together) and see what that list looks like. Jeff will write all this up. Maybe we'll even have a science newsletter. Jeez... still having a hard time recovering... - Matt 7 Jul 2008 22:23:11 UTC Rather dull holiday weekend except for the fact I was up in Oregon and remotely dealing with several server issues hidden from the public - nothing really newsworthy. Various previously mentioned projects are continuing along: I'm installing an OS on ptolemy in the hopes we can flash upgrade the current RAID cards' software and see if that helps, otherwise we're buying new cards that we *know* work. I might do a bit of physical server shuffling during the weekly outage tomorrow - get some of the newer stuff into the closet - maybe. Looks like the big "scoring meeting" is also tomorrow where we will try to settle on our candidate scoring algorithms. Basically we need to pool together our scoring techniques from previous reobservation runs and apply it to the nitpicker which, unlike all prior data analysis, runs and updates in real time as signals flow in. It was easier before, at least in the candidate analysis I've done. You'd turn the crank, look at the results, adjust some variables and turn the crank again. Not so easy to be as casual and change algorithms when the crank is turning 24/7 and a million signals are added every day. Oh yeah - back to the "ALFA running" problem on the science status page. Turns out we need to recompile our program that peeks at the observatory status broadcasts for our own status pages. This hasn't been recompiled in ages, and much has changed in the meantime. An added compilation is that this running on a Solaris machine down in Puerto Rico making recompiling old, stale code a challenge. Jeff is tackling that. - Matt 3 Jul 2008 21:11:53 UTC Crazy day getting ready for the long July 4th weekend. There was more testing on ptolemy with more depressing results (why isn't it picking up the hot spare when I pulled a drive out from an active array?!). I actually yanked the whole server out of the closet (which required me temporarily shutting down one of the download servers which was physically in the way - but nobody seemed to notice much). We opened it up and found the RAID is indeed on cards and not the motherboard, which is good as this means if we can't get this to ultimately work we can get some 3ware cards (or some such) instead. Meanwhile, with ptolemy pretty much gone we've been having mounting problems with servers still requesting its disks. No matter how hard you try there's always some dependencies that hide until too late. So it's been a morning full of killing automounter processes, cleaning up stale mounts, deleting bogus trigger files, restarting services, etc. This was mostly hidden from the public - except for several status pages being out of whack. Actually the assimilators all froze but this was hidden behind the stale server status page. Now the queue is pretty large, but it should drain out just fine. Eric and Jeff are still getting to the bottom of the database/esql interface woes, doing some extreme programming over by Jeff's desk. Converting lists with cryptic, undocumented size limits to blobs. One of the last major hurdles for the first rev of the nitpicker. Then it's doing all the scoring algorithms, which we'll discuss next week. - Matt 2 Jul 2008 22:29:10 UTC Working on ptolemy's conversion into a NAS box today, with the focus on putting bigger drives in it and testing out its onboard RAID controllers. We're finding the hardware RAID to be a bit outdated and not exactly everything we want. For example, it has a 2TB logical drive size limit, and we can't create logical drives using more than half the physical drives (they are split over two separate controllers). I guess we can deal. Some user web/user interfaces got broke over the past 24 hours. First, the credit certificates. Incomplete updates were made which were confusing. Dave cleaned that up. Second, the "special user" tags got reset by accident - this also got cleaned up but in the process we temporarily gave some users extra powers (the mysql table dumps were comma delimited so forum signatures containing commas offset the values, blah blah blah). Regarding the "ALFA running" bit on the science status page - I think I fixed this, but we haven't collected ALFA data since, and won't for a while, so I don't have truly positive confirmation yet. No a big crisis either way, though I hope we get more ALFA time soon. - Matt 1 Jul 2008 22:09:19 UTC Today's Tuesday, which means we went through the usual database cleanup/backup outage. That went smoothly. As I may have already noted before, the replica mysql server has been regularly failing when actually writing the dump to disk. Our suspicion was that this server was having difficulty reaching the NAS via NFS - and mysql has been ultra-sensitive to any NFS issues. The master server doesn't have this problem, but maybe that's because it's attached to the NAS via a single switch (as opposed to the replica, which is going through at least three switches). Anyway.. we dumped the replica database locally and it worked fine. Our theory was strengthened, though not 100% confirmed. While the project was down we plucked out and old (and pretty much unused) serial console server from the closet. That saves us an IP address (we get charged per IP address per month as part of university overhead - which is another reason I try to keep our server pool lean and trim). I also cleaned up our current Hurricane Electric network IP address inventory and realized and cleaned up some old, dead entries in the DNS maps. Not sure if this is what has been causing lingering scheduler-connection problems. We shall see. Noted in the previous tech news thread, the science status page has been continually showing Alfa (the receiver from which we currently collect data) as "not running" for a while now. This was lost in the noise as Alfa actually hasn't been running much recently, but is still should have been shown as "running" every so often as data trickles in here and there. Looking back at the logs there has been a problem for some time now. We get the telescope specific data (pointing information, what receivers are on, etc.) every few seconds as they are broadcast to all the projects around the observatory. Perhaps the timing/format of these broadcasts have changed? In any case, I'm finding our script that reads these broadcasts is occasionally missing information, so I made it more insistent. We'll see if that helps. - Matt 30 Jun 2008 21:58:57 UTC A rather static weekend which is always welcome. This morning found that, despite DNS changes made several days ago many clients are still connecting to the old scheduling server. I find this particularly frustrating as there is no legitimate reason for anything to be caching bogus domain information for more than 5 days, especially if said domain had a 5 minute time to live. We need to get to work on this server, so I opened up a currently unused port on one of our non-public servers and gave it the old scheduler IP address to forward along to the new address, thereby acting as a "detour" so we can get to work. Hopefully over time clients will get wind of the correct IP address so we can turn off this detour as well. Eric's back in town. Overheard him and Jeff talking a bit about current nitpicker/database programming woes. Seems like an effective new strategy is being enacted. Other than that, no real new to report and nothing but chores and meetings all day today for me, pretty much. - Matt 26 Jun 2008 21:07:44 UTC The new scheduler continues to be handling its new duties just fine. Slowly but surely people are moving their connections over to this new server, but I'm not convinced the change rate is fast enough to do a whole sale cutover by next week. We shall see. Funny aside: while getting new-ish donated server "clarke" up yesterday I was annoyed to find that Fedora Core 9 was booting to run level 5 (where it loads the X windowing environment). We don't need X on these servers, so we typically set our servers to boot to run level 3 via a change in /etc/inittab. In doing so, I'd comment out the old line with a "#" and enter in a new line with the adjusted run level. It was still booting up in X. Why? Turns out the latest inittab parser (new with FC9, I guess) ignores "#" comments in inittab, and just looks for lines containing the string "initdefault" and parses the first one it finds. Since I left the old line in there commented out (or so I thought) it was superseding the line I wanted. So much for standards (and clear documentation stating when/how standards change). Nitpicker weirdness: While finally getting around to testing the few optimizations I made to Jeff's code I found that multiple runs of the nitpicker on the same pixel were producing slightly different results each time. We believe this is due to the order which the database pulls out rows - unless requested otherwise databases generally pull things out in random order, i.e. the order which requires the least I/O at that exact point in time (mostly due to page caching or where the many drive arms are currently located in our RAID set). Sorting query output adds significant (and usually unnecessary) overhead. But there are a lot of "fuzzy compares" in the nitpicker (due to floating point computations on different chips you can't expect decimal values to be "exactly exact"). When two items are close enough to be called "duplicates" you only need one, but which one you pick may cause different results down the road. So Jeff is elbow deep in this problem right now. Apropos of nothing, the entire northern half of state of California is on fire. The smoke ending up here in the Bay Area is intense. I feel like I'm smoking a couple packs a day just walking around outside. I can smell it sitting here at my desk. - Matt 25 Jun 2008 22:23:54 UTC This morning we turned off the scheduling server on ptolemy and started it up on anakin. This basically worked right out of the box. Pretty quickly we determined the lower traffic rates were due to DNS rollout. Despite having the TTL (time to live) on the download name (boinc2.ssl.berkeley.edu) set to 5 minutes, it sometimes takes weeks to fully convince the world the change has been made. This is due to various types of DNS caching I still don't fully understand (why don't they all obey the TTL?). Stopping/restarting the BOINC client sometimes resolves this. However, after an hour or so I decided to play nice and turn ptolemy back on, set in a way using apache to forward all lagging scheduling requests over to anakin with a "permanently moved" warning. I guess I should have done this from the get-go, but better late than never. Immediately this seemed to help, but only the uploads. Download traffic still remained under some rather low ceiling. So I checked the two redundant download servers (bane and vader). Turns out bane wasn't serving any download requests. Was it even getting any? That part is a total mystery - nothing changed in any configurations pertaining to these servers. I double checked the DNS updates. No smoking guns there, either. Well, bane had weird dns/mounting/apache problems before that a quick reboot cleared up, so after rebooting it seemed to be "better" but not by much. Instead of 0 requests per second before reboot, it started serving 2 or 3 - vader is serving around 10. What's the deal, then? Perhaps this has to do with our "pound" load balancing utility recognizing bane was having trouble (strangely coincident but unrelated to the anakin switch) and has been favorite vader until bane got better. I filed this under "unrelated and currently harmless problem." Anyway.. I then noticed (in between doing other tasks, hence the lag) the upload traffic was increasing way beyond expectations. I assumed everything was okay as all the apache logs were reporting no errors, but indeed the requests forwarded from ptolemy to anakin were failing. Why? Because the http headers were missing variables, including the all-imporant "Conent-Length." Why?!! This I have no idea, but apparently between apache (and/or the boinc client) redirected traffic results in different and less informative http headers. And so the schedulers on anakin were saying, "I don't know what you want - try again in 10 seconds." This got worse and worse as more clients wrapped up their currently workunits and tried to connect. The solution to all that was to *not* do apache redirects (both 301 and 302 redirects had the same effect) but to use good ol' pound to simple shovel ptolemy's packets towards anakin. This helped all our DNS-lagging clients to finally connect again, but won't help to inform them that the scheduling server has indeed changed. Hopefully the clients will learn on their own in the coming days. We plan to turn off ptolemy outright early next week. Nitpicker progress has been slowed by database programming issues. Informix has undocumented limits on user-defined lists in certain contexts. We may have to work around all that using something other than lists. Jeff's been banging on this and other similar programming hurdles for a while, hence the lack of recent info. Plus we have yet to sit down and discuss candidate scoring algorithms which will only happen if we can manage to get the four parties involved (Dan, Eric, Jeff, and me) in the same room at the same time without greater problems hanging over our heads. This hasn't happened in, well, months. At least glacial speeds are non-zero speeds. - Matt 24 Jun 2008 21:50:01 UTC Had the usual outage today. No news there, and we're recovering normally at the moment. Continuing along the hardware vs. software RAID theme, we have vast experience getting bitten by both - in the early days of SETI@home we got burned by hardware RAID, hence our current general affinity towards software. However, today Jeff and I got over the (very small) hump of learning how to query the recently donated IBM Xseries on-board RAID from within linux and decided that we're going to learn to enjoy living with a zillion different kinds of RAID, each employed based on current needs and resources. Tomorrow we're going to attempt converting our scheduler to the new-used system "anakin" so we can then convert the current scheduler (ptolemy) into a NAS box (to ultimately replace the NAS taking up one third of our server closet). Expect funky DNS rollout issues. - Matt 23 Jun 2008 22:22:22 UTC Another weekend without much ado. Our assimilator queue is low but not exactly pegged at zero. What's causing it to not run as fast as all the other backend processes? Not entirely sure, but we know of several things that happen from time to time which may be the problem (i.e. cause extra load on the science database), or at least aggravate the problem. But for now, it's not even close to a tragedy, so we're just keeping our eye on it. I guess we did have a disk failure on thumper (the master science database server), or at least disk complaint. It didn't cause any downtime or data loss, but it's getting us to reconsider our current stance on software vs. hardware RAID. We've been sticking with software RAID due to ease of use and quickness of warning, but we're finding it sometimes doesn't behave the exact way we expect, or sometimes not the best way. So this event inspired some additional R&D on that front I just rebooted the main web server, so that was offline for a couple minutes. No big deal - just some mounting issues that needed to be cleared out. - Matt 19 Jun 2008 19:41:22 UTC We're still maintaining an assimilator queue, but it is indeed draining over time. Besides the nitpicker CPU consumption issues addressed yesterday, we're also doing several data transfers down to HPSS (our off-site storage) including a large science database backup, as well as several raw data files (we keep copies of all raw data down there). All these things - the backups, the raw data storage, the nitpicker, and the assimilation of new results - run on thumper (because that's where all the data are). So there's basic I/O contention at the moment. Other than that I have nothing to report - I've been mostly occupied by bureaucratic/policy tasks for the past while. I was also annoyed to find somebody threw away my plastic fork, which I admit has been sitting used and unwashed on my desk for days, but nevertheless I came to work expecting to eat my lunch with it. The lab kitchen is oddly devoid of utensils. I did find a pile of aged wooden coffee stirrers, out of which I fashioned a pair of makeshift chopsticks. There's a halo around the sun at the moment. Cool. - Matt 18 Jun 2008 23:16:03 UTC The assimilator queue grew again. The main culprit this time was the NTPCkr - from here on out I'll simply refer to it as the nitpicker - as a reminder this is the program that is pretty much the culmination of all our SETI@home data collection and analysis, i.e. it's the thing that'll find the aliens if they exist. All other analyses so far using SETI@home data were cursory by comparison. Anyway.. we're finding every so often that we have "deep" pixels containing tens of thousands of multiplets, each containing thousands of signals. When my "science status page updater" hits one of these it hangs on for quite a long time, causing a heavy CPU load on the database server as it tries to wade through this flood of signals gathering statistics. My optimizations (mentioned earlier in the week) helped, but not enough. We may devise/implement more. In any case, the heavy nitpicker load made the assimilators slow down. We killed those particular processes and I think we're catching up again. Slowly. So the donation processing suite had been choked for a couple weeks and nobody noticed. This was caused by a suddenly (and silently) more stringent firewall, and masked by several things. We've been getting the donations, just no confirmations. So there's quite a few missing green stars I imagine. Not exactly sure what to do about that just yet. - Matt 17 Jun 2008 20:44:23 UTC Ho hum weekend, which is good. The air conditioning people came up yesterday (Monday) and today to do follow-up inspection of our server closet system (which failed last week) and found a couple more leaks which have been repaired. We seem to really be pushing it beyond its limits. Had the usual database outage today. No big whoop there. Somebody noted earlier that their results were getting validated surprisingly quickly. We didn't change anything. This may have been due to a longer-than-usual period this past weekend of fast workunits - the average turnaround time was roughly 10 hours (about 20%) shorter than normal, meaning pairs were getting matched up that much faster. A lot of what's been going on the past couple of days has been post-vacation catchup (half the staff was out of town). While I have a zillion other things to do I discovered a couple ways to optimize the NTPCkr so I coded that up and I'm testing it now. Every little speedup on this front helps. Jeff's still working on the scoring part. We're getting there... - Matt 11 Jun 2008 21:25:25 UTC Some general BOINC code got updated on our servers this morning, which broke a couple things (some pages went blank, and the php "magic quotes" got messed up causing all kinds of backslashes to appear everywhere). I whined to Dave and he fixed it, which is usually how these particular problems sort themselves out. The problem with the web code is that it is being completely or partially used by all kinds of BOINC projects, so a "fix" for one project may end up unexpectedly being a "bug" for another, which is why this kind of thing happens from time to time. We try to keep SETI@home as up to date with the BOINC source tree as possible, even if that means we're on the "bleeding edge." Of course this is all web code, so problems like these are cosmetic and relatively minor in the grand scheme of things. We do more thorough alpha/beta testing of the important back-end functions - you know, the ones that update millions of database records every day. Other than that today has seen more OS installs/RAID manipulations on various donated servers that have been anxiously waiting their call to duty (I got beyond the issues I was having yesterday). Slowly but surely we'll get these up and running. I also got a bunch of data drives from Arecibo - it's been a while we got a batch of fresh data up here, so I'm now lost in data pipeline management mode. - Matt 10 Jun 2008 22:20:19 UTC Normal Tuesday outage. Didn't really do anything special this time around. I did mess around with server "anakin" a bit (the presumptive replacement scheduling server) - for starters it keeps booting up in X (though the inittab says not to) and one of its drives got marked as "defunct" (the hardware RAID is rather confusing - I can't figure out how to "unfail" the drive). Both really minor issues. At least there was zero fallout from the air conditioner failure yesterday. Other than that I'm mostly working on mundane sys admin chores and catching up on some back-end diagnostic/analysis stuff. - Matt 9 Jun 2008 20:52:35 UTC Over the weekend the scheduler ceased operations on its own again. I was able to remotely fix this Saturday morning and recovery was swift. This was the same problem as earlier in the week but this time we had a smoking gun: the CGI output log file was maxed out at 2GB in size (this is running on a 32 bit system). Cleaning out the logs solved the problem. The thing is: We've been letting these logs grown to 2GB in size for months without any issue. So why is this a problem all of a sudden? However strange, I put a log rotation script in place to prevent this from happening again any time soon. Funny side note: I would have gotten the alerts faster but coincidentally the lab-wide mail servers conked out as well Saturday morning. Other than that, nothing much to report the past couple of days. Which brings us to today. Around 12:30 our server closet air conditioning unit died. Within 30 minutes all the servers warmed up over 5 degrees Celsius and I started getting alerts. This may be a significant problem (i.e. we may need more than just a coolant refill). So depending on how fast we can get the maintenance people up here I might have to shut down parts or all of the project to prevent server burnout. Meanwhile, I have the server closet doors open to help cool things down, much to the annoyance of all the projects on this floor (the fan noise is about 20-30 decibels louder with the doors open). The poor people across the hall from the closet are being defeaned - my desk is a few doors down. - Matt 5 Jun 2008 21:24:59 UTC Another mild day in server land. Lots of minor apache issues. There was an annoying web scrape yesterday afternoon that gummed up the works for a moment. This morning I found a bug in the web log rotation script that prevented our public web server from restarting - so it's been running for weeks non-stop during which the httpd processes bloated in size (apparently there are small/tolerable memory leaks in php/apache/boinc code somewhere). Then later our scheduling server was suddenly unable to run the scheduler cgi. We were dropping connections so I got alerts right away about this. I had to stop/restart apache twice, though, to get it working again. Not sure why the first restart didn't take. Jeff's adding more star catalog data to our database. Bob worked on another alert script to better check our current database storage allocations (and prevent another minor mishap like earlier this week). Eric and I swapped drives between his hydrogen server "ewen" and ptolemy (for when the latter becomes a storage server) - ewen freaked out a little bit unexpectedly - we umounted the filesystems before pulling the drives, but an xfs daemon woke up and thought that particular partition should still be around, etc. No big deal - just a lot of alert e-mails that were scary at first. - Matt 4 Jun 2008 20:06:25 UTC Things are continuing to clear up nicely since the science database kerfuffle earlier this week. The assimilator queue is still large, but now that everything is more or less "caught up" it's draining at a pretty good clip. Nobody probably noticed but for a while there this morning (actually still as I type this sentence) we had two scheduling servers - ptolemy and anakin. I finally got anakin up and configured and made it a secondary scheduler to test it out. Once we're ready to convert ptolemy into something else, we now have another scheduling server in our back pocket. - Matt 3 Jun 2008 21:46:01 UTC Good news. The science database problems were far less severe than we thought. Short story: we ran out of space. Long story: due to a slightly confusing configuration we thought we ran out of extents for reasons unclear. Informix categorizes all usable storage space into dbspaces, fragments, chunks, extents... maybe more things I'm not sure. We've had problems in the past where we ran out of extents long before running out of actual disk space and we thought this is what happened again. The solution for such is painful - basically like rebuilding a RAID system (unload everything, recreate, and reload). Luckily we discovered we had some fragments/chunks misaligned (some fragments had more chunks than others) so all we had to do was add more chunks, and we had plenty of disk space for that. We added enough to get by for now, and will do more when we catch up from the queue draining/filling. We had our usual outage today (for BOINC database backup/compression, etc.). Between the usual recovery for that and the recovery for all the above it may be a bumpy ride for the next 24 hours or so. Yesterday afternoon server "bane" (one of the two download servers) was having mounting issues which required a reboot to clean up. I was home at the time and rebooted it remotely. Of course, like my desktop last week, a new kernel was yum'ed in during the recent past and messed up grub for some reason, so it wouldn't load the OS. I had to get Jeff, who was still at the lab, to deal with booting from the emergency DVD and boot from an older kernel. While bane was down half the downloads connections were failing, but usually retries were successful as we have the two redundant servers. Today I got server anakin more officially racked up (actually just sitting in a rack directly on top of a UPS) to ultimately become the new scheduler. It's a recently donated Dual Xeon (used) that is actually less powerful than our current scheduler, ptolemy, but should be able to handle the job just fine. We plan on making ptolemy, with its 16 mostly unused drive bays, a network storage server to replace our ageing Network Appliance server, which fell out of service long ago and its many drives are dying with regularity - infrequent but still worrisome. - Matt 2 Jun 2008 18:58:32 UTC Early Sunday morning I discovered the assimilators were all failing. Immediate analysis uncovered zero smoking guns. All the assimilators were choking on the same subset of results, and all while inserting pulses. Plus the actual processes were seg-faulting before they could produce any useful error codes. Checking the failing result files and database entries showed nothing obvious (all different sizes, submitted at different times, created by different clients, etc.). I did all I could do. I told the other guys (Bob, Jeff, Eric) - Bob's checking the database now for any subtle weird behaviour (once again I found no obvious problems yesterday) and Jeff's recompiling the assimilator code (perhaps a version that outputs useful error information). In the meantime, the assimilation cue grows, and our disk usage grows with it (as we haven't deleted anything in over a day) - sooner than later I'll have to stop the splitters to prevent storage disasters. I'll update this thread if we figure out what's up on that front. The only other real gripe right now is that our data recorder system at Arecibo is only seeing one of two data drives. Not a tragedy - we can still record data but this will put additional strain on the operators down there until we figure out why. - Matt 29 May 2008 22:40:14 UTC I spent the entire day so far (and will certainly continue after writing this missive) doing nothing anybody will ever care about - mostly revolving around php programming for upcoming letter drive (more on that later). My desktop was getting funky X errors so I decided it was due for a reboot, and then it wouldn't come up again. This new Fedora Core 9 distro apparently yum'ed in something which broke the boot loader. An hour or two spent trying to suss that out and ultimately reinstalling the OS and I'm back in business We did have a software meeting earlier - we're getting back on track with various stagnant analysis/database projects. Also discussed the Google Sky map stuff - they get their images from many different sources, so it's still unclear what epoch the coordinates are in. No simple official statements like, "Google Sky coordinates are entirely in J2000." So we're going to have this cosmetic issue where the image data on the science status page may not exactly line up with our reality (which is J2000). In any case, this is hardly a scientific issue as in doesn't affect our analysis - just what's in that neat little Google window. - Matt 28 May 2008 20:04:41 UTC People noticed there were short network "hiccups" during the course of the evening, ending this morning. All of it was quite mysterious - no database problems, no workunit storage server problems, and at first no obvious download server problems. Upon further examination I found the DNS configuration was "lopsided" towards one of the two download servers. We have load balancing software on both machines so they were sending equal numbers of workunits, but all initial requests hit only one of the two. This hasn't been a problem before, but apparently this week's outage caused enough strain on apache such that every few hours the load got fairly high and log rotation would take abnormally long (several minutes) and nothing could get through during that time. We are also at our highest active user level in over a year (about 10% higher than a couple months ago), so maybe that added to the apache/server stress level, and what we were seeing were outage "aftershocks." In any case, I fixed the DNS so perhaps this won't be so drastic next week (and hopefully for many weeks to come). Work on the NTPCkr continues - Jeff uploaded the Hipparcos Catalog to the database, so I added a star count on the science status page for the pixel we are currently observing. Of course, the more stars in a pixel the higher the score. However, there are only about 100,000 catalogued stars and 15,000,000 pixels. So odds are pretty high we are observing zero (known) stars at any given moment. Oh yeah the idle splitter processes - a couple were shirking their duties. I told them to stop slacking off and get back to work. Not that we needed them but it looks bad to have 'em sitting around doing nothing (in reality they were stuck on some stale trigger files). - Matt 27 May 2008 21:23:45 UTC Long holiday weekend (Memorial Day). On the actual day off (yesterday) the BOINC web/download server was misbehaving. In theory I should have been able to connect to the KVM from home but that wasn't working properly (couldn't access via the web due to incompatibilities with newer JRE versions, couldn't access via the standalone client since I ain't got no Windows machines and the client only works on Windows, etc.) so I had to drive up to the lab to kick it in person. No big deal - just a runaway job that clobbered the process queue. Had the usual database backup outage today. Not much news to report. To answer RHWhelan from my last thread: > ...it seems that most of the data we analyze gets dumped soon after we report. Not sure what you mean by dumped but nothing important is getting thrown out. Your SETI@home client reduces about 350K of raw data into a few signals which get plopped into a result file and uploaded to our server. Once these signals are verified and put into our master database the result file (and its sister row in the database) are deleted to make way for more. The signals themselves never get deleted. > It also appears that the real staff spends more time transferring, storing and manipulating data and hardware than actually analyzing the results. I don`t mean to be critical, I am actually very devoted to the philosophy of SETI but I must admit it seems a bit futile.It appears that way because it's completely true. And there's nothing wrong with that. To be clear, the "real staff" running the entire show is me, Jeff, Eric, and Bob - all working part time (combined we're about 3 full time employees). Anyway... I understand the feelings of frustration due to perceived futility - science takes time, underfunded/understaffed science takes even more. We're only just now turning the corner on the analysis. Unless final results start appearing, we're still productively collecting/reducing data - not as interesting, but still quite useful. I don't expect everybody to maintain interest until we have some real data products, and then I expect interest to jump. > Are there ever any "HITS" or even slightly suspicious data streams?There are hits and then there are HITS. We haven't really looked for the HITS yet as we've been unable to until very recently (that part is working now in beta). There are no data "streams" as data don't come to us in streams - the earth rotates so signals that persist over time that are actually originating from outer space will only last a few seconds as our beam passes over it. When I first started working on SETI in 1997 the group here (just Dan and Jeff at the time) we were wrapping up final analysis on SERENDIP III. Didn't find anything really interesting. Then we started collecting data for SERENDIP IV. We were starting to dig into the final analysis of that data set (about 60GB) when SETI@home came into being and derailed that, though Jeff and I have been plotting to wrap that up sometime soon (once we get the SETI@home final analysis rolling). SERENDIP IV is actually interesting, even with 11 year old data - the analysis is hardly as deep as SETI@home, but much wider: the frequency range is about 35 times bigger than SETI@home. We are also doing Optical SETI, and pulsar searching... The point being is SETI@home isn't all we do, nor is our lab here at Berkeley the only SETI lab on the planet. Nevertheless we do have the biggest, bestest search going by far. - Matt 22 May 2008 22:35:37 UTC More database poking/prodding today. Tweaking different mysql variables (and even adding "noatime" and "nodiratime" to the mount options of the data partitions) didn't really help all that much in regards to the transaction committing stuff I was whining about yesterday. So be it. Bob and I also found this morning that our science database indexes were in need of rebuilding as well. Every few weeks we need to run an "update statistics" query to keep those indexes in line. Slowly working my work through the OS upgrade queue. We're getting FC9 installed on one of three recently donated servers (dual 2.80GHz Xeon / 4 GB RAM) so we can finally start getting these (and another equally powerful P4 server with more RAM, also recently donated) thrown into the fold. The use of these is still up for debate, though they all will be perfectly good general backup/redundant/compute servers. We are definitely missing some redundancy on the backend. I mean, we do have server "maul" sitting around which is quite powerful but being a test model donated by Intel it has an engineering motherboard with keyboard/mouse issues, so we don't want to trust it with anything that needs to have 24/7 uptime - instead it's up and running as a test/compute server, i.e. if it goes off line for any period of time we won't be sad. Anything else? Just some work on more internal data plots for data integrity checking, and the final bits and pieces of that proposal which is due tomorrow. - Matt 21 May 2008 22:16:59 UTC The BOINC mysql replica wrapped up its resync. This morning Bob did some testing to see if we can improve our failure/recovery situation. MySQL allows different levels of log commitments to disk: commit only when the buffer is full, commit at least once a second, or commit on every transaction. We've been sticking with the middle option, as that affords us the most protection without heavy disk I/O - the worst case is that we lose one seconds' worth of data. However, we've proven a couple times now that we do many updates per second (i.e. hundreds) and that's enough to bring the master/replica majorly out of sync if one crashes before being able to commit. So today we tried the last option and expected an increase of disk I/O and sure enough this commit level brought the database to its knees almost instantaneously. We tried this first on the replica and thought it was its software RAID or low number of spindles causing the headache, but applying this to the heftier master had the same effect. So it's back to the drawing board on that front: we don't have the server capacity to commit on every transaction. Maybe there's other screws we can tighten to make this possible. Bob's looking into that. More tests to come, or we'll just put this on the back burner. Other than that... Got FC9 running on my desktop. So two computers are upgraded now, and I'm getting to understand all the gotchas. Also Jeff and I actually are discussing SERENDIP again. You ever hear of that? That's the project we were working on before SETI@home happened, and it's been in limbo for about 10 years. But as Dan continues to build SERENDIP-like spectrometer boards to help other SETI scientists around the world, these other projects may want to incorporate our data collection/analysis software, so we better dust that off sooner than later. In the process we can maybe throw the old SERENDIP IV data into the same database as SETI@home to buff up our sensitivity even more. That's the hope, anyway. - Matt 20 May 2008 20:44:57 UTC Today's weekly backup/compression outage was more or less normal, running the "recover replica from backup" drill without ado or incident. That's all continuing now behind the scenes as we already have the main project up and going through its usual quick recovery. In the previous thread Joker mentions some (broken) changes on the account page, etc. I see that a lot of php files were updated on our web site. We sync our web site from time to time with the most current versions in the BOINC html repository, and of course this may alter behavior of certain pages or break them altogether. The appropriate parties have been notified. - Matt 19 May 2008 23:11:32 UTC Fairly straightforward weekend, server-wise. We're still without our BOINC mysql replica database (see previous note) but we'll clean all that up tomorrow during the usual Tuesday outage. We'll also test some mysql configuration options which may protect us from such failures but at the expense of increasing disk I/O. Basically mysql could write every transaction immediately to disk as opposed to writing all queued transactions in a batch once per second - which doesn't sound like much but we can do hundreds of updates per second at times. Still fighting with Fedora Core 9 on the test system. Ultimately trying to yum up from FC6 failed, and trying an upgrade from DVD failed - I just couldn't get X to work. So I did a clean install and that fixed the X problem, but there are some surprising but minor issues I'm working around. For example, a bug (or feature) prevented the ifcfg-eth0 script from having a "GATEWAY=" line, so I had to add that by hand to get network connectivity. And autofs wasn't installed by default. I yum'ed it in and it isn't working. I'm debugging that now. Oh I see - "grpid" isn't a valid mount option anymore (?!). I did add yet more info of nonzero interest to the science status page - namely a link to a chart noting our entire SETI@home data distribution history. I made this chart for internal use originally, but decided it may be fun for the public to see when exactly we observed and roughly how much we analyzed per day. I know I added a couple of web features under the radar lately - I figure we'll publicize all the fun new tidbits in bulk at some point. - Matt 15 May 2008 23:35:49 UTC Okay today wasn't so great, but it could have been worse. Eric had continuing problems with ewen so he tackled that for a couple hours this morning, finally getting the thing to recognize its new SCSI drives upon reboot. The general network malaise that happens when ewen is offline masked the fact that, like before, BOINC mysql database server jocelyn suddenly rebooted itself for no apparent reason, causing the mysql engine to shut down ungracefully and requiring a lengthy cleanup. So that's why we were offline most of the day. Upon recovering the replica server (sidious) was out of sync - no big surprise there but that means we'll have to rebuild the replica database yet again. What a pain! In theory we should be able to swap relation between these two servers easily during such crises, but we haven't gotten a well oiled procedure in place yet for that. Maybe we'll start running drills on this soon. Thing is we didn't want to get fancy as we're near the end of the week, people are bogged down with the proposal, and I'm actually going out of town tomorrow for a quick private corporate gig in LA so I'm going to be completely out of touch for the next 40 hours starting.... now! - Matt 14 May 2008 23:48:03 UTC More of the same today. General progress slowed by grant proposal effort and continuing ewen debugging - as mentioned in yesterday's note, when ewen is down everything still works, more or less, just veeeeery sloooowly. I'm also experiencing some growing pains trying to install Fedora Core 9 on one of our test servers (which also, as it happens, sends out the "reminder" e-mails). Run into problems with a standard "yum" live upgrade. Fair enough - I went to upgrade it from DVD but only then realized the system has only a CD drive. Sigh. So I had to pluck a DVD drive out of a defunct system. Then finally after the install X isn't working. I'm hoping a yum update at this point will fix that. On the bright side I continued Jeff's effort on Google Sky and converted our science status page to use it. Fun! I'll make a formal announcement of server status updates when I add one or two more things... - Matt 13 May 2008 22:11:58 UTC The standard weekly outage chores (database compression/backup, log rotation, general housecleaning) went by without much incident. It's the extra stuff we try to do at the same time that may or may not be as easy. Today Eric wanted to add a donated (and upgraded) 12TB disk array to his Hydrogen database server, ewen. We also took the opportunity to move a few things around in the closet now that there was rack space (and rack rails that fit!). The moving was fine - however ewen is having problems booting now. Eric added a couple SCSI cards, so maybe there's confusion about where the boot disk is, etc. In any case, ewen isn't really a SETI@home/BOINC server, but contains enough shared stuff that when it disappears, there's a general malaise in the BOINC backend. Uploads and downloads are fine - it's the splitter, validating, assimilating, etc. that's not going so well (if at all). Eric's beating his head on that. Meanwhile, random unix commands sometimes work immediately, sometimes take 30 seconds to respond. Not so fun. We hope to beyond this before day's end. I did fight the crowds and downloaded Fedora Core 9 for soon-to-be server upgrades. I'm upgrading one test case now - so far so good. Jeff has been figuring out the Google Sky API. We'll probably replace the Sloan Survey pix on the science status page with this, as well as use Google Sky to show our current top candidates as they start rolling in via the NTPCkr. - Matt 12 May 2008 23:26:00 UTC Not really much of an exciting weekend server-wise, which is typically a good thing. Lots of little bits and pieces being put together to get the new project and scientific analysis software rolling, but nothing really to report outside of mundane details. Progress in general is temporarily slowed this week - we're a man down as Eric is lost in grant proposal land. Fedora Core 9 is coming out tomorrow. If the mirrors aren't swamped I may upgrade a test machine or two during the usual Tuesday outage. I'll also start bringing some recently donated servers on line which have been waiting on this release (I didn't want to install 8 just to have it become obsolete that much faster). We may also do some server closet shuffling during the downtime. Happy belated Mother's Day! - Matt 8 May 2008 21:17:25 UTC I'll start with hardware - just some minor things. First: the boinc.berkeley.edu website (and alpha projects) were down for a while this morning because the BOINC server froze. Still not sure why, but a power cycle cleared that up. Second: currently AstroPulse scientific data only exists in the "beta" realm - Bob and company are now creating the db spaces on the master science database server along with SETI@home. This may slow things down temporarily due to heavy disk I/O. Third: we got our second new enclosure (the previous one was broken) so we're starting to archive data off site again via our ISP, hence the slightly noticeable bump on our traffic graphs. I guess from this point on you shouldn't assume all transferred bits depicted on said graphs are due to workunit/result exchange. Software wise, we're chugging along on the various projects mentioned in previous threads. When we all get into programming mode this generally tends to uncover bugs/issues that went unnoticed during network manager mode (or scientist mode, or administrator mode, or ...). Things like being able to insert workunit_groups of any size, but only able to read ones under 8K. Not a problem when all we're doing is inserting, but now that we have to read them back in to do some precess adjustments, this constraint uncovered a few such groups that were extra-large in size. Why? Well, that's what I mean - one little headscratcher leads to another. I've been on this all day, and Jeff's been beating his head on this "ragged file" problem causing some splitters to error out - but when we restart them on the same files they work. Why? Why?! Actually, these problems are kinda fun as when we do discover the root cause there's a happy "a-HA!" moment. - Matt 5 May 2008 22:44:09 UTC Typical weekend - a couple weird things but nothing tragic. For example the assimilator queue ballooned for a while, but then worked its way back down to zero on its own. There might have been mysql database load causing some general malaise like the above - no smoking guns have been found yet. Otherwise general progress. With the servers doing well I continue to send out reminder e-mails to users who haven't returned results in a while. We consistently fight a general downward trend as people buy new computers and forget to reinstall BOINC. Looking at the recent active user graphs out there I'd say about 10% of the reminder e-mails result in a returning user. Most of them bounce (or get spam filtered). Also a large fraction of these e-mails are currently going to users who haven't sent results back in years. So I imagine the success rate will increase over time, but on the other hand I imagine we won't be sending out such mails as often in the future (the number of people who could be deemed "ready to remind" is finite). Meanwhile I'm working on finally running the precess fixer (run into some embedded sql issues this afternoon), while Jeff is almost ready to throw the NTPCkr into beta. We actually discussed public data visualization of candidates at our general meeting this afternoon. And it sound like AstroPulse is pretty much ready for prime time as well. Woo-hoo! Happy Cinco de Mayo! - Matt 1 May 2008 21:03:51 UTC Happy May Day! Not much to report these past couple of days. We've mostly been bogged down doing actual software development, which for me has meant trying to wrap my brain around how to pull useful information out of the science database in an efficient manner. The "efficient" part is the crux given the size of the database. Nevertheless, I will be restarting the skymap processing again - watch for new maps soon, albeit of coarser resolution, but perhaps animated over time. We shall see. Jeff's been in NTPCkr land, mostly, though we've been working through continuing data flow issues together as well. Note how I added a third color (gray) to the splitter status section of the server status page. This denotes files that didn't complete due to error which, at this point, is always due to "ragged" files (i.e. missing blocks at the head/tail containing the radar blanking signal). We had lingering problems rebuilding the BOINC db replica. Despite getting a clean dump from the master, upon reload the replica complained of broken tables that needed repair. These tables did break in the recent past but have since been fixed, but maybe there were lingering error flags hanging around. Anyway Bob cleaned all that up and it's catching up now (again). EDIT: in case you're watching the network graphs, we just figured out how to send more data to our archives over the ISP - so the spike is raw data archival traffic, not some kind of sudden workunit download frenzy. - Matt 29 Apr 2008 22:08:03 UTC During today's outage, Jeff and I did yet more reorganization of room 329, culminating in finally, for the first time ever, putting sidious in a rack. This was a major step in filling this particular rack, which will hopefully replace one of the three racks in the closet sooner than later. We also did the steps to rebuild the replica database, which is happening in the background now. May complete tonight or tomorrow, and then it shall "catch up" quickly after that and we'll be back in business on that front. Clarifying the bottleneck I mentioned yesterday - this is strictly due to our current data processing rate. Drives with raw data come in, which we always archive to off site storage as well as copy into our processing directory (where the splitters read them to make workunits). In a perfect world, we'd be processing data as fast as we archive them, but to do so would require a lot more active users. So frequently our 8 terabyte processing directory fills up with unsplit data, and everything logjams. So this isn't a database bottleneck - it's a data bottleneck. More people/computers is the solution. Still, people asked for more info about the quality/quantity of database throughput. Here's a short essay about that. This is by no means complete it's but a good start. We have two databases, the mysql database which is BOINC specific (running on jocelyn, replicated on sidious - we call it the "BOINC" database), and the informix database which is SETI specific (running on thumper, replicated on bambi - we call it the "science" database). The science database, while very very large (billions of rows) is not a problem under normal conditions, even as we insert over million new rows every day. This is because inserts are generally at the ends of tables, so it's all pretty much sequential writes and that's it. With the introduction of actual scientific data analysis comes large numbers of random access reads. Earlier this years tests using the NTPCkr (our software to do such analysis) showed this will be a problem so we spent a couple months reconfiguring the science database server/RAID systems to optimize random access performance. We seem to be in the clear for now as we continue NTPCkr testing. The BOINC database is largely where problems arise, partially because this is our public facing database, i.e. users notice quickly when it isn't working. This contains all data pertaining to user stats, the web site, result/workunit flow, and the whole BOINC backend state machine. On average it gets about 600 queries per second, peaking at well over 2000 per second (like now, as we recover from today's outage). Thanks to many years of gaining expertise forming proper queries and creating proper indexes, 99% of these queries are super duper fast. But there are still unavoidable issues. The lifetime of a particular workunit and its constituent results is long, as they are created, sit on disk waiting to be sent, hang out in the database as users process them after which they succomb to the whole validation/assimilation/deletion cycle, and finally get purged after a 24 grace period (so users can still see finished results up on the web for some time after completion). Due to this lifetime at any given point we have roughly 3 million workunits and 6 million results in the BOINC database. This is all important data, but it's mostly metadata - the scientific stuff is contained on larger files on disk. So even with these large tables, and the user/host tables, and forum/post/thread tables, all the commonly accessed parts of the database fit into memory cache when it's all "tightly packed." We create upwards to a million workunits/results a day in this database, which means the tables would immediately grow too large to be useful, which is why we purge (i.e. delete) them when they are finished - the useful data has been assimilated into the science database at this point anyhow. But deleting isn't in sequence - it's random as results don't return in sequential order. When rows are deleted from a mysql table, it doesn't free up space until ALL rows from the entire database page are deleted - something that isn't likely when done in random order. So even though row counts remain stagnant on these two tables, the tables bloat to roughly twice the size on disk by weeks' end, and mysql memory cache takes a major hit. This is why we have a weekly outage to, among other things, compress the tables (or "repack" them). Meanwhile, there are daily unavoidable long queries, for example to do user/host/team stats dumps. To dump all this data means reading in whole tables into memory (not just pertinent rows/fields) - queries like this temporarily choke memory cache. Indexes won't help - we're reading in everything no matter what. Also meanwhile, I haven't mentioned the "credited_job" table which is actually the largest table in the BOINC database. We're still just inserting into it (harmless sequential writes) but I'm afraid this is a disaster waiting to happen once we start actually reading from it. Bottom line, the BOINC/mysql database is usually fine as of now. It beautifully handles a stunning variety of queries from several public servers and a rather busy backend. A perfect open source solution that folds nicely into the general BOINC philosophy (keep it standard and free). SETI@home is rather large compared to other BOINC projects, so we had to put a lot more TLC into maintaining our mysql servers, and we pass our improvements on to the general BOINC community. - Matt 28 Apr 2008 22:59:14 UTC Back from a relatively painless weekend. Except the replica mysql database is screwed up again - it got stuck on a duplicate ID (not sure why) which is relatively harmless but this caused its logs to grow at an inordinate rate, filling up the data drives and bringing the whole thing out of sync. Fine. We'll recreate the replica again during the outage tomorrow (much like we did a couple weeks ago). Since we've been fairly stable the past couple of weeks I continued to send out the "reminder" e-mails today which has already rocketed our active user base back over 200,000. This is good, as our current data flow bottleneck is the amount of data we are able process, so the more computers the better. Tell your friends! - Matt 24 Apr 2008 20:33:28 UTC Work week wrapup. No major news outside of things I already posted here and elsewhere. People are out sick. Man there's been a lot of nasty bugs going around this year. I've been catching up on minor nagging items. Mostly cleaning up the lab - some recently donated servers are stuck waiting on fedora core 9 to be released as well as having no place to physically put the things to set them up. We have a lunch table in the center of the lab piled with random stuff so we're all eating lunch at our desks. Also worked on donation system upgrades. The IT people on campus are now allowing us to pass hidden user ids which will vastly increase my ability to match green stars to specific donators (we've been relying on people entering the right e-mail address on the donation form). Some updates to the boinc web interface broke a few pages - I fixed all that. Yeah.. lots of the usual day-to-day tasks. - Matt 22 Apr 2008 22:27:41 UTC Back from a long weekend out of town. Didn't seem to miss very much. I checked the network graphs while I was away and saw no dips, so that's a pretty good sign things were generally healthy in my absence. There was another seemingly bogus disk failure on thumper. Is smartd being too sensitive? The drive tagged as potentially faulty was failed/re-added without much ado. Today had the usual outage. Nothing out of the ordinary there. One funny thing - for an unspecified amount of time nobody on the Berkeley campus (outside of the space lab) was able to connect to our servers to receive/send SETI@home data. This was due to asymmetrical routing - a problem on our public facing servers that send data over our ISP (as opposed to via the campus LAN). Jeff found and fixed the problem and I updated the network scripts to make sure a reboot doesn't break it again. Jeff just spent an hour or so walking me through the current nitpicker (i.e. the candidate-finder) code. This really is one of those simple concepts that requires a complex solution. I find it frustrating to describe why, as the reasons are hardly obvious, and the problems are nested. We used to do this stuff with our own human brains which can find patterns and detect duplicates and RFI quickly as long as the data fits on a couple pages. This isn't so much the case anymore, and getting the computers to smartly (and efficiently) do the same grouping, comparing, and discarding is difficult. Think of it this way: you have a bunch of friends and you realize two of them are single and, based on many different variables, perhaps quite compatible - so you set them up on a date. Easy, no? Now try to run a completely automated dating service trying to accurately pair up every single person on the planet with the best possible mate. Not as easy. In any case, I might start throwing random output from it on the science status page which is of anecdotal interest. Like extra info about where we're currently pointing and what we've seen there before. Check for that in the next day or so. - Matt 16 Apr 2008 21:34:36 UTC So far so good with the new workunit server. We recovered from the recent spate of outages fairly quickly. The assimilator queue is starting to drain at a good clip, too. If anybody's looking at the traffic graphs and noticing a "bump" over the last hour or so - that's us sending our raw data to HPSS over the Hurricane pipe (in additional to sending it over the standard campus pipe). With the recently purchased (and employed) disk enclosure this extra bandwidth is now possible, and every little bit helps (pun intended). Mostly working on programming today. Wrapping up work on the precess recalculator - will probably deploy next week. Astropulse and the ntpckr are both just around the corner as well. I know we've been saying that a while, but it's getting truer ever day. Lots of big things coming down the pike. - Matt 15 Apr 2008 22:24:02 UTC As mentioned yesterday the kind folks at Adaptec/SnapAppliance replaced our server. The leading theory for its failure is still localized to the ribbon cable connecting the faceplate to the motherboard, but they swapped out the whole thing anyway just to be safe. The RAID devices had to be massaged a bit and then spent all night resyncing. That wrapped up around 4am, but one of the RAID1 pairs needed to be resynced again. Once that finished, I tackled the usual Tuesday database compression/backup. Since that began early this week (no reason not to since we were already off line) that completed around 12:30pm and I started the public/beta projects. We'll be catching up for a while, I imagine. The assimilator queue blossomed again, but this (I think) was mostly due to one of the four assimilators being stuck on one particular result where the uploaded file got garbled and therefore became un-parseable. I blew this result away and that one assimilator seems to have pushed through for now. Jeff is trying to debug a new problem with the splitters - despite additional smarts/logic some are failing mid-file, unable to find the radar blanking signal. But when we look at the file by hand, we see the signal (or at least where the signal should be). Insert sound of head scratching here. In any case, if there are less splitters running than normal, that's why. Happy Tax Day, my U.S. compatriots. - Matt 14 Apr 2008 19:03:42 UTC Continuing problems with the workunit storage server... There were more resets over the weekend, ultimately resulting in one that caused the server to think enough drives have failed to call the entire RAID dead. We are confident we can trick the server into thinking otherwise - we actually have some helpful techs logged in doing that as I type. We still want to replace the whole box, which we'll hopefully do today, and then the drives will have to resync again. Chances are we'll be down until tomorrow (Tuesday). So while we are down we'll try to catch up on several things. Moving servers around the closet, incorporating the new drive enclosure that arrived today, getting more stuff on the new KVM, etc. - Matt 10 Apr 2008 17:53:43 UTC We thought we had the hardware problem with the workunit download server diagnosed, but looks like we were wrong. False positive. The good news is that the kind folks who donated the thing have another ready to ship. But until we get it, that probably means potential random resets all weekend. Jeff just put an /etc/rc script in place so that upon reset/reboot there's a chance it'll be operational, meaning short glitches instead of multi-hour outages. That's the hope anyway. We might actually test that later today (if it doesn't reset itself on its own). There was discussion about how to implement a second workunit storage server so we don't have this single point of failure anymore. Not as easy as it sounds. - Matt 9 Apr 2008 21:24:22 UTC Continuing on from yesterday's tech news note, we had a "take two" outage today for database maintenance. We "repaired" several tables (the word repair is in quotes because, while MySQL locked the tables due to potential corruption, the repair query found zero errors). Then we dumped the master database and are recreating the replica from that dump. This is actually happening now, and will probably take all afternoon, but since the master is back in one piece we started up the projects and are catching up, draining backlogs, etc. We'll start the replica once it's ready and it should catch up as well. Outside of that, Jeff and I are tackling the current state of data flow to/from Arecibo. We have a lot of scripts in place to automate most things, but there are still some parts we do by hand based on the situation. Do we need to empty the drives as soon as possible and get them back to Arecibo to collect more data? What if there's no space available on the splitter system? Things like that. So I'll be coding up more robust scripts in the near term. - Matt 8 Apr 2008 23:43:16 UTC Had a relatively painless weekend, which is a good sign as that probably means we correctly determined the cause of our workunit download server woes (broken faceplate sending bogus resets to the system). Everything else was okay except the database statistics on the server status page flatlined. This was fallout from the mysql database server rebooting itself on Thursday and the replica server getting out of sync. Since this was a harmless, cosmetic problem we let this fire burn until we re-synced the two databases today during the (extra long) weekly outage. Why were we down today for so long? What happened?! Seems like last week's database crash caused some minor confusion in (at least) the "credited_job" table, which of course is the largest table in the database. So we had to run a long, expensive "repair table" query after a longer, more expensive "optimize table" query failed with error thus preventing us from even backing up the database. How annoying. Even more annoying: the /tmp partition filled up during the repair so mysql twiddled its thumbs for 20 minutes before we realized and cleared out more space. Then /tmp filled up again. Then we realized the it was trying to write about 10GB of data to /tmp. This wasn't gonna happen. So we killed the "repair table" query and simply restarted the project so people could get back to work. However, without credited_job the validators can't work, so they're offline for the night. We'll discuss tomorrow what to do next. We still haven't backed up or re-synced our databases. They might be an extra outage tomorrow. We employed the new workunit-generating splitters with radar blanking yesterday, but then overnight ran out of work to send out. This was due to the way our data was collected and stored in the raw data files. Long story short, data buffers are collected and stored in pairs, one which contains the radar blanking signal (which lets us know exactly when the noisy radar is on), the other of which does not and therefore gets its blanking signal from its sibling. However, the orientation of these pairs in the data isn't fixed and may reverse "polarity" at any time. So there's a good chance the first buffer in a data file is missing its sibling and therefore can't find any blanking information. This is a critical error, so splitters were getting hung up on these files as the queue slowly drained. Not a big deal, and Jeff reworked the logic in the splitter so these errors are not critical (we'll just skip the first buffer). Anyway, this only affects a couple months' worth of files - we already fixed the logic on the data recorder down at Arecibo to reduce the chance of "half pairs" happening in a single file. - Matt 3 Apr 2008 21:31:19 UTC Minutes after I went to bed last night the BOINC mysql database server crashed. This has happened before - some kind of kernel panic. The upshot of it was that we were offline all night until Jeff (who wakes up far earlier than I) kicked the system early this morning. And then it took mysql about six hours to do all its checks and clean itself up. Once back up, we found the master and replica servers were ever so slightly out of sync, which was no surprise. We're continuing to run this way for now - but with all queries aimed at the master. This way the replica (if it continues to work beyond update conflicts) will still be an adequate-enough safety net until we re-copy its database from the master early next week. Meanwhile, spent the morning doing other stuff while the project was down. Like tightening up various aspects of our source code management. Or working on the data recorder to ensure raw data files have even numbers of blocks (blocks are written in groups of two, with the radar blanking signal for both in just one of them - so files with odd numbers of blocks may be missing blanking signals at the end, thus rendering that last block useless). And Eric had to give a tour of the lab to prospective Ph.D. students. It's things like these (which I usually fail to mention) which occupy most of our time - eating up a half hour here, a half hour there... Of course before we have visitors Jeff and I have to drop everything and actually clean up the lab - piles of KVM cables recently removed from the server closet, random DIMMs too small to use, on every possible flat surface O'Reilly manuals (or good ol' K&R) lying open to specific pages, empty soft drink containers... In any event, recovery (yet again) is happening now. Hopefully as the weekend approaches there will be a wee bit more stability in our server closet. Of course I just sent out about 25K of those "please come back" e-mails yesterday. It's all about timing. - Matt 2 Apr 2008 22:54:30 UTC So far so good, running with the faceplate off the workunit download server. If this remains the case we'll get a free replacement faceplate from Adaptec. This little exercise has proven that this server is a bad single point of failure - if we actually lost all the data, it isn't a scientific disaster, but a BOINC disaster - there would be hundreds of thousands of workunits "in the field" that no longer exist, and are no longer verifiable. We can regenerate the workunits, but it would be a big waste of CPU time not to mention a public relations disaster (not like we haven't weathered those before). Remember radar blanking? Here's a recap: unlike the classic data, the multibeam data is blitzed with radar sources, adding a lot of noise to a small subset of our workunits. The radar's time frequency is short but random, making it very hard to remove by simply randomizing data based on certain thresholds. This is more an annoyance that a threat to science. Arecibo implemented a "radar blanking signal" which we now get in our data, telling us exactly when the radar is on so we can "blank" the data exactly at that time. Among other things, we've been working to get this coded up and tested in the splitter for a while now. Jeff has been managing this recently and this morning had some final data and plots from workunits sent to our clients with the radar blanking and without. Looks like we solved the problem. Expect slightly less RFI workunits on average in the near future. With Arecibo slated to be decommissioned in the not-too-distant coming years (write your local congressperson!) this has been an unintentional temporary boon for us as the observatory is prioritizing sky surveys to appease its current/remaining projects. That means we're collecting a lot more data than we originally intended, which means we can't seem to get disk drives back and forth between Arecibo and Berkeley fast enough. The bottleneck is our limited bandwidth to copy fresh data that arrives here down to HPSS (offsite archival storage) before erasing drives and sending them back. We're going to purchase another cheap SATA drive enclosure and try to use some of our excess Hurricane Electric bandwidth to speed up the archiving process. Outside of that (and countless day-to-day chores) I got the basic plumbing of the "precess fix" program working. We unknowingly double-precessed all multibeam signal coordinates, so they aren't in J2000 as much as J1993 (the observatory's multibeam receiver code had coordinate precession built in, unlike classic receiver code). Not a major tragedy, and easy to revert - but this is one of those things where you want to make sure the math and logic are correct before updates billions of rows in a database. Edit: Oh yeah, and I also sent out about 10000 reminder e-mails today. See other threads about waning user interest for more info. I'll send more each day. - Matt 1 Apr 2008 22:15:39 UTC Last night the workunit storage server acted up again. I attempted to reconfigure it at midnight last night, but then it reset itself an hour later, and again every hour since. So whatever the problem is, it's gotten worse. Jeff and I did some diagnosing during the regular weekly database backup outage today. The reigning theory is still a faulty faceplate sending erroneous resets to the motherboard. So as it stands now the server is running without its faceplate (and therefore no control panel - which makes powering on quite difficult)! And so far no resets. If this stays stable for a week I think we'll have nailed the problem. Meanwhile the kind folks at Adaptec already have a complete replacement at the ready if we need it - we might just need to replace the faceplate. No other real big shakes about today's outage. I added more machines to the new kvm (which meant being able to pull more cables out of the closet) and we added a new field to the workunit table in the BOINC database - so far that hasn't broken anything as far as we can tell. The beta uploads are failing again, but hopefully that will clear up on its own like last time (I'd still like an explanation, however). Happy April Fools, by the way! - Matt 31 Mar 2008 21:46:51 UTC The last few days were a little bumpy, with our workunit storage server disappearing out from underneath us at random (see previous posts for more info). This is still not quite clearly understood. The reigning theory is there's some faulty connection somewhere between the front face of the system (where the reset button is located) and the internal circuitry. This isn't too hard to imagine as there are some servers sitting right on top of it, and pressing ever-so-slightly down on the server's faceplate. A month ago we added that new heavy router to the stack. Perhaps this is the problem, which leads us to the general (and incredibly annoying) rack standards issue: all server racks are by default non-standard size and shape, and therefore we aren't properly racking as much as stacking. One of the upshots of this were beta uploads were failing all weekend in various ways, most likely due to partially broken mounts between the upload server and the storage server (which contains the beta uploads as well as workunits - SETI@home public uploads are kept right on the upload server itself). This was very difficult to understand, but even worse: it just suddenly started working again - and during a meeting no less (when nobody was actually sitting at a computer doing any tweaking). I'm leaving early today to have a meeting down on campus with the donation department. Exchanging general ideas for improvement. - Matt 29 Mar 2008 5:16:39 UTC I was joking in my last post about machines dying at midnight starting this three day weekend. At least they were nice enough to wait 18 hours into the weekend to start failing. In this case, our workunit download server which failed earlier in the week croaked again. I happened to notice during my usual random check in from home that we were sending out any bits, which immediately led me to the faulty machine. For a short time I was able to log into it via a serial connection but it was in some funny, unhelpful single-user mode with a broken network config. Unable to do much I tried quitting out of that and it then basically became unreachable. Since its network configuration has reset, and the serial connection now shows no pulse, there's no option except drive up to the lab and kick the thing in person. Except it's 10pm on a Friday night, and it's raining, and the known fix will take an hour or two to enact. No thanks. Even if I wanted to go up to the lab, there's no guarantee any fix would work. And even if I did get it running, given current history there's no guarantee it would stay running through the night or the weekend, so I'm staying home. Bottom line: no workunits until somebody is in physical contact with the server. This may happen sometime before Monday, but don't count on it. I sent warnings to the others but not sure any of them will be free to go up to the lab. I have a gig tomorrow so my next 36 hours are occupied. - Matt 27 Mar 2008 22:40:40 UTC There's not much news to report on the technical front - but that doesn't mean I haven't been busy. I've mostly been engrossed in tasks that have little effect on the public servers, so anything I've been working on is either (a) too complicated to describe to everybody's satisfaction (including my own), or (b) relatively uninteresting. I've been lax in sending out regular "reminder" e-mails to participants who lapsed (i.e. have stopped processing data for N days) or never succeeded in processing work. We wanted to start these up in the fall, but there were server woes - and it's not good form to send "please come back" messages to people only to frustrate them with connection failures. Then everybody went on vacation at different times. Then it was donation season, and we try not to send e-mails to people more than quarterly, so that postponed the reminders until a month ago, but at that point we were having the science database/router woes. Anyway.. now seems like a good time to try and start again. Perhaps starting early next week. Tomorrow is a University Holiday, thus making this a three day weekend. Perhaps start an office pool involving which server will croak at midnight tonight. - Matt 24 Mar 2008 22:28:55 UTC Things have been running rather well over the past couple of weeks. Having effectively unlimited bandwidth really helps. It's a little more hectic behind the scenes as new data keeps getting sent up from Arecibo - we are continually working to offload the data to our local servers (and remote mass storage) so we can send back the blank drives for more. Steps will be taken soon to improve this situation (namely: sending some data to our remote storage via our faster Hurricane connection). There was a bit of a panic this morning, however. Suddenly gowron, our workunit storage server, reset itself. Not only did it reboot, but it lost all host/IP information. For all we could tell at first it lost everything! We had to connect to it over serial (most difficult part: finding the right cables) but once we got in we found our 2 terabytes of workunits were still intact (whew). So it was mostly a matter of reconfiguring the basic things and we were back in business. Why did it reset itself? That remains a mystery. Another minor gripe: I spent a man/day last week working on testing mdadm's "spare group" feature. That is, if a drive fails on a RAID device without a spare, it can steal a spare from another RAID device in the same RAID group - mdadm's way of enabling a "hot spare pool." We never had a case where this would happen, nor did we ever test it. Now that thumper is less two spares (due to making a new small, separate RAID1 for database indexes) I wanted to test this. I made simple test cases and failed drives - but the available spares in the spare group weren't being utilized. Long story short - I actually recompiled my own mdadm with fprintf's all over the place and found mdadm behaving strangely. Thing is, this is mdadm version 2.6.2 we're talking about here, and mdadm is already up to version 2.6.4. So I download that, and it worked, so apparently this bad behavior has been fixed. But Fedora doesn't have the latest version available yet, at least via "yum update," so we're pretty much waiting on the new version to become available before implementing a less trusted version, even if it seems to work better. - Matt 18 Mar 2008 21:15:54 UTC Today during the outage I installed the new network kvm in the closet and hooked up one of the servers. We're waiting on green cables to arrive (so we can tell them apart from other cables in the closet) before hooking up the other servers. Putting this server in actually maxed out our 24 port DLink gigabit switch - so I chained in an old reliable Netgear 100 Mbit switch to occupy the stuff that doesn't talk gigabit anyway - UPS's, service processors, older servers... Bill, who donated our previous and current routers, came by to pick up the 2811 we're no longer using, now that the current one has proven itself to be able to handle what we give it. Apparently this 2811 is off to Beirut. What an adventurous life this router is leading. Otherwise, a lot of my time the past couple of days has been spent mostly on generic network/systems administration not worth mentioning here (i.e. mundane drudgery). - Matt 14 Mar 2008 17:52:11 UTC We turned off the resend of old WU on client reset because of a huge IO load on the MySQL db. It was slowing down result validation, the main function. We have done a number of things to improve the db performance, reducing IO rates and hope to turn on the resend feature in the near future for a test period. If the IO load is manageable the feature will remain enabled. 13 Mar 2008 21:25:40 UTC A few small items today. Still messing with the new science database indexes. Bob just started dropping/recreating these one at a time, which may slow down the assimilator inserts, but we'll see. Having the indexes on a different volume can only help. We just got a used Raritan 16-port network KVM donated to us - I believe the donor would like to remain anonymous (if you're readind this thank you!). Eric got this hooked up to a test server pretty quickly - it's pretty sweet. We'll get this in the closet sometime next week, and then we'll have the ability to reboot systems from home, which should minimize down time over the long haul. With the regular BOINC database performing quite well these days, we may attempt turning on the "resend lost results" features again early next week and see if we can handle it. I have a gig tonight where I have to sing, but with my lingering cold/congestion I currently sound kinda like Brad Garrett. Should be interesting. - Matt 12 Mar 2008 22:32:31 UTC As for science database improvements... While getting the new science database RAID1 volume set up we discovered that the lvm gui doesn't allow for resizing of logical volumes containing xfs filesystems. Huh. We were able to grow these on the command line (both the logical volume and then the filesystem itself), so we'll just had to use the command line in instances like these. At any rate, Bob is building new db spaces for the indexes on this new volume. We'll recreate indexes there after dropping them from the old spaces (which are in I/O contention with the actual data). This will happen gradually over the next few weeks. And yes, there were still lingering issues with the donation script. Actually I should point out that the problems were not in my parsing script, nor the whole system I set up to garner information from campus. The problem is that the formatting of the confirmations from campus change format every so often. And by "change format" I mean they suddenly contain random line feeds in unexpected locations for no explicable reason. So my parsing script needs to be "improved" every so often to pick up the exciting new places these line feeds might happen to turn up. Anyway, it's fixed, and a couple "clogged" donations pushed through just now. - Matt 11 Mar 2008 22:09:13 UTC Typical Tuesday. The weekly outage went along just fine. This is the first time in many weeks the result table has been "lean" - i.e. no large excess of result entries due to blocked queues, waiting for purging, etc. How nice. Despite the happy current performance of our servers, we're still keen on improving science database throughput. We met today to discuss a plan to shuffle disks/RAID/LVMs around to optimize performance on thumper. I'm building the first RAID1 pair - it's syncing up now - where we'll start recreating indexes as soon as tomorrow. - Matt 10 Mar 2008 18:58:22 UTC Hello, folks - just getting over a really really bad cold. I rarely ever get sick like this so it's a bummer when I do. Anyway, I'm back, though still only about 80-90%. In the meantime, nothing much happened except the happy mixture of (a) enough download bandwidth to ensure an even flow of work, (b) a consistently long average workunit turnaround time, and (c) no unexpected other stresses, allowed us to finally, albeit slowly, catch up on the assimilator queue over the past week. At first I thought our queues were benefiting from the new splitter which might have been generating less noisy workunits (and therefore less prone to quick overflow and return), but the opposite was true: the new splitter was generating annoying broken workunits that errored out immediately. Sorry about that. In any case we're still in dire need of database server improvements, mostly in the RAID re-configuration realm. We're also getting smartd errors more and more - these drives are approaching retirement already. Can you believe it? - Matt (sniff cough) 4 Mar 2008 23:27:02 UTC Some positive progress today: During the weekly database backup outage I removed old kosh/penguin from the server closet, and replaced them both with bruno (the upload server) and its disk array. So the only backend servers still outside the closet are sidious and vader. In order to accommodate the new server I also put a second KVM and did some recabling to daisy chain it with our current one. The upshot is that thinman (the web server) which was up until today totally headless now has a spot on the KVM, which gives us some warm fuzzies. Even better: Thanks to the "help wanted" post use Gerry Green found the bug causing those occasional broken queries tying up our database. It was a bad function call lost in the "ask a friend" web code. Thank you Gerry! However, the outage was slowed due to our database simply getting larger and larger, and then we tried to let the assimilator queue drain a little bit before starting up again. A new splitter is also being rolled out today - the only difference is correcting a minor precession bug (for better accuracy we still have to un-precess our coordinates in all the previous signals up to this point - which we plan to do sooner than later). I'm reverting the four assimilators. Doesn't seem like 12 helps and only caused memory problems on bruno. We're really going to have to do some major reconfiguration on thumper before we can catch up again. - Matt 3 Mar 2008 23:13:14 UTC So it was a rough weekend, mostly due to the excess assimilators being employed to knock down the ridiculously large back of results waiting to be entered into the science database. Long, long ago we had chronic problems with a memory leak in the assimilators, but that hasn't been a problem so much lately as things have moved it to a more powerful server and got BOINC going. Now they all get restarted every week due to the database backup outage. Anyway... having 12 running at once seemed to exercise the memory problem enough to cause the upload server to lock up a couple times. This created a general malaise on the backend, aggravated by a current period of fast workunits creating a heavy load on everything. This morning bruno was rebooted and log jams were cleared. Servers are trying to get on top of their queues. But in the positive progress department, check out the most recent traffic graph (green = outbound, blue = inbound). Can you guess when we switched over to the new router? ![]() Yay! We now increased our bandwidth capacity by about 50%. The roving bottlenecks are surfacing elsewhere, though until we get beyond the current period of catchup we don't have a good sense of what's normal or what to expect. We still have a ways to go to fully capitalize on the full gigabit of bandwidth Hurricane Electric is offering us, but this is still a vast improvement for now. In regards to one comment in the previous thread: despite our small staff and minuscule pay scale we're generally close to 24/7 system monitoring, what with all of us on different schedules checking in regularly at random. And nope - I still don't have a cell phone. Never had one and, if possible, never will. - Matt 28 Feb 2008 21:25:13 UTC Fully recovered from the long outages earlier this week. I also employed more assimilators (and even more just now) to try to capitalize on periods of low I/O to help catch up on the big assimilator queue backlog. Seems to be working, sort of. We also changed the mount flags on the database volume to include "noatime" - we'll see if this actually makes a difference in performance. Jeff and I are still getting beyond the router config. One of our roadblocks was using cables that were gigabit capable mixed with ones that were not (once again it's cheap parts causing the headache). We might actually be ready to go except we have to upgrade the super-long cable going from our closet to the main lab server closet, which is inaccessible to us. Waiting on the appropriate parties to handle that. Regarding hardware/software RAID: We tend to shy away from hardware RAID as we've had many nightmares in the past regarding configuration and implementation. Namely, it takes forever to figure it out, and then drives fail spuriously and/or silently. The software RAID hit isn't enough to make us consider going hardware on our current systems any time soon. - Matt 27 Feb 2008 22:15:24 UTC So as the hours wore on last night the work queue was low enough that I had to stop scheduling lest we run out of work. This morning Jeff and I determined the science database server was in a stable-enough state to start everything up again, so we did. That's basically where we are now with that. The OS upgrade was a double leap frog (i.e. up 3 revision levels) so we're getting a few errors that are noisy but most likely bogus, caused by out-of-spec config files left behind and whatnot. We'll have to do a clean OS install at some point to clean out the chaff. At any rate we removed the old-OS variable from the mix, and the database is still slow as molasses. We really need to update the filesystems (both RAID and fs type, perhaps) and reorganize which data go where. Plans are being spelled out for that. The assimilator queue is getting to be more of a crisis, though. We'll panic more once the outage recovery mellows out a bit. More on the proposed RAID changes as there seems to be some interest. The current database (data *and* indexes) are on a single software RAID5 device. When we were just adding signals to the database, there were 0 reads and nothing but sequential writes, so this worked well. Now with all the indexes built, and some scientific analysis taking place, the read/write mix is far more random. Plus the stripe size is way too big for the random I/O (we're reading in a 64K stripe to read a 2K page - or something like that). It's very hard to predict what we'll ultimately need RAID-wise for any given server (as they change roles quite often), so we've had to bite the bullet and change RAID levels mid-stream before. This time, the general idea is to create a new RAID10, and drop the random-access indexes off the RAID5 and rebuild them on the RAID10. We shall see. Jeff, with my help, got the new router configured today. There were some blips as we swapped wires around to test this and that, and we eventually reached that magic 95% point where everything looks like it should work but just doesn't for some small number of unidentifiable reasons. E-mails to experts have been sent, and we'll sleep on it. Minor news: web server thinman choked on a bunch of stale cron job processes (presumably stuck on lost mounts over the past week) so I had to reboot it - the web site disappeared for a few minutes there. Also that root drive errors on thumper turned out to be bogus (again!). I added the wrongly failed drive back as a spare. Weird. - Matt 27 Feb 2008 0:09:25 UTC Let's see.. it's been a bit since I last wrote. I've been mostly working on code to pull pulses out of the database, which uncovered a couple general minor bugs that had to be fixed. These were successfully dumped and handed off to Josh to find good candidates for initial Astropulse analysis. Not much going on over the weekend but the science database server (thumper) is not performing. Jeff and I scanned all kinds of data during different tests and we're convinced it's the RAID configuration more than anything else. We're going to have to reconfigure all the file systems on that at some point. Painful, but we may be able to do it piece by piece without too much disruption. Today we actually upgraded the way-out-of-date OS on thumper, which was also a bit painful, but ultimately successful. It should have been up and running by now, but thanks to an 8 Terabyte ext3 filesystem that hasn't been checked in over 180 days, a forced check is running and will probably be running all night. Not sure if we'll implement the secondary server (bambi) in the meantime - it may be too late in the day to attempt that. We'll let the project run as best it can until we run out of work (we'll probably keep a buffer of work just so the recovery later isn't as painful). Meanwhile, the assimilator queue is growing and growing until we either let it drain, or we reconfigure thumper. Oh yeah.. bane (one of the download servers) just went kaput. Spent 20 minutes trying to figure out what went wrong with its network. Oh - the cable came out of the switch. Click. Voila! In good news, Jeff has been hammering on the new router today, and we got over a major hurdle of getting IOS installed on it. Only thing left now is configuration. It might be ready tomorrow! Buckle your seatbelts. - Matt 21 Feb 2008 21:17:55 UTC Yesterday I didn't have much news about anything to report. I was mostly spending my day elbow deep in pointing code, so we could determine when/where we observed known pulsars, and see if we actually found them in our data. However, we've been since experiencing some general aches and pains. In order to get the aforementioned code working we needed to add an index to the science database, and while it's able to create an index "live" the splitters/assimilators have been getting blocked for hours at a time. This should wrap up sometime later today. The lab in general has also been having mail server problems, which isn't helpful. - Matt 20 Feb 2008 0:10:42 UTC Another long weekend, literally thanks to the President's Day holiday, figuratively thanks to the various network bottlenecks. For the most part there was nothing out of the current usual - we were sending out a lot of fast workunits which meant our backend servers were swamped dealing with the increased number of results coming in. What was unusual was ptolemy having some kind of inexplicable freeze for several hours. It was sending away every scheduler request with 503 errors. Jeff examined everything but found nothing unusual going on to cause this - and service restarts and even a whole system reboot didn't fix the problem. Then all of a sudden it all just started working again. So we're calling this a fluke and perhaps something fishy further up the pike for now. One of download servers was having fits all weekend, losing mounts, etc. but that didn't seem to cause any additional headaches from the perspective of the public. Jeff and Eric were on top of all this, which was good as I was spending most of the weekend out of town - it was a battle to get wireless to work at my in-laws' house. Had the usual Tuesday outage today. No news there except recovery was slowed by a broken query which erroneously tries to slurp up the entire user table into memory. This happened before, but we couldn't find the culprit. Can you? I posted thread about this in our help wanted forum. I also just uploaded a new set of photos and descriptions for your viewing pleasure. - Matt 14 Feb 2008 22:11:21 UTC Right after writing yesterday's tech news I spotted the validators haven't been running since the morning. Oops! Turns out I discovered something that's been a problem for many, many months but only got triggered now: when starting validators from the command line (which is how we do it 99% of the time) everything is fine. But when started via cronjob (which is what happened this time) they couldn't find the right libraries and immediately quit. Trivial environment/path issue - just funny we haven't seen it before. I started them up, the queues cleared out, and the assimilator queue returned to slowly draining itself. Things got a little weird over night. Our single download server seemed to be unable to get work out fast enough. First thing we did this morning was hook up vader again to be a redundant download server, so already my configuration explanation from yesterday is out of date. That's how it is around here. Anyway.. this download redundancy, however nice to have, didn't help very much nor did we expect it to, because we already guessed the router was the choke point. But why? The outgoing data was far less than normal. So what's the deal? I noticed the incoming data rate was strangely high, so I checked the router graphs not by bytes but by packets, and we were pegged packet-wise. I repeat: but why? Turns out it was a DNS loop brought on by our recent separation of the scheduler and uploader. Clients were coming into the "wrong" server and being redirected to the other (via apache). But due to incredibly short TTLs there were still a few DNS servers or caches out there saying the "other" was still "both" (standard round robin DNS). This bogus information only affected about 3% of incoming requests, but half those requests were being redirected right back to the same machine. Not very noticeable at first, but over time more computers with outdated DNS maps would connect and get stuck in a loop, and eventually we were distributed-DOS'ing ourselves. We broke those apache redirects and immediately everybody was happy, and just now reinstated the redirects using hard IP addresses to avoid further DNS mistakes. I brought the digital camera today and took pictures of the closet in its current state. I'll put them on line over the weekend or early next week. - Matt 13 Feb 2008 23:54:49 UTC I'm realizing the server status page is giving a slightly bogus picture of our current server setup, and it's actually too much work right now to fix the status script, so I'll just tell you now what the current situation is: our public web server is thinman, our scheduling server is ptolemy, our upload server is bruno, and our download server is bane. None of these currently a redundant twin or a "hot" backup (but we have vader and maul all set up to be a replacement for any of the above if need be). More on that below Our primary/secondary BOINC (mysql) database servers are jocelyn/sidious, and our primary/secondary SETI science (informix) database servers are thumper/bambi. Specs for all these are correctly noted on the status page. We have other systems employed for less interesting but important things, but that's basically the meat of it. If we could double the CPU/memory/disk space on everything we have we'll be set (for the time being). Anyway.. things are looking better. Weekly outage recovery is still a little weird - I don't think our single download server (bane) can handle such crunch periods alone so we'll probably bring vader back into the fold for that. The other servers are super happy given the recent changes to reduce NFS traffic. I enacted some more such changes this morning. This tweaking, coupled with server ewen (where Eric does his Hydrogen work) crashing and hanging the network a bit, made for a slightly bumpy ride this morning. However, between smoother seas and perhaps running "update stats" on a couple signal tables made the assimilators much faster. We'll finally catch up on that queue in a couple hours I think. Due to the reduced dropped connections on the scheduling/upload servers it seem that the router got more cycles to spend on downloads, and we reached almost 70Mbps last night. Still need to get that new router going... Other than that - more mail drudgery. As much as I like computers, I hate when perfectly good but nevertheless wonky solutions to small problems become the foundations for advanced development, thus amplifying the original wonky-ness. Oh yeah - Eric sent some graphs around. Looks like the radar blanking code is working. Neat. Jeff's working that code into the splitter now so we can retest that small data file and compare results. - Matt 13 Feb 2008 0:34:39 UTC E-mail administration is utter torture. Time was every project in the lab had their own separate mail servers. Over the years people wisely moved towards a more unified lab-wide e-mail system. Of course, SETI was the last project to convert, pretty much due to not having the man-week to spare fixing something that ain't broke. Well, it suddenly broke last night enough that I had to pretty much drop everything today and make everyone bite the bullet to start switching over - something that should have happened years ago but nobody has had the time to deal with it. Not like I have the time to deal with it now. Ugh. At least it'll all be out of my hands in the coming weeks. Until then, I'll be up to my eyeballs in sendmail drudgery. Meanwhile, we had our usual outage today, during which we replaced the seemingly bad drive on thumper - the master science database. That was easy, but upon restart another of its 48 drives started complaining. So far the complaints can be seen as spurious enough to ignore. We'll do more robust RAID checking soon. Bob also moved some logs files around to hopefully reduce random access disk I/O, and is running some "update stats" on the tables to see if that improves performance. In better news, I did some DNS twiddling to split the upload and scheduling services to two separate machines (as opposed to running both services on both machines). This vastly improved performance, as splitting the functionality reduced the NFS traffic between the two to zero. We had it set up the previous way for historic reasons which were no longer apt. This is all very good but as it stands we have single points of failure for all our public facing servers. We have some systems in line to fix that but they are in use for Astropulse testing. And we still need to work that router into the fold. Note regarding the previous thread: I should take updated photos of the server closet - not that much different but a lot neater. - Matt 11 Feb 2008 22:48:02 UTC Came into the lab this morning and it was well over 70 degrees. This may seem nice on a winter day, but (a) we have fairly warm winters here in the Bay Area, and (b) the usual temperature in the lab is closer to 60 degrees - even in the summer. This isn't great from a human perspective - we wear jackets while sitting at our computers all year round. From a hardware perspective, the extra cold lab air assists in keeping our systems nice and cool. This is why I was immediately concerned about the suddenly warmer air. Turns out a fuse blew over the weekend, and it was already repaired before anything came close to melting. Still.. a little bit of panic this morning. Despite the load on our backend servers being on the low side (averaged over the past 5 days or so) the assimilator queue was barely able to shrink. In fact, it's growing again due to the Monday bump. My guess (and others') which I already mentioned is that the new science database indexes, which add more random reads/writes during inserts, are to blame. We're doing more aggresive analysis and will try some "low hanging fruit" type solutions before too long. Not a major tragedy just yet, especially as workunit may be generally less noisy in the near future. The scheduling/upload servers are also on the brink of disaster - they have short but nevertheless frequent periods of dropping connections. They too would benefit from less noisy workunits. Or more/better hardware. On that note, if you check out the slightly updated hardware donation page you'll see I added an item for a KVM-over-IP which would help us upgrade our server closet faster. We're maxed out in the console department. In fact, our one public web server has no keyboard/mouse/monitor attached to it. If it freaks out, we hope we can log in remotely and fix it. Any incredibly generous takers? Anybody have strong opinions about which make/model to obtain? - Matt 7 Feb 2008 22:58:44 UTC We're having little luck getting science database thumper to perform up to expectations. We determined the fact it is both a database and raw data storage server isn't really the problem - the database alone is somehow constrained. Is it all the additional indexes we added recently? Extra load due having to make logical logs for the replica? Something else entirely? Of course, while testing/tweaking the OS root mirror drive on thumper failed. We got the notice from smartd but mdadm didn't notice, which was scary. We manually failed the mirror and brought in the hot spare which is sync'ing up now. Anyway.. the assimilator queue is growing and there doesn't seem to be much we can do about it now, at least anything drastic given it's the end of the week. We are sending out a lot of short work - maybe this will change soon and give us some relief. Other small news: recent splitter updates include (a) more realistic deadlines, i.e. they have been reduced 25%, and (b) radar blanking code - we're testing that now. There also has been a little bit of scheduler/upload server choking due to the aforementioned headaches - including one of the schedulers running out of work (as it runs faster than the other and therefore its queue depletes faster). Once again, we're have little choice but to wait out the storm. - Matt 6 Feb 2008 23:04:24 UTC Recovery from yesterday's outage wasn't so bad after all, but we're hitting another wall. Well, not a wall as much as a mound. That mound is our science database server, thumper. Those watching the status page may have been noticing it's having a harder and harder time to keep up with making work (ready-to-send queue is hardly ever full) and keeping up with assimilation (ready-to-assimilate queue is hardly ever empty - in fact, it's been growing slowly over the past 24 hours). Of course, it's not the database load - thumper has almost 50 Terabytes of storage on it, so it also serves as our raw data buffer (where we keep all the data images for the splitters to chew on) as well as database backup storage (where we write/archive a 500GB data file every week). In short, we're hitting disk I/O limits on thumper. I fear making the "vertical" splitter (which acts on many raw data files simultaneously to reduce impact of hitting too much noise on a single file) has reduced any benefit of disk caching to zero. Since we're basically keeping up now, I whittled our number of splitters from 10 to 6 - hopefully this will help. I don't want to revert to non-vertical splitting just yet - we'll have greater problems if we do. Bob may also employ so different informix checkpointing parameters to reduce the impact of long checkpoints blocking science database traffic about 25% of the time. We're pretty much in wait-and-see mode on that. Jeff and I are more or less done hammering out the current set of kinks in our data pipeline from Arecibo to your computer. This will all be automated shortly. We also just threw a very short chunk of data into the splitter queue from last week (28ja08aa). It's already being split, actually. This contains radar blanking data. We're going to process it once without the blanker logic, and again with. It's a data-beta-test. We want to be really make sure it works before processing dozens of whole files. I'll try to remember to throw up some before/after plots comparing the two runs once they are complete. - Matt 5 Feb 2008 23:55:44 UTC The regular weekly outage to hose down the database got started a little late today since Bob was out and I was busy voting (election day here in California - they hold elections in the U.S. in the middle of the work week and nobody gets the day off). Otherwise it was fine though it took a little longer to compact the tables as it was a generally busy week meaning a lot more database inserts/deletes and therefore a lot more fragmentation. Spent a large chunk of the day helping Dave install a new fastcgi-enabled scheduler on the alpha project which meant figuring out the differences between fcgid and mod_fastcgi behavior and determining which apache directives work, etc. Pretty annoying, but finally got it all squared away - the upshot of this is we're now getting real scheduler logs for the first time in years, as opposed to scheduler messages cluttering up apache error logs. Cool. Of course, I was distracted enough to not notice bane (the workunit download server) spiraled out of control trying to recover from the outage. I just rebooted it with and started apache with a lower ceiling to hopefully prevent this from happening again. So I'm still operating on bane. Expect slightly slower, more painful recoveries from outages for the next while. Despite the red bar on the science status page saying ALFA is not running, we are indeed collecting data on and off. This is a false negative due to a change in reporting from the Arecibo feed which tells us telescope position/status/etc. Jeff's fixing this now. - Matt 4 Feb 2008 22:53:30 UTC Once again a normal weekend without anything bad to report. Though we are starting to "normally" push our current router to its limit - our normal Monday morning "bump" brought us just under 60 Mbits/sec. We really should be moving to the new router sooner than later - still waiting on OS upgrade support from others. Meanwhile, our web server situation is now completely down to the one new server "thinman." I turned aging server "kosh" off today. Just like "penguin" it served us well over its many years. Sun servers tend to last forever if you let them. Here's a reminder that our Classic data recorder was a Sun IPX, which was already about 5 or 6 years old when we put it into service as a 24/7 collector of raw data at Arecibo, and it lasted the 5 or 6 more years beyond that with nary a single problem. Jeff and I are mostly working on the data pipeline, which got "rusty" during the extended downtime at Arecibo. It should be running fully automatically any day now, with drives full of hot, fresh data arriving regularly. We're collecting data now, but having to kick the system along from time to time. - Matt 31 Jan 2008 22:54:06 UTC No big shakes today. Here's the lowdown: The RAID recovered just fine last night. Continuing install of OS'es on new desktop computers. Court (former SETI@home systems administrator extraordinaire) came by for a short visit which was nice. Fighting with gnuplot to get it to do what I want. Took some active measures (using creative load balancing) to rectify long-standing feeder mod polarity problems - in other words we have too many even-numbered results-ready-to-send in the database, so I'm currently giving preference to the even-numbered scheduler so the odd results could catch up. Should be completely transparent to our users. As a follow up to the television crews yesterday: I have no idea where/when the thing will be on air. I'm always pleased with increased media exposure, but personally I'm kind of cavalier about the whole television thing. Anyway I think Dan ended up being the only person on screen. I have been in many clips before. In fact, months before SETI@home launched a news crew showed up. I didn't know they were coming and arrived to work on little sleep, unshowered, unshaven and wearing a rocker t-shirt. I also had freshly dyed pink hair. I ignored the cameras best I could as I was actually quite busy. I also figured this footage would only be used for the local news, if at all. That night my sister who lives on the other side of the country called. She asked, "when did you dye your hair?" - Matt 31 Jan 2008 0:45:41 UTC Everything was kind of okay for most of the day. A couple new shuttle PCs came in - new desktops for Bob and Dan. I was setting those up, working on some database programming, etc. when the television crew for "Good Morning America" arrived. They were nice but they needed me to set up a shot with a computer running SETI@home. Oddly enough we don't have any systems readily available with a good display so I had to do some minor server reconfiguration to free up a fast enough computer that could show the screensaver in action. Then the NAS holding our web site, home accounts, etc. suddenly died and was in a vicious reboot cycle. WTH? I had to power cycle the whole thing to get it to boot for real, and only then it was clear that a drive failed and it was rebuilding the respective RAID volume. Ultimately no big deal, but it is quite disconcerting it didn't recover so easily from a simple drive failure and had to be dealt with manually. The projects were offline there for a bit as the dust settled. The RAID is still rebuilding now. Let's hope another drive doesn't go in the meantime. - Matt 30 Jan 2008 0:06:05 UTC Normal outage day for mysql database backup and compression. We took the opportunity to take care of two other things. First, we added a uniqueness constraint on a field in the analysis_config table in the science database. Interesting, no? Well, no, but long story short this constraint should have been there already, now it really is. Second, we upgraded the secondary science database server to latest Fedora rev and it seems to have accepted its new OS kindly. So far so good with that. The recovery from the outage was slowed by a couple things. Bob also stopped/restarted mysql to incorporate/test some recently tweak config parameters. This has the unfortunate side effect of flushing the 20+ GB of memory, which means that all has to be read in again before the project comes fully back up to speed. Meanwhile I thought I'd continue tweaking the apache config on bane as it was seemingly unhappy and I ended up just making it temporarily worse. Oh well. Hang in there. Workunits will come. Old web server penguin has been powered down and all its cables removed from the spaghetti in the closet. It has served us quite well. - Matt 28 Jan 2008 21:28:05 UTC Things are running more or less smoothly. The workunit/result traffic was fairly high over the weekend, but consistent and below our current cap, so no major faults there. Our active user count is still slowly climbing but the acceleration of growth is negative (at least until we have another press releases or "reminder" e-mails are sent out). Since various index builds (and removals of seemingly unused indexes) the MySQL database is masterfully handling everything we give it. The router upgrade is still in limbo. One odd thing was our "feeder" polarity problem reared its ugly head again. Reminder: we have two scheduling/upload servers (bruno and ptolemy) each given a separate queue of work to send to our participants. If all is well, they should send out work at the same rate. However, in the past this wasn't always the case. DNS favoritism was causing one queue to run out faster than the other, causing errant "no work from project" messages given to half the clients. This was fixed with software load balancing on top of DNS. However, this time around it seems the increased traffic tickled an actual, particular disparity between the two. That is, bruno writes uploaded result files to directly attached RAID storage, while ptolemy writes to bruno's storage over NFS. We seemed to hit a "too many files open" limit on bruno, and therefore bumped up the maximum on that. We'll see if that helps. In case you haven't noticed, I un-DNS-aliased one of the three setiathome.berkeley.edu webservers last week, and another this morning. All public web traffic is theoretically aimed solely at our new 1U dual opteron system, and it's doing great. However, DNS rollout takes forever (even with time-to-live set for 5 minutes) - it will take a week or so for those old aliases to disappear. The old web servers (kosh and penguin) were wonderful sparc/solaris systems but are approaching 8 years old and therefore are relatively physically big and slow. We'll pull them out of the closet to make way for more modern systems - like bruno. Yeah, bruno is still sitting in our secondary lab, connected to the systems in our closet via some funky switching around the building. It will be great to it on the same single switch as everything else. Other plans for the week: We're upgrading the fedora core levels on several systems, including our science database systems. We have already tested similar upgrades on our more-expendable desktops with little trouble. However, we will proceed with great caution given many terabytes of data are involved on the database servers - full recovery would be painful, to put it mildly. - Matt 24 Jan 2008 21:03:59 UTC I think I have the apache/tcp config in some kind of working order so that we won't suffer such wild dips like we had over the past couple of days. These pains were brought on by a confluence of three minor events: running out of work to send, waiting an extra precious day before enacting the database compression/backup, and reducing our backend to just one download server. You'd think the last item was the main culprit as we seemingly slashed our server capacity by 50%, but the real bottleneck is still the router (the new one still not config'ed yet - waiting on a new IOS image). The single download server (bane) can handle the traffic, but the apache config was such that when all the downloads started it the cpu load went up to 400. Basically, MaxClients was set way too high but this went unnoticed when only half the load was on vader and half on bane. Then I set MaxClients too low - we were dropping connections long before hitting other theoretical limits. Now MaxClients is set just right. Or right enough for now. We're still experiencing catch up "malaise" but it's a much smoother ride in general than yesterday. I've actually been working on some scientific programming. With the new science indexes being built we're able to analyze some data to get an idea of the current RFI structure. Basically we're seeing the radar noise in the final data - the radar blanking signals are still being implemented so new data (once it finally starts coming in) should be far less noisy. I'm hoping this kind of work will inspire more scientific updates from the others (remember: I'm a math/computer geek, not an astronomer - everything I know about SETI/astronomy is from 10+ years of osmosis working here at the lab). - Matt 23 Jan 2008 23:27:33 UTC No news on the recently donated router (see yesterday's post). Basically we're in a holding pattern waiting to get the OS updated on the thing (currently running CatOS - needs to run IOS) and then configuration should be straightforward. There are some growing pains on having server bane be the single point of workunit download. I just tweaked the apache config to lessen the load. It's funny how seemingly unimportant differences in CPU/memory type/amount/speed from one server to the next require radically different settings in httpd.conf or else the whole thing grinds to a halt. Anyway, expect some download pains as knobs get turned and we slowly recover from running low on ready-to-send work. Due to the recent long weekend we had the weekly outage today instead of yesterday. All went well with that, and my recently mentioned fixes to speed things up worked well. During all that I finally finished the last parts of the disk usage shell game so our workunit storage (on the Snap Appliance) is up to its maximum size of 2.5TB, of which we're currently occupying 50% - that will last us a while. As well, we are pretty much ready to start OS upgrades on the science database servers next week. - Matt 23 Jan 2008 1:16:26 UTC To my fellow US citizens (and others as well), hope you had a happy MLK day (or whatever your state officially calls it). Those wondering why no tech news item yesterday, that's why. I'll start with the negative. Lots of the usual annoying little hiccups over the weekend. Here's a non-chronological digest: One of the servers (bruno) lost its automount again (hasn't happened in a while), having the effect of inflating the validator queue before I noticed and unclogged the pipes. We went through the raw data files on disk faster than expected over the long weekend, so the results-to-send queue dropped down and we're going to be recovering from that for a bit. The web sites were increasingly dragged down by obnoxious activity over the weekend but that finally disappeared after I blocked the offending IP addresses. Now the positive. Our new 1U dual opteron server "thinman" is now up and running as a public web server. We were going to use new server maul, but thinman is, well, thinner, and it's already in the closet. So that saves us one immediate closet upgrade. As well, we have been redundantly sending out workunits via both vader and bane. This is way overkill and a vestige of a time before we realized our problems were router-related. Since bane is also just 1U and already in the closet, I decommissioned vader as a download server. The bottom line is we only have two machines to get into the closet now (as opposed to 4): bruno and sidious. And we have a single web server which is much smaller and faster than the old servers (kosh and penguin) combined. They will be shut down sooner or later. In better news, Bill Woodcock (a key player in getting us set up with Hurricane Electric, i.e. our current ISP and donator of our two current HE routers) has donated another cisco router to us to replace to weaker 2811. It a 7600 series, a bit overkill, but will give us tons of headroom to spare. We'll no longer be constrained by the 60Mb/sec cap! I guess we'll find the next set of bottlenecks quickly, including the 100Mb cap (due to our current lab wiring to campus). Of course, we have a lot of configuring to do before this thing is up and running, but at least it's in the rack! By the way, if you haven't heard of email bankruptcy, please read this article. I'm declaring "thread" bankruptcy, i.e. I am letting go all current questions, open-ended threads, unfinished story lines, etc. If anything is really important it will come up again. - Matt 17 Jan 2008 22:23:19 UTC No disasters or major revelations to report today. Interesting news from yesterday: Sun bought MySQL. Not sure how this will affect us, but it reminds me that I should mention that I am generally pleased with MySQL. There was that one comment about the professor who thought industrial grade software is the only way to go, and the MySQL is for mom-and-pop ventures. Let me address: Claiming the winners in the game of capitalism hold the best solutions to whatever problem is at best an arrogant assumption with obvious overtones of classism (both intellectual and economic), especially given that "mom-and-pop" crack. Other than that.. mostly spent the day cleaning up spills in various aisles. I also yum'ed up my desktop to Fedora Core 8 as an exercise to do so on more heftier servers in the coming weeks. - Matt 16 Jan 2008 23:25:12 UTC The recovery went rather well yesterday, considering its extended length. Bob made some mysql tweaks to perhaps better use the memory on jocelyn (allow more protected space for query sorting, for example). Vexing time-sinks: I spent 45 minutes this morning trying to figure out why one of the download servers (bane) was have autofs problems. Long story short: the route map was ever-so-slightly messed up so that it couldn't mount a single particular machine on a different subnet in our lab (why it needed to mount this machine was due to an "ls" command in a script - which by default displays color, so ls will traverse sym links to see if they are broken or not in order to select the proper color scheme, and in this case one sym link was on this remote machine). Also: the new donated server came with rails! As some of you know we have hilariously bad luck with rack rails of infinitely different (and useless) non standard sizes, and this time is no different. We needed to shrink the rail depth which should be easy. I did this to one and it fit! I did this to the other and, due to different screw hole location, it remains 1 cm too deep and unable to get any smaller. Ha ha ha (sob). Bottom line: useless rails, yet AGAIN. But that's just a minor detail really - no need to rant and I don't want to seem ungrateful to our generous donor! We ended up putting the thing in the closet flat on top of the whole rack chassis. Works for me. We now have a new server called "thinman" (dual opteron, 16GB RAM) to help bolster the BOINC back-end! Woo-hoo! We'll update the server-wish-list with routers, servers, kvms, etc. soon. Other vexing time-sink: Bogus news reports that we found a "mystery" signal should be summarily ignored. This was a gross misinterpretation by a reporter of an quick comment Dan made off the record about AstroPulse progress and recently published millisecond pulsar findings by another group. These are new stellar phenomena which are astronomically interesting (and AstroPulse hopes to find many of) but not ET. Sigh. - Matt 16 Jan 2008 0:37:05 UTC Yeah... we're really pushing the boundaries of our mysql database these days. I'm finally catching up on several years' of backlogged archives and inserting zillions of rows to credited_job and this, on top of general increased usage, is gumming up the works. In fact, optimizing this table alone during today's outage took three hours (normally only a few minutes) - which explains the extreme length of today's downtime. I guess we'll have to turn of credited_job optimization until we actually use the table. This brings up several questions, the first of which was asked in a previous thread: Why are you guys using mysql instead of a more robust commercial product? Two main reasons: BOINC projects generally are small academic ventures with limited funds, and BOINC is an open-source project itself utilizing other open-source pieces of software. So all you need is a relatively cheap linux box which comes with php, apache, mysql, etc. and it's pretty much plug and play. Remember the project specific data, i.e. the science database, can be whatever you want. In our case, it's Informix. Why Informix? We got it for free 10 years ago - we now have 10 years of experience using it as a group and it is still free to us. Would we consider changing to Oracle/SQL server/etc.? If somebody wants to buy such a license and donate a man/year to change all our back end software to do so, then we would perhaps entertain the thought, but we have higher priorities, especially as Informix works perfectly well at this point. It's the BOINC/mysql part that needs help, and we're sticking with it for reasons stated above, and with SETI@home being the flagship project of BOINC we don't want to diverge from the standard. In other news, it seems the every day there's a different reason our web sites are so darn slow. Yesterday afternoon we were getting hit by some seemingly nefarious activity which I was able to block quite easily once I discovered it. But we were also getting hit by some scraping of stats pages via a robot (called BoincBot) that was not obeying robots.txt. I blocked these hits as well. We don't allow such activity on our web sites. If you want BOINC stats you can download the daily xml dumps just like everybody else. On the bright side, we obtained another server donation yesterday from a private party: a 1U dual-opteron (2.4GHz) server with 16GB memory. I installed FC8 on it just now, though there was a little bit of tweaking to get that to go. There's no DVD drive in the thing (only a CD drive) and for some reason the was some disconnect with the 3ware disk controller such that the linux installer couldn't see the two root drives. I ultimately took that out of the equation and plugged the drives straight into the SATA ports on the motherboard. All's well and it's getting all yummed up now. So we're looking for a KVM-over-IP, at least 16 ports (24 preferable), easy-to-use but secure connections via a web browser, etc. Any thoughts? The Belkin Omniview seems the cheapest/easiest, but only allows one person to connect to the whole unit at a time - not a showstopper. Any suggestions, experience with such devices, etc. out there? - Matt 14 Jan 2008 22:23:56 UTC Things ran quite well over the weekend. Looks like we added the right index to the mysql database to reduce the slow "validator fix" queries. A note about general BOINC/mysql implementation/design: there are a lot of features in BOINC that are seemingly excessive from a single-project perspetive, but are there as every project has different needs. Project-specific factors (server power, workunit processing times, number of active users, min quorum, etc.) make some features less helpful. In the case of "resend lost workunits" (see last thread) this feature, implemented mostly for the benefit of Einstein@home, was most definitely weighing down our database server. We turned this off and have been running smoothly since. There were assumptions this would lead to greater problems down the line (fearing many results will be sitting on disk longer waiting for their redundant pairing to return) but in fact our "results returned and waiting for validation" number has been stable (if not slowly decreasing) since I made the change. Nevertheless, at some point soon we will see if we could optimize/reimplement this code, and Eric is actually making adjustments to the splitter which will perhaps create less "fast runners." Our new-hardware-to-obtain priorities are shifting. Namely, we need a router (we're not ignoring discussion about this on other threads but we are limited to what we can use for various configuration/policy reasons). We also need a new KVM - our current one in the closet is maxed out and we'd like to get more stuff in the there ASAP. We also need three new desktop systems. Dan's using an old, sloooow solaris system which is out of support. Bob is on a slightly faster solaris system, but needs a safe mysql test sandbox. Josh's old super-cheap windows/intel box is basically a glorified console server. Had some minor issues due to the root drive on bruno filling up on Sunday. I scanned the drive and found only 4GB of stuff, while "df" was showing 40GB. Eric eventually found a deleted-yet-open file - an infinitely growing httpd log. Apparently httpd log rotation broke at some point, but we cleaned this up. Annoying, but harmless. Due to increased load in general, I changed the server db stats to update every hour (instead of half hour). Actually it's becoming clearer as we increase active user load and I'm populating credited_job, etc. that the mysql database might be our bottleneck du jour any jour now. There were also some issues with the user-of-the-day selection process which I tracked down and fixed this morning. - Matt 10 Jan 2008 22:47:31 UTC The public web site servers slowed to a crawl again this morning thanks to several robots/spiders scanning us at once. So I took another gander at my robots.txt file and used Google's webmaster tools to check how well this was being parsed. This uncovered a typo (a missing "s") and while I was at it I added some new rules to robots.txt. We'll see how this all fares. Bob and I brought the BOINC/science database servers down briefly this morning to tweak some parameters and clean out logs - some of you may have noticed a brief data server/web site outage in the process. The only tweak of note was on the science database: we reduced the checkpoint intervals and increased the between-database-ping timeouts. Why? We've been seeing the secondary spuriously enter recovery mode due to being unable to reach the primary, when really the primary was simply busy doing checkpoints at the time. Anyway, outage recovery was slowed by confluence of various stats/update scripts starting up while the database was busy flooding its memory buffers. We really need to optimize those stats queries someday. As well a relatively new BOINC feature ("resend lost workunits") was eating up a lot of database too, so we turned that off for now. Actually that last thing helped immensely. In the process of general disk cleanup, etc. I'm now forced to finally populate the credited_job table with three years' worth of purge archives. These archives are taking up 200GB on a 1TB filesystem which we really need to convert into workunit storage sooner than later, hence the push. Reminder: this is the table that contains the history of which users processed which workunits. Just between you and me... In addition to the outbound traffic squeezing through our maxed-out router, I am now sneaking our an additional 5-10% over the campus net. This is thanks to the simple/useful "pound" load balancing utility. The campus net can definitely handle this tiny increase. In fact I might bump up the percentage. But don't tell anybody. Mwha ha ha. [edit: I brought that percentage back down to 0% an hour later - we'll keep this extra power in our back pocket for now.] By the way, the optimized client discussion has been taken offline and is progressing. Turns out this may actually be a single bad host more than a bad client. - Matt 9 Jan 2008 22:51:15 UTC More blips and blops in our traffic caused by who-knows-what. We still don't have enough data yet to see if yesterday's BOINC result outcome index build helped with those regular slow validation-fix updates. In any case, I misspoke: we are running a version of MySQL where triggers are available to us - we only have to figure out how to implement them to do what we need. This morning the secondary download server bane was having a mount headache and I had to give it a virtual kick to get it going again. And that router is still a problem, but we're not convinced it's the only problem. Swapped out cables, switches etc. to no avail this morning. I installed some real load balancing between vader and bane (in practice round robin DNS is hardly balanced) which may help. There was still slowness to the web site as of a few minutes ago. This had nothing to do with recent web code tinkering/updates or database load or any such thing - this was strictly due to the aforementioned router problems, as half the web traffic was going through the same router (the other half over the standard campus network). I just moved the competing traffic onto the campus network as well, so that should improve web site performance in general. Regarding recent assimilator clogs, we had another one this afternoon. And yes, once again it was from a result produced by an optimized client. This time around I attached a debugger and found the problem was in XML parsing of the result and sure enough with enough eye-squinting I found a couple garbage characters in the uploaded result file. Specifically, in the power-of-time declaration of a pulse. Instead of: <pot length=211 encoding="x-csv"> It was: <pot length=211 encoding71x-csv"> So there are two problems. First, something is causing corruption in the xml (the non-standard client? something else on our end?). And second, the assimilator is too sensitive to such corruption. It shouldn't bail out so readily and create these large ready-to-assimilate queues. Minor updates to the server status page: I changed references to "beam/polarization pair" to the more concise "channel." I then added a parenthetic numeric value to the ends of each data file (representing total working/done channels for each file) so you don't have to count the little green squares. I also added total values at the bottom for all data files (mostly so we can see how long we have before we run out of data to split). Note how the "vertical" processes (i.e. splitting multiple files at once) has a negative side effect: we are forced to keep data files around much longer, which makes it difficult to keep a queue of data on disk. Some better "vertical" logic has been coded, to be rolled out in the next day or so. - Matt 8 Jan 2008 22:16:52 UTC So we've been running this annoyingly load-intensive query everyday on the BOINC database to clean up results that failed validation. It took up to an hour to run, during which it hogs a bunch of database memory and slows everything down, including workunit distribution. Why not build an index? Well, indexes still take up disk/memory, and the main table field in question is of low cardinality, and we're only hunting for a few thousand out of a millions of rows each time. So Bob was looking into implementing a new fangled mysql "trigger" to flag the few rows when they enter this bad state, making them much easier to find without needing the overkill of an index. However, we only discovered today triggers don't work in our current version of mysql. So we built an index after all. We'll see how much it helps. Other than that and the usual database backup outage this morning, mostly spent the day moving large numbers of files/archives around to prepare to grow the workunit storage space again. I also got the new server (maul - see yesterday's note) up to speed, more or less. Still won't be live for at least a day or two, but it's working. It's a 4x2.66 GHz dual core intel with 4 GB of memory. Looks like another perfect web server to me. Also had to grow our home directory space because, as you know, no matter how much space you have, it's never enough. Somebody pointed to an article that mentioned the Cisco 2811 has a known throughput rated at about 61 Mbps. This was a surprise to me and Jeff - I guess this wasn't what we were told, and you'd think a router with 100 Mbp ports could reach a theoretical maximum of 100 Mbps. The cap seems to be due to CPU limits, and we are doing tunnel encryption and have a small but still non-zero set of access rules. Anyway live and learn. And no further progress on that since yesterday. Another storm is whizzing through. The top third of a 50 foot tree just broke off right outside my lab window. Cool. I understand why people are freaking out about this current weather, but this is nothing compared to the hurricanes I dealt with growing up in downstate NY. - Matt 7 Jan 2008 23:28:38 UTC Lots of weather in the Bay Area over the weekend, leading to many power outages. Luckily our project was not affected. The new pseudo-random nature of our workunit creation finally worked itself out, and we were sending data at a relatively even pace. Speaking of sending data... At the end of last week my suspicions were confirmed: the router between us and our ISP (a Cisco 2811) has been CPU bound for who-knows-how-long, thus causing an artificial 60 Mbit/sec cap on our outbound packets. Further research will determine whether we can improve its performance or if we need to procure a better router. We had an assimilator get jammed on a broken result. I had to delete the result to clear the pipes. This happened once before a week or two ago. A little detective work this morning uncovered that both such broken results were processed by optimized clients. I'm just sayin'. This could easily be a conincidence. Spent a large chunk of the day trying to coax another Intel-donated server to life. We've gotten a lot of stuff from Intel recently, all in varying states of functionality (some missing CPUs, some have test boards, etc.). This particular one (4 2.66GHz CPUs, 8 GB RAM) was dead in the water for a while as it wouldn't respond to any keyboard/mouse. However, the other day I noticed one of the front-side fan modules wasn't seated properly. I adjusted it, and now the server sees all input devices. It's still a little squirly, but may be a worthwhile web server after all. We're calling it "maul" (sticking to the current "darth" theme). I'll announce it again if it actually proves to be ready for prime time. - Matt 3 Jan 2008 20:54:14 UTC Spreading the workunit creation over several files at once seems to be helping create a healthier mix of fast/slow workunits. However, adding a second download server seems to have confirmed a suspicion of mine (key word: "seems"): that somewhere down the pike we're being capped at 60 Mbits/sec. For a while there we had two download servers and a workunit storage server with plenty I/O capacity to spare, but still we were hitting a hard 60 Mbit ceiling outbound. Inquiries are being drafted/sent to the appropriate parties. It still could be a local problem, but we're not sure what else to try (given our current hardware). We are in the middle of building another helpful index on the science database. Looks like Bob's magic informix incantations are working - we can keep the project running simultaneously (though the assimilators might back up a bit). It is always happier around here when work is flowing. To be safe we increased the ready-to-send queue size to one million - we have the disk space now to keep more workunits around. The only downside is that this inflates the result table in the database by approximately 5-10%, which may exercise the RAM on the BOINC database server that much more. There is another problem Dave and I were poking at today: excessive "out of range" failures on our public web sites. Here's the deal: BOINC clients have a nice GUI which shows you icons, pictures, etc. from different projects as you select which to run on your computer. Where does it get these files? From the project's web servers. This is all well and good, but there are several (hundreds? thousands?) older clients out there making such requests but are being met with 416 "range not satisfiable" errors. Why? Because they have already downloaded the image file, but are making requests for more bytes beyond the file boundaries as if there was more to download. Obviously a bug somewhere, or a change in the way apache handles such things, but there's not much we can do about it. Even though this activity is creating bursts of heavy load on our web servers, this is a fire we're going to let burn for now. The official press release about multi-beam is finally out. This should help on many levels (though I'll be busier making sure the servers can handle any significant load increase). I guess I'll also be shaving every morning in case there is interest from the national television news media. I guess this is "technical" news: Our desks/chairs/furniture are mostly ancient hand-me-downs, some pieces older than I. We did get some new chair donations recently, but one of them broke - it came loose from its base, causing unsuspecting sitters to suddenly fall forward if their balance wasn't particularly keen. It's been lurking in our lab way too long, coaxing uninformed standers with tired legs to rest upon its comfortable and seemingly stable cushion base. I came to the lab this morning and that evil chair was by my desk with a note taped to it: "Matt - can you please toss this chair?" I guess enough was enough. I dragged it to the dumpster and sent it back to the dark void from whence it came. - Matt 2 Jan 2008 22:54:11 UTC Happy new year! Actually, being that every moment is the beginning of some arbitrarily defined era, I should be more clear: Happy new calendar year number 2008, whoever uses this particular calendar system which I usually do! The weekend was busy with the more-and-more-common fast workunits. Discussions today at the lab brought up the fact that about a third of our data will translate into these fast runners, so we better turn our attention back towards improving the data pipeline. We picked two low hanging fruits today: convert server bane from a redundant web server to a secondary download server. This will help determine if that bottleneck is the server or the storage. I also added a flag to the splitter scripts to select files in beam/polarization pair order, not filename order. This will help pseudo-randomize the creation of work, and hopefully spread the pain of fast workunit periods so we aren't so overwhelmed at times. Nevertheless, we have Astropulse coming down the pike, and have a lot of SETI@home data to go through (and we're starting to collect new data again!). So we need to upgrade the network/servers in a big way. And acquire more participants. Not sure how this will all happen yet, but it has to happen. Meanwhile, we might try another science database index build tomorrow (or soon thereafter). Bob found a way to do so while the database is up and inserting rows, so we might not have to shut down splitters/assimilators during the long build. Cool. - Matt 27 Dec 2007 20:41:10 UTC ("Tweenday" referring to the scant few work days between Xmas and New Year's holidays). As we progress in our back-end scientific analysis we need to build many indexes on the science database (which vastly speed up queries). In fact, we need and hope to create 2 indexes a week for the next month or two. Seems easy, but each time you fire off such a build the science database locks up for up to 6 hours, during which there will be no assimilation and no splitting of new workunits. Well, we were planning to build another index today but with the frequent "high demand" due to our fast-return workunits the ready-to-send queue is pretty much at zero. So if we started such an index build y'all would get no work until it was done. We decided to postpone this until next week when hopefully we'll have a more user-friendly window of opportunity. In the meantime, I've been trying to squeeze more juice out of our current servers. I'm kinda stumped as to why we are hitting this 60 MB/sec ceiling of workunit production/sending. I'm not finding any obvious I/O or network bottlenecks. However, while searching I decided to "fix" the server status page. I changed "results in progress" to "results out in the field" which is more accurate. This number never did include the results waiting for the redundant partners to return. So I added a "results returned/awaiting validation" row which also isn't exactly an accurate description either but is the shortest phrase I could think up at the time. Basically these are all the results that have been returned and have yet to enter the validation/assimilation/delete pipeline, after which it is "waiting for db purging." To use a term coined elsewhere, most of these results, if not all, are waiting for their "wingman" (should be "wingperson"). At this point if you add the results ready to send, out in the field, returned/awaiting validation, and awaiting db purging, you have an exact total of the current number of all results in the BOINC database. Thinking about this more, to get a slightly more accurate number of results waiting to reach redundancy before entering the back-end pipeline you take the "results returned/awaiting validation" and subtract 2 times the workunits awaiting validation and subtract 2 times the workunits awaiting assimilation. Whatever.. you get the basic idea. If I think of an easier/quicker way to describe all this I will. Answering some posts from yesterday's thread: > Missing files like that prompt me to make an immediate fsck on the filesystem. Very true - except this is a filesystem on network attached storage. The filesystem is propietary and out of our control, therefore no fsck'ing, nor should there be a need for manual fsck'ing. > Why are the bits 'in' larger than the bits 'out'? In regards to the cricket graphs, the in/out depends on your orientation. The bytes going into the router are coming from the lab, en route to the outside world. So this is "outbound" traffic going "into" the router. Vice versa for the inbound. Basically: green = workunit downloads, blue line = result uploads - though there is some low-level apache traffic noise mixed in there (web sites and schedulers). - Matt 27 Dec 2007 0:05:28 UTC The weekend was a difficult as we kept splitting noisy/fast work, so our back-end production was running full speed most of the time, clogging several pipes, filling some queues, emptying others, etc. We were able to keep reaching our current outbound ceiling of 60 Mbits/sec, so despite the problems we were sending out work as fast as we could otherwise. That's good, but bigger pipes would be better. Also one of the assimilators was failing on a particular result. We're not sure why, but I deleted that one result and that particular dam broke. Some untested forum code was put on line which also wreaked minor havoc. Not my fault. Anyway.. this is a short mini week for us in between Xmas/New Year's. Since we weren't around yesterday, we had our normal weekly outage today. Also took care of cleaning some extra "bloat" in our database. About 20% of the rows in the host table were hosts that last connected over a year ago and ultimately never got any credit. We blitzed all those. Upon restarting everything this afternoon after the outage I noticed the feeder executables had disappeared sometime around 3-4 days ago (luckily images of the executables remained in memory since we had no downtime over the weekend). We have snapshots on that filesystem so recovery was instantaneous, but the initial disappearance is mysterious and a bit troubling. - Matt 23 Dec 2007 19:05:25 UTC Quick note: We never really did recover from the science database issues from a couple days ago due to DOS'ing ourselves with fast workunits. Whatever. We chose to let things naturally pass through the system. Kinda like kidney stones. Meanwhile, one of the assimilators is failing with a brand new error. If any of us have time we'll try to check into that over the coming days, but we may be out of luck until we're all in the lab doing "extreme debugging" together on Wednesday. Hang in there! - Matt 21 Dec 2007 18:27:07 UTC Happy Holidays! As a present thumper (our main science database) crashed for no reason this morning. Not even the service processor was responding. I wasn't planning on coming to the lab today but here I am. Long story short, Jeff/Bob/I have no idea why it crashed - I found it powered down (but with standby power on). I powered it up no problem. Some drives are resyncing, but there's no sign that any drives died. In fact, every service on it is coming up just fine, including informix. Also no signs of high temperatures, or other hardware failures. Well, jeez. While the main disks are syncing up I'll leave the assimilators/splitters off. We may run out of work, but hopefully not for too long. - Matt 20 Dec 2007 21:50:18 UTC We're about to enter the first of two long holiday weekends. I'm not going anywhere - I'll be around checking in from time to time. To reduce the impact of unexpected problems I reverted the web servers back to round-robin'ing between kosh, penguin, and the new bane, and also (thanks to the recent increase in storage capacity) doubled the size of our ready-to-send queue. That should fill up nicely this afternoon and give us a happy, healthy cushion. There was a blip yesterday afternoon due to our daily "cleanup" query to revalidate workunits that failed validation due to some transient error. Such a query hogs database resources and can cause a dip of arbitrary size in our upload/download I/O. We made an optimization this morning to hopefully mitigate such impacts in the future. Eric discovered yesterday that we were actually precessing our multi-beam data twice. Not a big deal as it's easy to correct, and we would have discovered this immediately once the nitpicker got rolling, but it's better we discovered this sooner than later as cleanup will be faster. Pretty much we just have to determine which signals in our database were found via the multi-beam clients (as opposed to the classic/enhanced clients) and unprecess them. (What is precessing?) - Matt 19 Dec 2007 21:46:14 UTC There were some minor headaches during the outage recovery last night, mostly due to the scheduler apache processes choking. They needed to simply be restarted, which happens automatically every half hour due to log rotation. Or they should be restarted - I just discovered this rotation script was broken on bruno and other machines. I fixed it. I'm still breaking in the new web server "bane" - still having to make minor tweaks here and there. Of course I asked people to troubleshoot it during the outage recovery and the ensuing problems noted above - not very smart. Should be nice and zippy now. In fact, as I type this it's the only public web server running. I'm "stress testing" right now, but will turn the old redundant servers back on before too long. There's a push to get BOINC version 6 compiled/tested/released, so all questions regarding BOINC behavior are taking a back seat. Please stay tuned! These type of questions are usually answered better/faster in the Number Crunchers forum. I'm mostly focused on the servers and the SETI science side of things (though I do some minor BOINC development from time to time - but usually not anything involving credit or deadlines). - Matt 18 Dec 2007 23:24:47 UTC Our Tuesday outage ran a little long this week because we're no longer dumping to the super fast Snap Appliance as we converted that space into more workunit storage. Instead we're currently writing to the internal disk space on thumper, which is vast but much slower for some reason. This situation will evolve, so nothing really to worry about. We also made the database change to fix the cryptic bug noted in this thread. Pretty much just adding a new row to the middle of the application table so it was in sync with the data structs in the code. And yep, after that it was behaving normally, even without our "force" to set values to where they should be regardless of what was erroneously culled from the database. So we're calling this fixed. I also got the new server "bane" on line as a third redundant public web server. Perhaps you noticed a speedup? Perhaps you noticed some unexpected garbage, broken links, or weird php behavior? Let me know via this thread if you see anything obviously (and suddenly) wrong with the web site. Over the coming days we will retire the current web servers kosh and penguin. Bane is a system with two Intel quad-core 2.66GHz CPUs and 4GB RAM in 1U of rack space. Alone it is more powerful than kosh and penguin combined, which together account for about 6U of rack space. - Matt 17 Dec 2007 23:57:49 UTC Another Monday back on the farm. Due to faulty log rotation (and overly wordy logs) our /home partition filled up over the weekend, which didn't do much damage except it caused some BOINC backend processes to stop (and fail to restart). No big deal - the assimilators/splitters are catching up now. Jeff just kicked the validators, too. The hidden real problem is that the server start/stop script is 735 lines of python. In our copious free time we'll re-write a better, smarter version in a different scripting language (which will be, by default, easier to debug) - and it'll probably be only 100 lines or so, I imagine. Okay.. maybe 200. The mass mail pleading for donations is wrapping up without much ado, except a large number of them got blocked/spam filtered. No big surprise there, but we need to do more research about how to get around all that. - Matt 13 Dec 2007 20:50:46 UTC Roll up your sleeves, get the coffee brewing, etc. So yesterday's "bug" hasn't been 100% solved yet, but there is a workaround in place. Here are the details (continued from yesterday's spiel): We have two redundant schedulers on bruno/ptolemy, both running the exact same executable (mounted from the same NAS, no less), on the exact same linux OS/kernel. One was sending work, the other was not. By "not" I mean there was work available, but something was causing the schedule processes on bruno to wrongly think that the work wasn't suitable for sending out. Since this was all old, stable code, running on identical servers, this naturally pointed to some kind of broken network plumbing on bruno at first. A large part of the day was spent tracking this down. We checked everything: ifconfigs, MTU sizes, DNS records, router settings, routing tables, apache configurations, everything. We rebooted switches and servers to no avail. We had no choice but to begin questioning the actual code that has been working for months and happens to still be working perfectly on ptolemy. Jeff attached a debugger to the many scheduler cgi processes and eventually spotted something odd. Why was the scheduler tagging the ready-to-send result in the shared memory (which is filled by the feeder) as "beta" results? We looked on ptolemy. There were not tagged as "beta" there. A clue! Scheduler code was pored through and digested and it was determined this was indeed the heart of the problem - results tagged as "beta" were not to be sent out to regular clients asking for non-beta work. So bruno's refused to send any of these results out - it was erroneously thinking these were all "beta" results. But why?! After countless fprintf's were added to the scheduler code we found this actually wasn't the schedulers fault - it was the feeder! The feeder is a relatively simple part of the back end which keeps a buffer of ready results to send out in shared memory for the hundreds of scheduler processes to pick and choose from. The scheduler plucks results from the array, creating an empty slot which the feeder fills up again. When the feeder first starts up it reads the application info from the database to determine which application is "current" and then gets the pertinent information about the application, including whether or not it is "beta." This information is then tied to the ready-to-send results as they are pulled from the database. We found that even though beta was "0" in the database, it was being set to "1" after that particular row was read into memory. Was this a database connection problem then? We checked. Both bruno and ptolemy were connecting to the same database and getting at the same rows with the same values, so no. However, during this exercise we noted that C struct in the BOINC db code for the application had an extra field "weight" and of course this was the penultimate row, just before the final row "beta." What does that mean? Well, when filling this struct with a stream coming from MySQL, whatever value MySQL thinks is "beta" will be put in the struct as "weight" and whatever random data (on disks or in memory) beyond that MySQL would put in the struct as "beta." This has been the case for months, if not years (?!) but being these fields are never used by us (our beta project is basically a "real" project that's completely separate from the public project so its beta value is "0" as well), this never was an issue. We were fine as long as beta happened to be set to "0" (correctly or incorrectly) which it always had been... ...until JUST NOW! And only on bruno! This seems statistically impossible without any good explanation, but before getting lost down that road we put in a one-line hack which forces beta to be "0" no matter what bogus values get put in the oversized C struct, and immediately bruno was back in business. Until we get the whole gang in the lab at the same time and we can answer the final questions and confirm the appropriate fixes, it will remain this way. Now back to some actual programming (helping Jeff wrap up work on radar blanking code). - Matt 12 Dec 2007 21:27:05 UTC Blech. The fallout from yesterday's business wasn't very pretty. The science database server had a migraine all night due to the load-intensive index build and subsequent mounting errors due to heavy disk i/o. So the assimilators were off until this morning after we rebooted the system and cleared its pipes. However, towards the end of the day yesterday I spotted something funny. Of two scheduling servers, bruno and ptolemy, the former was refusing to send out any work. This wasn't a network issue, nor was it a real lack-of-work issue. There was plenty of work in bruno's queue, and the feeder had it all stowed up in shared memory ready to go, but the scheduler for no apparent reason was allowing none of it through. Clients were requesting N seconds of work and bruno would send it 0 workunits. The clients requesting the same N seconds of work on ptolemy were getting work. This was weird and nothing like we've seen before. Of course, bruno and ptolemy have identical kernels, scheduler executables, apache configurations, database permissions, file server permissions, network routes, etc. etc. etc. Jeff and I have been beating our heads on this for basically all last night and this morning and we still have no idea. Jeff's adding some new debug code to the scheduler as I type. We do have a workaround - just dump all the traffic on ptolemy until we figure it out. We may very well do this by the end of the day if the real problem doesn't present itself. Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now. By the way, Bob is taking over adding a "median" form of the result turnaround time query and determining if it will hit the database as hard as I feared. Cool. - Matt 11 Dec 2007 22:35:37 UTC Okay so the weekly outage is running long and still going strong as I write and post this missive. So be it. What's the deal? I'll tell you. Short story: we're trying to get a lot done today. We fully expected things to take a while, and our expectations are being realized. As we continue pushing forward on the analysis code, we needed to build another index on the master science database (thumper). This takes many hours, during which the table in question is locked and therefore the parts of the back end that require science database access have to be shut down, which is why we time such events with the regular outages. However, we're also finally tackling the nagging workunit space problem. Our workunit storage server (gowron) shares workunit storage space with various BOINC database archives, so the easiest/best solution is to move those archives elsewhere. Where's elsewhere? We currently have a lot of space in a volume established for science database archives on thumper. So today we had the two BOINC backups and the index build all hitting the thumper disks pretty hard, thus slowing everything down. Seems kind of silly, but this is a special case as we're not normally doing index builds. Nevertheless we'll move the BOINC database archives elsewhere at some point down the line as time/disk space permits. Meanwhile.. we broke the archive space on gowron and converted it all into a bunch of RAID1 pairs which are taking a long time to sync up. Actually, there's even more ex-archive space available but we'll do that at another time. My guess is the syncing should be done around 3:30pm Pacific Time. Are you getting all this? Warning: this entire chapter will be on the test. By the way, while waiting for all the parts above to come together I burned a Fedora Core 8 DVD and installed it on our latest Intel donation (mentioned in an earlier post). We're going to call it "bane" - actually reusing a name/IP address of another potential server donation that didn't pan out so well. I don't believe in jinxes, and I'm all for recycling. Anyway, it's already up and configured and working a lot better than the old bane. Might have a new web server racked up by the end of the week! And we got the mass mail pipeline finalized. Maybe I'll start those up today too. This is actually the highest priority but it's not very good form to start a mass mail while the project is down. - Matt 10 Dec 2007 23:26:47 UTC We had another batch of "fast" workunits this weekend. No big deal, except we did run out of a ready-to-send queue for a while there. To help alleviate panic I added a couple items to the server status page for your (and our) diagnostic pleasure: count of results returned over the past hour, and their average "turnround" time (i.e. "wall" time between workunit download and its result upload). It seems the current "normal" average is about 60 hours, during the weekend we were as low as 30. It would be be more meaningful to have median instead of average (as there are always slow computers that turnaround mere seconds before the deadline, thus skewing the averages), but mysql doesn't have a "median" function and it's not really worth implementing one of our own - we have so many other fish to fry. Our air conditioner tech was in today to wrap up work on fixing the current (and hopefully last) coolant leak. No real news there, except it was fun to see our temperatures shoot up 6 degrees Celsius within a few minutes as the air conditioner was temporarily turned off. I'm about to start the latest donation drive. This will wreak havoc on a few of our isolated servers which are dedicated for such large mass mailings. Hopefully this will happen without incident - people are understandably sensitive about what they perceive as spam. - Matt 7 Dec 2007 18:25:47 UTC Another quick note to mention that last night's power outage was a success, or at least our part of it. Thanks to all the cable/power cleanup Jeff and I did weeks ago it was a breeze getting everything safely powered down last night. This morning after we got the "all clear" we brought everything back up. Ultimately everything was fine, but there a few minor obstacles. Like our home directories being mounted read only (a misconfiguration in the exports file that got exercised upon reboot). And the BOINC database server booted up in the wrong kernel which didn't have fibre card support (though we fixed that last time but I really fixed it now). Also the BOINC database replica needed some extra convincing that it was in fact a replica server. We also moved vader into its new rack - part of the slooooow shuffle process of reorganzing the server closet (moving old stuff out, new stuff in, etc.). Anyway.. we're catching up on the big backlog now which will take a while of course. Hang in there. - Matt 6 Dec 2007 19:04:38 UTC Early tech news report today as we're going to have a power outage in about 4-5 hours. Yep. Everything is coming down. No web sites and no data servers until we power up Friday morning. That said, there's not much to report. Still waiting on final pieces to fall into place before I start sending out the mass donation e-mail. Slow steady progress on increasing space for workunit storage. Doing some actual programming again (mostly ramping up on Jeff/Eric's work on the nitpicker and data recorder code to deal with the radar blanking signal). Nothing terribly exciting - more of the same. Yeah... hopefully this will be the last lab-wide power outage to deal with those long-standing breaker problems. Yesterday afternoon we did get permission to use another project's espresso machine down in the community kitchen. For a moment there we were thinking of adding such a device to our hardware donation wish list. - Matt 5 Dec 2007 22:35:36 UTC Moving on... This morning Eric noticed our donation processing pipeline was clogged. Some backstory: central campus handles all the donation stuff. They send us an automated e-mail whenever people donate so we can give them a green star. I had to write a script that parses these e-mails. Not very elegant, but it works most of the time. But every so often, without warning, the format of the automated e-mail changes. This is exactly what happened a couple weeks ago - they removed a single "the" from one line and my parser went kaput. I fixed it, and suddenly we're a little bit richer. Sweet. This morning had a nitpicker (near time persistency checker) design review. Maybe we'll post the (rather cryptic) minutes somewhere soon. I did update the plans page - it's really hard for us to keep all these informative pages in sync and up to date. I do have a public SETI wiki ready to go but we're too busy to get it started (import the current pages, etc.). Usual manpower problems around here. Our friend at Intel gave us a 1U server missing CPUs a few months ago, and yesterday came through with a pair of quad cores. I scraped together 4GB of RAM, and we're ordering some drives now. This may very well become our new public web server. If it actually works once I install an OS (no guarantees yet - it's an engineering test model) I'll take this off the hardware donation page. - Matt 4 Dec 2007 22:15:22 UTC Yesterday afternoon some of our servers choked on random NFS mounts again. This may have been due to me messing around with sshd of all things. I doubt it, as the reasons why are totally mysterious, but the timing was fairly coincidental. Anyway, this simply meant kicking some NFS services and restarting informix on the science db services. The secondary db on bambi actually got stuck during recovery and was restarted/fixed this morning before the outage. The outage itself was fairly uneventful. Question: Will doubling the WU size help? Unfortunately it's not that simple. It will have the immediate benefit of reducing the bandwidth/database load. But while the results are out in the field the workunits remain on disk. Which means the workunits will be stuck on disk at least twice the current average. As long as redundancy is set to two (see below) this isn't a wash - slower computers will have a greater opportunity to dominate and keep more work on disk than before, as least that's been our experience. Long story short, doubling WU size does help, but not as much as you'd think, and it would months before we saw any positive results. Question from previous thread: Why do we need two results to validate? Until BOINC employs some kind of "trustworthiness" score per host, and even then, we'll need two results per workunit for scientific validation. Checksumming plays no part. What we find at every frequency/chirp rate/sky position is as important as what we don't find. And there's no way to tell beforehand just looking at the raw data. So every client has to go through every permutation of the above. Nefarious people (or CPU hiccups) can add signals, delete signals, or alter signals and the only way to catch this is by chewing on the complete workunit twice. We could go down to accepting just one result, and statistically we might have well over 99% validity. But it's still not 100%. If one in every thousand results is messed up that would be a major headache when looking for repeating events. With two results, the odds are one in a million that two matched results would both be messed up, and far less likely messed up in the exact same way, so they won't be validated. Not sure if I stated this analogy elsewhere, but we who work on the SETI@home/BOINC project are like a basketball team. Down on the court, in the middle of the action, it's very hard to see everything going on. We're all experienced pros fighting through the immediate chaos of our surroundings, not always able to find the open teammate or catch the coach's signals. This shouldn't be seen as a poor reflection of our abilities - just the nature of the game. Up in the stands, observers see a bigger picture. It's no surprise the people in the crowd are sometimes confused or frustrated by the actions of the players when they have the illusion of "seeing it all." Key word: "illusion." Comments from the fans to the players (and vice versa) usually reflect this disparity in perspective, which is fine as long as both parties are aware of it. - Matt 3 Dec 2007 22:16:10 UTC I was out of town all weekend (on the east coast visiting family) but didn't miss much around here. However we did have a long server meeting this morning as many things are afoot. First off, our power outage from last Thursday is now rescheduled for this upcoming Thursday (see notice on the front page). We're hyper-prepared now, so outside of shutting everything down Thursday afternoon and resurrecting the whole project Friday morning, it should be a breeze. There was discussion about our current workunit storage woes. Namely, we need more, and we have an immediate plan to make more (converting barely-used archive storage). This is because of our 2/2 redundancy, i.e. we send out two redundant workunits and need two results to validate. This means a large number of users finish their workunits quickly, but have to wait for their "partner" (or "wingman") to return the other before validating, during which time the workunit is stuck on disk taking up space. Months ago when we were 3/2 we'd send out three redundant workunits and only need 2 to validate, which means the workunit stays on disk only as long as the two fastest machines take to return their result - so they'd get deleted faster. That's the crux of it. Other than that chatted about making some minor upgrades to the BOINC backend (employing better trigger file standards, cleaning up the start/stop scripts (i.e. program them in something other than python)) and gearing up for the end-of-the-year donation drive. Most of the pieces are in place for that. - Matt 28 Nov 2007 22:28:56 UTC Turns out I was misinformed: while Arecibo Observatory is currently being recommissioned, the ALFA receiver still isn't attached yet and won't be after some more cleanup. In short, the ETA is still TBD. So be it. Currently (at least as I am writing this) we are in the midst of another "crunch" period where workunits are returning much faster than normal, thereby swamping our servers. This time Jeff and I looked at the results. The bunch we observed weren't "noisy" - they were normal workunits that just happened to finish quick due to their slew rates. This isn't a scientific/project problem - it's simply just extra load on our servers (a.k.a. a free "stress test"). We're getting prepared for another donation drive. I just updated the hardware donation page, for example. - Matt 27 Nov 2007 21:17:03 UTC Another week, another database backup/compression outage. This time around I took care of many house-keeping details while we were offline. I restarted the load balancers on our scheduling servers to enact higher timeouts - we're seeing occasional messages in our logs about such timeouts. We'll see if my adjustment helps. We moved vader onto a power strip to facilitate yet more ease during the power outage Thursday night. I also fully power cycled bambi to recover the drives that were wrongly reported as "failed" yesterday. Also compressed a bunch of old archives, logs. And unconvered many sym link chains that I then cleaned up, which in turn will hopefully reduce NFS problems in the future. [edit] UPDATE! This Thursday's electrical outage has been canceled. Woo-hoo! It shall be rescheduled sometime in the coming weeks. [/edit] - Matt 26 Nov 2007 22:18:15 UTC We survived the long weekend more or less unscathed. Another "busy" raw data file entered the queue and caused some extra traffic yesterday, but nothing nearly as bad as last Wednesday, and even that wasn't too bad. One user suggested we have the multiple splitters simultaneously chew on different files to mitigate the damage when one particular file is noisy. This would help, but at the expense of losing any benefits from file/disk caching. It's up for debate if caching is really an issue, but Jeff and I agree of all the dozens of fires on our list this one is low priority. A bigger problem, though most people didn't even notice, was bambi's nfsd freaking out around Saturday afternoon. This had the effect of causing the load on bruno and ptolemy to inflate for no good reason. Traffic was still pushing through at seemingly normal rates but there was a general "malaise" all over the backend. Eric actually stopped and restarted nfsd right after this happened but that didn't actually do anything. It wasn't until I fully rebooted bambi this morning that the loads on bruno/ptolemy plummeted. Slightly annoying: upon restarting bambi came up missing drives - this is a known problem where bambi's disk controller needs a full power cycle from time to time. We'll do that tomorrow during the usual outage. Looks like we're going to start taking new data at Arecibo again literally any minute now. Well, it could be thousands of minutes, but still.. We shipped some drives down there this weekend so hopefully they have one already mounted up ready to receive some hot, fresh bits whenever they start pouring in. Note the news on the front page. We're having a lab-wide power outage later this week. In theory no action on your part is necessary. - Matt 21 Nov 2007 21:41:44 UTC I wasn't selected for jury duty! Hooray! I fulfilled my civic duty without having to miss work! So we're in the middle of a slight server malaise - the data we're currently splitting/sending out is of the sort that it gets processed quickly and returned much faster than average. That's one big difference with our current multi-beam processing: the variance of data processing time per workunit is far greater than before, so we get into these unpredictable heavy periods and have no choice but to wait them out. Well... that's not entirely true. Jeff actually moved the rest of the raw data from this day out of the way so we can move to other days which are potentially more friendly towards our servers. Also we could predict, with very coarse resolution, what days might be "rough" before sending them through the pipeline. But we're going to split the data anyway at some point, so why not get it over with? At any rate we started more splitters to keep from running out of work, and we'll keep an eye on this as we progress into the holiday weekend. Happy Thanksgiving! Or if you're not in the U.S. - Happy Thursday! - Matt 20 Nov 2007 22:26:25 UTC The recovery from yesterday's outage (see my previous post) ended up going faster than expected. During the evening I turned the assimilators/splitters back on before we ran out of work or clogged the pipelines too much. Today we had the usual database backup/compression outage. Usual drill - no news there. We're back on line and catching up. Other than that, lots of minor hardware/software cleanup - basically getting ready for the long weekend (for those outside the U.S. I'm referring to Thanksgiving, i.e. an excessively large meal centered around turkey on Thursday, followed by three days of shopping, watching football, and digesting). I forgot to bring in a camera to take pictures of the cleaned-up closet. Maybe tomorrow (if I don't have jury duty - cross your fingers). I don't have a cell phone either, much less one with a camera in it. Not that there's much to see that's new - but it's good to post some pictures once in a while. - Matt 20 Nov 2007 0:05:57 UTC As we warned, we had a major outage today to do some massive cleaning/organization in our server closet. It went well: with dozens of cable ties and power strips on hand we got rid of about 95% of the spaghetti dangling from the backs of the racks, spilling into several piles on the closet floor. But that wasn't the main reason for this outage. We also installed a new UPS to replace a broken one - so jocelyn and isaac are protected again, as well as put everything on some kind of power switch so that when we have our lab-wide outage it'll be easy to just flick things on/off (as opposed to reaching behind big, heavy things to yank plugs from the wall). With the power off we were able to move racks around to allow enough of a gap to finally get the old E3500 out of there (the late, great galileo) - it had been collecting dust in the corner for years. Speaking of dust, we also vacuumed. But of course there were issues, which is to be expected when powering many massive servers off and on. We discovered jocelyn lost contact with its fibre-channel RAID (where the BOINC database resides). After some head scratching we realized this was due to fibre-channel support being lost in the recently upgraded kernel. We booted to an older kernel and it was fine. As I write this, both ewen (Eric's hydrogen database server) and thumper are doing forced checks of large disk volumes - that might take all night during which certain parts of our project will have to remain offline. We'll probably run out of work before too long. Apparently we need to turn off the forced checks. We also had some routing problems upon rebooting the Cisco but we quickly remembered that you have to do a "magic ping" to wake up the next hop and then traffic pushed through. - Matt 15 Nov 2007 20:35:13 UTC No real exciting news regarding the public facing stuff over the past 24 hours. Some of us have been lost in a grant proposal due today, some have yet more proposals to squeeze out. It's grant writing season. I've been playing with the new UPS's and some random php code. Jeff and I are making plans for our big preparatory power outage on Monday. We'll be switching all kinds of servers off and on over the course of a few hours, cleaning up cables, reducing the number of power strips, installing/implementing the new UPS's, moving stuff around on the racks, perhaps removing some things. Basically want to do as much as possible to make the real outage at the end of the month as smooth as possible. Once we settle on the real plan we'll post a warning message on the front page. - Matt 14 Nov 2007 21:32:32 UTC In case anybody noticed we had the assimilators/splitters turned off for a bit to test the swap between our primary/secondary science database servers. Everything worked! So that was a valuable test, especially we'll need to do this for realsies in the coming weeks to upgrade the OS on the current primary (thumper). Any mediawiki nerds out there? I need some assistance... We're trying to wiki-fy parts (or perhaps most) of the SETI@home public web site. However right off the bat I'm hitting an annoying problem: pages with '@' in their title, like, uh, "SETI@home." This is documented everywhere I could find as a "legal" wiki title character, but if I try to edit any page with '@' in the title it fails (saying the page - missing the at sign - doesn't exist - would you like to create it?). So I tried to escape it with '%40' but this also fails (as the software converts the escaped ASCII code to '@' which results in the same problem). What do I need to hack? Title.php? Something else? Google searches have proven useless so far (hard to search for '@' or 'at sign'). Dan and I re-seated this chips on the failing UPS (which I whined about yesterday). Now it works. All three new/used UPS's are charging now. Can't wait to add these to our server closet. Outage notices: There's gonna be a lab-wide outage later this month. Probably the night of November 29th, but this isn't official yet. Jeff and I will probably have our own full-day server outage prior to that (early next week?) to do some server closet maintenance in preparation for the real outage. - Matt 13 Nov 2007 22:16:57 UTC After the smoke cleared from the science database headaches of late last week, all was well for the long weekend. We had the day "off" yesterday, then did the usual outage today. We'll be bringing non-public-facing services up and down tomorrow for more planned science database testing (making the secondary the primary and then reversing again). Working with three new/used UPS's this morning - varying APC models. The first was easy: batteries went right in via a pull-out module, the cabling was obvious, it tested just fine. The second was an older model. The cabling was far more difficult, I ultimately had to tape sets of batteries together to get them to safely slide in/out the only access hatch, and then it didn't work. The third was a similar older model that worked just fine. Anyway, we have annoying return/exchange bureaucracy ahead of us. - Matt 10 Nov 2007 2:47:11 UTC Just an update on the past 24 hours. After all the index builds pushed through from the primary to the secondary database server the dam broke on its own last night. However, the assimilators were unable to insert anything. With the assimilators clogged the workunit file server began to fill up. We had to stop the splitters to keep the volume from growing out of bounds. Things got cleaned up this morning, the databases safely restarted, and everything is back on track though we are still catching up. To answer questions from the previous thread: We do plan on doing the analysis on the secondary/replica server. Problems may only seem to happen on long weekends, but perhaps there's some truth to this. Chances are on a long weekend we make other semi-vacation-like plans and so there's less hands on deck to take care of problems. I'm personally not paid enough to care about 24 hour uptime. Don't like it? Donate some money and maybe we'll hire more staff. - Matt 8 Nov 2007 21:25:23 UTC As noted yesterday in my tech news item we had some database plans this morning. First a brief SETI@home project outage to clean up some logs. That was quick and harmless. We then kept the assimilators offline so we could add signal table indexes on the science database. Jeff's continuing work on developing/optimizing the signal candidate "nitpicker" - short for "near time persistency checker" i.e. the thing that continually looks for persistent, and therefore interesting, signals in our reduced data. The new indexes will be a great help. Of course, there were other things afoot to make the above a little more complicated. The science replica database server hung up again this morning. We found this was due to the automounter losing some important directories. Why the hell does this happen? The mounts time out naturally, but the automounter fails to remount them next time they are needed. Seems like a major linux bug to me, as it's happening on all our systems to some extent. I adjusted the automounter timeouts from 5 minutes to 30 days. Doing so already helped on one other test system. Meanwhile, back on the farm... we're sending out some junky data that overflows quickly so that's been swamping our servers with twice the usual load. Annoying, but we'll just let nature take its course and get through the bad spots. This has the positive by product of giving us a heavy-load test to see how our servers currently perform under increased strain... except with the simultaneous aforementioned index build the extra splitter activity was gumming everything up. We have the splitters offline as I write this. Hopefully we'll be able to get them back online before we run out of work. If not, then so be it. - Matt 7 Nov 2007 21:32:50 UTC Let's see. Kind of getting bogged down in proposal land (Dan, Eric, and Josh are doing most of the work on that but I get pulled in from time to time to help with the menial stuff). After the proposal stress is beyond us we'll begin the next donation push which will find me babysitting servers sending out hundreds of thousands of e-mails. Fun. Meanwhile I'll be chipping away at the zillion things on my to-do list which could easily take a man-year to complete. Around the lab we've been discussing the notion of "e-mail bankruptcy" - realizing there is no way you can catch up on your teeming in-box, so you simply delete everything, then send out a mass e-mail to everyone saying something like "I deleted all my e-mails - sorry I didn't respond - if it's really important please send it again." In reality I do this all the time without sending that mass e-mail. Someday I might have to declare "to-do list bankruptcy." Warning: we might have a quick BOINC database outage tomorrow (to clean up old logs). And then we'll keep the assimilators offline an additional few hours so we can safely build indexes on the science database. The latter won't affect normal upload/downloads. - Matt 6 Nov 2007 22:21:30 UTC Another Tuesday, another regular weekly database backup outage. The web/data servers were in a funky state for a while there as we encountered some random minor issues. First, some new web code was wrongly accessing the database when the project was explicitly in "no db" mode. Dave fixed that. I also found some typos in the host_venue_action.php script (thanks to bug reports on this forum). I fixed that. And I also rebooted the scheduling servers during the outage to make sure the new load balancing regime worked with intervention upon restart. It did. I also fixed the "connecting clients" page again (hopefully for good this time). Also moved the db_purge archives to a different file system (as planned per yesterday's tech news item). And I effectively thwarted future complaints about our weekly outage starting too early/late by eliminating any mention of exact times. Ha ha. Other than that, still working on data pipeline automating scripts. Also spent a chunk of time helping the tangentially related CASPER Project upgrade their server's OS to one was supported by our lab-wide data backup servers. And as for that one post about "setifiler1"... A keen observer found "setifiler1" in all the pathnames relating to various recent errors. This is a red herring - setifiler1 is just a network attached storage server containing, among other things, many home accounts and web pages. So if any possible error shows up anywhere about anything, chances are the string "setifiler1" will appear in the pathname of the script/executable in question. - Matt 5 Nov 2007 22:34:40 UTC Well.. No bad news, really. Everything under my domain was working more or less. We did fill the data pipeline directory - an eight terabyte filesystem - with backlogged raw data. I'm only just now implementing my "janitor" scripts that check these files to make sure they have been successfully copied to our off site archives and fully processed by the splitters so we can safely delete our local copies. In the meantime we've been forming a long "delete queue." No big deal, except we were also keeping our db_purge archives on the same filesystem, which meant the db_purger stopped working, which in and of itself is also no big deal, but it's all getting cleaned up now. - Matt 1 Nov 2007 22:04:56 UTC So the new load balancing regime on the schedulers has been working great. That's good news. On the other hand, our science database replica still isn't quite perfect yet. At least we're finding it to be resilient (i.e. we don't have to reload it from scratch every time it barfs). It got into a funny state yesterday, and had to be ungracefully killed. We rebooted the system to clean the pipes and then it recovered just fine. However, the reboot tickled a disk controller problem we've seen before where a tiny random subset of disks were invisible after reboot. Luckily the RAID is robust enough that this wasn't a big issue. We fixed this problem the way we did before: a full power cycle. The disk controller must be hanging on to some broken bits that only a complete power down can remove. In any case, we really need to invest in those networkable power strips at some point. Smaller items: Various web site issues arose yesterday afternoon. A partial update of web code was in conflict with older parts. Dave cleaned that up this morning. Meanwhile Jeff and I are getting ever closer to fully automating the multibeam data pipeline, from Arecibo, to UCB, to the splitters, to our clients, and to/from our archives down at HPSS. We are hoping that someday soon we break through whatever bureaucratic dam(s) to get gigabit out of the lab (still currently stuck at a 100 Mbit ceiling for the whole lab, including our own private ISP strictly for SETI data downloads/uploads). By the way.. we believe we'll start collecting fresh data again at Arecibo before the end of the month. And oh yeah.. I'm closer to making this page ready for prime time (doing regular daily plots, making selectable archives depicting other signal types from other 24 hour periods, maybe even animating them): SETI@home Skymaps - Matt 31 Oct 2007 21:37:15 UTC Happy Halloween! We celebrated here in the Bay Area by having a 5.6 earthquake last night. No big shakes (ha ha) considering the relatively high magnitude. Anybody thinking Californians are crazy for living in such a seismic zone should remember the top two recorded earthquakes in the contiguous US were both in Missouri. I also grew up across the river from the Indian Point nuclear reactor, just outside NYC, which lies right next to a very active fault. Anyway... Somebody complained about the weekly outage time notices on the web being off from reality. They are semi-automated, and one mechanism was created during PST and the other during PDT. As well, we haven't been sticking to exact times lately as we've come to rely heavily on BOINC's fault tolerance, i.e. if it's convenient to bring down servers a half hour early then it's no big deal - the clients should fail to connect and back off gracefully. So those messages are under the category of "vaguely informative" or "better than nothing" but at some point I'll tighten up their accuracy. Jeff and I spent a chunk of time finally getting some reasonable load balancing to work such that we don't have to worry about feeder mod polarity issues (see older tech notes - basically round robin DNS doesn't work as expected and one server runs out of work faster than the other). We were lagging on this as actual requester IPs weren't showing up in the apache logs as the proxy was in the way. We discovered "mod_extract_forwarded" but we were using the wonderfully simple and effective "balance" utility which doesn't pass the expected "X-Forwarded-For" header to this module. Then I discovered "pound" which is like "balance" but does add the right headers to make this happen. Long story short: we're currently up with hopefully more equitable load balancing. Outside of that: messing around with beta splitters again this morning (the beta project is mostly Eric's domain which I try to avoid as much as possible) to keep work generation going and test out the new splitter compile. And working on skymap stuff for public web consumption. - Matt 30 Oct 2007 20:24:43 UTC Some small improvements today during the outage. First, just to get the ball rolling in some positive direction, we moved ptolemy (the redundant scheduling server, among other things) out of our secondary lab and into the actual closet. This was an easy procedure, except it wouldn't boot up after the move. After successive reboots, but before utter panic set in, I guessed it was a hardware RAID configuration problem - I pulled out all the superfluous non-boot drives and then it booted up just fine. Phew. Second, we pretty much given up on bane which meant its parts were free to cannibalize. So I upgraded the memory sidious (MySQL replica server) - it was at 16GB, now it's at 24GB. Sidious has been having more and more trouble keeping up with the master database on jocelyn as of late. Perhaps this will help. Jeff is compiling a new multibeam splitter with additional smarts to account for a new radar blanking signal in the actual data (to help keep radar noise out of the workunits before they are split). We'll test this in beta first - which as it happens ran out of work last night. So workunits generated by this new splitter should be in beta any second now, and then soon in the public project. - Matt 29 Oct 2007 21:17:47 UTC There were minor minor hiccups over the weekend, mostly due to a concentrated bunch of noisy workunits being pushed through the pipeline. Other than that - no big server issues to mention. Some people discovered a single BOINC client creating new, redundant hosts at the rate of one every few seconds. In the grand scheme of things this is no big deal. Bob usually checks for such things every so often and removes the zombie hosts to keep our hosts database as trim as possible. This case was slightly unusual due to the creation rate. I contacted the participant in question and we confirmed an old client on a system running Vista was to blame. - Matt 25 Oct 2007 20:30:00 UTC For some reason I'm in the "deal with boring, nagging sysadmin tasks" zone this week, so that's mostly what I've been working on. Gotta ride the wave when it happens, you know? Nothing really interesting there to report. Writing scripts, updating our UPS plans, cleaning up and improving our internal alert system... stuff like that. Last night the logical log on our primary science database filled up. This is the log that is used by the secondary to keep in sync with the primary. When the log is full, the primary halts all connections as a protective measure, as the secondary will lose track of future updates. What does all this mean for you? Well, with the primary effectively offline the assimilators and splitters were blocked, and we ran out of work to send this morning. We spotted this quickly enough, but apparently we need better alerts and some automatic logical log rotation system. We're still getting the feel of this informix database replication stuff. - Matt 24 Oct 2007 20:56:54 UTC More of the same from yesterday. Getting the SETI gang ramped up on the wiki. When there's actual content I'll announce it. I had to screw around with the BOINC database a couple times. First, there was a minor issue with the my.cnf file, but the server has to be stopped/restarted to enact any changes (which meant quickly bringing the project down and back up). We're also continuing to have mod polarity issues due to DNS round robin not working as it should (one scheduler has plenty of work in its queue, the other gets pegged at zero so clients connecting to it are erroneously told we are out of work, etc.). We need a better solution instead of continually reversing the polarity "by hand" (changing command line options on the feeders and restarting them). We tried "balance" which may ultimately be our best bet, though I don't like that our apache logs only reflect the IP address of the balance server (and the IP addresses of the connecting clients). Anyway... What else... oh yeah... The connection client type page *was* working, it just was firing up the same time as the web log rotater, so it was analyzing empty log files. Ha ha. Suddenly some pigeons are nesting right outside the lab. Every so often I feel like I'm being watched, and I turn to find a pigeon standing on the other side of the window next to my desk, staring intently at me ("what is that funny monkey doing in there?"). - Matt 23 Oct 2007 21:58:02 UTC Lots of little things today. Jeff and I are working on the automated data pipeline in preparation for the data recording to come back on line - where recording, reading, copying to offsite archives, splitting, deleting, etc. happens via a set of automated scripts. Bob is fairly convinced the science database replica is working adequately - we tested various shutdown scenarios and it came back on line after each one. I spent some time working on wiki-fying parts of the SETI@home website. There's been a growing list of planned edits/upgrades to our website that none of us ever got around to, so this has been a long time comin' (and it's far from useful yet). Speaking of lists of things to fix: I got that client-connection-types page working again. It's a permissions problem that break every time linux automatically updates httpd. I grow weary of having to read manuals (very few well-written) every time I need to install/upgrade/fix anything. Things used to be much more intuitive and simple. Nowadays standards are pretty much entirely abandoned and direct contact with actual bits and bytes has been abstracted to death. It's like having a garage full of simple tools (c-clamps, screwdriver, jigsaw, etc.) that you don't have direct access to anymore - the garage is now guarded by Billy who will gladly obtain the proper tool and do whatever you tell him to do with it. Billy doesn't speak English - and the language he comprehends changes all the time - some days he only speaks Portugese, sometimes Estonian, sometimes Afrikaans - every few months a new language is added to the list. You just want to hang a stupid picture frame in your hallway but there you are, desperately trying to figure out how to say "hammer" in Japanese. Billy doesn't like it when you yell. - Matt 22 Oct 2007 22:34:00 UTC Post weekend update: Things have been running relatively smoothly over the past week. Bob, Jeff, and I got a few more warm fuzzies from the science database replica server today - we were able to stop/restart both sides without having to reinstall the whole database from scratch! I updated some splitter maintenance code, so that's why all the green dots disappeared from the server status page. I'll fix that eventually. But most of the day was spent working on swapping out a motherboard from a giant 4-processor Xeon server donated from Intel (and they donated the spare motherboard, too). This was the machine called "bane" that months ago I converted into a public web server and then after a week it crashed. Upon powering up it would beep out a cryptic error message and that was it. So I spent half the day today swimming in thermal grease (replacing heat sinks), unplugging, unscrewing, replugging, rescrewing, and scraping my fingers and arms on sharp metal things until the new motherboard was in place. Sure enough, same beeps. Sigh. These are used test systems, so there was no guarantee they'd work. - Matt 16 Oct 2007 21:11:03 UTC Turns out the air conditioner coolant was actually down to near 50% full. After the tech filled it to normal levels this morning the temperatures immediately dropped about 5 degrees Celsius all over the closet. Sweet. They'll check again for leaks in the coming days. The Tuesday outage for database backup/compression went just fine, except we wanted to take this opportunity to get a couple more Sun 220s shut down and removed from the closet, as well as get Eric's hydrogren database server ewen railed up and moved elsewhere in the racks (to improve its air flow). Well, none of that happened - once again despite having actual rails made for ewen they wouldn't fit in any of our non-standard racks in any configuration. Lots of heavy lifting, bolting/unbolting, cabling/decabling, and nothing to show for it. Very frustrating. And due to routing/apache configuration issues galore we ultimately couldn't shut down our old public web servers. In fact, we had to move klaatu out of the way for what we thought was going to be a successful ewen relocation, which meant turning penguin back on and making *that* a public web server. And then I realized there were libs that only existed on klaatu's disks, so I had to recompile php/apache on kosh/penguin to remove that dependency. All these efforts, and we're basically where we were yesterday afternoon. Except the air conditioner is working for realsies. Maybe sometime this week I'll get back to what I was working on before all this nonsense. Hmm... What was I working on? - Matt 15 Oct 2007 22:50:59 UTC So the past two days we were fighting with what to do about sudden rising temperatures in our server closet. This sort of thing happens every year around this time, as the regular lab air conditioner which "assists" our closet by keeping things extra cool in the sunny summer obviously doesn't do the same as we enter foggy fall. We also have some nagging tiny imperceptible coolant leak so we need to recharge that every so often. In any case, the systems were getting hotter, so we ultimately had to shut everything down (the idle disks and CPUs generate far less heat). This morning the right people were called to inspect the situation. Turns out our air conditioner was more or less okay (we'll add more coolant soon) but the lab air conditioning system did konk out over the weekend. Apparently the lack of assist - even the slight amount during this wet weekend - pushed us over the edge. Before we figured this all out we had a meeting and planned on several courses of action to remove as many aging, less efficient systems from the closet. I planned to get three systems out by the end of the day (download server, and the two public web servers) but due to annoying little nested problems I've been only able to g |