Technical News - 2008

The news items below address various issues requiring more technical detail than would fit in the regular news section on our front page. These news items are all posted first in the Technical News discussion forum, with additional comments/questions from our participants.

(available as an RSS feed.)

That\'s it - the last tech news update (from me at least) for 2008. I\'m already looking forward to 2009. Maybe we\'ll get some or all of the above done.

- Matt
' ), array('29 Dec 2008 23:56:24 UTC', 'One short holiday week is behind us, now here comes another one.

We did fairly well over the weekend, considering we were pretty much maxed out the whole time. The assimilator queue finally drained, thanks to splitters starting to chew on raw data files physically located on the new raw data storage server (as opposed to located on the same server as the science database), but also thanks to the validator queue falling behind.

In times of low resources we do have some knobs to turn to help squeeze more juice out of our embattled servers. Sometimes you have to roll up your sleeves (or, in this case, pull out a calculator) and determine what processes needs what resource, and which are claiming too much. After some investigation it was clear this time around we were giving httpd too much - and this is a tunable we have to adjust every so often, depending on how many people are connecting at any given time, and for how long - otherwise you have too many httpd listeners hanging out doing nothing eating up valuable memory/cpu. Anyway, long story short I reduced the number of validators from 6 to 4, moved the validator logs to a different filesystem (reduce i/o contention), and vastly reduced the number of httpd listeners. So far so good - that queue is draining (and therefore the assimilator queue is inflating again).

We will have the usual outage drill tomorrow, followed by another set of "days off."

- Matt
' ), array('24 Dec 2008 21:06:54 UTC', 'We seem to have gotten beyond the current period of high demand and back into a realm of working within our limited resources. Queues are filling or draining in a positive direction, albeit slowly. I did finally write a script to compute how many results passing through our validation queue are CUDA processed - currently roughly 3%. And speaking of that, I am now aware of the CUDA validation problems mentioned in other threads and I passed them along with screenshots, info, etc. to the proper authorities (i.e. Eric and Jeff).

At this time of year I do a lot of prep for upcoming server projects without enacting anythying too crazy, lest I break anything that\'s currently working just fine. For example, I\'m building more RAID mirror pairs on the workunit storage server, but won\'t actually add them until the new year. We added enough space yesterday to hold us over until then. I\'m also cleaning up the lab, labelling spare parts, placing things in boxes, organizing dozens of O\'Reilly books currently stored inefficiently in stacks, etc. We also tend to "store up for the winter" - at some point soon we\'ll pull up a bunch of data from HPSS to keep splitters happy until the new year.

Thanks for all the holiday wishes/greetings, and please accept my likewise sentiments. For those thinking I\'m going above and beyond the call of duty by working during vacation, don\'t give me too much credit. My vacation comes later.

- Matt

' ), array('23 Dec 2008 23:00:32 UTC', 'Today had our weekly outage for mysql database backup, maintenance, etc. This week we are recreating the replica database from scratch using the dump from the master. This is to ensure that the crash last week didn\'t leave any secret lingering corruption. That\'s all happening now as I type this and the project is revving back up to speed.

Had a conference call with our Overland Storage connections to clean up a couple cosmetic issues with their new beta server. That\'s been working well and is already half full of raw data. Once the splitters start acting on those files the other raw data storage server will breathe a major sigh of relief. I was also set to (finally) bump up the workunit storage space yesterday using their new expansion unit - but waited until their procedure confirmation today lest I did anything silly and blew away millions of workunit files by accident. The good news is that I increased this storage by almost a terabyte today, with more to come. We have officially broken that dam.

I also noticed this morning the high load on bruno (the upload server) may be partially due to an old, old cronjob that checks "last upload" time and alert us accordingly. This process was mounting the upload directories over NFS and doing long directory listings, etc. which might have been slowing down that filesystem in general from time to time. I cleaned all that up - we\'ll see if it has any positive effect.

Jeff\'s been hard at work on the NTPCker. It\'s actually chewing on the beta database now in test mode. We did find that an "order by" clause in the code was causing the informix database engine to lock out all other queries. This may have been the problem we\'ve been experiencing at random over the past months. Maybe informix needs more scratch space to do these sorts, and it locks the database in some kind of internal management panic if it can\'t find enough. Something to add to the list of "things to address in the new year."

- Matt
' ), array('22 Dec 2008 23:32:27 UTC', 'Okay, well, it\'s not like we didn\'t see difficulties coming with the release of a client that could potentially improve our processing by 10x. But it hasn\'t been all that bad, either. Due to various reasons, mostly excessive i/o, the assimilator queue swelled, which caused the workunit storage to reach maximum capacity, which in turn constrained the splitters. This is still the case, more or less - though I am working to increase the workunit storage which will help break one of our dams. I already employed some of the Overland Storage for raw data images, which will eventually break another dam or two. There\'s still our network bandwidth limits, though... We\'re just crossing bridges as we get there.

In any case, I did add a new photo album of our server closet for the nerds in our audience.

Schedules will be erratic for the holidays, as you can imagine.

- Matt
' ), array('18 Dec 2008 22:41:17 UTC', 'Moving onward and upward. More and more people are switching over to the GPU version of SETI@home and Dave (and others) are tackling bugs/issues as they arise. As predicted we\'re hitting various bottlenecks. For starters, increased workunit creation (and current general pipeline management since we have full raw data drives that need to be emptied ASAP) has consumed various i/o resources, filled up the workunit storage, etc. On this front I\'m getting around to employing some of the new drives donated by Overland Storage. The first RAID1 mirror is syncing up - may take a while before that\'s done and we can concatenate it to the current array. Might not be usable until next week.

Also, as many are complaining about on the forums, the upload server is blocked up pretty bad. This is strictly due to our 100Mbit limit, and there\'s really not much we can do about it at the moment. We\'re simply going to let this percolate and see if things clear up on their own (they may as I\'m about to post this). Given the current state of wildly changing parameters it\'s not worth our time to fully understand specific issues until we get a better feel for what\'s going on. Nevertheless, I am working on using server "clarke" to configure/exercise bigger/faster result storage to put on bruno (the struggling upload server) perhaps next week.

As for the mysql replica, it did finally finish its garbage cleanup around midnight last night, but then couldn\'t start the engine because the pid file location was unreachable (?!). Bob restarted the server again, which initiated another round of garbage cleanup. Sigh. That finished this morning, and with the pid file business corrected in the meantime it started up without much ado - it still has 1.5 days of backlogged queries to chew on, though.

- Matt
' ), array('17 Dec 2008 23:50:51 UTC', 'So it\'s official: you can now run SETI@home on your NVIDIA GPU. Of course they\'re still working out the kinks, and it has yet to be seen what effects (immediate and long term) this will have on our servers and known bottlenecks. Such things are quite unpredictable, given the dizzying long list of variables.

In order to keep our bandwidth from going bonkers due to all the new client downloads, we employ the use of Coral Cache. This is all well and good, except that some ISPs out there firewall http redirects, which means a tiny subset of users cannot download these new clients. This is unfortunate, as we have no choice because we can\'t handle the new client downloads ourselves. So these few users will suffer a bit until we can remove such caching.

Our replica server never did recover from the outage yesterday, causing stats of various kinds to be jammed for the past day or so. This morning we found scary log messages and we couldn\'t even shut mysql down gracefully, so we had to kill the process and reboot the machine. It\'s been in really slow recovery mode all day. When finished there\'s a good chance it\'ll be out of sync from the master and will have to be rebuilt from scratch anyway. Sigh. In the meantime, I\'m pointing all queries at the master, which is loading it down a bit and causing us some minor grief (running out of work to send, for example).

- Matt' ), array('16 Dec 2008 23:43:25 UTC', 'First and foremost, it\'s snowing outside. This doesn\'t happen very often around here.

So today was an outage day - with one unexpected surprise: a visit from Court, systems administrator extraordinaire here in our lab a couple years back. Nice to see him again and catch up.

The standard outage stuff was, well, standard. Allow me to remind our new readers: Weekly we "compress" the mysql databases (which bloat from continual inserts/deletes all week, much like disk fragmentation) and back them up. These databases contain all the user/host/team info, and who is working on which workunits - basically all the generic volunteer computing stuff. The science is all kept in a separate database (using an Informix engine) on a different server altogether. The latter doesn\'t suffer from the same bloat, so we can do simple no-frills backups to disk while the database is live, without much ado. In theory we could do the mysql dumps live as well, but we choose to take things down to ensure the master/replica databases are in sync, and allow us some regular downtime to take care of pending server tasks. For example...

Today we finally turned off the old Network Appliance - a NAS server which worked fast and wonderfully, but (a) was only 3 Terabytes raw storage, (b) took up one third of our server closet, and (c) the individual disks have been failing at an increasing rate. We moved all of its functionality elsewhere already, so it was time to say goodbye. Jeff and I tore it apart shelf by shelf. Any sadness was lost in the joy of now having a completely empty rack full of completely useful shelves (we\'ve had ridiculous problems finding racks/shelves that matched in the past). It\'s kind of funny the most useful part of that system at this point was its racks/shelves. We put all the recently donated Overland Storage servers into this now-empty rack (containing 10 Terabytes worth of storage), as well as anakin (the scheduling server), and there\'s still room for a lot more stuff. We still have to configure/employ all this new storage, but it\'s all plugged in and on line at least.

Recovery from the outage is usually painful. Today seems a little worse. Part of that is our work-to-send queue is at zero and the splitters are waiting for some space to free up before creating new work. I also think server "bruno" is having result storage issues slowing things down (people are connecting okay, which they can push through the usual traffic jam). We might need to reconfigure/rebuild that RAID array sooner than later.

I brought the mini video camera to make a quick video tour of our server closet, but the noise of all the fans is so loud it\'s basically worthless. I did take some low-quality still photos though - I\'ll get those up on the web someday.

- Matt
' ), array('16 Dec 2008 0:10:30 UTC', 'Happy Monday, one and all.

So let\'s see... things are progressing in a general positive direction. Our conversion from multi- to single-dimensional indexes on the result table in the BOINC/mysql database seems to have been a success, though I\'m still not sure if it\'s helping all that much just yet. In any case, we may continue doing the same on other tables. We might get the whole database, indexes and all, fitting entirely in memory. We don\'t need to (we\'re doing just fine with whatever level of paging is currently happening), but it\'d still be nice. In any case, at least we proved that we don\'t need to create extra unwieldy multi-dimensional indexes to do specific merges - mysql 5.x and up will figure out how to the merges on its own.

Jeff and I plan to do some big steps towards moving things in and out of the server closet tomorrow. I\'ll try real hard to remember to bring a camera. If all goes well we\'ll at least have (a) more free rack space, (b) more available power, and (c) more workunit storage on-line (one less bottleneck to worry about!).

Thanks to those who\'ve been beta tested the cuda version of the SETI@home client. Sorry if I confused people by vaguely mentioning this in my last missive. Once this is formally released I\'m sure we\'re going to exercise new and old bottlenecks, but it will be a huge step in the world of volunteer computing. We may run out of work more often. Depending on your perspective this may be seen as a "good problem."

And we did finally get the donation mass e-mail rolling out late last week. I really appreciate the generosity of the SETI@home community, especially in these dark economic times.

- Matt
' ), array('11 Dec 2008 0:21:20 UTC', 'During the wee hours this morning our upload server (bruno) froze. We are still unsure why, but recovery was a comedy of errors. Jeff was already about to power cycle it (having little other choice given the unresponsive console) when I got in around 8am. After rebooting bruno failed to mount its result storage drives due to some kind of mdadm mismanagement. This forced us into a read-only please-fsck-your-drives mode. The drives, outside of pointless resyncing due to hard power cycle, were fine - they didn\'t need to be fsck\'ed. Still, being root (/) was read-only we couldn\'t edit /etc/fstab to prevent this from happening again upon every reboot.

So I tried to get it into a real single user mode to make such an edit - all I wanted to do was comment out that one mount line. However, thus started a series of about 8 consecutive reboots, each taking about five minutes, and all wastes of time due to a typo or an unresponsive kvm. I ultimately gave up and booted from DVD in "rescue mode" where I could finally make the fstab edit. Finally all was well with the mount (which I did on the command line), but then I had all kinds of network errors with the system. More tweaks, more reboots... Long story short this server is being held together with figurative duct tape at the moment. We\'ll get it all sorted out later.

Jeff and I also worked together to get the remaining pieces of the "donation drive" in place, such as it is. I\'m sending out test e-mails out now, and will probably start sending in earnest on Friday. Please send all questions/comments about our fundraising efforts to the principal investigators (Dan, Dave, Eric). I am simply implementing the technical aspects of this endeavor, though I would like to point out we finally updated the text on the plans page.

By the way.. did anybody notice this?

- Matt' ), array('10 Dec 2008 0:31:47 UTC', 'Tuesday outage day (mysql database backup/maintenance). Today Bob took care of the final step of the "single vs. multi-dimensional indexes" exercise. That is, he dropped all the multi-dimensional indexes on the result table in the main project on the master database and we crossed our fingers. Looks like mysql is neatly, or smartly, parsing queries and merging single indexes as needed just fine. This whole point was to remove the number of indexes we need, and thus keep a slightly smaller footprint in memory, which in turn helps performance.

The raw data pipeline has been a major headache, if only because our hot-swap enclosures have been giving us grief. Jeff and I determined one of them is flat out broken, so that reduces our current maximum throughput by half until we get it replaced. This isn\'t a disaster, as we pretty much never reach half of our maximum throughput anyway, but still a slight inconvenience as we have to more rigorously schedule drive swaps.

Gearing up for the donation drive, I discovered our mass mail server lost its DNS entry for some reason. The lab DNS master replaced it, but not after I turned sendmail on an hour earlier and started my tests, thus causing all kinds of circular bounces that clogged the entire lab\'s mail queue with literally thousands of e-mails (maybe tens of thousands). It\'s still draining as I type this. Don\'t blame me - I didn\'t remove that DNS entry.

We\'re another step closer to removing that NetApp box. In fact, it\'s out of the automounter maps, everything on it is sym-linked elsewhere or chmod\'ed to 0, and I scoured all the other servers to remove sym-links to it. Part of this project meant resurrecting server "clarke" (donated many months ago) to be a CPU server (or otherwise internal use) as it will soon have room in the closet. It had a stale configuration at this point which needed refreshing.

No news on the Overland boxes - though one question was: why not combine them into one big box? Well, we have two separate needs: workunit storage, and raw data storage. The former we already have, and it works great - we just need more room - so we\'ll plug in one of the new expansions and get that room. The latter we don\'t really have and would like to keep on separate volumes (as you read the raw data and write out workunits, so you don\'t want the I/O to compete as it would on shared drives). Also.. part of the deal is we\'re going to continue helping them beta test their latest OS, which they have on the second head unit they gave us. So in a sense we\'re obliged to have two separate entities - the raw data on the beta test head/expansion and the workunits on the known-reliable head and additional expanion. Other question: form factor - the heads are about 2U and the expansions are about 3U. We have 2 of the former and 3 of the latter now. We\'ll have room for them eventually. I will update closet photos when we do the next major move (next week, I hope?).

- Matt
' ), array('9 Dec 2008 0:45:19 UTC', 'Happy Monday, folks. Things were sort of okay over the weekend. The replica mysql database got stuck on Sunday - the usual drill - I logged in and quickly restarted it. The science database, however, also choked. This happened on Friday. Jeff\'s been doing some NTPCkr testing that would have gone all through the weekend except the excess I/O ate up all the informix threads, thus causing the splitters/assimilators to slow down and run out of work to send. Luckily I caught this before bedtime that night and broke that dam. Jeff\'s looking into why that happened.

In good news, Overland Storage (formally Snap Appliance, or Adaptec), donated 10 Terabytes of NAS storage in the form of a new "head" and two expansion units. One of the expansion units we\'ll try to get on our current workunit storage server ASAP (so we stop running out of room to split new work), and the other stuff we\'ll make a new temporary (possibly permanent) raw data reserve so we can do the big shell game and convert all the science database devices from RAID5 to RAID10. Thanks, Overland!

- Matt
' ), array('5 Dec 2008 23:12:26 UTC', 'Happy Friday! I don\'t really have much to add to the proceedings.. today was a lot like Wednesday when last I was here at the lab. Time spent on more filesystem shell games, compiling/running code, and working with Josh to figure out some weird discrepancies between beta/public Astropulse results.

I should point out I added a couple more stats to the server status page, those being mysql queries/second, along with the amount of seconds behind the replica is from the master. Maybe this will help clarify when things go awry, though I know sometimes more information obscures the pertinent stuff.

I forsee a couple dams breaking in the very near future, resulting in massive server closet updates/upgrades including, but not limited to: shutting down the incredibly solid (but physical large and logically small) NetApp rack to be replaced by a 3U system with twice the storage, thus making room to (finally) put vader and sidious in the closet, along with several UPSes, and another CPU server, clarke, which has been waiting for too long to be employed. Sometimes these things have to happen serially. Ducks in a row and all that.

- Matt
' ), array('3 Dec 2008 23:24:42 UTC', 'Ah, Wednesday. It usually today when Jeff and I swap our "focus." Early in the week I\'m aimed at hardware/sysadmin and he\'s deep in software development, and then later in the week we switch. This is an attempt to make sure we both get some programming time as the other person is taking the helm. He\'s mostly working on the NTPCker, and me on radar blanking stuff. Both projects are slow going.

There are a lot of chores we both manage. Maintaining the raw data pipeline eats up an astonishing amount of time so we swap those duties as well. Simply "walking the beat," chasing down alerts, fixing hung processes and broken services, could easily end up a whole day every day if we\'re not careful. Today a huge chunk of time was spent by me moving home accounts off the old server onto the new one (and cleaning up a bunch of old garbage in the process). Also lost an hour with Jeff trying to figure out why his subversion repository was out of sync in such a manner he couldn\'t check changes in. I did get a moment to get the latest version of the software radar blanking signal generator to compile - and I just started a test run.

- Matt
' ), array('2 Dec 2008 23:27:39 UTC', 'Typical Tuesday outage day today (for database maintenance), and currently we\'re in the midst of smooth recovery from that, more or less. Things sometimes seem weirder on the server status page than they actually are, as the replica database (where we collect the stats) is too far behind the master. Sometime soon I\'ll add some stats to show this, hopefully thus refusing confusion (and fix the broken XML stuff while I\'m at it).

Major improvements during the outage: Jeff put in some freshly compiled servers that went into beta last week, Bob rebuilt an index that has been missing on result for some time (used for occasional statistics Eric checks by hand), and I changed data selection priority to match between both Astropulse and Multibeam splitters (so they chew on the same files at the same time - and make it easier to determine who\'s splitting faster).

I also been busy with other sysadmin-y tasks. Moving accounts around (still), kicking one of our internal diagnostic cronjobs that has been hanging on stale lock files in /var/lib/rpm, data pipeline management (including shipping empty drives to Arecibo), and messing around with FC10.

- Matt
' ), array('1 Dec 2008 21:29:48 UTC', 'Welcome back from the holiday weekend, those who actually had a holiday weekend. Things were more or less calm around here. However thanks to our predictable nemesis autofs some things got a little murky yesterday. The mysql replica lost contact with the master - a regular occurrence - but we didn\'t get the warnings as mail was hung on a dead mount. Now that the replica has fallen behind (though it is catching up) the stats/server pages are a bit behind as well. This will clean itself up in due time. A few hours perhaps.

Otherwise work/data seems to be flowing normally, or normal enough. Dave incorporated some new scheduler logic (not sure what offhand) that is being tested in beta, probably rolled out to the public tomorrow. I\'m bouncing around between data management, radar blanking code, and OS upgrade projects today.

- Matt
' ), array('26 Nov 2008 21:30:53 UTC', 'Oops. My web configuration changes yesterday afternoon seemed to work at first (I checked the logs, tested it myself, etc.) but something bad got exercised, probably at the next web log rotation (which quickly stops/starts the web server) which then made it impossible for people to see the home page for a couple hours. Instead they got a broken link to our subversion page (an interface to our freely available source code). My bad. I fixed this as soon as I noticed it later in the evening.

Later on we had some weird behavior on the scheduling server (anakin) where it ran out of memory due to too many httpd/cgi processes running. It actually recovered on its own around midnight, then got choked up again. Nothing really changed, as far as our configuration nor our executables so we restarted it again this morning with the "ceiling" process limit values lower than before. However I noticed the fastcgi\'s were growing as they stuck around. A memory leak perhaps? Dave pointed out we have been doing client logging the past couple of weeks (which we usually don\'t do). Maybe that part of the code contains a leak - he\'s checking. Maybe that combined with the short period of mysql query logging slowing everything down caused the scheduler fastcgi processes to bloat. Not sure exactly, but we turned client logging off, and I added another flag to the fastcgis to force them to exit from time to time regardless of error just to make sure they don\'t bloat for too long and eat up RAM. I also finally bit the bullet and figured out our broken/wonky web log rotation system given all the above and fixed all that (I think).

Obviously I didn\'t get dinged with jury duty this time around, though last night the automated reporting instructions hotline told me to call again today at 11am for further instructions. So I did, but then the service kept saying it was "unavailable at this time." You know, I tried. Anyway.. Happy day of turkey. Actually I think we\'re having goose this year. Jeff and I will both be around and checking in from time to time (as usual).

- Matt
' ), array('25 Nov 2008 23:35:36 UTC', 'Happy Tuesday. We had the usual outage rigamarole today and should be recovering from that in due time. Right after the backup was finished we restarted mysql with full query logging turned on. We knew this would choke the server a bit, and would just be on temporarily. After about a half hour we had over a million queries in the log, so we brought everything back down and turned logging off. We\'ll parse this log file, and perhaps others we generate over the next 24 hours, in order to find pesky unoptimized queries, anything that would die if we remove all multidimensional indexes, or queries running far more often than we expected.

Also during the outage I moved some big directories around - more NAS shell games. Other than that I\'ve been reconfiguring some more web server stuff (internal use pages) and trying to maximize the raw data pipeline plumbing to get as much work online as possible. It doesn\'t help that a lot of our raw data drives are showing weird signs of corruption. Don\'t worry - we do checksums at every important transfer to ensure the data are sound, and the splitters cannot operate on garbage (there are keyword strings occurring regularly throughout the files). Nevertheless, we\'re having to throw away some files, which is sad. My spider sense tells me this has to do with our general SATA enclosure mounting/unmounting woes. For example, we\'re finding drives that are 500GB thinking they are 750GB when mounted. Was this because a drive previously on that mount point was 750GB and some bookkeeping bits haven\'t been cleared? I dunno, but I\'m sure this isn\'t good.

In a couple hours I get to call a number where an automated voice will tell me if I have to attend jury duty tomorrow or not. I get dragged in for potential jury duty an astonishing amount (pretty much the legal maximum) considering I never actually get selected for trial, and never will.

- Matt
' ), array('25 Nov 2008 0:04:53 UTC', 'Welcome back from the weekend, which was actually relatively painless except for the usual set of automounter issues. We\'re close to giving up on all that. Today was a day filled with lots of chores - including trying to maximize how much raw data we have on line for splitting over the long weekend.

We did have a server hiccup today due to an administrative script corrupting an /etc/passwd file (thanks to aforementioned automounter problems). It\'s hard to maintain a server if the "root" user disappears from the passwd file. So I had to boot from DVD to file the corrupt file. Just so happens this was the server I was having BIOS issues last week, and they happened again! Without my consent it reset the boot drive sequence, causing a little bit of annoyance and grief. Eric and I are thinking there\'s a dead battery involved.

Reminder: this is a "short week" for us, thanks to the turkey day.

- Matt
' ), array('21 Nov 2008 22:29:21 UTC', 'Let\'s see. Do I actually have any news to report...? Among other things today I\'ve been working on some web site configuration cleanup, the continual chore of raw data pipeline management, and discussing the general Astropulse game plan with Josh. I think when Jeff and Eric are in the lab we\'ll all figure out what our exact plans are, and what we need to do to enact these plans. Generally I keep myself out of as many loops as possible for my own sanity, but I have to ramp up on Astropulse sooner or later. It\'s no longer a "proof of concept" kind of project handled completely by Eric/Josh. Anyway this is the kind of day where I take care a small subset of the little things that need fixing - it\'s been a long week and my brain is unable to deal with big projects, nor do I want to mess around with project critical stuff (especially as I am the only person on the "systems team" physically here at the lab right now and we\'re heading into the weekend).

Oh yeah.. keep in mind we are entering holiday hijinx time. Next week will be "short" (even shorter if I get called into jury duty the day before Thanksgiving).

- Matt
' ), array('20 Nov 2008 0:38:26 UTC', 'Today was a day mostly spent tracking down little problems involving BOINC, Astropulse beta, the Astropulse 5.00 release, the beta SETI@home splitter, raw data pipeline flow, data drives reporting wrong capacities to the OS... Lots of bizarre problem solving.

As for Astropulse 5.00 and an "official" statement (which was requested in the last thread) I just have to step back a moment and tell everybody that these threads are for entertainment purposes only and nothing I say should be considered official. I just work here and happen to suffer from hypergraphia. I do understand this is the most dynamic form of news on this site and so I nag the others to add content here and elsewhere. They never do, and I end up looking like the de facto spokesperson.

In practice, due to the incredible web of resource dependencies behind the scenes here, I have to keep tabs on pretty much every aspect of the whole BOINC/SETI@home/Astropulse family of projects since each program, server, budget constraint, etc. affects everything else. Jeff has to do the same. Nevertheless we can only keep track of so much, and what I believe is going to happen doesn\'t always necessarily happen.

That said.. from what I know and understand Astropulse queues did drain last week and the new client was released on Friday or Saturday. The vader choke hampered this a little bit, but shouldn\'t have affected progress on this front that much. Josh is a little puzzled about current results, or lack thereof. That was part of the problem solving today. I still have no real handle on the current Astropulse plans - just temporarily offering my mysql/BOINC expertise to the "Astropulse committee" (Josh and Eric) and then getting back to work on something else.

- Matt
' ), array('19 Nov 2008 1:59:09 UTC', 'Had the usual outage today (weekly mysql database reorg/backup). I also took this opportunity to do what I mentioned yesterday: the remaining last bits of NAS-box shuffling. This included breaking a (currently unused) RAID5 array, putting in bigger drives, and rebuilding it all as a RAID10. However, I quickly came to find the command line utility doesn\'t allow me to delete single logical drives - it\'s all or nothing. Not wanting to destroy the root logical drive, I was forced to go into the RAID BIOS, which meant the server (and the web site) had to be brought down temporarily.

Temporarily became a couple hours - after doing the reconfigure the regular BIOS surreptitiously changed the boot drive sequence. This meant the system wasn\'t booting after that, leading to much confusion and panic (and many long, slow reboots) until I discovered this tricky, pointless switcheroo. Anyway, everything was fine after that and I brought up the new partition and started moving things back to where they are supposed to be.

This included the beta download directory, which uncovered a "bookkeeping" error on our part which meant beta downloads of the new client were broken for the past few days. Oops. That should be fixed now.

We turned on query logging bringing the project back up in order to do an inventory and determine any need for more/different indexes. I had to bounce the project again later in the afternoon to turn that logging back off (it eats up too much i/o to just leave on indefinitely). I also spent a lot of time helping the CASPER gang reconfigure their main web server. I\'m also supposed to be working on donation drive stuff. Oh well. I\'ll get to it tomorrow.

- Matt' ), array('18 Nov 2008 0:12:43 UTC', 'So vader went down again over the weekend. Actually just its ethernet connection went down. We\'re blaming Network Manager. Anyway, we remotely moved enough services around to get beyond vader missing from the server fold, and got everything working again this morning once back in the lab.

I don\'t have much time to report on all the other mundane details that occupiedthe rest of my day. Tomorrow is a standard outage day, during which we hope to get a bulk of the remaining NAS-box shuffling done - one more step towards major server closet overhaul.

- Matt
' ), array('14 Nov 2008 21:49:19 UTC', 'Happy Friday. After the Wednesday outage we had some splitter issues - Jeff incorporated new raw data reading logic that changed in our standard internal data handling libraries. This didn\'t break in tests, but broke in reality. Actually I\'m not sure if it actually broke as much as misbehaved. In any case, the splitters tore through all our raw data and called the files "done" so we ran out of work to send for a moment there. I "un-did" these files and Jeff fell back to the old splitter. The project of debugging this is still open as far as I know. In the meantime, the Astropulse splitters are disabled for a reason - Eric and Josh want to fully drain all those workunits before releasing another client.

Meanwhile since we still haven\'t gotten our shipment of the latest data drives we had to pull data up from our archives to ensure we have enough work to send over the weekend. These are raw files that were surplus at the time and therefore unsplit (and "saved for a rainy day," like today).

We had some more automounter issues, though they are happening far less frequently than before. I added some alerts so Jeff and I will get more warning when such things go awry on any particular system. I also cleaned up the server status page some more. Other than that most of my time has been spent on shuffling big bunches of data around like some shell game in preparation for optimizing file systems (probably early next week). This is mostly internal stuff and has little to do with public server performance.

- Matt
' ), array('12 Nov 2008 23:39:34 UTC', 'Let\'s see.. we had our weekly Tuesday outage today, since yesterday was a holiday. This meant the database had an extra day to get more bloated, which is perhaps why several queues started falling behind. Actually that probably has to do with our workunit storage server filling up again causes general backend malaise. So we were low on downloads for a while there before the outage.

Good news is that I found one reason why our apaches were randomly failing - kind of stupid, actually - there were two httpd log rotation scripts in occasional competition with each other. I think I cleared that up, but automounter/nfs problems are still creeping in there and wreaking havoc. I also finally employed new file_deleter logic to split the deletes between results and workunits, so they can run specific jobs on more appropriate machines. Hopefully that will help speed up the queue drainage.

I also added a tiny bit more logic to the server status page to help make clear which data files are being acted upon by which application. On that front, we were expecting raw data from Arecibo to show up today. It didn\'t. However, I\'ve been pulling up old raw data files from HPSS for Josh\'s pulsar testing, and found these haven\'t been chewed on by Astropulse yet, so I added those to the data queue. So you\'ll be gettin\' Astropulse work soon enough.

As for the project getting all of our stuff off the Network Appliance rack (to free up major amounts of space/power in our closet) we continue to make sloooow progress. Today we moved the boincadm account to this new machine, and so far so good - response times are still pretty snappy. Or snappy enough. The web page server does a lot of random access reading/writing in this directory, so it would be obvious if there were an i/o problem.

For what it\'s worth my schedule is changing a bit over the next month or so, and I\'ll be here on Fridays instead of Thursdays. This has nothing to do with anybody except those who expect my tech reports on specific days.

- Matt
' ), array('10 Nov 2008 23:24:13 UTC', 'Reminder: Tuesday (tomorrow) is a national holiday. We\'ll be having our weekly outage on Wednesday.

It\'s been a bit of a rocky weekend. A rocky week, actually - since the last regular Tuesday outage the CPU load on anakin (the scheduling server) has been kinda high. Like around 100. The obvious answer - turning on client logging for debugging on Tuesday - wasn\'t so obvious at first as we thought we vindicated that and moved on to finding other possible problems which were all red herrings. Eventually we were brought back to client logging and Dave made an optimization fix which we tested in beta this morning, and Jeff applied to the public project around noon. This brought the load back down to 1. I guess the fix worked.

Our raw data pipeline management still needs work. Lots of bottlenecks make it impossible to keep a steady, automatic flow of work. In a perfect world it would be simple and serial - data drives sent up here, files brought online and acted upon by both multibeam and astropulse splitters, copied down to archival off-site storage, and then deleted. However each step takes a rather long time (hours per file if not days), and storage is limited, so we have to parallelize as much as possible. One possible effect of this, and one we\'re seeing now, is that we currently don\'t have raw data available for astropulse to split. We\'re loathe to copy data back up from the archives unless we really have to. We still might do so, but we are expecting a new shipment of drives directly from Arecibo today or Wednesday so astropulse at least be will be fed then.

The bright side is this is now very clear on the server status page now that I made some updates to finally split out database counts/rates and splitter activity per application. There\'s still more updates to be done, but now it\'s much easier to tell what\'s going on between the two.

- Matt
' ), array('5 Nov 2008 21:32:17 UTC', 'At 7:30pm last night the scheduler apache server got hung up - probably from all the election night excitement. These apache servers need a kick fairly often. Unfortunately they die various way due to various things, so automating the checking of certain pulses doesn\'t always help - in fact such things usually make systems more complicated and unpredictable. In the case last night it failed during log rotation which issues an "httpd restart" - this time the head-in process didn\'t die, so port 80 got locked up. I had to log in and kill the zombie httpds by hand before restarting apache. Not a big deal, though it got missed for a couple hours as it was timed perfectly with the entire country busy watching the news.

- Matt
' ), array('4 Nov 2008 23:26:50 UTC', 'I don\'t know if you heard but today is Election Day in the U.S.. Luckily I only had to wait in line an hour to cast my vote so the usual weekly maintenance outage wasn\'t delayed. However, I wanted to reboot jocelyn to pick up a new kernel, and had issues upon shutting down and coming up not unlike those I had with bruno a week or two ago. Namely - the server couldn\'t find its large storage partition and/or thought it was corrupt. Not sure why but the data storage partition, which is under LVM control, wasn\'t being activated. I had to go through the rigamarole of booting from CD, commenting out the mount point in /etc/fstab, rebooting, then typing "vgchange -a y" myself to finally see the partition. Then everything was kosher. So far the projects are coming up just fine, though slowed as the restarted database has to flood its memory caches before reaching maximum efficiency - this usually takes around 30-45 minutes, I think.

Next Tuesday is Veteran\'s Day, so we\'ll probably have the weekly outage on Wednesday.

- Matt
' ), array('3 Nov 2008 23:46:19 UTC', 'Yeesh - another rocky weekend, but nothing out the ordinary. One download server got a headache, the schedule process felt sick for a while, the workunit storage filled up again thus blocking the splitters... At least we don\'t have those Astropulse download spikes anymore, but we\'re still at a loss to exactly explain why bruno is so overloaded - and therefore why the queues can\'t seem to drain as fast as they used to. Anecdotal evidence shows the mysql database may seem fine on the surface but is about to collapse any second, and all those extra milliseconds it takes to respond is causing bruno\'s processes to get all gummed up. In any case I put some effort into moving as many of these processes elsewhere. I also asked Dave for a BOINC feature request - a file_deleter command line option where you can state "only delete results" or "only delete workunits" so you can have file_deleters running on more appropriate systems.

It\'s raining here in the Bay Area - and this wet weather is very much welcome given a ridiculously long summer of drought and fire, but rain also means our air conditioner isn\'t working as efficiently. So we got the server closet temperature to worry about on top of everything else.

- Matt
' ), array('30 Oct 2008 23:00:44 UTC', 'Okay. So the assimilator memory leak wasn\'t a problem so much as an effect. It\'s consumption of resources still needs to be addressed, but it was only affecting itself, and being aggravated by the other problems around it.

Poring through logs I confirmed that the network bursts were indeed due to Astropulse downloads - during the "baseline" 2 out of 100 workunit downloads are Astropulse, but during the "burst" 40 out of 100 are Astropulse. The Astropulse workunits are much larger in size than SETI@home workunits, hence the bandwidth consumption. I also confirmed it wasn\'t a single (or few) clients hitting us at once - connections were randomly distributed over many IP addresses.

It finally dawned on me, and now like most things is painfully obvious on hindsight. The SETI@home and Astropulse splitters have separate high water marks. For SETI@home, if we get above 50000 results ready to send, we temporarily halt splitting. For Astropulse, it is still set pretty low at 2500. Every so often a splitter process checks to see the size of the queue and if it should stop. Since there are many SETI@home splitters running at a time, and there is always a delay in transitioning state, thousands of workunits may be generated before the splitters actually realize they are above the high water mark. And then they go to sleep for a while - like an hour or so - until the queue drains enough and they wake up again and get back to work.

The thing is, during SETI@home\'s "sleep until we\'re needed again" phase the Astropulse splitters continue to run since they haven\'t reached their high water mark even though it\'s much lower - those splitters are fewer in number and run slower. Now remember when workunits are created, the transitioners also create respecitve results to "send." New results are id\'ed serially - i.e. they are tagged with a number in the database which increases automatically. So during these periods you\'ll get an area in database id space rich in Astropulse results.

Moving on to the feeder. Since it\'s stupid regarding application types, it fills its own send queue with the oldest results ready to send regardless of application, and the way mysql works this tends to mean in database id order. Of course with the ready-to-send queue at 50000 or so, we have to send out 50000 results before we finally see the effects of what happened above - many hours, usually. Then suddenly - bam! - 20 times more Astropulse workunits than normal. That arbitrary time delay really confused matters.

Anyway, one easy solution is to make the feeder smarter. It does have an "-allapps" flag to send to all applications equally. We were hesitant to use this before due to fear this will give too many shared memory slots to Astropulse - and it may very well cause periods of low work during peak periods as the feeder has half the memory for SETI@home workunits than it did. Nevertheless we turned this on today and it had an immediate, positive smoothing effect. Sweet.

Other than that today... some data pipeline scripting, and continuing discussions amongst the gang regarding changing redundancy to zero - trying to wrap our brains around all the current bottlenecks and what will suffer depending on what we do. As it stands now, our servers most likely will not be able to support reducing redundancy all the way to zero *and* keeping up with current workunit demand. So we have to either improve our server i/o or figure out what other knobs to turn.

- Matt' ), array('29 Oct 2008 22:55:54 UTC', 'Well we haven\'t really gotten completely around the general problems with our raw data drives being unreadable via our tangled web of SATA enclosures and USB converters, etc. However I did find one thing this morning which helped. Turns out one enclosure just simply stopped working. Long story short, upon very careful inspection I found one of the drive bays had a tiny tiny piece of pink fluff wedged in the SATA power plug. The fluff was from our shipping containers to/from Arecibo. Bits of it get torn off from regular use, and it looks like some got stuck on a drive, which then got wedged into the power plug upon insertion into the enclosure. I dug it out, replaced the drives, and they were visible again. At least for now. I do appreciate the "modprobe" suggestion in the last thread, which may help other similar issues.

Jeff and I were discussing a lot of stuff today, focused mainly on future planning and needs, i.e. what are our current bottlenecks, how do we fix them, and then what will our new bottlenecks be? We\'re resurrecting conversation with campus, possibly to have them research the current cost/feasibility of increasing our bandwidth. We\'re also internally discussing needs regarding a potential move towards less redundancy - which will pretty much double our load if we decide to keep up with demand, and can keep up. As well we were scratching our heads about these semi-regular bandwidth spikes that max out our current bandwidth and wreak general havoc for an hour or so at a time.

As far as the last thing I found an important clue today. The assimilator code has a memory leak - it\'s had the leak for years now, but it\'s usually not a problem. It eventually reaches a limit, fails, then restarts within a few minutes. Today I found the assimilators have been dying quite often recently, and their failures are perfectly in tandem with upward bumps we see in upload traffic. No surprise, as the assimilators and uploads happen on the same machine (bruno) - so if bloated, resource-consuming assimilators suddenly disappear from the process queue, more resources are suddenly given to uploads.

The story goes on from there, but I have to get back to work and will leave the conclusion until tomorrow. You see, I put in a "assimilator killer" cronjob today in every two hours to restart the assimilators regularly and prevent them from bloating too much. I think observing the effects of that over the next 24 hours will inform what I think about other network problems we\'ve been having...

- Matt
' ), array('28 Oct 2008 22:59:27 UTC', 'Today\'s outage took a little longer than usual. This had mostly to do with the replica mysql database needing to be reloaded from scratch (since it fell behind over the weekend and would take days to catch up otherwise). Plus there was some more index manipulation, en route to a (slightly) more streamlined mysql database. I also replaced the drive that failed on bambi a week ago. So you can stop worrying about that.

Jeff and I spent way too much time fighting with our current raw data pipeline. We get SATA drives up from Arecibo full of data. What happens to this data is a matter of priorities. Do we need to send empty drives back down to Arecibo as soon as possible? Is the splitter data queue low? Is the raw data storage full? Etc. etc. So at any given time we\'re been either (a) sending data to our offsite archival storage or (b) moving data over to the raw data storage, or (c) both of the above. We\'re not here 24/7 so to ensure continual data flow we have external SATA drive enclosure on a couple systems.

However, due to various annoying mechanical/form factor reasons, very few of our systems can host these enclosures. Also the drives should be swappable (otherwise what\'s the point?) but we\'re finding that very frequently a drive is pulled, another is put it to be read, and the OS can\'t see the new drive until we reboot the system. This has been a problem with the enclosure directly connected to a SATA card, or via a SATA to USB converter. We\'re trying to automate this whole process, but with the drives/enclosures constantly disappearing for no good reason we\'re up against a wall on this.

- Matt
' ), array('27 Oct 2008 21:52:50 UTC', 'Bit of a weird weekend. Towards the end of last week we had some science database issues - apparently informix "runs out of threads" and needs to be restarted every so often. Around this time there were continuing mount problems on various servers. The usual drill. Then I headed to San Diego for a gig (only gone 28 hours) and Jeff went on a backpacking trip.

Things were more or less working in our absence, but - as it happens sometimes - sendmail stopped working on bruno. This wouldn\'t be a tragedy except for the fact that bruno wasn\'t able to send us the usual complement of alerts. For example: "the mysql replica isn\'t running!" So we didn\'t realize the replica was clogged all weekend. The obvious effect of this is our stats pages have flatlined. It\'s catching up now, but we\'ll probably just reload it from scratch during the outage tomorrow.

We also had more air conditioning problems last night. At least the repairguy returned today with replacement parts in tow. So that\'s being addressed, but not after Jeff got the alarm at midnight last night and Dan trudged up to the lab to open the closet doors and let things cool off. And the httpd process on bruno, once again, crapped out at random - meaning uploads weren\'t happening for a short while there. Jeff gave that a swift kick, too.

On the bright side, we\'re discovering ways to tweak NFS which have been vastly improving efficiency/reliability here in the backend. This may help most of the chronic problems like the ones depicted above.

- Matt
' ), array('23 Oct 2008 20:55:56 UTC', 'There\'s been some problems with the web server lately which are hard to track down. However, this morning I found we were being crawled fairly severely by a googlebot. I thought I took care of that ages ago with a proper robots.txt file! I then realized the bot was scanning all the beta result pages, and I only had "disallow" lines for the main project - so I had to add extra lines for beta-specific web pages. This may help.

So we\'ve been getting these frequent, scary, but ultimately harmless kernel warnings on bruno, our upload server. Research by Jeff showed a quick kernel upgrade would fix that. We brought the kernel in yesterday and rebooted this morning to pick it up. The new kernel was fine but exercised a set of other mysterious problems, mostly centered on our upload storage partition (which is software RAIDed). Lots of confusing/misleading fsck freakouts, mounting failures, disk label conflicts, etc. but eventually we were able to convince the system everything was okay, but not after a series of long, boring reboots.

Speaking of RAID, I still haven\'t put in the new spare on bambi. It\'s late enough in the week to not mess around with any hardware, especially after dealing with the above. Plus the particular RAID array in question is now 1 drive away from degradation (no big deal), and 2 drives away from failure. Plus it\'s a replica of the science database - and the primary is in good shape, and is backed up weekly. So no need to panic - we\'ll get the drive in there early next week.

Speaking of science database, I\'m finding our signal tables (pulse, triplet, spike, gaussian) are sufficiently large that informix is automatically guessing that with certain "expensive" queries indexes aren\'t worth using, and is reverting to sequential scans which take forever. This has to be addressed sooner than later.

- Matt
' ), array('22 Oct 2008 21:00:31 UTC', 'Really busy day for me, but not much on the public facing side of things. Jeff and I are revamping our current backend data pipeline in light of continual hardware and I/O headaches. I\'m pulling a bunch of stuff out of the database for Josh so he can do some more "find the known pulsar and see if it looks like RFI" game in Astropulse. I enacted the "no redundancy" policy on beta - we\'re curious to see how well it works in practice, mostly for the sake of general BOINC testing. I had some updating/programming to do regarding our donations database - stuff that campus requested.

Still no signs of the air conditioner being fixed (though it is running cooler in the closet than earlier in the week). And we haven\'t yet replaced the bad drive on bambi (though we have a spare sitting on the shelf).

- Matt
' ), array('21 Oct 2008 22:25:00 UTC', 'Today had the weekly outage for the mysql backup/compression/etc. Bob did some index manipulation on the beta project while we were down - to see if we can perform as well with less indexes (now that mysql merges indexes if possible on its own). During the outage one of bambi\'s 24 drives failed, or at least seemed to. A spare has been pulled in and is rebuilding the array now.

The forums were pretty slow yesterday - actually everything was. Queues were filling, storage was maxed out, servers and databases was slowed by all the above, causing all kinds of headaches. However overnight the dams finally broke through and everything more or less cleared up on its own. I like when that happens.

About our bandwidth.. We do have *two* 100Mbit connections to the world. First is Hurricane Electric (HE), which is the what SETI pays for, and the other is the link supplied by campus which is shared by the entire lab. The HE traffic is strictly result uploads and workunit downloads, with occasional archival transfers to offsite storage. Everything else - most of the archival transfers, the public web sites, etc. go over the very underutilized campus link. So if there are web site connectivity problems, it has nothing to do with a maxed out link - it\'s probably due to the database server being overloaded, or something else.

- Matt
' ), array('20 Oct 2008 23:09:08 UTC', 'Hello. So the weekend was a bit "noisy" on the network backend. This was mostly due to these network bursts we\'ve been getting. It\'s still confusing to me why these bursts are happening - every few hours we get a bunch of Astropulse workunit downloads in quick succession that max out our bandwidth and wreak general havoc. And over time our workunit storage server filled up again, so queues are filling up, the splitters can\'t create work fast enough. Also the load on our upload server is unbearably high. I\'m hoping during the usual weekly outage tomorrow we can give certain servers a rest to help clear the pipes. Until then, practice patience.

We also have conditioning air conditioning problems in our server closet - apparently the temporary fix from last week is unfixing itself. It\'s not a disaster, but the temperatures are rising about a degree per day. I hope the facilities people will be checking it out tomorrow.

- Matt
' ), array('16 Oct 2008 16:46:38 UTC', 'Early note as I\'m leaving after lunch today. Looks like the translation code on the web broke sometime during yesterday evening. How embarrassing. Code was updated on this site (not by me!) which messed things up. The problem with the translation stuff is that it takes a while to "percolate" - you update the proper .inc files, you look at the web site, it looks fine, so you move onto something else - and don\'t notice when it breaks 10-15 minutes later. Normally I check in regularly on "off hours" to catch such problems but I was busy last night. Anyway, I don\'t want to apologize for this, especially as it wasn\'t my fault, and in fact I fixed it when I got in by falling back to older code.

I believe everything is else is more or less recovering from various mounting/network/reboot issues yesterday. Hope y\'all are getting your workunits for the weekend!

- Matt
' ), array('15 Oct 2008 21:51:36 UTC', 'This morning the other building in the Space Lab "complex" started having network issues on one of their subnets. For various reasons I shan\'t go into here, we have some servers on that subnet. Since some of these "foreign" servers were recently mounted, this reverberated into all kinds of NFS malaise on most of our local servers, some of which needed rebooting to break various network logjams (and then in one case fsck\'ing after rebooting...). It\'s been that kind of day.

The good news is the mysql master/replica seemed to have survived the OS upgrade yesterday, though not after some confusion about unexpected behavior.

- Matt
' ), array('14 Oct 2008 23:21:23 UTC', 'Well here we are. I just had a long day mostly occupied with upgrading the last of server that required a long-overdue OS upgrade. This was our master mysql server. We started the outage early so we could compress/back up the database like we usually do, then allowing enough time in the afternoon for me to install the new OS and configure everything. It seems everytime we install a new OS on a server, a completely random set of unexpected hurdles eats up a couple hours. Today was no different.

Hurdle 1: This system has two hardware RAID devices, which the OS saw as /dev/sda and /dev/sdb - the former being the root drive, the latter be
ing the data drive. The installer recognized both devices but swapped names - the root drive was /dev/sdb and the data drive was /dev/sda. Fair
enough, but I had to be extra careful not to blow the data drive away. This would have been okay, except upon reboot the names were swapped yet again, and grub\'s device map was pointing to the wrong drive (it doesn\'t use partition UUIDs). This led to some confusion and having to edit grub config files in rescue mode, etc.

Hurdle 2: Actually this happens every time I install an OS, but each time it is slightly different. That is, despite entering the proper network info during the install process, things just don\'t work right out of the box. This time it took 45 minutes of hair pulling before I gave up, swapped the ethernet jack from eth1 to eth0 (it was working just fine in eth1 before the upgrade) and then, inexplicably, I could see the world using the exact same "broken" configuration on eth0 that I used on eth1. Very annoying.

Hurdle 3: I was able to get mysql to start up and see the data, but it\'s master/replica configuration was messed up. I fixed it, but then the replica itself barfed for other reasons. Problem is it was still lodged on trying to replicate "alter table" commands which we do each week to compress the data. So every time I try to reset values an errant "alter table" seems to run, thus locking the database for 60-70 minutes. Makes debugging/progress very slow. In fact, the replica is still off - I just started the project running entirely on the master. I might get the replica working today. Maybe not.

- Matt
' ), array('13 Oct 2008 21:30:21 UTC', 'Busy day today. Jeff came in and found the server closet air conditioner went dead around 5am. So the entire closet was running pretty hot. Turns out there was another coolant leak (a problem we seem to deal with a lot). At any rate, this was fixed pretty quickly and everything cooled up to 2 degree colders than before this weekend.

Problems over the weekend. The mysql replica lost its connection - a known, common problem (hopefully will be fixed once the replica is on the same switch as the master db). I discovered that and gave it a kick. Hours later the upload server needed a kick as well. Eric discovered that in the morning and got it working again.

We\'re also fairly pegged at our network limit again, I think thanks to the workunit turnaround time being pretty low (i.e. fast). Plus I have to send extra raw data to our archive over the same link. Oh well. Expect data transfer headaches for the next qwhile.

I also am planning for our last OS upgrade tomorrow on jocelyn, the master mysql database server. This means, like when we upgraded bruno, an extra long outage tomorrow.

- Matt
' ), array('9 Oct 2008 20:26:27 UTC', 'Let\'s see.. We had one of our download servers choked on NFS again, which caused its httpd server to die. I gave both autofs and httpd a swift kick on that machine (vader) and it\'s back up server workunits again. Of course that means there\'s a backlog of clients trying to connect to it, and we\'ll be dropping various other connections while that queue clears out.

Our mysql research led us to discover we needn\'t upgrade our current mysql version after all to make use of automatic index merges. We haven\'t been seeing this logic being employed due to (a) low ordinality of certain indexes and (b) mysql refusing to use multi-dimensional indexes in their merges. Fair enough. We\'ll just have to change around our current constraints.sql (dropping some 2-dimensional indexes and making new single dimensional ones) and see what sticks.

Other than that.. today I\'ve been working on LVM/xfs snapshots and making slow but steady progress on radar blanking testing.

- Matt
' ), array('8 Oct 2008 22:01:36 UTC', 'Some nagging network issues, mostly due to the known liabilities/usual suspects. Very often we are maxing out our 100Mbit private connection to the world, due to peak periods (catchup after an outage, a spate of "fast" workunits, new client downloads added to the usual set of workunit downloads) or sending our raw data to offsite archival storage. This is why download/upload rates are abysmally slow at times - if you can get through at all.

One solution would be to increase our bandwidth - we do pay for a 1Gbit connection, but due to campus infrastructure can only use 100Mbit of that. Getting campus to improve this infrastructure is currently prohibitive due to cost (which includes new routers and dragging new cables up a mountainside to our lab), bureaucratic red tape, and the backlog of higher priority networking tasks campus wishes to tackle first. In other words as far as I can tell it ain\'t never gonna happen. Another solution would be to reduce our result redundancy, as already discussed in recent threads.

We also had our science db/raw data storage server choke a bit today - perhaps because of the recent swarm of fast workunits and therefore increased demand on the splitters. We do try to randomize the data to prevent such swarms but you can\'t win all of the time. And our web log rotating script sometimes barfs for one reason or another and fails to restart one server or another. For a moment there both the scheduler and upload server were off - I caught it fairly quickly though and restarted them.

To clarify, result uploads and workunit downloads go over the private SETI net, along with scheduler traffic. Web pages and other stuff goes over the campus net (it\'s not that much - only a couple Mbits/sec at peak times). The archival storage (where we copy all our raw data offsite) sometimes goes over the campus net, sometimes over the private SETI net, sometimes both if we need to empty the disks as fast as possible to return to Arecibo.

Other than all that.. I fixed the fonts of the status pages, and Jeff elsewhere posted a quick note about NTPCker progress.

- Matt
' ), array('7 Oct 2008 22:30:02 UTC', 'Had our weekly outage for mysql database backup/compression. Reminder: by "compression" I mean that the rather large tables in the database (notably "workunit" and "result" tables) stay stagnant in size if you go by number of rows. That is, workunits and results are created/deleted at about the same rate. However, when you delete a result you can\'t reclaim that space in the database again until either (a) a whole page of results is deleted (due to random nature of the project this rarely happens) or (b) we actively do this "compression." Why is this a problem? Well, imagine a city where, once you leave a parking space, nobody can ever park in that spot ever again unless all spaces in that neighborhood are vacated. This would make hunting for parking quite a chore. As time goes on, we see a similar effect on the database I/O. Seems silly that the database has this issue, but consider how many endeavors around the world, commercial or otherwise, require a database as large as ours in which a million rows get deleted and added every day? It\'s not a common problem, to say the least. At least at our scope.

People seem to be experiencing slowness uploading/downloading work. I know why: I\'ve been pumping raw data over our network to our offsite archive (HPSS) over the same network link as the uploads/downloads. Usually we don\'t, and in fact after the current batch is done (later tonight) I\'ll archive over the campus network (which is what we usually do).

- Matt
' ), array('6 Oct 2008 23:15:18 UTC', 'Let\'s see. No real major crises at the moment. We do have these network bursts which are entirely due to Astropulse workunits. Here\'s what happens, I think: an Astropulse splitter takes a long time to generate a set of workunits, and then dumps them on the "ready to send" pile. These get shipped to the next 256 clients looking for something to do, which in turn causes a sudden demand on our download servers as the average workunit size being requested goes from 375K to 8000K. We\'ll smooth this out at some point.

Lots of systems projects, mostly focused on improving mysql performance (Bob is researching better index usage in newer versions) and improving disk I/O performance (I\'m aiming to convert all our RAID5 systems to some form of RAID1). Also lots of software projects, mostly focused on radar blanking (the sooner we clean up the data the better). Unfortunately needs of the software radar blanker required us to break open working I/O code - Jeff implemented some new logic and we walked through the code together today. Hopefully soon we can get back to the NTPCker.

Thanks for your input about the "zero redundancy" plan. Frankly I\'m a bit surprised how many are against it, though the arguments are all sound. As I said we have no immediate need to enact this feature. I still personally think it\'s worth doing if only for the reduction in power consumption - though I\'d feel a lot better if we could buff up the validation methods to ensure we\'re not getting garbage from wrongly trusted clients.

- Matt
' ), array('2 Oct 2008 21:22:11 UTC', 'Not much to report, really. We had a couple blips or brownouts which were minor and easily corrected. Mostly spending my day working on R&D type stuff (mysql replication, radar blanking, etc.) and data pipeline management - this included boxing up freshly reformatted drives to ship to Arecibo.

One thing in the works, maybe, is changing the workunit redundancy to effectively zero. There is already the mechanism in BOINC to "trust" hosts that continually return validated work. These hosts are then sent workunits that only they will have to process (not a redundant "wingman"). No validation is required (or actually possible) upon returning the result, and no waiting on others for credit, either. Of course, even trusted hosts will get occasional tests to prove they are still trustworthy. Plus there are quick tests we can do on the backend in lieu of "comparison validation." Other pros for doing this include using half the resources for the same amount of science (hooray!) and potentially getting through our backlog of data twice as fast.

The cons are mostly concerns. If we try to keep up with current demand for work we\'d have to run twice as many splitters, which is impossible given our current resources (we\'d at least need more cpus, more disks, and better disk i/o). Or we could split at today\'s rate and regularly run out of work, which might upset some people. If we do increase our splitter production rate and burn through our data, we will even more likely run out of work on a regular basis (since we can\'t pad fresh data with old data if we used up the old data).

Just some thoughts for now. We haven\'t really decided on anything yet.

- Matt
' ), array('1 Oct 2008 21:01:21 UTC', 'Random day. Fixed more stuff on bruno (which got upgraded yesterday), most notably the update_stats process which needed to be recompiled to find newer libraries. Also dealt with lots of internal data pipeline management. And some subversion repository cleanup (in preparation to possibly improve web page translations).

The big thing is that I finally got some time to reconfigure that one RAID5 system into RAID10 (effectively), and the write rates increased by over 16x. Now we\'re talking. As we get more disk space to work with, we\'ll pretty much convert all our RAID5\'s to something else to help get beyond several backend IO bottlenecks. I know this sounds like we only now just discovered the joys of non-parity-based RAID systems, but - like most things around here - we are always firmly aware of better solutions but lack to resources to enact them. Pretty much all our RAID5 systems were built grudgingly but we needed the extra storage at the time.

- Matt
' ), array('30 Sep 2008 23:28:47 UTC', 'We had an extended outage today (more than the regular 3-4 hour database maintenance outage) to finally upgrade one of our core servers, bruno. Usually the OS upgrades are trivial, however this particular machine required a little extra TLC, due to its functional importance, as well as its unique (but admittedly not that unusual) hardware configuration. In regards to the latter, we basically put off upgrading this system until a modern day OS would automatically support its fibre channel card (as opposed to us having to compile drivers into the kernel, etc... blech...).

Anywho... there were no major failures during the long procedure (which included backing everything up, reconfiguring root RAID devices (while trying not to destroy others), then resetting all the network/RAID/apache/etc. services). It still took longer than it should due to a steady stream of minor annoyances (installer crash on first attempt, missing sym links that had to be discovered/recreated, missing packages to be installed, having to recompile every BOINC service due to standard library changes). Doesn\'t matter - it\'s done. Or at least done enough - there are still some screws to tighten which I\'ll tackle later.

So, we\'ll be catching up for a while. If at first you don\'t connect, let your client try again later.

- Matt
' ), array('29 Sep 2008 22:17:27 UTC', 'Quick news for the beginning of the week. We chugged along nicely all weekend, though for server load reasons we were running less Astropulse splitters (and thus creating less Astropulse workunits) and so they\'ve been "falling behind" SETI@home in the competition for processing power. I changed that this morning. Also we\'re going to attempt the bruno upgrade again tomorrow. We realized last week we\'ll need a lot of time to do everything we\'d like, so the regular outage will start a bit early and possibly end later.

- Matt
' ), array('24 Sep 2008 21:12:42 UTC', 'Something we\'ve been lagging on is separating the database count totals on the server status page. Currently we\'re showing "totals" - for example, the "results ready to send" is a sum of both SETI@home and Astropulse results ready to send. For diagnostic purposes, it would be much better to split these into two separate columns.

However, this isn\'t so easy, as such queries become suddenly very expensive if we\'re adding an additional "where appid = N" conditional (AstroPulse and SETI@home are considered two different "applications" in the BOINC realm). I\'m talking the difference being a 3 second query versus a 3600 second query. Yup. We\'ve made joint indexes in the past for servers that needed them, but this hasn\'t been a priority for diagnostic stuff. We also don\'t really have the memory/resources to keep such extra indexes around. In any case, Bob pointed out that newer versions of mysql are smarter - doing the index joins automatically - so we may push on upgrading mysql sooner than later.

Today I\'m actually lost in mundane bureaucracy land. I also should be working on the new software radar blanking embedder code. Sigh.

- Matt
' ), array('23 Sep 2008 23:17:01 UTC', 'We had the regular database maintenance outage today - no news there and we\'re recovering from that now. We have several backlogged data pipeline jobs adding much noise to our backend network, so progress is slower than normal.

We also planned to do some OS upgrading today but were blocked waiting for some backup jobs to finish. The influx of free time led me to do some extensive testing regarding our general bottlenecks as of late. I\'ll cut to the chase. We can blame RAID5 for pretty much everything. No real shocker there, but I was surprised by the extent of RAID5\'s lousy performance. In one example, a large file copied from temp space to a directly attached RAID5 partition took two minutes, and the same file copied over NFS to a remote RAID10 device took 6 seconds (file caching had nothing to do with it, in case you\'re wondering). While some systems handle RAID5 (or RAID4) much better than others, we simply can\'t afford the performance hit on the writes no matter how fast the parity bits are computed.

So why choose RAID5? Well, you get far more raw storage that way. But that\'s pretty much it as far as I care. Unfortunately in some cases (like our raw data storage buffer on thumper) we need every terabyte we can get. Seems kinda silly what with single terabyte drives readily available to the world, but spindle count is also quite important to us. In any case we have some convertin\' to RAID10 ahead of us on several systems and the usual round of careful/paranoid testing. I don\'t think we have much of a choice in making some of thumper\'s partitions RAID10 as well, and that\'ll mean sometime in the future a planned outage of indeterminate length.

- Matt
' ), array('22 Sep 2008 21:19:03 UTC', 'No big disasters over the weekend. However, turns out one of the download servers had its root partitions fill up yesterday due to faulty log rotation behaviour. I\'m figuring that\'s why outbound traffic was spotty for a while. I had to clean that mess up this morning - I think we\'re out of the woods on that front but the traffic graphs still seem kinda weird to me.

I plan to upgrade the OS on server bruno tomorrow, and with that being the "hub" computer for BOINC in a lot of ways, the outage may be longer than usual. Hopefully not too long.

It is coming clear that our hopes for the new NAS box we assembled aren\'t being realized - it\'s pretty slow. It is also clear that using thumper as both a raw data storage buffer and science database server isn\'t going to work out for much longer. The I/O on the machine is usually maxed out, and we need a better solution. Not sure exactly what that solution is yet.

I\'m going to be prioritizing helping to implement the new radar blanking code, as Astropulse is kinda blocked until it\'s ready. Jeff\'s been working pretty hard on that, as the program required some changes to core data management routines without breaking currently working software. Once we\'re over the hump on that he (or we) can turn our attention back to the NTPCkr.

- Matt
' ), array('18 Sep 2008 23:30:22 UTC', 'Just checking in before the weekend. Not much super urgent to report. The mysql replica fell behind again as our alert scripts didn\'t exactly work as expected. When the replica lost connection to the master the "seconds behind master" diagnostic variable gets set to NULL, which my scripts interpreted as "zero" as in "zero seconds behind master" - which is usually optimal. Ha ha ha. Anyway, it didn\'t fall that far behind and is catching up now.

Otherwise I\'ve been doing some data pipeline scripting updates - for example you may have noticed that the server status page no longer gets cluttered with files that finished "in error" - as mentioned in a previous post these files are finishing fine except for some "raggedness" at the very end. Also some fighting with sendmail, and moving servers around. I moved a rather heavy desktop server downstairs into a new office - while carrying it the weight was enough to keep me distracted from the fact the sharp corner was digging two bleeding holes into my wrist. No big deal - but I showed my wife the wound later and she said it looked like a snake bite, which was amusing as the offending server\'s name is "snake."

We also walked through Luke\'s radar blanking code today - he\'s back to school so he was wrapping it up best he can this week and all our free resources were aimed at making this possible. His program is pretty much doing its job - in fact it\'s detecting the radar in our data better than the embedded hardware radar blanking signal we currently use! Well, we\'ll confirm this we more analysis.

Thanks for the concerns/tips/suggestions regarding my previous post about the mysterious RAID controller card behaviour. Maybe I\'ll check jumpers/etc. next week.

- Matt
' ), array('16 Sep 2008 22:25:07 UTC', 'Another week, another database maintenance outage. This one was short but busy. We actually had major upgrade plans for one server but feared this would take all day and lock out the servers so we postponed it until less week which may be less stressful.

Eric cleared a bunch of space of the workunit storage so that bottleneck has been alleviated for now, i.e we have elbow room to create enough workunits to keep up with demand. However this leads us to the first of two mysteries today. You see, he\'s moving all the beta workunits to our new homemade NAS box (ptolemy). While this move has been already been helpful, it\'s taking forever to complete. Why are the disks pegged at 100% utilization? Lack of spindles? PCI bus traffic? Old/slow controller cards? RAID5 biting us again? We\'ll either sort that out or eventually give up on this machine as anything more than archival storage.

The other mystery has been a known issue for some time, but with the down time we revisited the problem: our secondary science database server, bambi, works great except for the fact that upon reboot there\'s a random chance one or two (or three) drives simply don\'t show up on the 3ware controller, causing all kinds of RAID panics/rebuilds. It\'s never clear why this happens, or when it will happen, and when it does it\'s not always the same drives that disappear.

However, a full power cycle always works. The only difference really is that the drives have to spin up on power cycle, but not on reboot. So we\'ve been assuming there\'s some spin-up settings that need to be tweaked. There\'s been talk of making bambi the primary database server, so today we looked for those settings. Couldn\'t find them - nothing in the regular motherboard BIOS, and nothing useful in the 3ware BIOS - and the latter was moot because the drives would have already disappeared according to the 3ware BIOS, so all the spin-up problems are happening before the 3ware is aware. I find nothing about this in any documentation or on the web. It\'s not a showstopper, we can still use bambi as the backup that it is, but this pretty much means we\'ll never be able to fully trust bambi as a "main" server.

Oh yeah.. other stuff. The mysql replica croaked this morning just before we arrived - a partition on the server filled up. Apparently when upgrading the OS we missed a sym link somewhere. So the replica is resync\'ing yet again. Also messing around getting the CUDA development/testing server up and running.

- Matt
' ), array('15 Sep 2008 23:14:16 UTC', 'Happy Monday, everybody. We\'ve been in a holding pattern all weekend, more or less, dealing with the usual constraints (not enough space for workunits, mostly). This morning was weird - something tripped the "stop all daemons" trigger on our back end, so we were weren\'t sending out work for a couple hours until I noticed. Even then restarting everything was blocked by the lack of space again.

On the bright side, we\'ve been getting this homemade NAS box up (for use as general backup of stuff we don\'t want to waste time/money backing up to tape, as well as administrative stuff, home accounts, etc.). So far so good, and there\'s a lot of extra space on it to move the less-active beta downloads there thus freeing up space to make SETI@home/Astropulse workunits to keep up with demand. Woo-hoo! That\'ll break the dam, at least temporarily. We\'re still looking for a cleaner long term solution - several things are in the works on that front.

Other than that, spent a lot of today in meetings, installing high-end graphics cards (for CUDA development/testing), and writing scripts to kick the replica mysql database when it lags behind for no good reason.

- Matt
' ), array('11 Sep 2008 22:08:04 UTC', 'So we hit that brick wall again with the science database - that is, when we try to create a new index it works fine on the primary server but then clogs up sending the new index pages to the secondary. This clog locks up the database, the splitters grind to a halt, the assimilators grind to a halt, i.e. fun for everybody!

We thought we were out of the woods yesterday afternoon but checking in at 1am last night (this morning?) I saw this all happening again, so I gave things a swift kick and went to bed. This morning, once we were all here at the lab, we decided to just bite the bullet this time and shut down all the splitters/assimilators and let the clog work through naturally on its own, which it did. We also took the down time to do an "update statistics" on one signal table (this helps re-sort current indexes for speedier lookups) and add disk space for said indexes. I just turned things back on, we\'ll be catching up for a while, etc.

I did do some qlogic card testing today which got us over my "information gathering and training" hurdle so we can upgrade the remaining two servers with old OS\'s in the coming weeks. We also got our homemade NAS configured so that we may get the old NetApp rack out of the closet maybe next week. It\'s still working quite reliably, but it\'s taking up a third of our closet space, a seventh of our power, but delivering only 2 TB of raw disk space. Not really efficient, and we have a *lot* of servers waiting to get into the closet already.

- Matt' ), array('9 Sep 2008 22:36:13 UTC', 'Tuesday means down time. Same drill that happens every week: projects go down for a few hours, mysql databases are washed, dried, and neatly folded, and then we\'re back on line sometime in the afternoon (Pacific Time). Some people don\'t like the scheduling of these outages, but as it happens NERSC (where we archive all our raw data off site) has their weekly maintenance outage at the exact same time. Something about Tuesday morning that makes it particularly good for maintenance downtime: it\'s not Monday, when we\'re catching up on weekend issues, but it\'s still early enough in the week to recover from potential problems should any arise.

We tackled several other projects during the outage, as we always try to do. We upgraded the OS on sidious (mysql replica db server), which was long overdue. There\'s lots of configuration involved, but with extra care the software RAID partitions containing the database survived the ordeal. We also tested some 750GB drives in one storage server - we\'re still trying to figure out what we have and what we can use given our current storage needs (for workunits, results, or less interesting but equally important things kept on the NAS box which will soon disappear). I also finished getting a new desktop installed - replacing the old clunker which had been our "mass mail" server (for reminder e-mails and such). I\'ll wait before the current smoke has cleared before telling people to "please come back."

There are always other work items too confusing to mention here. In fact I avoid a lot of happenings/details in these glib tech news posts as it will only raise more questions which I don\'t have the time to answer. Sometimes I\'m cagey with my responses for political reasons - occasionally we have commercial vendors/anonymous donators/grant administrators involved in our decision making processes, occasionally I don\'t want to perpetuate the false impression I call the shots around here (I just work here - and post a lot because I happen to suffer from hypergraphia). I understand this vagueness is to the detriment of those who have a generally good understanding of the big picture and are keen to guess what our motivations and needs are, but without key bits of information people sometimes end up being a tad off base. Nevertheless it is amazing to me how much people glean from the scant amount of public relations material we barely manage to squeak out.

- Matt
' ), array('8 Sep 2008 23:02:11 UTC', 'The triplet table in the science database has been a headache for over a week now. We\'ve been trying to add some indexes to it, but this has been mysteriously filling up some kind of logical space (not physical space) such that new triplets couldn\'t be inserted. This has also been adversely affecting the science database replica. For now we\'re giving up on the indexes and letting triplet insertions continue, and allowing the replica to recover.

Internal discussions continued today regarding what to do next as far as general storage. As mentioned often recently, we\'re low on workunit storage - the crux of most of our recent public server problems. We just got some disks in the mail today which were slated for our new home-made NAS box, but we might instead aim these at workunit storage somehow. Testing will commence tomorrow during the outage, as will several other server-related tests/upgrades.

To clear up some confusion: a lot of raw data files depicted on the server status page are showing errors. This is somewhat misleading as these errors all happen at the very end of the particular file/channel. So it\'s not like we\'re losing half our data. Only about one tenth of a percent. What are the errors? At the very very end of the raw data files, some channels are missing the radar blanking signal, so it\'s impossible to remove the RFI. These channels exit in error, though there\'s nothing we can do about it. We have taken steps to try to reduce the number of files that exit this way.

- Matt
' ), array('4 Sep 2008 19:48:30 UTC', 'The good news is that recent woes due to lack of workunit disk space have seemingly passed for now. We\'re still on the very edge of our capacity, but now that we\'re prioritizing the smaller regular workunits (as opposed to the big Astropulse workunits) we were able to build up a ready-to-send queue and network traffic stabilized overnight.

The less-good news is that we still need to build some indexes on the science database. We\'re building one now, and it usually takes 12-24 hours. This adds a lot of CPU and disk I/O to the science database server, meaning the splitters can add rows as fast, nor can the assimilators. So the ready-to-send queue drops, and the assimilator queue rises. As an added bonus, when the assimilator queue rises, that means the deleters slow down, which means the available workunit disk space reduces, and we\'re back to square one again. No big deal as long as people are patient. All the backend services are doing the best they can until the index build finishes, and then we should catch up again.

- Matt
' ), array('2 Sep 2008 22:16:36 UTC', 'Currently as I write this we\'re recovering from the weekly outage (during which we take care of database backups and other sundry server details). It may take a while...

This past Friday we overloaded our science database trying to create a new index. A database engine restart solved the problem, but not after choking the whole local network. As mentioned in many posts past, we\'re strangely sensitive to heavy network bandwidth (I think due to linux\'s imperfect handling of NFS dropouts), and such periods cause random unexpected events. This time, for example, the bottleneck from the primary science database server ultimately caused the BOINC/mysql replica server to disconnect from the master. So the replica fell behind all weekend. Sigh. Instead of actually letting it catch up we\'re just re-mirroring it from the master as we just backed it up this morning.

Meanwhile, we\'re out of space again on the workunit server, and with no fast/easy way to add space. Eric\'s playing with the splitter mix to reduce the number of Astropulse workunits being generated (they are much larger than SETI@home workunits). Maybe that will help, but not immediately. This is what\'s mostly causing our headaches today as we can\'t create enough work to keep up with demand.

- Matt
' ), array('28 Aug 2008 22:51:58 UTC', 'We have a lot of servers in play around here, and once in a while an operating system on one particular server falls far enough behind in spec that the best move is to do a clean reinstall of the latest OS version from DVD (as opposed to trying to do 3 or 4 separate upgrades over the net, one revision at a time). Such was the case with vader, and I bit the bullet yesterday and tackled that project. It mostly acts as a compute server and a redundant download server, so it wasn\'t really missed for the 24 hours it was offline. Only one annoying snag: we have a lot of systems already running this OS, but this was the first 64-bit clean install from DVD, and turns out there\'s a package dependency bug that caused the install to crash until I figured out the offending package and left it off the list. This morning I wrapped up work and it\'s back online. That\'s good, but I still have a few more servers needing similar upgrades.

The summer we have a volunteer undergrad, Luke, working on radar blanking code. Background: our multibeam data is inundated with military radar noise of semi-predictable rate and frequency. Such data collected since early 2008 has a "blanking signal" embedded by Arecibo within the raw data, so we can easily tell when the radar is on or off and we can ignore the loud noise. What Luke\'s working on is a program that analyzes pre-2008 data to retroactively find the radar noise and recreate a similar "blanking signal" so we can clean it up. We (me, Jeff, Eric, and Luke) had a code walkthrough yesterday. So far, so good. In the process of making this program Luke also found phase issues, even with the Arecibo blanking signal, which is probably why we still get overflow workunits from time to time. So there\'s still a little work to be done. When we have an observatory on the dark side of the moon, this won\'t be a problem. Don\'t see that happening anytime soon, though...

Still messing around with this new/old NAS system. It\'s becoming a real time sink. Lots of waiting through long reboots, then trying to figure out why X or Y isn\'t working as expected.

I don\'t come into the lab on Fridays, and Monday is a national holiday. So signing off for a few days...

- Matt
' ), array('26 Aug 2008 22:53:45 UTC', 'Ah, yes - here we go again - the regular Tuesday outage for mysql database backup/compression and other tasks better suited to happen during "quiescent" time.

For example, this week we replaced the failed drive in the workunit storage server with a new drive. That was painless. We also spent a bunch of time experimenting with the new-ish RAID server. I say "new-ish" as it\'s new to us, but it is an old system. For example, it can\'t handle logical volumes greater than 2TB. We however today confirmed (a) it can handle physical single drives at least 750GB in size, and (b) physical volumes greater than 2TB (i.e. put three 750GB drive together to make a 1.5TB RAID5).

We also tested that this system is keeping up pretty well doing a continual backup of our upload directory. That is, we\'re doing a constant rsync with the upload directory to keep a "hot backup" around on a separate system. We didn\'t have the bandwidth/storage capacity to do this ourselves before (and daily backups to tape were too expensive).

Anyway.. the extended length of the outage today was mostly due to revamping the way we\'re doing the backups. We\'re working to include better query blocking (to ensure the database is totally update-free) and figure out the best way to maximize our time, thus ultimately shortening these outages.

- Matt
' ), array('25 Aug 2008 22:56:00 UTC', 'I\'ve been out for a couple weeks. I really need to get the others around here to chime in while I\'m away, but it\'s hard to convince people who aren\'t as hypergraphic as I. Anyway, it seems like whatever happened most everybody survived. Another problem: what I end up blathering on about in these posts is hardly comprehensive, and given arbitrary priority based on whatever is on my mind at the given time. This can be confusing, I imagine.

I might also just go ahead and start only posting here when I really need to (during *real* server issues) and post less important day-to-day type things in the blog. We\'ll see how that goes. It might help keeping specific issues contained to one meaningful thread.

In any case, a brief rundown of the past two weeks: A drive failed on the workunit storage server. Usual drill there except it hung after the failure, however once rebooted it recovered just fine using a spare drive. Outside of that were more minor issues (another server hung requiring reboot, the mysql replica stopped for no apparent reason and took a few days to catch up, etc...) causing various queues to drain or fill too fast, bottlenecks were exercised, and we had a couple temporary complete/partial public server outages... all told nothing out of the ordinary. We are still running a bit "hot" due to the Astropulse release - by "hot" I mean we\'re using far more storage/network resources than we\'d like, but we\'re otherwise okay.

Going back to catching up from the absence...

- Matt
' ), array('7 Aug 2008 22:11:38 UTC', 'Towards the end of the afternoon yesterday we put in a new scheduler to fix a bug with "anonymous platforms" and the way they handle Astropulse workunits. This is working fine as far as I know, but at first there were some brief issues with uploads in general (human error when installing new scheduler).

Today got our new NAS machine into the closet. We\'re close to removing the old NetApp filer, which still works great after so many years, but the drives are too small and we can\'t afford support on this system, and buying new replacement drives is prohibilitively expensive. Plus the thing is just physically huge - a whole rack taking up a third of our closet for only 3 TB raw space. We\'re replacing it with a 3U system that will ultimately have about 7 TB raw space. Getting that into the closet meant I was able to fire up another server-to-be today in our prep lab and get that configured.

Traffic-wise we\'re still trying to get a feel for our demand and our bottlenecks. Eric wrote a script that is busy deleting antique workunits/results that exist on disk but not in the database (not sure why the antique deleter built into BOINC isn\'t working...). This will clear up additional much needed room but this is pretty much all we can do short of getting a whole new workunit storage server.

Looks like web code was updated just now, breaking a thing or two. I think Dave\'s addressing that stuff. I\'ve been mostly catching up on several behind-the-scenes programming projects today.

- Matt
' ), array('6 Aug 2008 21:11:48 UTC', 'Generally speaking, the wealth of issues we\'ve been experiencing were simply due to Astropulse adding about 10-20 more Mbits/sec to our general average. This was a little higher than we expected, hence the initial air of mystery, but still quite within our abilities given current infrastructure. This traffic might go down a bit once everybody requesting their first Astropulse workunit gets their single copy of the Astropulse client.

So this explains the big rush once we released the first workunits and the longer "catching up" period, especially given the fact we were constrained all weekend due to lack of workunit storage space.

Today I\'ve been mostly working on build scripts and testing recent database code fixes. Getting back on the "development" train for a bit... We are also close to getting that new home-grown NAS into production.

- Matt
' ), array('5 Aug 2008 23:15:08 UTC', 'Today was another one of them "outage days" where we shut everything down to do basic weekly maintenance (database backup and whatnot). We had a particularly large task list this time around. A lot of it was fairly mundane - like moving/compressing files to make more room on various storage systems.

The sidious crash the other day did in fact break the mysql replica again. No big deal, but that meant recreating the database from the master - a seemingly weekly occurrence. It\'s easy to do, just adds extra time to the whole operation.

Also, we tried to fix that broken index on the science database. We found the corruption was actually not on the RAID system we thought (the one that required a drive replacement). Huh. Anyway.. the index repair on the whole table was taking too long. We might just go ahead and drop/rebuild the specific index later now that we are more sure what\'s what.

We brought all our backend services (feeder, transitioner, validator, etc.) up to spec on current BOINC code for the first time in a long time, so we carefully turned these on one at a time to observe the logs/results and make sure nothing got all screwy with the updated code.

So we\'re back up, more or less. The current mystery is why we are using so much bandwidth. Too many factors at play to make a clear determination - lots of known network bottlenecks, lots of database bottlenecks, unknown Astropulse behavior, etc. We\'ll give this a closer look tomorrow after (hopefully) some of the traffic jams disappear.

- Matt
' ), array('4 Aug 2008 21:37:18 UTC', 'Another wacky weekend for us. Astropulse is still ramping up - we\'re creating work, sending it out, receiving results back and assimilating them. However the validator stopped granting credit for these workunits - something we\'ll fix and we can also retroactively give people their credit. The workunit storage server ran low on room again, the bottleneck that\'s been giving everybody headaches over the weekend as the splitters could only create work as fast as workunits got deleted off disk. Right now things are generally running slow as I\'m moving stuff off the workunit server to make room causing lots of excess internal i/o. As an added bonus the mysql database replica server crashed this morning - it ran out of memory. No harm done, but it looks like it\'ll take a while to catch up again (it\'s been lagging behind all weekend). I would like to try to split the numbers on the status page between the two different applications (SETI@home/Astropulse) but those extra "where" clauses make the queries run forever.

In better news, looks like we got our new home-grown NAS/RAID box working as we\'d like it, so we may start employing that sooner than later (thus freeing up lots of room/power in our server closet). Also all drive issues on our science database server over the past couple of weeks have been completely dealt with at this point. Well.. there\'s one lingering corrupted index which we\'ll try to rebuild tomorrow during the outage.

I was actually out of the loop since Thursday as I went up to Seattle to play a gig on the main stage at the Microsoft Techready conference at Bell Harbor. Anybody around here attend that thing? Fun show/event, but the stage tent was completely inadequate and the entire band got soaked by rain and sea mist. I\'m amazed none of us were electrocuted.

- Matt
' ), array('30 Jul 2008 20:10:28 UTC', 'Looks like we\'re pretty much out of the woods regarding recent issues. Plus the stats dumps are working again (for the first time in days) so there was an artificially inflated bump in BOINC world-wide productivity for a moment there.

Following on with the science database server stuff. I continue to play the RAID "shell game" to get the root filesystems back on the actual root drives (just for our own sanity, mostly). I also still have to drop/rebuild that one index which gave us trouble a couple weeks ago (apparently "checking" the index didn\'t fix it) - all very minor issues.

Regarding our experience with drive failures... We see the obvious stuff - drives fail either (a) immediately, (b) after 2-4 years, or (c) never ever. I remind people that our original SETI@home data recorder contained drives that were already heavily used for about 5-6 years when we installed them down at Arecibo in 1998, and then they were reading/writing successfully until a couple years ago. They would still probably be working but we have since switched to the newer multibeam data recorder system. Anyway, we don\'t have enough data to prove that high temps or heavy loads kill drives faster. My gut feeling is they don\'t as much as you think. My gut feeling is also that more than half our "failures" are bogus - for example, we had a lot of fibre channel errors, or RAID card bugs, or smartd being oversensitive making it seem like perfectly good drives were unhappy. Many times we just remove and re-add the "broken" drive and it works just fine. In the current case we believe the drive replacement was necessary.

Regarding linux OS re-installs... We\'ve been using Fedora for a while now. Each OS rev has about 18 months of support, and we like to keep up to date for various compatibility/security/bug-fix reasons. It\'s easy to "yum upgrade" to the next OS rev, but after doing this a couple times you find configuration files get out of whack, and your system is littered with "rpmnew" files. Package conflicts arise. Plus every few years you learn enough that you might want to rethink your file systems/adjust partition sizes, etc. So a fresh install is more just "spring cleaning" than anything else.

- Matt' ), array('29 Jul 2008 23:13:57 UTC', 'Today we had our usual Tuesday outage which was a bit longer than usual as we had extra things to take care of (outside of the usual BOINC database table compression and backup to disk).

I failed to mention yesterday (though many have noticed) that db_dump hasn\'t been working for days, which means our stats have flatlined all weekend. This was because our mysql replica failed (we run these expensive stats lookups on the replica so they don\'t affect the more important updates running on the master). So part of the outage today was to rebuild this replica from scratch via the dump from the master. It was easy - we do this regularly anyway - just takes a long time.

Also, Jeff and I replaced a failed drive on thumper (the science database server). There are 48 drives on the thing so disk failures are common, and we get Sun support on this important system. We ask for a drive, they send one, we put it in and ship the old one back. Easy as pie. Unfortunately the software RAID on this system made some bogus complaints upon restart (unrelated to the device that required the new drive). I\'m not sure why mdadm gets confused - for example I converted a couple spare drives to a new RAID device, which works fine, but upon reboot (many months later) mdadm freaks out that those spares are missing. Anyway, this was mostly harmless, and another warning we really need a fresh OS install on this system sooner than later (that\'ll be scary).

We\'re running full bore now. It\'ll take a while to catch up, and we may temporarily run out of work again (still not a comfortable amount of free disk space on the workunit storage). But it\'ll all clear up eventually.

- Matt
' ), array('28 Jul 2008 21:27:00 UTC', 'Wow. What a weird weekend. A lot of little minor things went wrong causing a bunch of "perfect storms" in succession. I have a technical term for this which I can\'t say in public. Anyway, I\'ll spell some of it out in no particular order and in varying amounts of detail.

Our workunit storage server filled up again. We got the warnings too late, as mounting problems were keeping the server status scripts from running, which obscured a rather large assimilator queue backlog. When results stay on disk waiting to be assimilated, so does their respective workunit. Plus with Astropulse ramping up those giant workunits were filling up the storage faster than usual. Eric did already put in code for the splitter (which generates the workunits) to check for a full disk before attempting to write anything. Of course, this fix was only deployed in beta so far. The result, there are about 20000 workunits of zero length, which will cause annoying errors for all clients trying to download them, but they should pass through like kidney stones before too long. For a while I stopped the splitters to reduce the disk usage. Today we put the updated splitter in the main project.

We\'ve been having general scheduler problems over the last week as BOINC code updates were made in preparation for Astropulse. We haven\'t built a new scheduler process in a while which brought to light several problems, mostly due to our database schema being outdated and therefore out of sync with what the code expected. This didn\'t cause any data corruption, but caused random hosts to be unable to connect. For no real good reason a lot of hosts reporting problems were Macs which added to the difficulty of diagnosis - we thought it was an architecture dependent issue at first. In any case, we got beyond understand those problems late last week and planned to clean it all up early this week. There was some miscommunication and the new "broken" scheduler was turned on again last Friday for about a day.

On Sunday our bandwidth dropped to zero. At this point we threw up our hands and figured we\'ll figure this out when we\'re all in the lab together on Monday (today). Remember we do have a policy that it is perfectly okay for our project to be down for a day or two as this is BOINC and people can crunch on other projects in the meantime. Nevertheless, we don\'t want to be too cavalier about that as we know a lot of people just crunch SETI data. But still, given our meager resources our average uptime is quite good, so a day or two of occasional downtime is acceptable. But I digress... Turns out apache was the problem on this server (once again a problem obscured by alerts not running due to mounting issues) and we had to kick it a couple times (including a full system reboot due to messed up shared memory segments) to get it going again. Once going, both download servers choked. So I had to kick both of them as well.

Then we ran out of work. Remember how I said we put a fix in the splitter to keep from writing if the workunit storage server was full? Well, it was being extra cautious and not writing if it said storage server was over 90% full. So as I write this paragraph we\'re low on work to send out, but Eric gave me permission to turn file deletion on in beta so that\'ll clear up space soon enough and we\'ll generate fresh work.

And oh yeah.. we were slashdotted again on Sunday.

That\'s enough for today. We\'ll have the usual outage tomorrow (may be slightly longer than normal) and maybe start splitting some more Astropulse workunits to send out!

- Matt' ), array('24 Jul 2008 21:35:24 UTC', 'Astropulse release progress has been slowed by various things. Some necessary updates were made to the generic BOINC scheduler which we then employed on Monday. After that we found several weird problems including computers being refused work because their hardware was wrongly deemed inappropriate. At first this seemed like a "Mac only" problem but as far as I could tell some Macs were still able to get work. In any case, we ultimately fell back to the "old" scheduler this morning. This improved things according to some rough, immediate analysis. It is still unclear the complete set of scheduler problems, their causes, and their solutions. We\'ll chip away at that as Dave works his way through a large e-mail backlog.

Yesterday Dave, Jeff, and I had a "work stoppage" and went for a hardcore hike in the Desolation Wilderness (near Lake Tahoe) - something we\'ve been talking about doing for way too long, as we are all avid hikers. We were joined by my wife and Daniel, a visiting BOINC developer from Spain. Since this is technical news, the technical details are thus: We took the Twin Bridges trailhead (at 6200\') up to and beyond Horsetail Falls. This included some surprisingly dangerous boulder scrambling which sapped more energy than originally expected. Our plan to bag Ralston Peak (9200\') was reduced to basic exploration up to (and ultimately into) Lake of the Woods (over 8000\'). The boulder scrambling downward was even worse, but all knees/ankles survived intact. All told, about 7-8 miles of hiking/scrambling, almost 2000 vertical feet gained and lost, taking about 8 hours including lengthy breaks. I felt poorly acclimated, even though I easily conquered a similar hike in Yosemite (up to the top of Nevada Falls and back) six days earlier. Dave was acclimated but started the hike a bit exhausted as he did about 800 feet of rock climbing in upper Yosemite the previous day.

- Matt
' ), array('22 Jul 2008 21:16:02 UTC', 'Yesterday afternoon we installed in a new scheduler which included some updates necessary for the upcoming Astropulse rollout. However, our network performance took an immediate hit. After about 10 minutes trying to figure out what was causing this Jeff and I realized our scheduler switch perfectly coincided with several expensive credit-analysis queries Eric was running, also in regards to the Astropulse rollout. So it wasn\'t the scheduler - just the database getting overloaded. That got cleared up quickly.

Last night I noticed people complaining about Mac computers being denied work. This is still an issue, probably with the new scheduler implementation, and we\'ll address it shortly.

We had the regular weekly outage today during which I tackled some extra things. First off, due to continuing mysql database performance issues we completely dropped the credited_job table (before we just dropped the indexes). Reminder: this is the table that connects user ids in the mysql database to result ids in the science database, so we know who did what. This is also the only table in the mysql database that grows without bounds, and therefore has been the cause of much headaches as of late. Don\'t worry - we have all this data archives in three formats in three different locations, and will continue to collect this data in flat file format. I also checked the integrity of the database filesystem now that it was cleaner. No problems there. I started up the projects and mysql is currently handling well over 2000 queries/sec without breaking a sweat.

- Matt' ), array('21 Jul 2008 18:49:42 UTC', 'I was out of the lab since last Wednesday hence the dearth of tech news reports. Though not all that much to report. We had a couple of the usual/typical blips that required minor maintenance, most notably the db_purge process (the thing that keeps the result/workunit tables trim by actually deleting database rows from the BOINC database once the scientific data has been inserted into the science database) - this process hung for some unknown reason and the BOINC db grew great in size. A simple restart fixed that.

As for that index corruption in the science database I mentioned last week, that index was rebuilt just fine, but only after we took one drive in the particular RAID holding these indexes off line - smartd was reporting a lot of errors so we think that drive was the culprit of the corruption. We\'ll try to replace it sooner or later (the system is now down to only 47 out of 48 500GB drives).

I haven\'t fully caught up yet from being gone but I imagine there will be some AstroPulse ramping up to report sooner or later. I see scheduler updates have been made (and I think put into beta). I\'ll meet with Jeff/Eric later and discuss.

Looks like there will be a campus network outage that affects us this upcoming Wednesday morning - it will last about a half hour, starting at 6:30am (Pacific Time). A couple router upgrades from what I can tell.

- Matt
' ), array('15 Jul 2008 22:42:09 UTC', 'Had the typical weekly outage today - the results of which were much happier than last week. We were also hoping to fsck the mysql data drive that gave us grief last week to make sure it\'s okay, but the outage was taking too long so we\'ll do that later. We did fire off our weekly science database backup which quickly failed due to finding a corrupt page or two. This happens from time to time - and turns out this particular corruption is within a index that we can easily drop and recreate if the usual data-cleanup utility doesn\'t work. Also science database replication broke at some recent point, probably due to the primary database catching up on backlogged inserts caused some kind of handshake timeout. No big deal - replication is catching up now.

The campus network graphs are all out, which is how we confirm what our current bandwidth usage is. I hope this will get fixed soon. I feel like a doctor without a stethoscope.

- Matt
' ), array('14 Jul 2008 23:07:47 UTC', 'So the second half of last week was spent trying to figure out why our database server was so painfully slow. Bob, Jeff, Eric, and I were scratching our heads, trying this and that to diagnose and fix this mysterious problem. Everything was fine before the Tuesday outage, nothing changed during the outage, but upon restarting the project we couldn\'t handle very much load.

We were quick to blame mysql, as it has had random episodes in the past of secretive bookkeeping causing us grief. We ruled this out. We started blaming the "credited job" table which is growing infinitely. This is the table keeping track of which user did which workunit. We do nothing but insert into this table (no random access selects), so why would that be a problem? Nevertheless we turned off inserts (back to writing similar info to flat files for later parsing) to no avail.

Maybe it was hardware? Did a disk fail? Is a disk about to fail? We ruled all that out as well, which brought the focus back on mysql with dozens of server tuneables that we tweaked for various reasons over the years. Did we go too far with some of those variables? We convinced ourselves that wasn\'t it.

Of course on hindsight the ultimate solution seems obvious: the filesystem where all the data is kept. Just because the hardware seems okay, and I/O rates are normal, doesn\'t mean the filesystem is happy. And the focus was back on "credited job" as this table is constantly growing and therefore a big ol\' file - much bigger than anything else. A file that is constantly growing during all other inserts and updates that happen as the project is running will likely become interleaved and fragmented to the nth degree. Without fearing data loss we dropped the credited job indexes and that alone broke the dam. Well, jeez.

We\'re still catching up from the backlog, but mysql is performing incredibly well at this point. This is good, as we\'re hoping to release Astropulse before the end of the week. More on that later.

Happy Bastille Day, by the way.

- Matt
' ), array('8 Jul 2008 23:19:59 UTC', 'Weekly outage day (to compress/backup BOINC database). It lasted a little longer than usual due to some confusion - unbeknownst to me a recent web code update was made that broke the "stop_web" mechanism which keeps the database quiescent during the outage. It\'s also taking a long time to recover. Not sure why but we\'ll see if the clog pushes through. I took advantage of the outage to move server anakin into the closet. We also upgraded the RAID card BIOS to see if that fixes our minor issues with ptolemy\'s current hardware RAID setup. Well, it\'s logical volume initialization is still way too slow, but maybe we\'ll live with that if all future resync\'s are fast.

Just wrapped up the scoring meeting I mentioned yesterday. The bottom line being our current scoring algorithms for individual signals (spike, guassians, pulses, triplets) are sound, the multiplet scores (interesting groups of signals of a single type) are 99.9% sound, and metacandidate scores (of single sky pixels containing "candidates" like indiviual signals, multiplets, or stuff observed from previous SETI project, as well as interesting celestial objects) are still way up for debate as this is where individual philosophies differ, but we\'ll probably just go with the easiest solution (multiply all the candidate probabilities together) and see what that list looks like. Jeff will write all this up. Maybe we\'ll even have a science newsletter.

Jeez... still having a hard time recovering...

- Matt
' ), array('7 Jul 2008 22:23:11 UTC', 'Rather dull holiday weekend except for the fact I was up in Oregon and remotely dealing with several server issues hidden from the public - nothing really newsworthy. Various previously mentioned projects are continuing along: I\'m installing an OS on ptolemy in the hopes we can flash upgrade the current RAID cards\' software and see if that helps, otherwise we\'re buying new cards that we *know* work. I might do a bit of physical server shuffling during the weekly outage tomorrow - get some of the newer stuff into the closet - maybe.

Looks like the big "scoring meeting" is also tomorrow where we will try to settle on our candidate scoring algorithms. Basically we need to pool together our scoring techniques from previous reobservation runs and apply it to the nitpicker which, unlike all prior data analysis, runs and updates in real time as signals flow in. It was easier before, at least in the candidate analysis I\'ve done. You\'d turn the crank, look at the results, adjust some variables and turn the crank again. Not so easy to be as casual and change algorithms when the crank is turning 24/7 and a million signals are added every day.

Oh yeah - back to the "ALFA running" problem on the science status page. Turns out we need to recompile our program that peeks at the observatory status broadcasts for our own status pages. This hasn\'t been recompiled in ages, and much has changed in the meantime. An added compilation is that this running on a Solaris machine down in Puerto Rico making recompiling old, stale code a challenge. Jeff is tackling that.

- Matt
' ), array('3 Jul 2008 21:11:53 UTC', 'Crazy day getting ready for the long July 4th weekend. There was more testing on ptolemy with more depressing results (why isn\'t it picking up the hot spare when I pulled a drive out from an active array?!). I actually yanked the whole server out of the closet (which required me temporarily shutting down one of the download servers which was physically in the way - but nobody seemed to notice much). We opened it up and found the RAID is indeed on cards and not the motherboard, which is good as this means if we can\'t get this to ultimately work we can get some 3ware cards (or some such) instead.

Meanwhile, with ptolemy pretty much gone we\'ve been having mounting problems with servers still requesting its disks. No matter how hard you try there\'s always some dependencies that hide until too late. So it\'s been a morning full of killing automounter processes, cleaning up stale mounts, deleting bogus trigger files, restarting services, etc. This was mostly hidden from the public - except for several status pages being out of whack. Actually the assimilators all froze but this was hidden behind the stale server status page. Now the queue is pretty large, but it should drain out just fine.

Eric and Jeff are still getting to the bottom of the database/esql interface woes, doing some extreme programming over by Jeff\'s desk. Converting lists with cryptic, undocumented size limits to blobs. One of the last major hurdles for the first rev of the nitpicker. Then it\'s doing all the scoring algorithms, which we\'ll discuss next week.

- Matt' ), array('2 Jul 2008 22:29:10 UTC', 'Working on ptolemy\'s conversion into a NAS box today, with the focus on putting bigger drives in it and testing out its onboard RAID controllers. We\'re finding the hardware RAID to be a bit outdated and not exactly everything we want. For example, it has a 2TB logical drive size limit, and we can\'t create logical drives using more than half the physical drives (they are split over two separate controllers). I guess we can deal.

Some user web/user interfaces got broke over the past 24 hours. First, the credit certificates. Incomplete updates were made which were confusing. Dave cleaned that up. Second, the "special user" tags got reset by accident - this also got cleaned up but in the process we temporarily gave some users extra powers (the mysql table dumps were comma delimited so forum signatures containing commas offset the values, blah blah blah).

Regarding the "ALFA running" bit on the science status page - I think I fixed this, but we haven\'t collected ALFA data since, and won\'t for a while, so I don\'t have truly positive confirmation yet. No a big crisis either way, though I hope we get more ALFA time soon.

- Matt
' ), array('1 Jul 2008 22:09:19 UTC', 'Today\'s Tuesday, which means we went through the usual database cleanup/backup outage. That went smoothly. As I may have already noted before, the replica mysql server has been regularly failing when actually writing the dump to disk. Our suspicion was that this server was having difficulty reaching the NAS via NFS - and mysql has been ultra-sensitive to any NFS issues. The master server doesn\'t have this problem, but maybe that\'s because it\'s attached to the NAS via a single switch (as opposed to the replica, which is going through at least three switches). Anyway.. we dumped the replica database locally and it worked fine. Our theory was strengthened, though not 100% confirmed.

While the project was down we plucked out and old (and pretty much unused) serial console server from the closet. That saves us an IP address (we get charged per IP address per month as part of university overhead - which is another reason I try to keep our server pool lean and trim). I also cleaned up our current Hurricane Electric network IP address inventory and realized and cleaned up some old, dead entries in the DNS maps. Not sure if this is what has been causing lingering scheduler-connection problems. We shall see.

Noted in the previous tech news thread, the science status page has been continually showing Alfa (the receiver from which we currently collect data) as "not running" for a while now. This was lost in the noise as Alfa actually hasn\'t been running much recently, but is still should have been shown as "running" every so often as data trickles in here and there. Looking back at the logs there has been a problem for some time now. We get the telescope specific data (pointing information, what receivers are on, etc.) every few seconds as they are broadcast to all the projects around the observatory. Perhaps the timing/format of these broadcasts have changed? In any case, I\'m finding our script that reads these broadcasts is occasionally missing information, so I made it more insistent. We\'ll see if that helps.

- Matt
' ), array('30 Jun 2008 21:58:57 UTC', 'A rather static weekend which is always welcome. This morning found that, despite DNS changes made several days ago many clients are still connecting to the old scheduling server. I find this particularly frustrating as there is no legitimate reason for anything to be caching bogus domain information for more than 5 days, especially if said domain had a 5 minute time to live. We need to get to work on this server, so I opened up a currently unused port on one of our non-public servers and gave it the old scheduler IP address to forward along to the new address, thereby acting as a "detour" so we can get to work. Hopefully over time clients will get wind of the correct IP address so we can turn off this detour as well.

Eric\'s back in town. Overheard him and Jeff talking a bit about current nitpicker/database programming woes. Seems like an effective new strategy is being enacted. Other than that, no real new to report and nothing but chores and meetings all day today for me, pretty much.

- Matt
' ), array('26 Jun 2008 21:07:44 UTC', 'The new scheduler continues to be handling its new duties just fine. Slowly but surely people are moving their connections over to this new server, but I\'m not convinced the change rate is fast enough to do a whole sale cutover by next week. We shall see.

Funny aside: while getting new-ish donated server "clarke" up yesterday I was annoyed to find that Fedora Core 9 was booting to run level 5 (where it loads the X windowing environment). We don\'t need X on these servers, so we typically set our servers to boot to run level 3 via a change in /etc/inittab. In doing so, I\'d comment out the old line with a "#" and enter in a new line with the adjusted run level. It was still booting up in X. Why? Turns out the latest inittab parser (new with FC9, I guess) ignores "#" comments in inittab, and just looks for lines containing the string "initdefault" and parses the first one it finds. Since I left the old line in there commented out (or so I thought) it was superseding the line I wanted. So much for standards (and clear documentation stating when/how standards change).

Nitpicker weirdness: While finally getting around to testing the few optimizations I made to Jeff\'s code I found that multiple runs of the nitpicker on the same pixel were producing slightly different results each time. We believe this is due to the order which the database pulls out rows - unless requested otherwise databases generally pull things out in random order, i.e. the order which requires the least I/O at that exact point in time (mostly due to page caching or where the many drive arms are currently located in our RAID set). Sorting query output adds significant (and usually unnecessary) overhead. But there are a lot of "fuzzy compares" in the nitpicker (due to floating point computations on different chips you can\'t expect decimal values to be "exactly exact"). When two items are close enough to be called "duplicates" you only need one, but which one you pick may cause different results down the road. So Jeff is elbow deep in this problem right now.

Apropos of nothing, the entire northern half of state of California is on fire. The smoke ending up here in the Bay Area is intense. I feel like I\'m smoking a couple packs a day just walking around outside. I can smell it sitting here at my desk.

- Matt
' ), array('25 Jun 2008 22:23:54 UTC', 'This morning we turned off the scheduling server on ptolemy and started it up on anakin. This basically worked right out of the box. Pretty quickly we determined the lower traffic rates were due to DNS rollout. Despite having the TTL (time to live) on the download name ( set to 5 minutes, it sometimes takes weeks to fully convince the world the change has been made. This is due to various types of DNS caching I still don\'t fully understand (why don\'t they all obey the TTL?). Stopping/restarting the BOINC client sometimes resolves this.

However, after an hour or so I decided to play nice and turn ptolemy back on, set in a way using apache to forward all lagging scheduling requests over to anakin with a "permanently moved" warning. I guess I should have done this from the get-go, but better late than never. Immediately this seemed to help, but only the uploads. Download traffic still remained under some rather low ceiling.

So I checked the two redundant download servers (bane and vader). Turns out bane wasn\'t serving any download requests. Was it even getting any? That part is a total mystery - nothing changed in any configurations pertaining to these servers. I double checked the DNS updates. No smoking guns there, either. Well, bane had weird dns/mounting/apache problems before that a quick reboot cleared up, so after rebooting it seemed to be "better" but not by much. Instead of 0 requests per second before reboot, it started serving 2 or 3 - vader is serving around 10. What\'s the deal, then? Perhaps this has to do with our "pound" load balancing utility recognizing bane was having trouble (strangely coincident but unrelated to the anakin switch) and has been favorite vader until bane got better. I filed this under "unrelated and currently harmless problem."

Anyway.. I then noticed (in between doing other tasks, hence the lag) the upload traffic was increasing way beyond expectations. I assumed everything was okay as all the apache logs were reporting no errors, but indeed the requests forwarded from ptolemy to anakin were failing. Why? Because the http headers were missing variables, including the all-imporant "Conent-Length." Why?!! This I have no idea, but apparently between apache (and/or the boinc client) redirected traffic results in different and less informative http headers. And so the schedulers on anakin were saying, "I don\'t know what you want - try again in 10 seconds." This got worse and worse as more clients wrapped up their currently workunits and tried to connect.

The solution to all that was to *not* do apache redirects (both 301 and 302 redirects had the same effect) but to use good ol\' pound to simple shovel ptolemy\'s packets towards anakin. This helped all our DNS-lagging clients to finally connect again, but won\'t help to inform them that the scheduling server has indeed changed. Hopefully the clients will learn on their own in the coming days. We plan to turn off ptolemy outright early next week.

Nitpicker progress has been slowed by database programming issues. Informix has undocumented limits on user-defined lists in certain contexts. We may have to work around all that using something other than lists. Jeff\'s been banging on this and other similar programming hurdles for a while, hence the lack of recent info. Plus we have yet to sit down and discuss candidate scoring algorithms which will only happen if we can manage to get the four parties involved (Dan, Eric, Jeff, and me) in the same room at the same time without greater problems hanging over our heads. This hasn\'t happened in, well, months. At least glacial speeds are non-zero speeds.

- Matt
' ), array('24 Jun 2008 21:50:01 UTC', 'Had the usual outage today. No news there, and we\'re recovering normally at the moment.

Continuing along the hardware vs. software RAID theme, we have vast experience getting bitten by both - in the early days of SETI@home we got burned by hardware RAID, hence our current general affinity towards software. However, today Jeff and I got over the (very small) hump of learning how to query the recently donated IBM Xseries on-board RAID from within linux and decided that we\'re going to learn to enjoy living with a zillion different kinds of RAID, each employed based on current needs and resources.

Tomorrow we\'re going to attempt converting our scheduler to the new-used system "anakin" so we can then convert the current scheduler (ptolemy) into a NAS box (to ultimately replace the NAS taking up one third of our server closet). Expect funky DNS rollout issues.

- Matt
' ), array('23 Jun 2008 22:22:22 UTC', 'Another weekend without much ado. Our assimilator queue is low but not exactly pegged at zero. What\'s causing it to not run as fast as all the other backend processes? Not entirely sure, but we know of several things that happen from time to time which may be the problem (i.e. cause extra load on the science database), or at least aggravate the problem. But for now, it\'s not even close to a tragedy, so we\'re just keeping our eye on it.

I guess we did have a disk failure on thumper (the master science database server), or at least disk complaint. It didn\'t cause any downtime or data loss, but it\'s getting us to reconsider our current stance on software vs. hardware RAID. We\'ve been sticking with software RAID due to ease of use and quickness of warning, but we\'re finding it sometimes doesn\'t behave the exact way we expect, or sometimes not the best way. So this event inspired some additional R&D on that front

I just rebooted the main web server, so that was offline for a couple minutes. No big deal - just some mounting issues that needed to be cleared out.

- Matt' ), array('19 Jun 2008 19:41:22 UTC', 'We\'re still maintaining an assimilator queue, but it is indeed draining over time. Besides the nitpicker CPU consumption issues addressed yesterday, we\'re also doing several data transfers down to HPSS (our off-site storage) including a large science database backup, as well as several raw data files (we keep copies of all raw data down there). All these things - the backups, the raw data storage, the nitpicker, and the assimilation of new results - run on thumper (because that\'s where all the data are). So there\'s basic I/O contention at the moment.

Other than that I have nothing to report - I\'ve been mostly occupied by bureaucratic/policy tasks for the past while. I was also annoyed to find somebody threw away my plastic fork, which I admit has been sitting used and unwashed on my desk for days, but nevertheless I came to work expecting to eat my lunch with it. The lab kitchen is oddly devoid of utensils. I did find a pile of aged wooden coffee stirrers, out of which I fashioned a pair of makeshift chopsticks.

There\'s a halo around the sun at the moment. Cool.

- Matt' ), array('18 Jun 2008 23:16:03 UTC', 'The assimilator queue grew again. The main culprit this time was the NTPCkr - from here on out I\'ll simply refer to it as the nitpicker - as a reminder this is the program that is pretty much the culmination of all our SETI@home data collection and analysis, i.e. it\'s the thing that\'ll find the aliens if they exist. All other analyses so far using SETI@home data were cursory by comparison.

Anyway.. we\'re finding every so often that we have "deep" pixels containing tens of thousands of multiplets, each containing thousands of signals. When my "science status page updater" hits one of these it hangs on for quite a long time, causing a heavy CPU load on the database server as it tries to wade through this flood of signals gathering statistics. My optimizations (mentioned earlier in the week) helped, but not enough. We may devise/implement more. In any case, the heavy nitpicker load made the assimilators slow down. We killed those particular processes and I think we\'re catching up again. Slowly.

So the donation processing suite had been choked for a couple weeks and nobody noticed. This was caused by a suddenly (and silently) more stringent firewall, and masked by several things. We\'ve been getting the donations, just no confirmations. So there\'s quite a few missing green stars I imagine. Not exactly sure what to do about that just yet.

- Matt
' ), array('17 Jun 2008 20:44:23 UTC', 'Ho hum weekend, which is good. The air conditioning people came up yesterday (Monday) and today to do follow-up inspection of our server closet system (which failed last week) and found a couple more leaks which have been repaired. We seem to really be pushing it beyond its limits. Had the usual database outage today. No big whoop there.

Somebody noted earlier that their results were getting validated surprisingly quickly. We didn\'t change anything. This may have been due to a longer-than-usual period this past weekend of fast workunits - the average turnaround time was roughly 10 hours (about 20%) shorter than normal, meaning pairs were getting matched up that much faster.

A lot of what\'s been going on the past couple of days has been post-vacation catchup (half the staff was out of town). While I have a zillion other things to do I discovered a couple ways to optimize the NTPCkr so I coded that up and I\'m testing it now. Every little speedup on this front helps. Jeff\'s still working on the scoring part. We\'re getting there...

- Matt
' ), array('11 Jun 2008 21:25:25 UTC', 'Some general BOINC code got updated on our servers this morning, which broke a couple things (some pages went blank, and the php "magic quotes" got messed up causing all kinds of backslashes to appear everywhere). I whined to Dave and he fixed it, which is usually how these particular problems sort themselves out. The problem with the web code is that it is being completely or partially used by all kinds of BOINC projects, so a "fix" for one project may end up unexpectedly being a "bug" for another, which is why this kind of thing happens from time to time. We try to keep SETI@home as up to date with the BOINC source tree as possible, even if that means we\'re on the "bleeding edge." Of course this is all web code, so problems like these are cosmetic and relatively minor in the grand scheme of things. We do more thorough alpha/beta testing of the important back-end functions - you know, the ones that update millions of database records every day.

Other than that today has seen more OS installs/RAID manipulations on various donated servers that have been anxiously waiting their call to duty (I got beyond the issues I was having yesterday). Slowly but surely we\'ll get these up and running. I also got a bunch of data drives from Arecibo - it\'s been a while we got a batch of fresh data up here, so I\'m now lost in data pipeline management mode.

- Matt
' ), array('10 Jun 2008 22:20:19 UTC', 'Normal Tuesday outage. Didn\'t really do anything special this time around. I did mess around with server "anakin" a bit (the presumptive replacement scheduling server) - for starters it keeps booting up in X (though the inittab says not to) and one of its drives got marked as "defunct" (the hardware RAID is rather confusing - I can\'t figure out how to "unfail" the drive). Both really minor issues. At least there was zero fallout from the air conditioner failure yesterday. Other than that I\'m mostly working on mundane sys admin chores and catching up on some back-end diagnostic/analysis stuff.

- Matt
' ), array('9 Jun 2008 20:52:35 UTC', 'Over the weekend the scheduler ceased operations on its own again. I was able to remotely fix this Saturday morning and recovery was swift. This was the same problem as earlier in the week but this time we had a smoking gun: the CGI output log file was maxed out at 2GB in size (this is running on a 32 bit system). Cleaning out the logs solved the problem. The thing is: We\'ve been letting these logs grown to 2GB in size for months without any issue. So why is this a problem all of a sudden? However strange, I put a log rotation script in place to prevent this from happening again any time soon. Funny side note: I would have gotten the alerts faster but coincidentally the lab-wide mail servers conked out as well Saturday morning. Other than that, nothing much to report the past couple of days.

Which brings us to today. Around 12:30 our server closet air conditioning unit died. Within 30 minutes all the servers warmed up over 5 degrees Celsius and I started getting alerts. This may be a significant problem (i.e. we may need more than just a coolant refill). So depending on how fast we can get the maintenance people up here I might have to shut down parts or all of the project to prevent server burnout. Meanwhile, I have the server closet doors open to help cool things down, much to the annoyance of all the projects on this floor (the fan noise is about 20-30 decibels louder with the doors open). The poor people across the hall from the closet are being defeaned - my desk is a few doors down.

- Matt
' ), array('5 Jun 2008 21:24:59 UTC', 'Another mild day in server land. Lots of minor apache issues. There was an annoying web scrape yesterday afternoon that gummed up the works for a moment. This morning I found a bug in the web log rotation script that prevented our public web server from restarting - so it\'s been running for weeks non-stop during which the httpd processes bloated in size (apparently there are small/tolerable memory leaks in php/apache/boinc code somewhere). Then later our scheduling server was suddenly unable to run the scheduler cgi. We were dropping connections so I got alerts right away about this. I had to stop/restart apache twice, though, to get it working again. Not sure why the first restart didn\'t take.

Jeff\'s adding more star catalog data to our database. Bob worked on another alert script to better check our current database storage allocations (and prevent another minor mishap like earlier this week). Eric and I swapped drives between his hydrogen server "ewen" and ptolemy (for when the latter becomes a storage server) - ewen freaked out a little bit unexpectedly - we umounted the filesystems before pulling the drives, but an xfs daemon woke up and thought that particular partition should still be around, etc. No big deal - just a lot of alert e-mails that were scary at first.

- Matt
' ), array('4 Jun 2008 20:06:25 UTC', 'Things are continuing to clear up nicely since the science database kerfuffle earlier this week. The assimilator queue is still large, but now that everything is more or less "caught up" it\'s draining at a pretty good clip.

Nobody probably noticed but for a while there this morning (actually still as I type this sentence) we had two scheduling servers - ptolemy and anakin. I finally got anakin up and configured and made it a secondary scheduler to test it out. Once we\'re ready to convert ptolemy into something else, we now have another scheduling server in our back pocket.

- Matt
' ), array('3 Jun 2008 21:46:01 UTC', 'Good news. The science database problems were far less severe than we thought. Short story: we ran out of space. Long story: due to a slightly confusing configuration we thought we ran out of extents for reasons unclear. Informix categorizes all usable storage space into dbspaces, fragments, chunks, extents... maybe more things I\'m not sure. We\'ve had problems in the past where we ran out of extents long before running out of actual disk space and we thought this is what happened again. The solution for such is painful - basically like rebuilding a RAID system (unload everything, recreate, and reload). Luckily we discovered we had some fragments/chunks misaligned (some fragments had more chunks than others) so all we had to do was add more chunks, and we had plenty of disk space for that. We added enough to get by for now, and will do more when we catch up from the queue draining/filling.

We had our usual outage today (for BOINC database backup/compression, etc.). Between the usual recovery for that and the recovery for all the above it may be a bumpy ride for the next 24 hours or so.

Yesterday afternoon server "bane" (one of the two download servers) was having mounting issues which required a reboot to clean up. I was home at the time and rebooted it remotely. Of course, like my desktop last week, a new kernel was yum\'ed in during the recent past and messed up grub for some reason, so it wouldn\'t load the OS. I had to get Jeff, who was still at the lab, to deal with booting from the emergency DVD and boot from an older kernel. While bane was down half the downloads connections were failing, but usually retries were successful as we have the two redundant servers.

Today I got server anakin more officially racked up (actually just sitting in a rack directly on top of a UPS) to ultimately become the new scheduler. It\'s a recently donated Dual Xeon (used) that is actually less powerful than our current scheduler, ptolemy, but should be able to handle the job just fine. We plan on making ptolemy, with its 16 mostly unused drive bays, a network storage server to replace our ageing Network Appliance server, which fell out of service long ago and its many drives are dying with regularity - infrequent but still worrisome.

- Matt
' ), array('2 Jun 2008 18:58:32 UTC', 'Early Sunday morning I discovered the assimilators were all failing. Immediate analysis uncovered zero smoking guns. All the assimilators were choking on the same subset of results, and all while inserting pulses. Plus the actual processes were seg-faulting before they could produce any useful error codes. Checking the failing result files and database entries showed nothing obvious (all different sizes, submitted at different times, created by different clients, etc.). I did all I could do. I told the other guys (Bob, Jeff, Eric) - Bob\'s checking the database now for any subtle weird behaviour (once again I found no obvious problems yesterday) and Jeff\'s recompiling the assimilator code (perhaps a version that outputs useful error information). In the meantime, the assimilation cue grows, and our disk usage grows with it (as we haven\'t deleted anything in over a day) - sooner than later I\'ll have to stop the splitters to prevent storage disasters. I\'ll update this thread if we figure out what\'s up on that front.

The only other real gripe right now is that our data recorder system at Arecibo is only seeing one of two data drives. Not a tragedy - we can still record data but this will put additional strain on the operators down there until we figure out why.

- Matt
' ), array('29 May 2008 22:40:14 UTC', 'I spent the entire day so far (and will certainly continue after writing this missive) doing nothing anybody will ever care about - mostly revolving around php programming for upcoming letter drive (more on that later). My desktop was getting funky X errors so I decided it was due for a reboot, and then it wouldn\'t come up again. This new Fedora Core 9 distro apparently yum\'ed in something which broke the boot loader. An hour or two spent trying to suss that out and ultimately reinstalling the OS and I\'m back in business

We did have a software meeting earlier - we\'re getting back on track with various stagnant analysis/database projects. Also discussed the Google Sky map stuff - they get their images from many different sources, so it\'s still unclear what epoch the coordinates are in. No simple official statements like, "Google Sky coordinates are entirely in J2000." So we\'re going to have this cosmetic issue where the image data on the science status page may not exactly line up with our reality (which is J2000). In any case, this is hardly a scientific issue as in doesn\'t affect our analysis - just what\'s in that neat little Google window.

- Matt
' ), array('28 May 2008 20:04:41 UTC', 'People noticed there were short network "hiccups" during the course of the evening, ending this morning. All of it was quite mysterious - no database problems, no workunit storage server problems, and at first no obvious download server problems. Upon further examination I found the DNS configuration was "lopsided" towards one of the two download servers. We have load balancing software on both machines so they were sending equal numbers of workunits, but all initial requests hit only one of the two. This hasn\'t been a problem before, but apparently this week\'s outage caused enough strain on apache such that every few hours the load got fairly high and log rotation would take abnormally long (several minutes) and nothing could get through during that time. We are also at our highest active user level in over a year (about 10% higher than a couple months ago), so maybe that added to the apache/server stress level, and what we were seeing were outage "aftershocks." In any case, I fixed the DNS so perhaps this won\'t be so drastic next week (and hopefully for many weeks to come).

Work on the NTPCkr continues - Jeff uploaded the Hipparcos Catalog to the database, so I added a star count on the science status page for the pixel we are currently observing. Of course, the more stars in a pixel the higher the score. However, there are only about 100,000 catalogued stars and 15,000,000 pixels. So odds are pretty high we are observing zero (known) stars at any given moment.

Oh yeah the idle splitter processes - a couple were shirking their duties. I told them to stop slacking off and get back to work. Not that we needed them but it looks bad to have \'em sitting around doing nothing (in reality they were stuck on some stale trigger files).

- Matt
' ), array('27 May 2008 21:23:45 UTC', 'Long holiday weekend (Memorial Day). On the actual day off (yesterday) the BOINC web/download server was misbehaving. In theory I should have been able to connect to the KVM from home but that wasn\'t working properly (couldn\'t access via the web due to incompatibilities with newer JRE versions, couldn\'t access via the standalone client since I ain\'t got no Windows machines and the client only works on Windows, etc.) so I had to drive up to the lab to kick it in person. No big deal - just a runaway job that clobbered the process queue. Had the usual database backup outage today. Not much news to report.

To answer RHWhelan from my last thread: > seems that most of the data we analyze gets dumped soon after we report. Not sure what you mean by dumped but nothing important is getting thrown out. Your SETI@home client reduces about 350K of raw data into a few signals which get plopped into a result file and uploaded to our server. Once these signals are verified and put into our master database the result file (and its sister row in the database) are deleted to make way for more. The signals themselves never get deleted.

> It also appears that the real staff spends more time transferring, storing and manipulating data and hardware than actually analyzing the results. I don`t mean to be critical, I am actually very devoted to the philosophy of SETI but I must admit it seems a bit futile.It appears that way because it\'s completely true. And there\'s nothing wrong with that. To be clear, the "real staff" running the entire show is me, Jeff, Eric, and Bob - all working part time (combined we\'re about 3 full time employees). Anyway... I understand the feelings of frustration due to perceived futility - science takes time, underfunded/understaffed science takes even more. We\'re only just now turning the corner on the analysis. Unless final results start appearing, we\'re still productively collecting/reducing data - not as interesting, but still quite useful. I don\'t expect everybody to maintain interest until we have some real data products, and then I expect interest to jump.

> Are there ever any "HITS" or even slightly suspicious data streams?There are hits and then there are HITS. We haven\'t really looked for the HITS yet as we\'ve been unable to until very recently (that part is working now in beta). There are no data "streams" as data don\'t come to us in streams - the earth rotates so signals that persist over time that are actually originating from outer space will only last a few seconds as our beam passes over it.

When I first started working on SETI in 1997 the group here (just Dan and Jeff at the time) we were wrapping up final analysis on SERENDIP III. Didn\'t find anything really interesting. Then we started collecting data for SERENDIP IV. We were starting to dig into the final analysis of that data set (about 60GB) when SETI@home came into being and derailed that, though Jeff and I have been plotting to wrap that up sometime soon (once we get the SETI@home final analysis rolling). SERENDIP IV is actually interesting, even with 11 year old data - the analysis is hardly as deep as SETI@home, but much wider: the frequency range is about 35 times bigger than SETI@home. We are also doing Optical SETI, and pulsar searching... The point being is SETI@home isn\'t all we do, nor is our lab here at Berkeley the only SETI lab on the planet. Nevertheless we do have the biggest, bestest search going by far.

- Matt
' ), array('22 May 2008 22:35:37 UTC', 'More database poking/prodding today. Tweaking different mysql variables (and even adding "noatime" and "nodiratime" to the mount options of the data partitions) didn\'t really help all that much in regards to the transaction committing stuff I was whining about yesterday. So be it. Bob and I also found this morning that our science database indexes were in need of rebuilding as well. Every few weeks we need to run an "update statistics" query to keep those indexes in line.

Slowly working my work through the OS upgrade queue. We\'re getting FC9 installed on one of three recently donated servers (dual 2.80GHz Xeon / 4 GB RAM) so we can finally start getting these (and another equally powerful P4 server with more RAM, also recently donated) thrown into the fold. The use of these is still up for debate, though they all will be perfectly good general backup/redundant/compute servers. We are definitely missing some redundancy on the backend. I mean, we do have server "maul" sitting around which is quite powerful but being a test model donated by Intel it has an engineering motherboard with keyboard/mouse issues, so we don\'t want to trust it with anything that needs to have 24/7 uptime - instead it\'s up and running as a test/compute server, i.e. if it goes off line for any period of time we won\'t be sad.

Anything else? Just some work on more internal data plots for data integrity checking, and the final bits and pieces of that proposal which is due tomorrow.

- Matt
' ), array('21 May 2008 22:16:59 UTC', 'The BOINC mysql replica wrapped up its resync. This morning Bob did some testing to see if we can improve our failure/recovery situation. MySQL allows different levels of log commitments to disk: commit only when the buffer is full, commit at least once a second, or commit on every transaction. We\'ve been sticking with the middle option, as that affords us the most protection without heavy disk I/O - the worst case is that we lose one seconds\' worth of data. However, we\'ve proven a couple times now that we do many updates per second (i.e. hundreds) and that\'s enough to bring the master/replica majorly out of sync if one crashes before being able to commit. So today we tried the last option and expected an increase of disk I/O and sure enough this commit level brought the database to its knees almost instantaneously. We tried this first on the replica and thought it was its software RAID or low number of spindles causing the headache, but applying this to the heftier master had the same effect. So it\'s back to the drawing board on that front: we don\'t have the server capacity to commit on every transaction. Maybe there\'s other screws we can tighten to make this possible. Bob\'s looking into that. More tests to come, or we\'ll just put this on the back burner.

Other than that... Got FC9 running on my desktop. So two computers are upgraded now, and I\'m getting to understand all the gotchas. Also Jeff and I actually are discussing SERENDIP again. You ever hear of that? That\'s the project we were working on before SETI@home happened, and it\'s been in limbo for about 10 years. But as Dan continues to build SERENDIP-like spectrometer boards to help other SETI scientists around the world, these other projects may want to incorporate our data collection/analysis software, so we better dust that off sooner than later. In the process we can maybe throw the old SERENDIP IV data into the same database as SETI@home to buff up our sensitivity even more. That\'s the hope, anyway.

- Matt
' ), array('20 May 2008 20:44:57 UTC', 'Today\'s weekly backup/compression outage was more or less normal, running the "recover replica from backup" drill without ado or incident. That\'s all continuing now behind the scenes as we already have the main project up and going through its usual quick recovery.

In the previous thread Joker mentions some (broken) changes on the account page, etc. I see that a lot of php files were updated on our web site. We sync our web site from time to time with the most current versions in the BOINC html repository, and of course this may alter behavior of certain pages or break them altogether. The appropriate parties have been notified.

- Matt
' ), array('19 May 2008 23:11:32 UTC', 'Fairly straightforward weekend, server-wise. We\'re still without our BOINC mysql replica database (see previous note) but we\'ll clean all that up tomorrow during the usual Tuesday outage. We\'ll also test some mysql configuration options which may protect us from such failures but at the expense of increasing disk I/O. Basically mysql could write every transaction immediately to disk as opposed to writing all queued transactions in a batch once per second - which doesn\'t sound like much but we can do hundreds of updates per second at times.

Still fighting with Fedora Core 9 on the test system. Ultimately trying to yum up from FC6 failed, and trying an upgrade from DVD failed - I just couldn\'t get X to work. So I did a clean install and that fixed the X problem, but there are some surprising but minor issues I\'m working around. For example, a bug (or feature) prevented the ifcfg-eth0 script from having a "GATEWAY=" line, so I had to add that by hand to get network connectivity. And autofs wasn\'t installed by default. I yum\'ed it in and it isn\'t working. I\'m debugging that now. Oh I see - "grpid" isn\'t a valid mount option anymore (?!).

I did add yet more info of nonzero interest to the science status page - namely a link to a chart noting our entire SETI@home data distribution history. I made this chart for internal use originally, but decided it may be fun for the public to see when exactly we observed and roughly how much we analyzed per day. I know I added a couple of web features under the radar lately - I figure we\'ll publicize all the fun new tidbits in bulk at some point.

- Matt
' ), array('15 May 2008 23:35:49 UTC', 'Okay today wasn\'t so great, but it could have been worse. Eric had continuing problems with ewen so he tackled that for a couple hours this morning, finally getting the thing to recognize its new SCSI drives upon reboot. The general network malaise that happens when ewen is offline masked the fact that, like before, BOINC mysql database server jocelyn suddenly rebooted itself for no apparent reason, causing the mysql engine to shut down ungracefully and requiring a lengthy cleanup.

So that\'s why we were offline most of the day. Upon recovering the replica server (sidious) was out of sync - no big surprise there but that means we\'ll have to rebuild the replica database yet again. What a pain! In theory we should be able to swap relation between these two servers easily during such crises, but we haven\'t gotten a well oiled procedure in place yet for that. Maybe we\'ll start running drills on this soon. Thing is we didn\'t want to get fancy as we\'re near the end of the week, people are bogged down with the proposal, and I\'m actually going out of town tomorrow for a quick private corporate gig in LA so I\'m going to be completely out of touch for the next 40 hours starting.... now!

- Matt
' ), array('14 May 2008 23:48:03 UTC', 'More of the same today. General progress slowed by grant proposal effort and continuing ewen debugging - as mentioned in yesterday\'s note, when ewen is down everything still works, more or less, just veeeeery sloooowly. I\'m also experiencing some growing pains trying to install Fedora Core 9 on one of our test servers (which also, as it happens, sends out the "reminder" e-mails). Run into problems with a standard "yum" live upgrade. Fair enough - I went to upgrade it from DVD but only then realized the system has only a CD drive. Sigh. So I had to pluck a DVD drive out of a defunct system. Then finally after the install X isn\'t working. I\'m hoping a yum update at this point will fix that. On the bright side I continued Jeff\'s effort on Google Sky and converted our science status page to use it. Fun! I\'ll make a formal announcement of server status updates when I add one or two more things...

- Matt' ), array('13 May 2008 22:11:58 UTC', 'The standard weekly outage chores (database compression/backup, log rotation, general housecleaning) went by without much incident. It\'s the extra stuff we try to do at the same time that may or may not be as easy. Today Eric wanted to add a donated (and upgraded) 12TB disk array to his Hydrogen database server, ewen. We also took the opportunity to move a few things around in the closet now that there was rack space (and rack rails that fit!). The moving was fine - however ewen is having problems booting now. Eric added a couple SCSI cards, so maybe there\'s confusion about where the boot disk is, etc.

In any case, ewen isn\'t really a SETI@home/BOINC server, but contains enough shared stuff that when it disappears, there\'s a general malaise in the BOINC backend. Uploads and downloads are fine - it\'s the splitter, validating, assimilating, etc. that\'s not going so well (if at all). Eric\'s beating his head on that. Meanwhile, random unix commands sometimes work immediately, sometimes take 30 seconds to respond. Not so fun. We hope to beyond this before day\'s end.

I did fight the crowds and downloaded Fedora Core 9 for soon-to-be server upgrades. I\'m upgrading one test case now - so far so good.

Jeff has been figuring out the Google Sky API. We\'ll probably replace the Sloan Survey pix on the science status page with this, as well as use Google Sky to show our current top candidates as they start rolling in via the NTPCkr.

- Matt
' ), array('12 May 2008 23:26:00 UTC', 'Not really much of an exciting weekend server-wise, which is typically a good thing. Lots of little bits and pieces being put together to get the new project and scientific analysis software rolling, but nothing really to report outside of mundane details. Progress in general is temporarily slowed this week - we\'re a man down as Eric is lost in grant proposal land.

Fedora Core 9 is coming out tomorrow. If the mirrors aren\'t swamped I may upgrade a test machine or two during the usual Tuesday outage. I\'ll also start bringing some recently donated servers on line which have been waiting on this release (I didn\'t want to install 8 just to have it become obsolete that much faster). We may also do some server closet shuffling during the downtime.

Happy belated Mother\'s Day!

- Matt
' ), array('8 May 2008 21:17:25 UTC', 'I\'ll start with hardware - just some minor things. First: the website (and alpha projects) were down for a while this morning because the BOINC server froze. Still not sure why, but a power cycle cleared that up. Second: currently AstroPulse scientific data only exists in the "beta" realm - Bob and company are now creating the db spaces on the master science database server along with SETI@home. This may slow things down temporarily due to heavy disk I/O. Third: we got our second new enclosure (the previous one was broken) so we\'re starting to archive data off site again via our ISP, hence the slightly noticeable bump on our traffic graphs. I guess from this point on you shouldn\'t assume all transferred bits depicted on said graphs are due to workunit/result exchange.

Software wise, we\'re chugging along on the various projects mentioned in previous threads. When we all get into programming mode this generally tends to uncover bugs/issues that went unnoticed during network manager mode (or scientist mode, or administrator mode, or ...). Things like being able to insert workunit_groups of any size, but only able to read ones under 8K. Not a problem when all we\'re doing is inserting, but now that we have to read them back in to do some precess adjustments, this constraint uncovered a few such groups that were extra-large in size. Why? Well, that\'s what I mean - one little headscratcher leads to another. I\'ve been on this all day, and Jeff\'s been beating his head on this "ragged file" problem causing some splitters to error out - but when we restart them on the same files they work. Why? Why?! Actually, these problems are kinda fun as when we do discover the root cause there\'s a happy "a-HA!" moment.

- Matt
' ), array('5 May 2008 22:44:09 UTC', 'Typical weekend - a couple weird things but nothing tragic. For example the assimilator queue ballooned for a while, but then worked its way back down to zero on its own. There might have been mysql database load causing some general malaise like the above - no smoking guns have been found yet.

Otherwise general progress. With the servers doing well I continue to send out reminder e-mails to users who haven\'t returned results in a while. We consistently fight a general downward trend as people buy new computers and forget to reinstall BOINC. Looking at the recent active user graphs out there I\'d say about 10% of the reminder e-mails result in a returning user. Most of them bounce (or get spam filtered). Also a large fraction of these e-mails are currently going to users who haven\'t sent results back in years. So I imagine the success rate will increase over time, but on the other hand I imagine we won\'t be sending out such mails as often in the future (the number of people who could be deemed "ready to remind" is finite).

Meanwhile I\'m working on finally running the precess fixer (run into some embedded sql issues this afternoon), while Jeff is almost ready to throw the NTPCkr into beta. We actually discussed public data visualization of candidates at our general meeting this afternoon. And it sound like AstroPulse is pretty much ready for prime time as well. Woo-hoo!

Happy Cinco de Mayo!

- Matt
' ), array('1 May 2008 21:03:51 UTC', 'Happy May Day!

Not much to report these past couple of days. We\'ve mostly been bogged down doing actual software development, which for me has meant trying to wrap my brain around how to pull useful information out of the science database in an efficient manner. The "efficient" part is the crux given the size of the database. Nevertheless, I will be restarting the skymap processing again - watch for new maps soon, albeit of coarser resolution, but perhaps animated over time. We shall see. Jeff\'s been in NTPCkr land, mostly, though we\'ve been working through continuing data flow issues together as well. Note how I added a third color (gray) to the splitter status section of the server status page. This denotes files that didn\'t complete due to error which, at this point, is always due to "ragged" files (i.e. missing blocks at the head/tail containing the radar blanking signal).

We had lingering problems rebuilding the BOINC db replica. Despite getting a clean dump from the master, upon reload the replica complained of broken tables that needed repair. These tables did break in the recent past but have since been fixed, but maybe there were lingering error flags hanging around. Anyway Bob cleaned all that up and it\'s catching up now (again).

EDIT: in case you\'re watching the network graphs, we just figured out how to send more data to our archives over the ISP - so the spike is raw data archival traffic, not some kind of sudden workunit download frenzy.

- Matt' ), array('29 Apr 2008 22:08:03 UTC', 'During today\'s outage, Jeff and I did yet more reorganization of room 329, culminating in finally, for the first time ever, putting sidious in a rack. This was a major step in filling this particular rack, which will hopefully replace one of the three racks in the closet sooner than later. We also did the steps to rebuild the replica database, which is happening in the background now. May complete tonight or tomorrow, and then it shall "catch up" quickly after that and we\'ll be back in business on that front.

Clarifying the bottleneck I mentioned yesterday - this is strictly due to our current data processing rate. Drives with raw data come in, which we always archive to off site storage as well as copy into our processing directory (where the splitters read them to make workunits). In a perfect world, we\'d be processing data as fast as we archive them, but to do so would require a lot more active users. So frequently our 8 terabyte processing directory fills up with unsplit data, and everything logjams. So this isn\'t a database bottleneck - it\'s a data bottleneck. More people/computers is the solution.

Still, people asked for more info about the quality/quantity of database throughput. Here\'s a short essay about that. This is by no means complete it\'s but a good start.

We have two databases, the mysql database which is BOINC specific (running on jocelyn, replicated on sidious - we call it the "BOINC" database), and the informix database which is SETI specific (running on thumper, replicated on bambi - we call it the "science" database).

The science database, while very very large (billions of rows) is not a problem under normal conditions, even as we insert over million new rows every day. This is because inserts are generally at the ends of tables, so it\'s all pretty much sequential writes and that\'s it. With the introduction of actual scientific data analysis comes large numbers of random access reads. Earlier this years tests using the NTPCkr (our software to do such analysis) showed this will be a problem so we spent a couple months reconfiguring the science database server/RAID systems to optimize random access performance. We seem to be in the clear for now as we continue NTPCkr testing.

The BOINC database is largely where problems arise, partially because this is our public facing database, i.e. users notice quickly when it isn\'t working. This contains all data pertaining to user stats, the web site, result/workunit flow, and the whole BOINC backend state machine. On average it gets about 600 queries per second, peaking at well over 2000 per second (like now, as we recover from today\'s outage). Thanks to many years of gaining expertise forming proper queries and creating proper indexes, 99% of these queries are super duper fast. But there are still unavoidable issues.

The lifetime of a particular workunit and its constituent results is long, as they are created, sit on disk waiting to be sent, hang out in the database as users process them after which they succomb to the whole validation/assimilation/deletion cycle, and finally get purged after a 24 grace period (so users can still see finished results up on the web for some time after completion).

Due to this lifetime at any given point we have roughly 3 million workunits and 6 million results in the BOINC database. This is all important data, but it\'s mostly metadata - the scientific stuff is contained on larger files on disk. So even with these large tables, and the user/host tables, and forum/post/thread tables, all the commonly accessed parts of the database fit into memory cache when it\'s all "tightly packed."

We create upwards to a million workunits/results a day in this database, which means the tables would immediately grow too large to be useful, which is why we purge (i.e. delete) them when they are finished - the useful data has been assimilated into the science database at this point anyhow. But deleting isn\'t in sequence - it\'s random as results don\'t return in sequential order. When rows are deleted from a mysql table, it doesn\'t free up space until ALL rows from the entire database page are deleted - something that isn\'t likely when done in random order. So even though row counts remain stagnant on these two tables, the tables bloat to roughly twice the size on disk by weeks\' end, and mysql memory cache takes a major hit. This is why we have a weekly outage to, among other things, compress the tables (or "repack" them).

Meanwhile, there are daily unavoidable long queries, for example to do user/host/team stats dumps. To dump all this data means reading in whole tables into memory (not just pertinent rows/fields) - queries like this temporarily choke memory cache. Indexes won\'t help - we\'re reading in everything no matter what.

Also meanwhile, I haven\'t mentioned the "credited_job" table which is actually the largest table in the BOINC database. We\'re still just inserting into it (harmless sequential writes) but I\'m afraid this is a disaster waiting to happen once we start actually reading from it.

Bottom line, the BOINC/mysql database is usually fine as of now. It beautifully handles a stunning variety of queries from several public servers and a rather busy backend. A perfect open source solution that folds nicely into the general BOINC philosophy (keep it standard and free). SETI@home is rather large compared to other BOINC projects, so we had to put a lot more TLC into maintaining our mysql servers, and we pass our improvements on to the general BOINC community.

- Matt' ), array('28 Apr 2008 22:59:14 UTC', 'Back from a relatively painless weekend. Except the replica mysql database is screwed up again - it got stuck on a duplicate ID (not sure why) which is relatively harmless but this caused its logs to grow at an inordinate rate, filling up the data drives and bringing the whole thing out of sync. Fine. We\'ll recreate the replica again during the outage tomorrow (much like we did a couple weeks ago).

Since we\'ve been fairly stable the past couple of weeks I continued to send out the "reminder" e-mails today which has already rocketed our active user base back over 200,000. This is good, as our current data flow bottleneck is the amount of data we are able process, so the more computers the better. Tell your friends!

- Matt
' ), array('24 Apr 2008 20:33:28 UTC', 'Work week wrapup. No major news outside of things I already posted here and elsewhere. People are out sick. Man there\'s been a lot of nasty bugs going around this year. I\'ve been catching up on minor nagging items. Mostly cleaning up the lab - some recently donated servers are stuck waiting on fedora core 9 to be released as well as having no place to physically put the things to set them up. We have a lunch table in the center of the lab piled with random stuff so we\'re all eating lunch at our desks. Also worked on donation system upgrades. The IT people on campus are now allowing us to pass hidden user ids which will vastly increase my ability to match green stars to specific donators (we\'ve been relying on people entering the right e-mail address on the donation form). Some updates to the boinc web interface broke a few pages - I fixed all that. Yeah.. lots of the usual day-to-day tasks.

- Matt
' ), array('22 Apr 2008 22:27:41 UTC', 'Back from a long weekend out of town. Didn\'t seem to miss very much. I checked the network graphs while I was away and saw no dips, so that\'s a pretty good sign things were generally healthy in my absence. There was another seemingly bogus disk failure on thumper. Is smartd being too sensitive? The drive tagged as potentially faulty was failed/re-added without much ado. Today had the usual outage. Nothing out of the ordinary there.

One funny thing - for an unspecified amount of time nobody on the Berkeley campus (outside of the space lab) was able to connect to our servers to receive/send SETI@home data. This was due to asymmetrical routing - a problem on our public facing servers that send data over our ISP (as opposed to via the campus LAN). Jeff found and fixed the problem and I updated the network scripts to make sure a reboot doesn\'t break it again.

Jeff just spent an hour or so walking me through the current nitpicker (i.e. the candidate-finder) code. This really is one of those simple concepts that requires a complex solution. I find it frustrating to describe why, as the reasons are hardly obvious, and the problems are nested. We used to do this stuff with our own human brains which can find patterns and detect duplicates and RFI quickly as long as the data fits on a couple pages. This isn\'t so much the case anymore, and getting the computers to smartly (and efficiently) do the same grouping, comparing, and discarding is difficult. Think of it this way: you have a bunch of friends and you realize two of them are single and, based on many different variables, perhaps quite compatible - so you set them up on a date. Easy, no? Now try to run a completely automated dating service trying to accurately pair up every single person on the planet with the best possible mate. Not as easy. In any case, I might start throwing random output from it on the science status page which is of anecdotal interest. Like extra info about where we\'re currently pointing and what we\'ve seen there before. Check for that in the next day or so.

- Matt
' ), array('16 Apr 2008 21:34:36 UTC', 'So far so good with the new workunit server. We recovered from the recent spate of outages fairly quickly. The assimilator queue is starting to drain at a good clip, too. If anybody\'s looking at the traffic graphs and noticing a "bump" over the last hour or so - that\'s us sending our raw data to HPSS over the Hurricane pipe (in additional to sending it over the standard campus pipe). With the recently purchased (and employed) disk enclosure this extra bandwidth is now possible, and every little bit helps (pun intended).

Mostly working on programming today. Wrapping up work on the precess recalculator - will probably deploy next week. Astropulse and the ntpckr are both just around the corner as well. I know we\'ve been saying that a while, but it\'s getting truer ever day. Lots of big things coming down the pike.

- Matt
' ), array('15 Apr 2008 22:24:02 UTC', 'As mentioned yesterday the kind folks at Adaptec/SnapAppliance replaced our server. The leading theory for its failure is still localized to the ribbon cable connecting the faceplate to the motherboard, but they swapped out the whole thing anyway just to be safe. The RAID devices had to be massaged a bit and then spent all night resyncing. That wrapped up around 4am, but one of the RAID1 pairs needed to be resynced again. Once that finished, I tackled the usual Tuesday database compression/backup. Since that began early this week (no reason not to since we were already off line) that completed around 12:30pm and I started the public/beta projects. We\'ll be catching up for a while, I imagine.

The assimilator queue blossomed again, but this (I think) was mostly due to one of the four assimilators being stuck on one particular result where the uploaded file got garbled and therefore became un-parseable. I blew this result away and that one assimilator seems to have pushed through for now.

Jeff is trying to debug a new problem with the splitters - despite additional smarts/logic some are failing mid-file, unable to find the radar blanking signal. But when we look at the file by hand, we see the signal (or at least where the signal should be). Insert sound of head scratching here. In any case, if there are less splitters running than normal, that\'s why.

Happy Tax Day, my U.S. compatriots.

- Matt
' ), array('14 Apr 2008 19:03:42 UTC', 'Continuing problems with the workunit storage server... There were more resets over the weekend, ultimately resulting in one that caused the server to think enough drives have failed to call the entire RAID dead. We are confident we can trick the server into thinking otherwise - we actually have some helpful techs logged in doing that as I type. We still want to replace the whole box, which we\'ll hopefully do today, and then the drives will have to resync again. Chances are we\'ll be down until tomorrow (Tuesday).

So while we are down we\'ll try to catch up on several things. Moving servers around the closet, incorporating the new drive enclosure that arrived today, getting more stuff on the new KVM, etc.

- Matt
' ), array('10 Apr 2008 17:53:43 UTC', 'We thought we had the hardware problem with the workunit download server diagnosed, but looks like we were wrong. False positive. The good news is that the kind folks who donated the thing have another ready to ship. But until we get it, that probably means potential random resets all weekend. Jeff just put an /etc/rc script in place so that upon reset/reboot there\'s a chance it\'ll be operational, meaning short glitches instead of multi-hour outages. That\'s the hope anyway. We might actually test that later today (if it doesn\'t reset itself on its own). There was discussion about how to implement a second workunit storage server so we don\'t have this single point of failure anymore. Not as easy as it sounds.

- Matt
' ), array('9 Apr 2008 21:24:22 UTC', 'Continuing on from yesterday\'s tech news note, we had a "take two" outage today for database maintenance. We "repaired" several tables (the word repair is in quotes because, while MySQL locked the tables due to potential corruption, the repair query found zero errors). Then we dumped the master database and are recreating the replica from that dump. This is actually happening now, and will probably take all afternoon, but since the master is back in one piece we started up the projects and are catching up, draining backlogs, etc. We\'ll start the replica once it\'s ready and it should catch up as well.

Outside of that, Jeff and I are tackling the current state of data flow to/from Arecibo. We have a lot of scripts in place to automate most things, but there are still some parts we do by hand based on the situation. Do we need to empty the drives as soon as possible and get them back to Arecibo to collect more data? What if there\'s no space available on the splitter system? Things like that. So I\'ll be coding up more robust scripts in the near term.

- Matt
' ), array('8 Apr 2008 23:43:16 UTC', 'Had a relatively painless weekend, which is a good sign as that probably means we correctly determined the cause of our workunit download server woes (broken faceplate sending bogus resets to the system). Everything else was okay except the database statistics on the server status page flatlined. This was fallout from the mysql database server rebooting itself on Thursday and the replica server getting out of sync. Since this was a harmless, cosmetic problem we let this fire burn until we re-synced the two databases today during the (extra long) weekly outage.

Why were we down today for so long? What happened?! Seems like last week\'s database crash caused some minor confusion in (at least) the "credited_job" table, which of course is the largest table in the database. So we had to run a long, expensive "repair table" query after a longer, more expensive "optimize table" query failed with error thus preventing us from even backing up the database. How annoying. Even more annoying: the /tmp partition filled up during the repair so mysql twiddled its thumbs for 20 minutes before we realized and cleared out more space. Then /tmp filled up again. Then we realized the it was trying to write about 10GB of data to /tmp. This wasn\'t gonna happen. So we killed the "repair table" query and simply restarted the project so people could get back to work. However, without credited_job the validators can\'t work, so they\'re offline for the night. We\'ll discuss tomorrow what to do next. We still haven\'t backed up or re-synced our databases. They might be an extra outage tomorrow.

We employed the new workunit-generating splitters with radar blanking yesterday, but then overnight ran out of work to send out. This was due to the way our data was collected and stored in the raw data files. Long story short, data buffers are collected and stored in pairs, one which contains the radar blanking signal (which lets us know exactly when the noisy radar is on), the other of which does not and therefore gets its blanking signal from its sibling. However, the orientation of these pairs in the data isn\'t fixed and may reverse "polarity" at any time. So there\'s a good chance the first buffer in a data file is missing its sibling and therefore can\'t find any blanking information. This is a critical error, so splitters were getting hung up on these files as the queue slowly drained. Not a big deal, and Jeff reworked the logic in the splitter so these errors are not critical (we\'ll just skip the first buffer). Anyway, this only affects a couple months\' worth of files - we already fixed the logic on the data recorder down at Arecibo to reduce the chance of "half pairs" happening in a single file.

- Matt
' ), array('3 Apr 2008 21:31:19 UTC', 'Minutes after I went to bed last night the BOINC mysql database server crashed. This has happened before - some kind of kernel panic. The upshot of it was that we were offline all night until Jeff (who wakes up far earlier than I) kicked the system early this morning. And then it took mysql about six hours to do all its checks and clean itself up. Once back up, we found the master and replica servers were ever so slightly out of sync, which was no surprise. We\'re continuing to run this way for now - but with all queries aimed at the master. This way the replica (if it continues to work beyond update conflicts) will still be an adequate-enough safety net until we re-copy its database from the master early next week.

Meanwhile, spent the morning doing other stuff while the project was down. Like tightening up various aspects of our source code management. Or working on the data recorder to ensure raw data files have even numbers of blocks (blocks are written in groups of two, with the radar blanking signal for both in just one of them - so files with odd numbers of blocks may be missing blanking signals at the end, thus rendering that last block useless). And Eric had to give a tour of the lab to prospective Ph.D. students. It\'s things like these (which I usually fail to mention) which occupy most of our time - eating up a half hour here, a half hour there... Of course before we have visitors Jeff and I have to drop everything and actually clean up the lab - piles of KVM cables recently removed from the server closet, random DIMMs too small to use, on every possible flat surface O\'Reilly manuals (or good ol\' K&R) lying open to specific pages, empty soft drink containers...

In any event, recovery (yet again) is happening now. Hopefully as the weekend approaches there will be a wee bit more stability in our server closet. Of course I just sent out about 25K of those "please come back" e-mails yesterday. It\'s all about timing.

- Matt
' ), array('2 Apr 2008 22:54:30 UTC', 'So far so good, running with the faceplate off the workunit download server. If this remains the case we\'ll get a free replacement faceplate from Adaptec. This little exercise has proven that this server is a bad single point of failure - if we actually lost all the data, it isn\'t a scientific disaster, but a BOINC disaster - there would be hundreds of thousands of workunits "in the field" that no longer exist, and are no longer verifiable. We can regenerate the workunits, but it would be a big waste of CPU time not to mention a public relations disaster (not like we haven\'t weathered those before).

Remember radar blanking? Here\'s a recap: unlike the classic data, the multibeam data is blitzed with radar sources, adding a lot of noise to a small subset of our workunits. The radar\'s time frequency is short but random, making it very hard to remove by simply randomizing data based on certain thresholds. This is more an annoyance that a threat to science. Arecibo implemented a "radar blanking signal" which we now get in our data, telling us exactly when the radar is on so we can "blank" the data exactly at that time. Among other things, we\'ve been working to get this coded up and tested in the splitter for a while now. Jeff has been managing this recently and this morning had some final data and plots from workunits sent to our clients with the radar blanking and without. Looks like we solved the problem. Expect slightly less RFI workunits on average in the near future.

With Arecibo slated to be decommissioned in the not-too-distant coming years (write your local congressperson!) this has been an unintentional temporary boon for us as the observatory is prioritizing sky surveys to appease its current/remaining projects. That means we\'re collecting a lot more data than we originally intended, which means we can\'t seem to get disk drives back and forth between Arecibo and Berkeley fast enough. The bottleneck is our limited bandwidth to copy fresh data that arrives here down to HPSS (offsite archival storage) before erasing drives and sending them back. We\'re going to purchase another cheap SATA drive enclosure and try to use some of our excess Hurricane Electric bandwidth to speed up the archiving process.

Outside of that (and countless day-to-day chores) I got the basic plumbing of the "precess fix" program working. We unknowingly double-precessed all multibeam signal coordinates, so they aren\'t in J2000 as much as J1993 (the observatory\'s multibeam receiver code had coordinate precession built in, unlike classic receiver code). Not a major tragedy, and easy to revert - but this is one of those things where you want to make sure the math and logic are correct before updates billions of rows in a database.

Edit: Oh yeah, and I also sent out about 10000 reminder e-mails today. See other threads about waning user interest for more info. I\'ll send more each day.

- Matt' ), array('1 Apr 2008 22:15:39 UTC', 'Last night the workunit storage server acted up again. I attempted to reconfigure it at midnight last night, but then it reset itself an hour later, and again every hour since. So whatever the problem is, it\'s gotten worse. Jeff and I did some diagnosing during the regular weekly database backup outage today. The reigning theory is still a faulty faceplate sending erroneous resets to the motherboard. So as it stands now the server is running without its faceplate (and therefore no control panel - which makes powering on quite difficult)! And so far no resets. If this stays stable for a week I think we\'ll have nailed the problem. Meanwhile the kind folks at Adaptec already have a complete replacement at the ready if we need it - we might just need to replace the faceplate.

No other real big shakes about today\'s outage. I added more machines to the new kvm (which meant being able to pull more cables out of the closet) and we added a new field to the workunit table in the BOINC database - so far that hasn\'t broken anything as far as we can tell. The beta uploads are failing again, but hopefully that will clear up on its own like last time (I\'d still like an explanation, however).

Happy April Fools, by the way!

- Matt
' ), array('31 Mar 2008 21:46:51 UTC', 'The last few days were a little bumpy, with our workunit storage server disappearing out from underneath us at random (see previous posts for more info). This is still not quite clearly understood. The reigning theory is there\'s some faulty connection somewhere between the front face of the system (where the reset button is located) and the internal circuitry. This isn\'t too hard to imagine as there are some servers sitting right on top of it, and pressing ever-so-slightly down on the server\'s faceplate. A month ago we added that new heavy router to the stack. Perhaps this is the problem, which leads us to the general (and incredibly annoying) rack standards issue: all server racks are by default non-standard size and shape, and therefore we aren\'t properly racking as much as stacking.

One of the upshots of this were beta uploads were failing all weekend in various ways, most likely due to partially broken mounts between the upload server and the storage server (which contains the beta uploads as well as workunits - SETI@home public uploads are kept right on the upload server itself). This was very difficult to understand, but even worse: it just suddenly started working again - and during a meeting no less (when nobody was actually sitting at a computer doing any tweaking).

I\'m leaving early today to have a meeting down on campus with the donation department. Exchanging general ideas for improvement.

- Matt
' ), array('29 Mar 2008 5:16:39 UTC', 'I was joking in my last post about machines dying at midnight starting this three day weekend. At least they were nice enough to wait 18 hours into the weekend to start failing.

In this case, our workunit download server which failed earlier in the week croaked again. I happened to notice during my usual random check in from home that we were sending out any bits, which immediately led me to the faulty machine. For a short time I was able to log into it via a serial connection but it was in some funny, unhelpful single-user mode with a broken network config. Unable to do much I tried quitting out of that and it then basically became unreachable. Since its network configuration has reset, and the serial connection now shows no pulse, there\'s no option except drive up to the lab and kick the thing in person.

Except it\'s 10pm on a Friday night, and it\'s raining, and the known fix will take an hour or two to enact. No thanks. Even if I wanted to go up to the lab, there\'s no guarantee any fix would work. And even if I did get it running, given current history there\'s no guarantee it would stay running through the night or the weekend, so I\'m staying home.

Bottom line: no workunits until somebody is in physical contact with the server. This may happen sometime before Monday, but don\'t count on it. I sent warnings to the others but not sure any of them will be free to go up to the lab. I have a gig tomorrow so my next 36 hours are occupied.

- Matt' ), array('27 Mar 2008 22:40:40 UTC', 'There\'s not much news to report on the technical front - but that doesn\'t mean I haven\'t been busy. I\'ve mostly been engrossed in tasks that have little effect on the public servers, so anything I\'ve been working on is either (a) too complicated to describe to everybody\'s satisfaction (including my own), or (b) relatively uninteresting.

I\'ve been lax in sending out regular "reminder" e-mails to participants who lapsed (i.e. have stopped processing data for N days) or never succeeded in processing work. We wanted to start these up in the fall, but there were server woes - and it\'s not good form to send "please come back" messages to people only to frustrate them with connection failures. Then everybody went on vacation at different times. Then it was donation season, and we try not to send e-mails to people more than quarterly, so that postponed the reminders until a month ago, but at that point we were having the science database/router woes. Anyway.. now seems like a good time to try and start again. Perhaps starting early next week.

Tomorrow is a University Holiday, thus making this a three day weekend. Perhaps start an office pool involving which server will croak at midnight tonight.

- Matt
' ), array('24 Mar 2008 22:28:55 UTC', 'Things have been running rather well over the past couple of weeks. Having effectively unlimited bandwidth really helps. It\'s a little more hectic behind the scenes as new data keeps getting sent up from Arecibo - we are continually working to offload the data to our local servers (and remote mass storage) so we can send back the blank drives for more. Steps will be taken soon to improve this situation (namely: sending some data to our remote storage via our faster Hurricane connection).

There was a bit of a panic this morning, however. Suddenly gowron, our workunit storage server, reset itself. Not only did it reboot, but it lost all host/IP information. For all we could tell at first it lost everything! We had to connect to it over serial (most difficult part: finding the right cables) but once we got in we found our 2 terabytes of workunits were still intact (whew). So it was mostly a matter of reconfiguring the basic things and we were back in business. Why did it reset itself? That remains a mystery.

Another minor gripe: I spent a man/day last week working on testing mdadm\'s "spare group" feature. That is, if a drive fails on a RAID device without a spare, it can steal a spare from another RAID device in the same RAID group - mdadm\'s way of enabling a "hot spare pool." We never had a case where this would happen, nor did we ever test it. Now that thumper is less two spares (due to making a new small, separate RAID1 for database indexes) I wanted to test this. I made simple test cases and failed drives - but the available spares in the spare group weren\'t being utilized. Long story short - I actually recompiled my own mdadm with fprintf\'s all over the place and found mdadm behaving strangely. Thing is, this is mdadm version 2.6.2 we\'re talking about here, and mdadm is already up to version 2.6.4. So I download that, and it worked, so apparently this bad behavior has been fixed. But Fedora doesn\'t have the latest version available yet, at least via "yum update," so we\'re pretty much waiting on the new version to become available before implementing a less trusted version, even if it seems to work better.

- Matt
' ), array('18 Mar 2008 21:15:54 UTC', 'Today during the outage I installed the new network kvm in the closet and hooked up one of the servers. We\'re waiting on green cables to arrive (so we can tell them apart from other cables in the closet) before hooking up the other servers. Putting this server in actually maxed out our 24 port DLink gigabit switch - so I chained in an old reliable Netgear 100 Mbit switch to occupy the stuff that doesn\'t talk gigabit anyway - UPS\'s, service processors, older servers...

Bill, who donated our previous and current routers, came by to pick up the 2811 we\'re no longer using, now that the current one has proven itself to be able to handle what we give it. Apparently this 2811 is off to Beirut. What an adventurous life this router is leading.

Otherwise, a lot of my time the past couple of days has been spent mostly on generic network/systems administration not worth mentioning here (i.e. mundane drudgery).

- Matt
' ), array('14 Mar 2008 17:52:11 UTC', 'We turned off the resend of old WU on client reset because of a huge IO load on the MySQL db. It was slowing down result validation, the main function.

We have done a number of things to improve the db performance, reducing IO rates and hope to turn on the resend feature in the near future for a test period.

If the IO load is manageable the feature will remain enabled.

' ), array('13 Mar 2008 21:25:40 UTC', 'A few small items today.

Still messing with the new science database indexes. Bob just started dropping/recreating these one at a time, which may slow down the assimilator inserts, but we\'ll see. Having the indexes on a different volume can only help.

We just got a used Raritan 16-port network KVM donated to us - I believe the donor would like to remain anonymous (if you\'re readind this thank you!). Eric got this hooked up to a test server pretty quickly - it\'s pretty sweet. We\'ll get this in the closet sometime next week, and then we\'ll have the ability to reboot systems from home, which should minimize down time over the long haul.

With the regular BOINC database performing quite well these days, we may attempt turning on the "resend lost results" features again early next week and see if we can handle it.

I have a gig tonight where I have to sing, but with my lingering cold/congestion I currently sound kinda like Brad Garrett. Should be interesting.

- Matt
' ), array('12 Mar 2008 22:32:31 UTC', 'As for science database improvements... While getting the new science database RAID1 volume set up we discovered that the lvm gui doesn\'t allow for resizing of logical volumes containing xfs filesystems. Huh. We were able to grow these on the command line (both the logical volume and then the filesystem itself), so we\'ll just had to use the command line in instances like these. At any rate, Bob is building new db spaces for the indexes on this new volume. We\'ll recreate indexes there after dropping them from the old spaces (which are in I/O contention with the actual data). This will happen gradually over the next few weeks.

And yes, there were still lingering issues with the donation script. Actually I should point out that the problems were not in my parsing script, nor the whole system I set up to garner information from campus. The problem is that the formatting of the confirmations from campus change format every so often. And by "change format" I mean they suddenly contain random line feeds in unexpected locations for no explicable reason. So my parsing script needs to be "improved" every so often to pick up the exciting new places these line feeds might happen to turn up. Anyway, it\'s fixed, and a couple "clogged" donations pushed through just now.

- Matt
' ), array('11 Mar 2008 22:09:13 UTC', 'Typical Tuesday. The weekly outage went along just fine. This is the first time in many weeks the result table has been "lean" - i.e. no large excess of result entries due to blocked queues, waiting for purging, etc. How nice.

Despite the happy current performance of our servers, we\'re still keen on improving science database throughput. We met today to discuss a plan to shuffle disks/RAID/LVMs around to optimize performance on thumper. I\'m building the first RAID1 pair - it\'s syncing up now - where we\'ll start recreating indexes as soon as tomorrow.

- Matt' ), array('10 Mar 2008 18:58:22 UTC', 'Hello, folks - just getting over a really really bad cold. I rarely ever get sick like this so it\'s a bummer when I do. Anyway, I\'m back, though still only about 80-90%.

In the meantime, nothing much happened except the happy mixture of (a) enough download bandwidth to ensure an even flow of work, (b) a consistently long average workunit turnaround time, and (c) no unexpected other stresses, allowed us to finally, albeit slowly, catch up on the assimilator queue over the past week. At first I thought our queues were benefiting from the new splitter which might have been generating less noisy workunits (and therefore less prone to quick overflow and return), but the opposite was true: the new splitter was generating annoying broken workunits that errored out immediately. Sorry about that. In any case we\'re still in dire need of database server improvements, mostly in the RAID re-configuration realm. We\'re also getting smartd errors more and more - these drives are approaching retirement already. Can you believe it?

- Matt (sniff cough)' ), array('4 Mar 2008 23:27:02 UTC', 'Some positive progress today: During the weekly database backup outage I removed old kosh/penguin from the server closet, and replaced them both with bruno (the upload server) and its disk array. So the only backend servers still outside the closet are sidious and vader. In order to accommodate the new server I also put a second KVM and did some recabling to daisy chain it with our current one. The upshot is that thinman (the web server) which was up until today totally headless now has a spot on the KVM, which gives us some warm fuzzies.

Even better: Thanks to the "help wanted" post use Gerry Green found the bug causing those occasional broken queries tying up our database. It was a bad function call lost in the "ask a friend" web code. Thank you Gerry!

However, the outage was slowed due to our database simply getting larger and larger, and then we tried to let the assimilator queue drain a little bit before starting up again. A new splitter is also being rolled out today - the only difference is correcting a minor precession bug (for better accuracy we still have to un-precess our coordinates in all the previous signals up to this point - which we plan to do sooner than later).

I\'m reverting the four assimilators. Doesn\'t seem like 12 helps and only caused memory problems on bruno. We\'re really going to have to do some major reconfiguration on thumper before we can catch up again.

- Matt
' ), array('3 Mar 2008 23:13:14 UTC', 'So it was a rough weekend, mostly due to the excess assimilators being employed to knock down the ridiculously large back of results waiting to be entered into the science database. Long, long ago we had chronic problems with a memory leak in the assimilators, but that hasn\'t been a problem so much lately as things have moved it to a more powerful server and got BOINC going. Now they all get restarted every week due to the database backup outage. Anyway... having 12 running at once seemed to exercise the memory problem enough to cause the upload server to lock up a couple times. This created a general malaise on the backend, aggravated by a current period of fast workunits creating a heavy load on everything.

This morning bruno was rebooted and log jams were cleared. Servers are trying to get on top of their queues. But in the positive progress department, check out the most recent traffic graph (green = outbound, blue = inbound). Can you guess when we switched over to the new router?

Yay! We now increased our bandwidth capacity by about 50%. The roving bottlenecks are surfacing elsewhere, though until we get beyond the current period of catchup we don\'t have a good sense of what\'s normal or what to expect. We still have a ways to go to fully capitalize on the full gigabit of bandwidth Hurricane Electric is offering us, but this is still a vast improvement for now.

In regards to one comment in the previous thread: despite our small staff and minuscule pay scale we\'re generally close to 24/7 system monitoring, what with all of us on different schedules checking in regularly at random. And nope - I still don\'t have a cell phone. Never had one and, if possible, never will.

- Matt' ), array('28 Feb 2008 21:25:13 UTC', 'Fully recovered from the long outages earlier this week. I also employed more assimilators (and even more just now) to try to capitalize on periods of low I/O to help catch up on the big assimilator queue backlog. Seems to be working, sort of. We also changed the mount flags on the database volume to include "noatime" - we\'ll see if this actually makes a difference in performance.

Jeff and I are still getting beyond the router config. One of our roadblocks was using cables that were gigabit capable mixed with ones that were not (once again it\'s cheap parts causing the headache). We might actually be ready to go except we have to upgrade the super-long cable going from our closet to the main lab server closet, which is inaccessible to us. Waiting on the appropriate parties to handle that.

Regarding hardware/software RAID: We tend to shy away from hardware RAID as we\'ve had many nightmares in the past regarding configuration and implementation. Namely, it takes forever to figure it out, and then drives fail spuriously and/or silently. The software RAID hit isn\'t enough to make us consider going hardware on our current systems any time soon.

- Matt
' ), array('27 Feb 2008 22:15:24 UTC', 'So as the hours wore on last night the work queue was low enough that I had to stop scheduling lest we run out of work. This morning Jeff and I determined the science database server was in a stable-enough state to start everything up again, so we did. That\'s basically where we are now with that. The OS upgrade was a double leap frog (i.e. up 3 revision levels) so we\'re getting a few errors that are noisy but most likely bogus, caused by out-of-spec config files left behind and whatnot. We\'ll have to do a clean OS install at some point to clean out the chaff.

At any rate we removed the old-OS variable from the mix, and the database is still slow as molasses. We really need to update the filesystems (both RAID and fs type, perhaps) and reorganize which data go where. Plans are being spelled out for that. The assimilator queue is getting to be more of a crisis, though. We\'ll panic more once the outage recovery mellows out a bit.

More on the proposed RAID changes as there seems to be some interest. The current database (data *and* indexes) are on a single software RAID5 device. When we were just adding signals to the database, there were 0 reads and nothing but sequential writes, so this worked well. Now with all the indexes built, and some scientific analysis taking place, the read/write mix is far more random. Plus the stripe size is way too big for the random I/O (we\'re reading in a 64K stripe to read a 2K page - or something like that). It\'s very hard to predict what we\'ll ultimately need RAID-wise for any given server (as they change roles quite often), so we\'ve had to bite the bullet and change RAID levels mid-stream before. This time, the general idea is to create a new RAID10, and drop the random-access indexes off the RAID5 and rebuild them on the RAID10. We shall see.

Jeff, with my help, got the new router configured today. There were some blips as we swapped wires around to test this and that, and we eventually reached that magic 95% point where everything looks like it should work but just doesn\'t for some small number of unidentifiable reasons. E-mails to experts have been sent, and we\'ll sleep on it.

Minor news: web server thinman choked on a bunch of stale cron job processes (presumably stuck on lost mounts over the past week) so I had to reboot it - the web site disappeared for a few minutes there. Also that root drive errors on thumper turned out to be bogus (again!). I added the wrongly failed drive back as a spare. Weird.

- Matt
' ), array('27 Feb 2008 0:09:25 UTC', 'Let\'s see.. it\'s been a bit since I last wrote. I\'ve been mostly working on code to pull pulses out of the database, which uncovered a couple general minor bugs that had to be fixed. These were successfully dumped and handed off to Josh to find good candidates for initial Astropulse analysis.

Not much going on over the weekend but the science database server (thumper) is not performing. Jeff and I scanned all kinds of data during different tests and we\'re convinced it\'s the RAID configuration more than anything else. We\'re going to have to reconfigure all the file systems on that at some point. Painful, but we may be able to do it piece by piece without too much disruption.

Today we actually upgraded the way-out-of-date OS on thumper, which was also a bit painful, but ultimately successful. It should have been up and running by now, but thanks to an 8 Terabyte ext3 filesystem that hasn\'t been checked in over 180 days, a forced check is running and will probably be running all night. Not sure if we\'ll implement the secondary server (bambi) in the meantime - it may be too late in the day to attempt that. We\'ll let the project run as best it can until we run out of work (we\'ll probably keep a buffer of work just so the recovery later isn\'t as painful).

Meanwhile, the assimilator queue is growing and growing until we either let it drain, or we reconfigure thumper.

Oh yeah.. bane (one of the download servers) just went kaput. Spent 20 minutes trying to figure out what went wrong with its network. Oh - the cable came out of the switch. Click. Voila!

In good news, Jeff has been hammering on the new router today, and we got over a major hurdle of getting IOS installed on it. Only thing left now is configuration. It might be ready tomorrow!

Buckle your seatbelts.

- Matt
' ), array('21 Feb 2008 21:17:55 UTC', 'Yesterday I didn\'t have much news about anything to report. I was mostly spending my day elbow deep in pointing code, so we could determine when/where we observed known pulsars, and see if we actually found them in our data.

However, we\'ve been since experiencing some general aches and pains. In order to get the aforementioned code working we needed to add an index to the science database, and while it\'s able to create an index "live" the splitters/assimilators have been getting blocked for hours at a time. This should wrap up sometime later today. The lab in general has also been having mail server problems, which isn\'t helpful.

- Matt' ), array('20 Feb 2008 0:10:42 UTC', 'Another long weekend, literally thanks to the President\'s Day holiday, figuratively thanks to the various network bottlenecks. For the most part there was nothing out of the current usual - we were sending out a lot of fast workunits which meant our backend servers were swamped dealing with the increased number of results coming in. What was unusual was ptolemy having some kind of inexplicable freeze for several hours. It was sending away every scheduler request with 503 errors. Jeff examined everything but found nothing unusual going on to cause this - and service restarts and even a whole system reboot didn\'t fix the problem. Then all of a sudden it all just started working again. So we\'re calling this a fluke and perhaps something fishy further up the pike for now. One of download servers was having fits all weekend, losing mounts, etc. but that didn\'t seem to cause any additional headaches from the perspective of the public. Jeff and Eric were on top of all this, which was good as I was spending most of the weekend out of town - it was a battle to get wireless to work at my in-laws\' house.

Had the usual Tuesday outage today. No news there except recovery was slowed by a broken query which erroneously tries to slurp up the entire user table into memory. This happened before, but we couldn\'t find the culprit. Can you? I posted thread about this in our help wanted forum.

I also just uploaded a new set of photos and descriptions for your viewing pleasure.

- Matt
' ), array('14 Feb 2008 22:11:21 UTC', 'Right after writing yesterday\'s tech news I spotted the validators haven\'t been running since the morning. Oops! Turns out I discovered something that\'s been a problem for many, many months but only got triggered now: when starting validators from the command line (which is how we do it 99% of the time) everything is fine. But when started via cronjob (which is what happened this time) they couldn\'t find the right libraries and immediately quit. Trivial environment/path issue - just funny we haven\'t seen it before. I started them up, the queues cleared out, and the assimilator queue returned to slowly draining itself.

Things got a little weird over night. Our single download server seemed to be unable to get work out fast enough. First thing we did this morning was hook up vader again to be a redundant download server, so already my configuration explanation from yesterday is out of date. That\'s how it is around here. Anyway.. this download redundancy, however nice to have, didn\'t help very much nor did we expect it to, because we already guessed the router was the choke point. But why? The outgoing data was far less than normal. So what\'s the deal? I noticed the incoming data rate was strangely high, so I checked the router graphs not by bytes but by packets, and we were pegged packet-wise. I repeat: but why?

Turns out it was a DNS loop brought on by our recent separation of the scheduler and uploader. Clients were coming into the "wrong" server and being redirected to the other (via apache). But due to incredibly short TTLs there were still a few DNS servers or caches out there saying the "other" was still "both" (standard round robin DNS). This bogus information only affected about 3% of incoming requests, but half those requests were being redirected right back to the same machine. Not very noticeable at first, but over time more computers with outdated DNS maps would connect and get stuck in a loop, and eventually we were distributed-DOS\'ing ourselves. We broke those apache redirects and immediately everybody was happy, and just now reinstated the redirects using hard IP addresses to avoid further DNS mistakes.

I brought the digital camera today and took pictures of the closet in its current state. I\'ll put them on line over the weekend or early next week.

- Matt
' ), array('13 Feb 2008 23:54:49 UTC', 'I\'m realizing the server status page is giving a slightly bogus picture of our current server setup, and it\'s actually too much work right now to fix the status script, so I\'ll just tell you now what the current situation is: our public web server is thinman, our scheduling server is ptolemy, our upload server is bruno, and our download server is bane. None of these currently a redundant twin or a "hot" backup (but we have vader and maul all set up to be a replacement for any of the above if need be). More on that below Our primary/secondary BOINC (mysql) database servers are jocelyn/sidious, and our primary/secondary SETI science (informix) database servers are thumper/bambi. Specs for all these are correctly noted on the status page. We have other systems employed for less interesting but important things, but that\'s basically the meat of it. If we could double the CPU/memory/disk space on everything we have we\'ll be set (for the time being).

Anyway.. things are looking better. Weekly outage recovery is still a little weird - I don\'t think our single download server (bane) can handle such crunch periods alone so we\'ll probably bring vader back into the fold for that. The other servers are super happy given the recent changes to reduce NFS traffic. I enacted some more such changes this morning. This tweaking, coupled with server ewen (where Eric does his Hydrogen work) crashing and hanging the network a bit, made for a slightly bumpy ride this morning. However, between smoother seas and perhaps running "update stats" on a couple signal tables made the assimilators much faster. We\'ll finally catch up on that queue in a couple hours I think. Due to the reduced dropped connections on the scheduling/upload servers it seem that the router got more cycles to spend on downloads, and we reached almost 70Mbps last night. Still need to get that new router going...

Other than that - more mail drudgery. As much as I like computers, I hate when perfectly good but nevertheless wonky solutions to small problems become the foundations for advanced development, thus amplifying the original wonky-ness.

Oh yeah - Eric sent some graphs around. Looks like the radar blanking code is working. Neat. Jeff\'s working that code into the splitter now so we can retest that small data file and compare results.

- Matt
' ), array('13 Feb 2008 0:34:39 UTC', 'E-mail administration is utter torture. Time was every project in the lab had their own separate mail servers. Over the years people wisely moved towards a more unified lab-wide e-mail system. Of course, SETI was the last project to convert, pretty much due to not having the man-week to spare fixing something that ain\'t broke. Well, it suddenly broke last night enough that I had to pretty much drop everything today and make everyone bite the bullet to start switching over - something that should have happened years ago but nobody has had the time to deal with it. Not like I have the time to deal with it now. Ugh. At least it\'ll all be out of my hands in the coming weeks. Until then, I\'ll be up to my eyeballs in sendmail drudgery.

Meanwhile, we had our usual outage today, during which we replaced the seemingly bad drive on thumper - the master science database. That was easy, but upon restart another of its 48 drives started complaining. So far the complaints can be seen as spurious enough to ignore. We\'ll do more robust RAID checking soon. Bob also moved some logs files around to hopefully reduce random access disk I/O, and is running some "update stats" on the tables to see if that improves performance.

In better news, I did some DNS twiddling to split the upload and scheduling services to two separate machines (as opposed to running both services on both machines). This vastly improved performance, as splitting the functionality reduced the NFS traffic between the two to zero. We had it set up the previous way for historic reasons which were no longer apt. This is all very good but as it stands we have single points of failure for all our public facing servers. We have some systems in line to fix that but they are in use for Astropulse testing. And we still need to work that router into the fold.

Note regarding the previous thread: I should take updated photos of the server closet - not that much different but a lot neater.

- Matt
' ), array('11 Feb 2008 22:48:02 UTC', 'Came into the lab this morning and it was well over 70 degrees. This may seem nice on a winter day, but (a) we have fairly warm winters here in the Bay Area, and (b) the usual temperature in the lab is closer to 60 degrees - even in the summer. This isn\'t great from a human perspective - we wear jackets while sitting at our computers all year round. From a hardware perspective, the extra cold lab air assists in keeping our systems nice and cool. This is why I was immediately concerned about the suddenly warmer air. Turns out a fuse blew over the weekend, and it was already repaired before anything came close to melting. Still.. a little bit of panic this morning.

Despite the load on our backend servers being on the low side (averaged over the past 5 days or so) the assimilator queue was barely able to shrink. In fact, it\'s growing again due to the Monday bump. My guess (and others\') which I already mentioned is that the new science database indexes, which add more random reads/writes during inserts, are to blame. We\'re doing more aggresive analysis and will try some "low hanging fruit" type solutions before too long. Not a major tragedy just yet, especially as workunit may be generally less noisy in the near future. The scheduling/upload servers are also on the brink of disaster - they have short but nevertheless frequent periods of dropping connections. They too would benefit from less noisy workunits. Or more/better hardware.

On that note, if you check out the slightly updated hardware donation page you\'ll see I added an item for a KVM-over-IP which would help us upgrade our server closet faster. We\'re maxed out in the console department. In fact, our one public web server has no keyboard/mouse/monitor attached to it. If it freaks out, we hope we can log in remotely and fix it. Any incredibly generous takers? Anybody have strong opinions about which make/model to obtain?

- Matt' ), array('7 Feb 2008 22:58:44 UTC', 'We\'re having little luck getting science database thumper to perform up to expectations. We determined the fact it is both a database and raw data storage server isn\'t really the problem - the database alone is somehow constrained. Is it all the additional indexes we added recently? Extra load due having to make logical logs for the replica? Something else entirely? Of course, while testing/tweaking the OS root mirror drive on thumper failed. We got the notice from smartd but mdadm didn\'t notice, which was scary. We manually failed the mirror and brought in the hot spare which is sync\'ing up now. Anyway.. the assimilator queue is growing and there doesn\'t seem to be much we can do about it now, at least anything drastic given it\'s the end of the week. We are sending out a lot of short work - maybe this will change soon and give us some relief.

Other small news: recent splitter updates include (a) more realistic deadlines, i.e. they have been reduced 25%, and (b) radar blanking code - we\'re testing that now. There also has been a little bit of scheduler/upload server choking due to the aforementioned headaches - including one of the schedulers running out of work (as it runs faster than the other and therefore its queue depletes faster). Once again, we\'re have little choice but to wait out the storm.

- Matt
' ), array('6 Feb 2008 23:04:24 UTC', 'Recovery from yesterday\'s outage wasn\'t so bad after all, but we\'re hitting another wall. Well, not a wall as much as a mound. That mound is our science database server, thumper. Those watching the status page may have been noticing it\'s having a harder and harder time to keep up with making work (ready-to-send queue is hardly ever full) and keeping up with assimilation (ready-to-assimilate queue is hardly ever empty - in fact, it\'s been growing slowly over the past 24 hours).

Of course, it\'s not the database load - thumper has almost 50 Terabytes of storage on it, so it also serves as our raw data buffer (where we keep all the data images for the splitters to chew on) as well as database backup storage (where we write/archive a 500GB data file every week). In short, we\'re hitting disk I/O limits on thumper. I fear making the "vertical" splitter (which acts on many raw data files simultaneously to reduce impact of hitting too much noise on a single file) has reduced any benefit of disk caching to zero. Since we\'re basically keeping up now, I whittled our number of splitters from 10 to 6 - hopefully this will help. I don\'t want to revert to non-vertical splitting just yet - we\'ll have greater problems if we do. Bob may also employ so different informix checkpointing parameters to reduce the impact of long checkpoints blocking science database traffic about 25% of the time. We\'re pretty much in wait-and-see mode on that.

Jeff and I are more or less done hammering out the current set of kinks in our data pipeline from Arecibo to your computer. This will all be automated shortly. We also just threw a very short chunk of data into the splitter queue from last week (28ja08aa). It\'s already being split, actually. This contains radar blanking data. We\'re going to process it once without the blanker logic, and again with. It\'s a data-beta-test. We want to be really make sure it works before processing dozens of whole files. I\'ll try to remember to throw up some before/after plots comparing the two runs once they are complete.

- Matt
' ), array('5 Feb 2008 23:55:44 UTC', 'The regular weekly outage to hose down the database got started a little late today since Bob was out and I was busy voting (election day here in California - they hold elections in the U.S. in the middle of the work week and nobody gets the day off). Otherwise it was fine though it took a little longer to compact the tables as it was a generally busy week meaning a lot more database inserts/deletes and therefore a lot more fragmentation.

Spent a large chunk of the day helping Dave install a new fastcgi-enabled scheduler on the alpha project which meant figuring out the differences between fcgid and mod_fastcgi behavior and determining which apache directives work, etc. Pretty annoying, but finally got it all squared away - the upshot of this is we\'re now getting real scheduler logs for the first time in years, as opposed to scheduler messages cluttering up apache error logs. Cool. Of course, I was distracted enough to not notice bane (the workunit download server) spiraled out of control trying to recover from the outage. I just rebooted it with and started apache with a lower ceiling to hopefully prevent this from happening again. So I\'m still operating on bane. Expect slightly slower, more painful recoveries from outages for the next while.

Despite the red bar on the science status page saying ALFA is not running, we are indeed collecting data on and off. This is a false negative due to a change in reporting from the Arecibo feed which tells us telescope position/status/etc. Jeff\'s fixing this now.

- Matt
' ), array('4 Feb 2008 22:53:30 UTC', 'Once again a normal weekend without anything bad to report. Though we are starting to "normally" push our current router to its limit - our normal Monday morning "bump" brought us just under 60 Mbits/sec. We really should be moving to the new router sooner than later - still waiting on OS upgrade support from others.

Meanwhile, our web server situation is now completely down to the one new server "thinman." I turned aging server "kosh" off today. Just like "penguin" it served us well over its many years. Sun servers tend to last forever if you let them. Here\'s a reminder that our Classic data recorder was a Sun IPX, which was already about 5 or 6 years old when we put it into service as a 24/7 collector of raw data at Arecibo, and it lasted the 5 or 6 more years beyond that with nary a single problem.

Jeff and I are mostly working on the data pipeline, which got "rusty" during the extended downtime at Arecibo. It should be running fully automatically any day now, with drives full of hot, fresh data arriving regularly. We\'re collecting data now, but having to kick the system along from time to time.

- Matt' ), array('31 Jan 2008 22:54:06 UTC', 'No big shakes today. Here\'s the lowdown:

The RAID recovered just fine last night. Continuing install of OS\'es on new desktop computers. Court (former SETI@home systems administrator extraordinaire) came by for a short visit which was nice. Fighting with gnuplot to get it to do what I want. Took some active measures (using creative load balancing) to rectify long-standing feeder mod polarity problems - in other words we have too many even-numbered results-ready-to-send in the database, so I\'m currently giving preference to the even-numbered scheduler so the odd results could catch up. Should be completely transparent to our users.

As a follow up to the television crews yesterday: I have no idea where/when the thing will be on air. I\'m always pleased with increased media exposure, but personally I\'m kind of cavalier about the whole television thing. Anyway I think Dan ended up being the only person on screen. I have been in many clips before. In fact, months before SETI@home launched a news crew showed up. I didn\'t know they were coming and arrived to work on little sleep, unshowered, unshaven and wearing a rocker t-shirt. I also had freshly dyed pink hair. I ignored the cameras best I could as I was actually quite busy. I also figured this footage would only be used for the local news, if at all. That night my sister who lives on the other side of the country called. She asked, "when did you dye your hair?"

- Matt
' ), array('31 Jan 2008 0:45:41 UTC', 'Everything was kind of okay for most of the day. A couple new shuttle PCs came in - new desktops for Bob and Dan. I was setting those up, working on some database programming, etc. when the television crew for "Good Morning America" arrived. They were nice but they needed me to set up a shot with a computer running SETI@home. Oddly enough we don\'t have any systems readily available with a good display so I had to do some minor server reconfiguration to free up a fast enough computer that could show the screensaver in action.

Then the NAS holding our web site, home accounts, etc. suddenly died and was in a vicious reboot cycle. WTH? I had to power cycle the whole thing to get it to boot for real, and only then it was clear that a drive failed and it was rebuilding the respective RAID volume. Ultimately no big deal, but it is quite disconcerting it didn\'t recover so easily from a simple drive failure and had to be dealt with manually. The projects were offline there for a bit as the dust settled. The RAID is still rebuilding now. Let\'s hope another drive doesn\'t go in the meantime.

- Matt
' ), array('30 Jan 2008 0:06:05 UTC', 'Normal outage day for mysql database backup and compression. We took the opportunity to take care of two other things. First, we added a uniqueness constraint on a field in the analysis_config table in the science database. Interesting, no? Well, no, but long story short this constraint should have been there already, now it really is. Second, we upgraded the secondary science database server to latest Fedora rev and it seems to have accepted its new OS kindly. So far so good with that.

The recovery from the outage was slowed by a couple things. Bob also stopped/restarted mysql to incorporate/test some recently tweak config parameters. This has the unfortunate side effect of flushing the 20+ GB of memory, which means that all has to be read in again before the project comes fully back up to speed. Meanwhile I thought I\'d continue tweaking the apache config on bane as it was seemingly unhappy and I ended up just making it temporarily worse. Oh well. Hang in there. Workunits will come.

Old web server penguin has been powered down and all its cables removed from the spaghetti in the closet. It has served us quite well.

- Matt' ), array('28 Jan 2008 21:28:05 UTC', 'Things are running more or less smoothly. The workunit/result traffic was fairly high over the weekend, but consistent and below our current cap, so no major faults there. Our active user count is still slowly climbing but the acceleration of growth is negative (at least until we have another press releases or "reminder" e-mails are sent out). Since various index builds (and removals of seemingly unused indexes) the MySQL database is masterfully handling everything we give it. The router upgrade is still in limbo.

One odd thing was our "feeder" polarity problem reared its ugly head again. Reminder: we have two scheduling/upload servers (bruno and ptolemy) each given a separate queue of work to send to our participants. If all is well, they should send out work at the same rate. However, in the past this wasn\'t always the case. DNS favoritism was causing one queue to run out faster than the other, causing errant "no work from project" messages given to half the clients. This was fixed with software load balancing on top of DNS. However, this time around it seems the increased traffic tickled an actual, particular disparity between the two. That is, bruno writes uploaded result files to directly attached RAID storage, while ptolemy writes to bruno\'s storage over NFS. We seemed to hit a "too many files open" limit on bruno, and therefore bumped up the maximum on that. We\'ll see if that helps.

In case you haven\'t noticed, I un-DNS-aliased one of the three webservers last week, and another this morning. All public web traffic is theoretically aimed solely at our new 1U dual opteron system, and it\'s doing great. However, DNS rollout takes forever (even with time-to-live set for 5 minutes) - it will take a week or so for those old aliases to disappear. The old web servers (kosh and penguin) were wonderful sparc/solaris systems but are approaching 8 years old and therefore are relatively physically big and slow. We\'ll pull them out of the closet to make way for more modern systems - like bruno. Yeah, bruno is still sitting in our secondary lab, connected to the systems in our closet via some funky switching around the building. It will be great to it on the same single switch as everything else.

Other plans for the week: We\'re upgrading the fedora core levels on several systems, including our science database systems. We have already tested similar upgrades on our more-expendable desktops with little trouble. However, we will proceed with great caution given many terabytes of data are involved on the database servers - full recovery would be painful, to put it mildly.

- Matt
' ), array('24 Jan 2008 21:03:59 UTC', 'I think I have the apache/tcp config in some kind of working order so that we won\'t suffer such wild dips like we had over the past couple of days. These pains were brought on by a confluence of three minor events: running out of work to send, waiting an extra precious day before enacting the database compression/backup, and reducing our backend to just one download server. You\'d think the last item was the main culprit as we seemingly slashed our server capacity by 50%, but the real bottleneck is still the router (the new one still not config\'ed yet - waiting on a new IOS image). The single download server (bane) can handle the traffic, but the apache config was such that when all the downloads started it the cpu load went up to 400. Basically, MaxClients was set way too high but this went unnoticed when only half the load was on vader and half on bane. Then I set MaxClients too low - we were dropping connections long before hitting other theoretical limits. Now MaxClients is set just right. Or right enough for now. We\'re still experiencing catch up "malaise" but it\'s a much smoother ride in general than yesterday.

I\'ve actually been working on some scientific programming. With the new science indexes being built we\'re able to analyze some data to get an idea of the current RFI structure. Basically we\'re seeing the radar noise in the final data - the radar blanking signals are still being implemented so new data (once it finally starts coming in) should be far less noisy. I\'m hoping this kind of work will inspire more scientific updates from the others (remember: I\'m a math/computer geek, not an astronomer - everything I know about SETI/astronomy is from 10+ years of osmosis working here at the lab).

- Matt
' ), array('23 Jan 2008 23:27:33 UTC', 'No news on the recently donated router (see yesterday\'s post). Basically we\'re in a holding pattern waiting to get the OS updated on the thing (currently running CatOS - needs to run IOS) and then configuration should be straightforward. There are some growing pains on having server bane be the single point of workunit download. I just tweaked the apache config to lessen the load. It\'s funny how seemingly unimportant differences in CPU/memory type/amount/speed from one server to the next require radically different settings in httpd.conf or else the whole thing grinds to a halt. Anyway, expect some download pains as knobs get turned and we slowly recover from running low on ready-to-send work.

Due to the recent long weekend we had the weekly outage today instead of yesterday. All went well with that, and my recently mentioned fixes to speed things up worked well. During all that I finally finished the last parts of the disk usage shell game so our workunit storage (on the Snap Appliance) is up to its maximum size of 2.5TB, of which we\'re currently occupying 50% - that will last us a while. As well, we are pretty much ready to start OS upgrades on the science database servers next week.

- Matt' ), array('23 Jan 2008 1:16:26 UTC', 'To my fellow US citizens (and others as well), hope you had a happy MLK day (or whatever your state officially calls it). Those wondering why no tech news item yesterday, that\'s why.

I\'ll start with the negative. Lots of the usual annoying little hiccups over the weekend. Here\'s a non-chronological digest: One of the servers (bruno) lost its automount again (hasn\'t happened in a while), having the effect of inflating the validator queue before I noticed and unclogged the pipes. We went through the raw data files on disk faster than expected over the long weekend, so the results-to-send queue dropped down and we\'re going to be recovering from that for a bit. The web sites were increasingly dragged down by obnoxious activity over the weekend but that finally disappeared after I blocked the offending IP addresses.

Now the positive. Our new 1U dual opteron server "thinman" is now up and running as a public web server. We were going to use new server maul, but thinman is, well, thinner, and it\'s already in the closet. So that saves us one immediate closet upgrade. As well, we have been redundantly sending out workunits via both vader and bane. This is way overkill and a vestige of a time before we realized our problems were router-related. Since bane is also just 1U and already in the closet, I decommissioned vader as a download server. The bottom line is we only have two machines to get into the closet now (as opposed to 4): bruno and sidious. And we have a single web server which is much smaller and faster than the old servers (kosh and penguin) combined. They will be shut down sooner or later.

In better news, Bill Woodcock (a key player in getting us set up with Hurricane Electric, i.e. our current ISP and donator of our two current HE routers) has donated another cisco router to us to replace to weaker 2811. It a 7600 series, a bit overkill, but will give us tons of headroom to spare. We\'ll no longer be constrained by the 60Mb/sec cap! I guess we\'ll find the next set of bottlenecks quickly, including the 100Mb cap (due to our current lab wiring to campus). Of course, we have a lot of configuring to do before this thing is up and running, but at least it\'s in the rack!

By the way, if you haven\'t heard of email bankruptcy, please read this article. I\'m declaring "thread" bankruptcy, i.e. I am letting go all current questions, open-ended threads, unfinished story lines, etc. If anything is really important it will come up again.

- Matt' ), array('17 Jan 2008 22:23:19 UTC', 'No disasters or major revelations to report today. Interesting news from yesterday: Sun bought MySQL. Not sure how this will affect us, but it reminds me that I should mention that I am generally pleased with MySQL. There was that one comment about the professor who thought industrial grade software is the only way to go, and the MySQL is for mom-and-pop ventures. Let me address: Claiming the winners in the game of capitalism hold the best solutions to whatever problem is at best an arrogant assumption with obvious overtones of classism (both intellectual and economic), especially given that "mom-and-pop" crack.

Other than that.. mostly spent the day cleaning up spills in various aisles. I also yum\'ed up my desktop to Fedora Core 8 as an exercise to do so on more heftier servers in the coming weeks.

- Matt
' ), array('16 Jan 2008 23:25:12 UTC', 'The recovery went rather well yesterday, considering its extended length. Bob made some mysql tweaks to perhaps better use the memory on jocelyn (allow more protected space for query sorting, for example).

Vexing time-sinks: I spent 45 minutes this morning trying to figure out why one of the download servers (bane) was have autofs problems. Long story short: the route map was ever-so-slightly messed up so that it couldn\'t mount a single particular machine on a different subnet in our lab (why it needed to mount this machine was due to an "ls" command in a script - which by default displays color, so ls will traverse sym links to see if they are broken or not in order to select the proper color scheme, and in this case one sym link was on this remote machine). Also: the new donated server came with rails! As some of you know we have hilariously bad luck with rack rails of infinitely different (and useless) non standard sizes, and this time is no different. We needed to shrink the rail depth which should be easy. I did this to one and it fit! I did this to the other and, due to different screw hole location, it remains 1 cm too deep and unable to get any smaller. Ha ha ha (sob). Bottom line: useless rails, yet AGAIN.

But that\'s just a minor detail really - no need to rant and I don\'t want to seem ungrateful to our generous donor! We ended up putting the thing in the closet flat on top of the whole rack chassis. Works for me. We now have a new server called "thinman" (dual opteron, 16GB RAM) to help bolster the BOINC back-end! Woo-hoo! We\'ll update the server-wish-list with routers, servers, kvms, etc. soon.

Other vexing time-sink: Bogus news reports that we found a "mystery" signal should be summarily ignored. This was a gross misinterpretation by a reporter of an quick comment Dan made off the record about AstroPulse progress and recently published millisecond pulsar findings by another group. These are new stellar phenomena which are astronomically interesting (and AstroPulse hopes to find many of) but not ET. Sigh.

- Matt
' ), array('16 Jan 2008 0:37:05 UTC', 'Yeah... we\'re really pushing the boundaries of our mysql database these days. I\'m finally catching up on several years\' of backlogged archives and inserting zillions of rows to credited_job and this, on top of general increased usage, is gumming up the works. In fact, optimizing this table alone during today\'s outage took three hours (normally only a few minutes) - which explains the extreme length of today\'s downtime. I guess we\'ll have to turn of credited_job optimization until we actually use the table.

This brings up several questions, the first of which was asked in a previous thread: Why are you guys using mysql instead of a more robust commercial product? Two main reasons: BOINC projects generally are small academic ventures with limited funds, and BOINC is an open-source project itself utilizing other open-source pieces of software. So all you need is a relatively cheap linux box which comes with php, apache, mysql, etc. and it\'s pretty much plug and play. Remember the project specific data, i.e. the science database, can be whatever you want. In our case, it\'s Informix. Why Informix? We got it for free 10 years ago - we now have 10 years of experience using it as a group and it is still free to us. Would we consider changing to Oracle/SQL server/etc.? If somebody wants to buy such a license and donate a man/year to change all our back end software to do so, then we would perhaps entertain the thought, but we have higher priorities, especially as Informix works perfectly well at this point. It\'s the BOINC/mysql part that needs help, and we\'re sticking with it for reasons stated above, and with SETI@home being the flagship project of BOINC we don\'t want to diverge from the standard.

In other news, it seems the every day there\'s a different reason our web sites are so darn slow. Yesterday afternoon we were getting hit by some seemingly nefarious activity which I was able to block quite easily once I discovered it. But we were also getting hit by some scraping of stats pages via a robot (called BoincBot) that was not obeying robots.txt. I blocked these hits as well. We don\'t allow such activity on our web sites. If you want BOINC stats you can download the daily xml dumps just like everybody else.

On the bright side, we obtained another server donation yesterday from a private party: a 1U dual-opteron (2.4GHz) server with 16GB memory. I installed FC8 on it just now, though there was a little bit of tweaking to get that to go. There\'s no DVD drive in the thing (only a CD drive) and for some reason the was some disconnect with the 3ware disk controller such that the linux installer couldn\'t see the two root drives. I ultimately took that out of the equation and plugged the drives straight into the SATA ports on the motherboard. All\'s well and it\'s getting all yummed up now.

So we\'re looking for a KVM-over-IP, at least 16 ports (24 preferable), easy-to-use but secure connections via a web browser, etc. Any thoughts? The Belkin Omniview seems the cheapest/easiest, but only allows one person to connect to the whole unit at a time - not a showstopper. Any suggestions, experience with such devices, etc. out there?

- Matt
' ), array('14 Jan 2008 22:23:56 UTC', 'Things ran quite well over the weekend. Looks like we added the right index to the mysql database to reduce the slow "validator fix" queries. A note about general BOINC/mysql implementation/design: there are a lot of features in BOINC that are seemingly excessive from a single-project perspetive, but are there as every project has different needs. Project-specific factors (server power, workunit processing times, number of active users, min quorum, etc.) make some features less helpful. In the case of "resend lost workunits" (see last thread) this feature, implemented mostly for the benefit of Einstein@home, was most definitely weighing down our database server. We turned this off and have been running smoothly since. There were assumptions this would lead to greater problems down the line (fearing many results will be sitting on disk longer waiting for their redundant pairing to return) but in fact our "results returned and waiting for validation" number has been stable (if not slowly decreasing) since I made the change. Nevertheless, at some point soon we will see if we could optimize/reimplement this code, and Eric is actually making adjustments to the splitter which will perhaps create less "fast runners."

Our new-hardware-to-obtain priorities are shifting. Namely, we need a router (we\'re not ignoring discussion about this on other threads but we are limited to what we can use for various configuration/policy reasons). We also need a new KVM - our current one in the closet is maxed out and we\'d like to get more stuff in the there ASAP. We also need three new desktop systems. Dan\'s using an old, sloooow solaris system which is out of support. Bob is on a slightly faster solaris system, but needs a safe mysql test sandbox. Josh\'s old super-cheap windows/intel box is basically a glorified console server.

Had some minor issues due to the root drive on bruno filling up on Sunday. I scanned the drive and found only 4GB of stuff, while "df" was showing 40GB. Eric eventually found a deleted-yet-open file - an infinitely growing httpd log. Apparently httpd log rotation broke at some point, but we cleaned this up. Annoying, but harmless.

Due to increased load in general, I changed the server db stats to update every hour (instead of half hour). Actually it\'s becoming clearer as we increase active user load and I\'m populating credited_job, etc. that the mysql database might be our bottleneck du jour any jour now. There were also some issues with the user-of-the-day selection process which I tracked down and fixed this morning.

- Matt
' ), array('10 Jan 2008 22:47:31 UTC', 'The public web site servers slowed to a crawl again this morning thanks to several robots/spiders scanning us at once. So I took another gander at my robots.txt file and used Google\'s webmaster tools to check how well this was being parsed. This uncovered a typo (a missing "s") and while I was at it I added some new rules to robots.txt. We\'ll see how this all fares.

Bob and I brought the BOINC/science database servers down briefly this morning to tweak some parameters and clean out logs - some of you may have noticed a brief data server/web site outage in the process. The only tweak of note was on the science database: we reduced the checkpoint intervals and increased the between-database-ping timeouts. Why? We\'ve been seeing the secondary spuriously enter recovery mode due to being unable to reach the primary, when really the primary was simply busy doing checkpoints at the time. Anyway, outage recovery was slowed by confluence of various stats/update scripts starting up while the database was busy flooding its memory buffers. We really need to optimize those stats queries someday. As well a relatively new BOINC feature ("resend lost workunits") was eating up a lot of database too, so we turned that off for now. Actually that last thing helped immensely.

In the process of general disk cleanup, etc. I\'m now forced to finally populate the credited_job table with three years\' worth of purge archives. These archives are taking up 200GB on a 1TB filesystem which we really need to convert into workunit storage sooner than later, hence the push. Reminder: this is the table that contains the history of which users processed which workunits.

Just between you and me... In addition to the outbound traffic squeezing through our maxed-out router, I am now sneaking our an additional 5-10% over the campus net. This is thanks to the simple/useful "pound" load balancing utility. The campus net can definitely handle this tiny increase. In fact I might bump up the percentage. But don\'t tell anybody. Mwha ha ha. [edit: I brought that percentage back down to 0% an hour later - we\'ll keep this extra power in our back pocket for now.]

By the way, the optimized client discussion has been taken offline and is progressing. Turns out this may actually be a single bad host more than a bad client.

- Matt' ), array('9 Jan 2008 22:51:15 UTC', 'More blips and blops in our traffic caused by who-knows-what. We still don\'t have enough data yet to see if yesterday\'s BOINC result outcome index build helped with those regular slow validation-fix updates. In any case, I misspoke: we are running a version of MySQL where triggers are available to us - we only have to figure out how to implement them to do what we need. This morning the secondary download server bane was having a mount headache and I had to give it a virtual kick to get it going again. And that router is still a problem, but we\'re not convinced it\'s the only problem. Swapped out cables, switches etc. to no avail this morning. I installed some real load balancing between vader and bane (in practice round robin DNS is hardly balanced) which may help.

There was still slowness to the web site as of a few minutes ago. This had nothing to do with recent web code tinkering/updates or database load or any such thing - this was strictly due to the aforementioned router problems, as half the web traffic was going through the same router (the other half over the standard campus network). I just moved the competing traffic onto the campus network as well, so that should improve web site performance in general.

Regarding recent assimilator clogs, we had another one this afternoon. And yes, once again it was from a result produced by an optimized client. This time around I attached a debugger and found the problem was in XML parsing of the result and sure enough with enough eye-squinting I found a couple garbage characters in the uploaded result file. Specifically, in the power-of-time declaration of a pulse. Instead of:

<pot length=211 encoding="x-csv">

It was:

<pot length=211 encoding71x-csv">

So there are two problems. First, something is causing corruption in the xml (the non-standard client? something else on our end?). And second, the assimilator is too sensitive to such corruption. It shouldn\'t bail out so readily and create these large ready-to-assimilate queues.

Minor updates to the server status page: I changed references to "beam/polarization pair" to the more concise "channel." I then added a parenthetic numeric value to the ends of each data file (representing total working/done channels for each file) so you don\'t have to count the little green squares. I also added total values at the bottom for all data files (mostly so we can see how long we have before we run out of data to split). Note how the "vertical" processes (i.e. splitting multiple files at once) has a negative side effect: we are forced to keep data files around much longer, which makes it difficult to keep a queue of data on disk. Some better "vertical" logic has been coded, to be rolled out in the next day or so.

- Matt' ), array('8 Jan 2008 22:16:52 UTC', 'So we\'ve been running this annoyingly load-intensive query everyday on the BOINC database to clean up results that failed validation. It took up to an hour to run, during which it hogs a bunch of database memory and slows everything down, including workunit distribution. Why not build an index? Well, indexes still take up disk/memory, and the main table field in question is of low cardinality, and we\'re only hunting for a few thousand out of a millions of rows each time. So Bob was looking into implementing a new fangled mysql "trigger" to flag the few rows when they enter this bad state, making them much easier to find without needing the overkill of an index. However, we only discovered today triggers don\'t work in our current version of mysql. So we built an index after all. We\'ll see how much it helps.

Other than that and the usual database backup outage this morning, mostly spent the day moving large numbers of files/archives around to prepare to grow the workunit storage space again. I also got the new server (maul - see yesterday\'s note) up to speed, more or less. Still won\'t be live for at least a day or two, but it\'s working. It\'s a 4x2.66 GHz dual core intel with 4 GB of memory. Looks like another perfect web server to me. Also had to grow our home directory space because, as you know, no matter how much space you have, it\'s never enough.

Somebody pointed to an article that mentioned the Cisco 2811 has a known throughput rated at about 61 Mbps. This was a surprise to me and Jeff - I guess this wasn\'t what we were told, and you\'d think a router with 100 Mbp ports could reach a theoretical maximum of 100 Mbps. The cap seems to be due to CPU limits, and we are doing tunnel encryption and have a small but still non-zero set of access rules. Anyway live and learn. And no further progress on that since yesterday.

Another storm is whizzing through. The top third of a 50 foot tree just broke off right outside my lab window. Cool. I understand why people are freaking out about this current weather, but this is nothing compared to the hurricanes I dealt with growing up in downstate NY.

- Matt
' ), array('7 Jan 2008 23:28:38 UTC', 'Lots of weather in the Bay Area over the weekend, leading to many power outages. Luckily our project was not affected.

The new pseudo-random nature of our workunit creation finally worked itself out, and we were sending data at a relatively even pace. Speaking of sending data... At the end of last week my suspicions were confirmed: the router between us and our ISP (a Cisco 2811) has been CPU bound for who-knows-how-long, thus causing an artificial 60 Mbit/sec cap on our outbound packets. Further research will determine whether we can improve its performance or if we need to procure a better router.

We had an assimilator get jammed on a broken result. I had to delete the result to clear the pipes. This happened once before a week or two ago. A little detective work this morning uncovered that both such broken results were processed by optimized clients. I\'m just sayin\'. This could easily be a conincidence.

Spent a large chunk of the day trying to coax another Intel-donated server to life. We\'ve gotten a lot of stuff from Intel recently, all in varying states of functionality (some missing CPUs, some have test boards, etc.). This particular one (4 2.66GHz CPUs, 8 GB RAM) was dead in the water for a while as it wouldn\'t respond to any keyboard/mouse. However, the other day I noticed one of the front-side fan modules wasn\'t seated properly. I adjusted it, and now the server sees all input devices. It\'s still a little squirly, but may be a worthwhile web server after all. We\'re calling it "maul" (sticking to the current "darth" theme). I\'ll announce it again if it actually proves to be ready for prime time.

- Matt
' ), array('3 Jan 2008 20:54:14 UTC', 'Spreading the workunit creation over several files at once seems to be helping create a healthier mix of fast/slow workunits. However, adding a second download server seems to have confirmed a suspicion of mine (key word: "seems"): that somewhere down the pike we\'re being capped at 60 Mbits/sec. For a while there we had two download servers and a workunit storage server with plenty I/O capacity to spare, but still we were hitting a hard 60 Mbit ceiling outbound. Inquiries are being drafted/sent to the appropriate parties. It still could be a local problem, but we\'re not sure what else to try (given our current hardware).

We are in the middle of building another helpful index on the science database. Looks like Bob\'s magic informix incantations are working - we can keep the project running simultaneously (though the assimilators might back up a bit). It is always happier around here when work is flowing. To be safe we increased the ready-to-send queue size to one million - we have the disk space now to keep more workunits around. The only downside is that this inflates the result table in the database by approximately 5-10%, which may exercise the RAM on the BOINC database server that much more.

There is another problem Dave and I were poking at today: excessive "out of range" failures on our public web sites. Here\'s the deal: BOINC clients have a nice GUI which shows you icons, pictures, etc. from different projects as you select which to run on your computer. Where does it get these files? From the project\'s web servers. This is all well and good, but there are several (hundreds? thousands?) older clients out there making such requests but are being met with 416 "range not satisfiable" errors. Why? Because they have already downloaded the image file, but are making requests for more bytes beyond the file boundaries as if there was more to download. Obviously a bug somewhere, or a change in the way apache handles such things, but there\'s not much we can do about it. Even though this activity is creating bursts of heavy load on our web servers, this is a fire we\'re going to let burn for now.

The official press release about multi-beam is finally out. This should help on many levels (though I\'ll be busier making sure the servers can handle any significant load increase). I guess I\'ll also be shaving every morning in case there is interest from the national television news media.

I guess this is "technical" news: Our desks/chairs/furniture are mostly ancient hand-me-downs, some pieces older than I. We did get some new chair donations recently, but one of them broke - it came loose from its base, causing unsuspecting sitters to suddenly fall forward if their balance wasn\'t particularly keen. It\'s been lurking in our lab way too long, coaxing uninformed standers with tired legs to rest upon its comfortable and seemingly stable cushion base. I came to the lab this morning and that evil chair was by my desk with a note taped to it: "Matt - can you please toss this chair?" I guess enough was enough. I dragged it to the dumpster and sent it back to the dark void from whence it came.

- Matt
' ), array('2 Jan 2008 22:54:11 UTC', 'Happy new year! Actually, being that every moment is the beginning of some arbitrarily defined era, I should be more clear: Happy new calendar year number 2008, whoever uses this particular calendar system which I usually do!

The weekend was busy with the more-and-more-common fast workunits. Discussions today at the lab brought up the fact that about a third of our data will translate into these fast runners, so we better turn our attention back towards improving the data pipeline. We picked two low hanging fruits today: convert server bane from a redundant web server to a secondary download server. This will help determine if that bottleneck is the server or the storage. I also added a flag to the splitter scripts to select files in beam/polarization pair order, not filename order. This will help pseudo-randomize the creation of work, and hopefully spread the pain of fast workunit periods so we aren\'t so overwhelmed at times.

Nevertheless, we have Astropulse coming down the pike, and have a lot of SETI@home data to go through (and we\'re starting to collect new data again!). So we need to upgrade the network/servers in a big way. And acquire more participants. Not sure how this will all happen yet, but it has to happen.

Meanwhile, we might try another science database index build tomorrow (or soon thereafter). Bob found a way to do so while the database is up and inserting rows, so we might not have to shut down splitters/assimilators during the long build. Cool.

- Matt
' ) ); ?>
3 May 2018, 0:15:54 UTC
Of course it's been a busy time as always, but I'll quickly say I posted something on the general SETI blog:

The short story: Last week I and others on the Breakthrough Listen team took a trip to Parkes Observatory in NSW, Australia. A short but productive visit, getting much needed maintenance done. However, we're still not exactly quite there for generating/distributing raw data from Parkes to be used for SETI@home. This will happen, though!

- Matt

see comments

4 Mar 2018, 17:15:26 UTC
I'm still here! Well, still working 200% time for Breakthrough Listen. Yes I know long time no see. I've been quite busy, and less involved with the day-to-day operations of SETI@home... but I miss writing technical news updates. And clearly we (as in all the projects of the Berkeley SETI Research Center) could always use more engagement. So I want to get back into that habit.

However, I would like a different forum than these message boards. I'll be writing mostly of Breakthrough Listen stuff, so this may not be the most appropriate place. Also non-SETI@home participants would be unable to post responses.

Any thoughts on preferred communication methods in this day and age? It's hard because people have their favorites (blogs, facebook, twitter, reddit, instagram, tumblr, ello, mastodon, vero, etc.).

Basically I'm looking for places I can post stuff where anybody can view, anybody can respond in kind and ask questions (not that I'll have much time to answer but I'll try since I'll be likely asking questions too), but I won't have to moderate or spam filter or what-have-you. I know there's a lot of hostility towards facebook and the like, but it *works* (at least for now). There are general Berkeley SETI Research Channels already, but I don't want to overwhelm them with my frequent informal nerdish ramblings.

Now that I'm writing all this maybe I should just start a Breakthrough Listen forum here on the SETI@home page, echoing whatever I write elsewhere. Still, I'm curious about what people think about this sort of thing these days.

In the meantime, if you want - I encourage you to sign up for the general Berkeley SETI Research Center social media channels:


Thank you all -

- Matt

see comments

17 May 2016, 23:00:11 UTC
I haven't written in a long long while, with good reason: As of December 2015 I moved entirely to working on Breakthrough Listen, and Jeff and Eric heroically picked up all the slack. Of course we are all one big SETI family here at Berkeley and the many projects overlap, so I'm still helping out on various SETI@home fronts. But keeping Breakthrough moving forward has been occupying most of my time, and thus I'm not doing any of the day-to-day stuff that was fodder for many past tech news items.

That said, I thought it would be fun to chime in again on some random subset of things. I guess I could look and see what Eric already reported on over the past few months, but I won't so I apologize if there's any redundancy.

First, thanks to gaining access on a free computing cluster (off campus) and a simultaneous influx of free time from Dave and Eric we are making some huge advances in reducing the science database. All the database performance problems I've been moaning about for years are finally getting solved, or worked around, basically. This is really good news.

Second, obviously we are also finally splitting Green Bank data for SETI@home. While Jeff and Eric are managing all that, it's up to me to pass data from the Breakthrough Listen pipeline at Green Bank to our servers here at Berkeley. This is no small feat. We're collected 250TB of data a day with Breakthrough Listen - and maybe eventually recording as much as 1000TB (or 1PB a day). When we aren't collecting data we need every cycle we have to reduce the data to some manageable size. It's still in my court to figure out how to get some of this unreduced data to Berkeley. Shipping disks is not possible, or at least as easy as it is at Arecibo (because we aren't recording to single, shippable disks, but to giant arrays that aren't going anywhere. We may be able to do the data transfers over the net, and in theory have 10GB links between Berkeley and Green Bank, but in practice there'sa 1GB chokepoint somewhere. We're still figuring that out, but we have lots of data queued up so no crisis... yet.

Third, our new (not so new anymore) server centurion has been a life-saver, taking over as both the splitter for Green Bank data (turns out we needed a hefty server like this - the older servers would have fallen behind demand quickly) as well as our web server muarae1 when that system went bonkers around the start of the year. Well, we finally got a new muarae1 so centurion is back to being centurion - a dedicated splitter, a storage server, and potentially a science database clone and analysis machine. We also got a muarae2 server which is a back up (and eventual replacement) for the web server.

Fourth, our storage server bucket is having fits. All was well for a while but this is an old clunker of a machine so it's no surprise its internal disk controllers are misbehaving (we've seen similar behavior on similar oldSun servers). No real news here, as it doesn't have any obvious effect on public data services, but it means that Jeff and I have to wake up early and meet at the data center to deal with it tomorrow.

And on that note.. there's lots more of course but I should get back to it...

see comments

10 Nov 2015, 23:15:41 UTC
So one thing I left off yesterday's catchup technical news item was the splitter snafu from last week which caused a bunch of bogus broken workunits to be generated (and will continue to gum up the system until they pass through).

Basically that was due to us having the splitter code cracked open to eventually work with Green Bank data. Progress is being made on this front. However some code changes for Green Bank affected our current splitter for Arecibo, as we needed to change some things to make the splitter telescope agnostic (i.e. generalized to work with any data from any telescope). These changes were tested in beta, or at least we thought were thoroughly tested, but things definitely broke in the public project. We fixed that, but not after a ton of bad workunits made its way into the world. We still have some clean up to do on that front.

BUT ALSO we needed to update some fields in the current science database schema to also make the database itself telescope agnostic. Just a few "alter table" commands to lengthen the tape name fields beyond 20 characters. We thought these alters would take a few hours (and completed before the end of today's Tuesday outage). Now it looks like it might take a day. We can't split/assimilate any new work until the alters are finished. Oh well. We're going to run out of work tonight, but should have fresh work sometime tomorrow morning. It is a holiday tomorrow, so cut us some slack, if it's later than tomorrow morning :).

- Matt

see comments

10 Nov 2015, 0:29:29 UTC
Okay. Every time I put off writing a tech news item a bunch more stuff happens that causes me to continue putting off even further! So here's a quick stream-of-consciousness update, though I'm sure I'm missing some key bits.

First off, the AP migration is officially finished! As a recap, things got corrupted in the database late last year - and to uncorrupt it required a long and slow migration of data among various servers and temporary tables. A lot of the slowness was just us being super careful, and we were largely able to continue normal operations during most of this time. Anyway I literally just dropped the last temporary table and its database spaces a few hours ago. Check that off!

One of the temporary servers used for the above is now being repurposed as a desperately needed file server just for internal/sysadmin use (temporary storage for database backups, scratch space, etc.). For this I just spent a couple hours last week unloading and reloading slightly bigger hard drives in 48 drive trays.

A couple months ago we also checked another big thing off our list: getting off of Hurricane Electric and going back to using the campus network for our whole operation. The last time campus supported all our bandwidth needs was around 2002 (when the whole campus had a 100Mbit limit, and paid for bandwidth by the bit). The upshot of this is that we no longer have to pay for our own bandwidth ($1000/month) and we can also manage our own address space instead of relying on campus. Basically it's all much cheaper and easier to maintain now. Plus we're also no longer relying on various routers out of our control including one at the PAIX that we haven't been able to log into for years.

But! Of course there were a couple unexpected snags with this network change. First, our lustre file server had a little fit when we changed its IP addresses to this new address space. So we changed it back, but it still wouldn't work! Long story short, we learned a lot about lustre and the voodoo necessary to keep these things behaving. Making matters more confusing was a switch that was part of this lustre complex having its own little fit.

The other snag was moving some campus management addresses into our address space, which also should have been trivial, but unearthed this maddening, and still not completely understood, problem where one of the two routers directing all the traffic in and out of the campus data center seemed unhappy with a small random subset of our addresses, and people all over the planet were intermittently unable to reach our servers. I think the eventual solution was campus rebooting the problem router.

Those starving for new Astropulse work - I swear new data from Arecibo will be coming. Just waiting for enough disks to make a complete shipment. Meanwhile Jeff is hard at work making a Green Bank splitter. Lots of fresh data from fresh sources coming around the bend... Part of the reason I bumped up the ceiling for results-ready-to-send was to do a little advance stress testing on this front.

Oh yeah there was that boinc bug (in the web code) that caused the mysql replica to break every so often. Looks like that's fixed.

Over the weekend lots of random servers had headaches due to one of the GALFA machines going down. It's on the list to separate various dependencies such that this sort of thing doesn't keep happening. Didn't help that me and Eric were both on vacation when this went down.

Meanwhile my daily routine includes a large helping of SERENDIP 6 development, a lesser helping of messing around with VMs (as we start entering the modern age), and taking care of various bits of the Breakthrough Listen project that have fallen on my plate.

- Matt

see comments

31 Aug 2015, 21:01:09 UTC
Right now there's a whole bunch of activity taking place regarding the Astropulse database cleanup project. Basically this week all AP activity will be off line (possibly longer than a week) as I'm rebuilding the server/OS from scratch as we're upgrading to larger disks, then merging everything together onto this new system. So all the assimilators and splitters will be offline until this is finished.

The silver lining is we're currently mostly splitting data from 2011 which has already been processed by AP, so it wouldn't be doing much anyway. Good timing.

There will be new data from Arecibo eventually, and progress continues on a splitter for data collected at Green Bank.

Uh oh, looks like the master science database server crashed. Garden variety crash at first glance (i.e. requiring a simple reboot). I guess I better go deal with that...

- Matt

see comments

21 Aug 2015, 22:47:12 UTC
Those panicking about a coming storm due to lack of data... The well is pretty dry but Jeff and I just uncovered a stash of tapes from 2011 that require some re-analysis, so that's why you'll see a bunch showing up in splitter queue over the weekend (hopefully before the the results-to-send queue drops to zero).

In the meantime, we are still recording data at AO (not fast enough to keep our crunchers supplied), but.... this situation has really pushed us to devote more resources to finally finishing the GBT splitter, which will avail to us another reserve supply of data in case we hit another dry spell.

The network switch on Tuesday seems to have gone fairly well. We are now sending all our bits over the campus net just like the very old days <waxes nostalgic>.

- Matt

see comments

17 Aug 2015, 17:56:43 UTC
I've been meaning to do a tech news item for a while. Let's just say things have been chaotic.

Some big news is that campus is, for the first time in more than a decade, allowing SETI@home traffic back on the campus network infrastructure, thus obviating our need to pay for our own ISP. We are attempting this switchover tomorrow morning. Thus there will be more chaos and outages and DNS cache issues but this has been in the works for literal years so we're not stopping now. I apologize if this seems sudden but we weren't sure if this was actually going to happen until this past weekend.

We are finally seeming to get beyond the 2^32 result id problem and its various aftershocks. Due to various residual bugs after deploying the first wave of back-end process upgrades we have a ton of orphan results in the database (hence the large purge queue) which I'll clean up as I can.

Re: BOINC server issues galore, all I gotta say is: Ugh. Lots of bad timing and cursed hardware.

The Astropulse database cleanup continues, though progress has stalled for several months due to one Informix hurdle requiring us to employ a different solution, then simply just failing to coordinate the schedules between me, Jeff, Eric, and our various other projects. But we will soon upgrade the server and start merging all the databases back into one. This hasn't slowed the public facing part of the project, or reduced science, but it will be wonderful to get this behind us someday.

So much more to write about, but as I wait for dust to settle ten more dust clouds are being kicked up...

- Matt

see comments

23 Jun 2015, 22:16:16 UTC
Catching up on some news...

We suddenly had a window of opportunity to get another SERENDIP VI instrument built and deployed at Green Bank Telescope in West Virginia. So we were all preoccupied with that project for the past month or so, culminating in most of the team (myself included) going to the site to hook everything up.

So what does this mean? Currently we have three instruments actively collecting data. One SETI@home data recorder at Arecibo - where all our workunits come from, and two SERENDIP VI instruments - one at Arecibo and one at Green Bank - collecting data in a different format. Once the dust settles on the recent install and we get our bearings on the SERENDIP VI data and bandwidth capabilities we will sort out how to get the computing power of SETI@home involved. Lots of work ahead of us and a very positive period of scientific growth and potential.

We also are in a very positive period of general team growth as the previously disparate hardware/software groups have slowly been merging together the past couple of years, and now that we all have a place to work here at Campbell Hall on campus - proximity changes everything. Plus we have the bandwidth to pick up some students for the summer. Basically the new building and the Green Bank project rekindled all kinds of activity. I hope this yields scientific and public-outreach improvements we've been sorely lacking for way too long (getting Steve Croft on board has already helped on these and many other fronts). We still need some new hardware, though. More about all this at some point soon...

Meanwhile, some notes about current day-to-day operations. Same old, same old. We got some new data from Arecibo which Jeff just threw into the hopper. I just had to go down to the colo and adjust some loose power cables (?!?!) that caused our web server to be down for about 12 hours this morning. Some failed drives were replaced, some more failed drives need to be replaced. Now that Jeff and I are back to focusing on SETI@home the various database-improvement projects are back on our plates...

Speaking of databases, the Astropulse rebuild project continues! As predicted the big rebuild project on the temporary database completed in early June. To speed this up (from a year to a mere 3 months) I did all the rebuilding in 8 table fragments and ran these all in parallel. I thought the merging of the fragments into one whole table would take about an hour. In practice it took 8 days. Fine. That finished this past weekend, and I started an index build that is still running. When that completes we then have to merge the current active database with this one. So there are many more steps, but the big one is behind us. I think. It needs to be restated that we are able to acheive normal public-facing operations on Astropulse during all this, outside of some brief (i.e. less than 24 hour) outages in the future.

Speaking of outages, this Saturday (the 27th) we will be bringing the project down for the day as the colo is messing with power lines and while they are confident we shouldn't lose power during their upgrades we're going to play it safe and make sure our databases are quiescent. I'll post something on the front page about this.

Still no word about new cricket traffic graphs, but that's rolled up with various campus-level network projects so there's not much we can do about that.

- Matt

see comments

21 May 2015, 17:50:33 UTC
It's been a busy bunch of weeks of behind-the-scenes stuff. I'll catch up on at least a few things.

So the data center (where most of our servers are housed, along with most of the important servers on campus) had a big network migration over to a new infrastructure this past week. This was mostly to bring the center up to 10Gbe potential (and beyond), but being we are still constrained by various other bottlenecks (our own Hurricane Electric route, our various 1Gb NICs, our ability to create workunits, etc.) we're not going to see any change in our traffic levels. Anyway, it went according to plan for the most part, though the next day I had to do some cleanup to get all the servers rolling again.

Yes, the cricket traffic graphs have changed during the migration. Did any of you find the new one yet? I haven't actually looked - we have our own internal graphs so this isn't a pressing need - but I did ping campus just now about it. Oh - just as I was typing that last sentence campus responded saying they are still evaluating options for how to gather/present this information. Looks like changes are afoot.

There is a push to finish getting our SERENDIP VI technology installed at Green Bank Telescope. Maybe as soon as mid-June, though nothing is set in stone (until we buy our flights). So the whole team is a bit preoccupied with preparing for that. A GBT splitter is already in the works.

We're well beyond the annoying wave of science database hangups for now (that were plaguing us last month as I was trying to migrate a full table into new database spaces). Now we're back to plotting the next big thing (or several things) to clean up the database, make it faster, etc.. A mixture of removing redundant data and obtaining new server hardware. I know I've mentioned all this before. We do progress on this front but at glacial speeds due to incredible caution, lack of resources, and balancing priorities. For example one priority was an NSF proposal round that pretty much occupied me and Dan and several others for the entire past week.

The whole Astropulse database cleanup project continues. As predicted, it's taking a long time to merge/reload all the data into the temporary database (current prediction: it'll complete in early June). Meanwhile we're still using server marvin (as normal) to collect current results. Once the merge/reload completes (all fingers/toes crossed) we will stop the database on marvin for a few days - merge them both together, and reload it all back on marvin. Note: if any of this fails at any time, we won't lose any data (we just have to try again in a different manner). We also aren't constrained by any this - we are splitting/assimilating AP workunits as fast we can as during normal operations.

- Matt

see comments

9 Apr 2015, 16:47:04 UTC
So! All the recent headaches are due to continuing issues with the master science database. While all the data seems to be intact, there's something fundamentally wrong causing informix to keep hanging up (usually when we are continuing work on reconnecting the fragmented result tables).

During the previous recent crashes I would clean up what informix was complaining about in various error messages (always the result table indexes) but this time I'm in the process of doing a comprehensive check of everything in the database just to be sure. And, in fact, I'm seeing minor problems that I've been able to clean up thus far (once again, no loss in data - just internal bookkeeping and broken index issues).

I thought this full check would be done by now (ha ha) but it's not even close. Meanwhile we *should* be able to do Astropulse work, but the software blanking engine requires the master science database to do some integrity checks, so that is all offline as well.

There are ways to speed up such events in the future. We're working on enacting several improvements. Yes, we here are all beyond tired of our project grinding to a halt and things will change for the better.

- Matt

see comments

1 Apr 2015, 23:05:33 UTC
Quick update:

We keep having these persistent science database crashes. It's a real pain! Despite how bad this sounds, this shouldn't greatly affect normal work flow as we get on top of this. Yesterday we had the third crash in as many weeks, always failing due to some corrupted index on the result table that we drop and rebuild. Not sure what the problem is, to be quite honest, but I'm sure we'll figure it out.

The BOINC (mysql) databases are fine, and the Astropulse databases are operating normally.

Lots of internal talks recently about the current server farm, the current database throughput situation, and looking forward to the future as we are expecting a bunch more data coming down the pike from various other sources. Dave is working is doing a bit of R&D on transitioning to a much more realistic, modern, and useful database framework, as well as adding some new functionality to the backend that will help buffer results before they go in the database, so we can still assimilate even if the database is down (like right now).

None of the above is an April Fools joke of any sort. Not that you should have read it that way.

More to come...

- Matt

see comments

25 Mar 2015, 18:57:16 UTC
So! We had another database pileup yesterday. Basically informix on paddym crashed again in similar fashion as it did last week, and thus there was some rebuilding to be done. No lost data - just having to drop and rebuild a couple indexes, and run a bunch of checks which take a while. It's back up and running now.

While taking care of that oscar crashed. Once again, no lost data, but there was some slow recovery and we'll have to resync the replica database on carolyn next week during the standard outage.

So there is naturally some concern about the recent spate of server/database issues, but let me assure you this is not a sign of impending project collapse but some normal issues, a bit of bad timing, perhaps a little bad planning, and not much else.

Basically it's now clear that all of paddym's failures lately were due to a single bad disk. That disk is no longer in its RAID. I should have booted that drive out of the RAID last week, but it wasn't obviously the cause of the previous crash until the same thing happened again.

The mysql crashes are a bit more worrisome, but I'm willing to believe they are largely due to the general size of the database growing without bounds (lots of user/host rows that never get deleted) and thus perhaps reaching some functional mysql limits. I'm sure we can tune mysql better, but keep in mind due to the paddym issues lately, the assimilator queue gets inflated with waiting results, and thus the database inflates upwards to 15% more than its normal size. Anyway, Dave and I might start removing old/unused user/host rows to help keep this db nice and trim.

The other informix issues are due to picking table/extent sizes based on the current hard drive sizes of the day, and really rough estimates about how much is enough to last for N years. These limits are vague and, in general, not that big a deal to fix when we hit them. In the case of paddym, which has a ton of disk space, we recently hit that limit in the result table, and just created db spaces for a new table and are in the process of migrating the old results into this new table - which would have been done by now if it weren't for those aforementioned crashes. As for marvin and the Astropulse database, we didn't have the disk space, so we had to copy the whole thing to another system - and the rows in question contain these large blobs which are incredibly slow to re-insert during the migration.

In summation, these problems are incredibly simple and manageable in the grand scheme of things - I'm pretty sure once we're beyond this cluster of headaches it'll be fine for the next while. But it can't be ignored that 1. all these random outages are resulting in much frustration/confusion for our crunchers, and 2. there is always room for improvement, especially since we still aren't getting as much science done as we would like.

So! How could we improve things?

1. More servers. Seems like an obvious solution, but there is some resistance to just throwing money and CPUs at the project. For starters, we are actually out of IP addresses to use at the colo (we were given a /27 subnet) and it's a big bureaucratic project to get more addresses. So we can't just throw a system in the rack at this point. There are workarounds in the meantime, however. Also, more servers equals more management. And we've been bitten by "solutions" in the past to improve uptime and redundancy that actually ended up reducing both. In short we need a clear plan before just getting any older servers, and an update to our server "wish list" is admittedly way overdue.

2. More and faster storage. If we could get, like, a few hundred usuable TB of archival (i.e. not necessarily fast) storage and, say, 50-100 TB of usuable SSD storage - all of it simple and stupid and easy to manage - then my general anxiety level would drop a bit. We actually do have the former archival storage. Another group here was basically throwing away their old Sun disks arrays, which we are starting to incorporate into our general framework. One of them (which has 48 1TB drives in it) is the system we're using to help migrate the Astropulse db, for example. But a lot of super fast disk space for our production databases wouldn't solve all our problems but would still be awesome. Would it be worth the incredible high SSD prices? Unclear.

3. Different databases. I'm happy with mysql and informix, especially given their cost and our internal expertise. They are *fine*. But, Dave is doing some exploratory research into migrating key parts of our science database into a cluster/cloud framework, or otherwise, to achieve google/facebook-like lookup speeds. So there is behind-the-scenes R&D happening on this front.

4. More manpower. This is always a good thing, and this situation is actually improving, thanks to a slightly-better-than-normal financial picture lately. That said, we are all being pulled in many directions these days beyond SETI@home.

As I said before way back when, every day here is like a game of whack-a-mole, and progress is happening on all fronts at disparate rates. I'm not sure if any of this sets troubled minds at ease. But that's the current situation, and I personally think things have been pretty good lately but the goodness is unfortunately obscured by some simultaneous server crashes and database headaches.

- Matt

see comments

20 Mar 2015, 19:37:21 UTC
Another weekend approaches. Perfect time for an update.

Master science database (informix/paddym): Due to transient disk issues last weekend the database crashed, but we quickly recovered. However this caused a big assimilator queue backlog, which we are just now about to clear out. I'll wait till Monday before I start up the result table merges (we hit extent limits on the old table, so I'm shoveling all those into a new larger table with more extents).

Random crashes: yesterday one of our file servers choked up for no obvious reason (it's a lustre system, of which I'm not an expert, so nothing that happens on that is obvious to me). I don't think this had any public effects, but was hanging a bunch of our servers up a bit. I actually spent all morning getting that in working order again. This morning synergy (the scheduling server) had some automounter freakout so I just reboot it to clear some pipes. All is well now.

Oh yeah the results-to-send queue got kinda low as part of that synergy freakout, but also due to hitting a bunch of tapes with data the splitter is deeming unworthy of workunit creation. So some CPU and I/O time is being wasted as it goes through those. It's best to let this just push through on its on. I did just add a bunch more files to the blanking/splitter queue for processing over the weekend (they'll show up within in the next 24 hours or so).

Unless there's weirdness before the end of the day I won't do anything crazy that'll mess things up for the weekend. I've been otherwise working on some GBT (Green Bank Telescope) code in advance of getting SERENDIP VI hardware working there.

- Matt

see comments

16 Mar 2015, 21:55:46 UTC
Happy Monday!

So yeah things were looking good Friday afternoon when I got marvin (and the Astropulse database) working enough to generate new work and insert new results, and thus bring Astropulse on line. A couple stupid NFS hangs at the end of the day rained on my parade, but things were still working once stuff rebooted.

But turns out pretty much all the data in our queue was already split by Astropulse so only a few thousand workunits were generated. We broke the dam, but there was not much on the other side. There will be actual AP work coming on line soon (the raw data has to go through all the software blanking processing hence the delay).

Meanwhile over the weekend our main science database server on paddym crashed due to a bungled index in the result table. I think this was due to a spurious disk error, but informix was in a sad state. I got it kinda back up and running Sunday night, but have been spending all day repairing/checking that index (and the whole database) so we haven't been able to assimilate any results for a while. Once again: soon.

- Matt

see comments

11 Mar 2015, 21:09:20 UTC
How about a new thread?

Last night I noticed the assimilators were failing, and this led to the usual conclusion: we ran out of extents on a table in the master database - this time the result table itself. Each result has a very basic entry in this table - it's basically a bridge between its signals and the parent workunit. Anyway, the solution is simple but painful - we gotta rebuild the whole table again.

Luckily, we can do this in parallel with inserting new stuff. So I made a new result table this morning (hence the assimilator queues backing up in the meantime) and then over the next weeks (months) I can silently shovel the older results into this new table. There is a balancing act between size limits and performance. When building these tables we hope to build them big enough so we don't run into these logical barriers, but not so big they don't perform. Sometimes we aim too low, and then we hit these barriers (like running out of extents).

In the case of the Astropulse database, I'm still making a massive backup copy of the whole database as it lives on disk (about 13TB) to archival storage before I do any of the next steps. The plan is still such as it is last I mentioned it - we will begin the reloads from the beginning using a new method that will only take about 2 months. In the meantime we will set up another functionally temporary db to insert new Astropulse data, which we will then merge again after these 2 months (or however long it takes). So we may see AP back on line again in the near future. Anyway, it's still a mess, but there's slow/steady/cautious progress.

I'm keeping "resend lost results" off for now - this functionality has been clobbering the BOINC/mysql database for a while. I think this is partially due to that database being a bit bloating with undigested workunits (i.e. stuff that's been stuck a while due to the Astropulse issues).

- Matt

see comments

5 Mar 2015, 0:12:03 UTC
Some updates!

The AstroPulse database is still in recovery mode. Since last Eric wrote about this, I did set up a temporary server was an effectively infinite amount of disk space (38TB usable), and then Eric copied the whole thing over to it. The original server was a lack of space to build temporary tables, dbspaces, do unloads, etc. which led to various other problems.

Anyway, we decided to keep the rebuild of the db nice and simple - basically unload all the signals into files, drop all the tables and corrupted dbspaces, and then rebuild it all from scratch via these files. The first phase went along swimmingly, albeit slower than expected (which is always the case, I guess). Then I started the reloads - which were taking much, much longer than expected. Once it got rolling I estimated it would take about 3 months to finish!

Some analysis yesterday revealed this was due to basic inefficiencies in the load command which weren't a problem in the past (on much smaller tables with much smaller row sizes). So... we're kind of back to square one unless we decide to let this all take three months. I'm trying several timing tests in the meantime to determine the best course of action.

I mentioned this in another thread, but I'll repeat it here: recently we've been crossing some vague (and still unknown) internal limit with our mysql database (the BOINC/user/web database). This has resulted in certain web and scheduler queries clogging up the works. We've been attacking each clog as they happen. Nothing on the db server has yet leaped out as an obvious problem, so it's just basic whack-a-mole for now.

Other behind the scenes stuff:

Eric's had recent run of bad luck with his own servers - we had to completely rebuild an OS, replace a power supply, and then another power supply, and then a whole 3ware card that went dead for no good reason.

One of our servers (the four-headed monster that is muarae{1,2,3,4}) developed a weird power issue - muarae2 seems completely dead. Fair enough, but when you try to power cycle it for some reason muarae4 power cycles as well. This is a bit worrisome as muarae1 is our main web server, so it's not great it's part of this slightly dysfunctional complex.

The servers vader and georgem also had dead power supplies, or so I thought. I got a replacement for georgem but that didn't work either! Long story short, it turns out one of the power loops in the back of the rack (at the colocation facility) got a little messed up (though this wasn't very obvious). When I moved these "broken" supplies to a different loop they were fine.

Otherwise a lot of attention spent on SERENDIP VI (code walkthroughs, plots, data pipeline, trying to track down obnoxiously persistent performance issues), proposals, and the usual mix of daily chores and repairs. I am finding it hilarious how our 45-drive JBOD at the colo (which we got about 3 years ago) is having drives drop like flies right now. I've been replacing about 1 drive a week on average in that array for the past 3-4 months.

As for me, I'm back to full time - have been since December, and will be until July. My schedule was fairly erratic the past three years, hence falling out of the tech news habit (though Eric definitely picked up the slack).

- Matt

see comments

21 Feb 2015, 23:38:12 UTC
Sorry for the long time with no updates. It's been a few busy weeks for me. First there was the ICON EUV critical design review (CDR) to prepare for. Now I'm in a panic about getting the automation in our calibration facility up and running before next week, and I'm preparing for a talk I'm giving next week. It hasn't left a lot of time for updates.

Well, as you've surmised, the repair on the Astropulse database came to a screeching halt a couple weeks ago when the impossible happened. We were moving files to free up space and then linking them back in. The script was checking before a file was moved to see if it was already a link or if it was a file to prevent a link being moved on top of a file, using the standard unix command "test -f" which returns true for files and false for links. Except on that day it returned true for about 10 links in a row. About the only thing I can come up with for a reason would be a bug in the xfs file system or in the "test" command, which is built into bash.

Anyway about 4.5 million astropulse signals lost their data blobs when that happened. This isn't as bad as it seems because that blob is just a little chunk of the workunit that can be recovered from the original data file.

But we decided that marvin was too tight on disk space for this reorganization, so we move the astropulse database and all its data to another machine where we made a dump of all the signals, dropped the old data chunks, added new ones, and now we're reloading all the signals at which point we will move everything back to marvin. Matt says the reload will probably be done Monday, then another day or two to copy everything back to marvin.

see comments

22 Jan 2015, 0:36:56 UTC
PaddyM crashed last night, possibly due to extra-high demand on its disks. It came back up OK, but in order to get things in a state where it won't happen again, we're going to have to do some shuffling.

In order to do the shuffling I've had to bring the Astropulse database back down for a while. That means we will probably run out of Astropulse work (if we haven't already). I'm hoping the file moving will go fast enough that I can bring AP back on line tonight, but I wouldn't say that I'm confident that will happen.

see comments

13 Dec 2014, 4:31:16 UTC
As some of you have noticed the Astropulse database is back down. We brought it down to delete some empty data chunks and move the new data chunks onto local storage which went just fine. But when I brought the database back up it complained that one of the root data chunks was corrupt. This was a chunk that wasn't touched in the process.

After many attempts to repair the damage it became apparent that we needed to restore the fragment containing that chunk. (Yes, I know this is all gibberish.) Unfortunately in order to restore one 16GB file we need to read in the entire 4.5TB backup and then run the logs since that backup was made (Tuesday). The first step will take at least 36 hours. I can't estimate the duration of step 2 yet.

see comments

25 Nov 2014, 1:48:38 UTC
I was right. It failed. But it never give an error message. It just undid everything it did and ended up in the same place it started. Harumph.

So I renamed the ap_signal table, created a new ap_signal table with the proper structure to avoid the problem, and now I'm copying everything into the new table. Early estimates are that it will finish on Thursday. But early estimates are never right. These things seem to slow down as they run.

If you're looking for another use to which to put your computer, some of our crunchers encouraged me to create a SETI@home fundraising campaign at Bitcoin Utopia. It allows you to use BOINC to generate donations in cryptocurrencies like bitcoin SETI@home.

see comments

21 Nov 2014, 18:07:08 UTC
It's still running. Apparently it has slowed down over time. It's still got 16.5M rows to go. So it'll be another two days before I can tell you how many more days it will be.

see comments

20 Nov 2014, 1:15:21 UTC
The AstroPulse database rebuild is continuing with 31M rows left to go (as far as I can tell). In which case, by Friday morning (PST) we should know whether it worked. Meanwhile Matt and Jeff are scrounging the archives for data we've overlooked.

Unfortunately, I think the rebuild didn't work. There hasn't been an error message or any error indication, but the appearance of the read/write statistics leads me to believe that it failed about 5 days ago and has spent the time since then undoing what it had done so far. Stopping it now would only make recovering things worse. That's not how databases work.

If I'm right, we'll end up doing a data dump and reload. At this point we're sticking with Informix. I looked into other databases, and PostgreSQL was the only feature complete database in the correct price range. With MySQL and its derivatives I would need to come up with a way to emulate defined types and LISTs. PostgreSQL has defined types and its array support is very much like lists. The only thing we use that PostgreSQL is missing is synonyms. Annoyingly "end" is a reserved word, so columns named "end" are forbidden, so that would have to change. It would probably only take be a few days of coding to write the interface layer to our database classes and to modify the schema_to_class compiler to parse PostgreSQL's schema syntax (it currently does Informix and MySQL). And a couple days to build and test all the server components. Of course those are full time days, and I don't really have any of those in the next couple weeks. So Informix it is, for the time being. I may peck at the PostgreSQL code during my down time, for future use.

see comments

13 Nov 2014, 17:14:27 UTC
The Astropulse database fix is taking longer than expected. By one measure (the amount of data written into the new database spaces) we're 70% done. By another measure (the number of table rows showing up in the table stats) we are 25% done. This is after 5 days. I'm inclined to believe the 25% number, in which case we will run out of SETI@home work before Astropulse is back online. If you don't have a low share back-up project, now might be time to add one.

Informix never ceases to astonish me with the way it does things. The table rebuild is neither maxing out CPUs or I/O, primarily because it doesn't seem to be running the table creation in parallel. It's working on one table fragment at a time. It also seems to be making a copy of the data in two places and leaving large portions the allocated data chunks unused. It's certainly not the way I would have written it, but then again I don't write database. The worst case is that these multiple copies of data will cause it to exceed the limits again and we'll have to start over. It looks like we're 68% of the way to exceeding those limits. Hope Informix is smart enough to avoid the problem.

see comments

9 Nov 2014, 0:15:29 UTC
Here's a long overdue status update that will hopefully answer some of the questions you may have about the last week.

    1. Bruno started hanging with a "stuck cpu" linux kernel message. I don't know what causes this sort of thing. Moving all the services except uploads to other machines seems to have solved the problem, so far. Next week we're planning to replace Bruno with a Sun X4540 that the Lab was removing from service.

    2. Around the same time, the Astropulse assimilators started failing with a message "-603 Cannot close TEXT or BYTE value." Turns out we had run up against another informix limit. I'm resolving that, but it's looking like we are only 1/8th of the way done after 24 hours. Until it's done we won't be able to generate Astropulse work.

    3. New Data. Finally, in the last few weeks, we've gotten some data from 2014 split and out the door. Not a lot, though. There are a few reasons why it took so long. First, the Arecibo itself and the ALFA receiver that we use for SETI@home was offline for much of 2014 (mostly January to June), so we don't have that much data. Second, because of funding, Astronomy isn't top dog at Arecibo anymore, so Astronomers get a smaller fraction of the observing time. And compounding it is that disks are bigger and we don't send a box until it's full (to save on shipping costs), so it takes longer to fill a box of disks. Which brings us to...

    4. Old Data. We have been working on old data. It's not a "make work" thing. Most of the data we've been sending had a problem the first time around. Either part of the data was left unprocessed, or the results were questionable. In addition, all the old data that has been sent had not been processed with S@H v7, so there was no autocorrelation analysis done on it. We've still got big chunks of data that have never been processed with Astropulse to send out. We don't believe in making work for the sake of making work.

    5. Will we run out of data? It depends upon what you mean by that. We may run out of SETI@home data taken by the current data recorder, although there is still plenty of Astropulse data to process. Jeff is prioritizing the GBT data splitter, so we hope to have that on line before too long. It will also be the starting point for the next thing, which will be to use SERENDIP VI as a data recorder. It should be capable of much higher data rates (GBps) than the current recorder, and therefore much higher bandwidths. It should also give us our first taste of the 327MHz Sky Survey data.

Hope that answers some of your questions.

see comments

22 Oct 2014, 4:48:46 UTC
Angela won't let me get into bed tonight until I post a Tech News update. This is going to be a short update because it is late and I am tired.

The reason for the long outage today was that our backup database machine crashed over the weekend and we decided that it was easier to restore that machine from a new backup of the primary database than to try to recover the data and replay a the database logs to the backup. It never quite works out when you do that.

For any guys thinking about proposing to their girlfriends this upcoming holiday season, buy that ring now. Marriage is a wonderful institution and I am so lucky to be a married man. She is making me say that.

see comments

24 Jul 2014, 21:22:58 UTC
We had a few extra minor planned outages this week (nothing as severe as I warned on the front page) for repairs down at the colocation facility where most of our servers are kept. Quite simply, they needed to replace some backup power circuits that were "out of tolerance" (i.e. not up to high standards for availability in case other circuits failed). In theory we didn't really need to bring servers services down, because most of them have multiple power supplies spread over different circuits. But (1) we used this as an exercise that all of our system will indeed stay afloat if power was suddenly cut to one of its supplies (all systems passed this test!) and (2) we have some systems which only have one supply, and thus had to be moved onto safer power for the repairs. Also (3) if the databases and servers are off, the systems are pulling much less power, thus lessening the impact when power was moved around during the procedure.

Anyway, everything went smoothly, we're back on line (still recovering a little bit as I write this sentence). As you were :).

- Matt

see comments

8 May 2014, 17:57:25 UTC
Well, the network is completely down at the Space Sciences Lab. In fact, for most of campus. So why not whip up a tech news article?

"But wait," you ask, "if the network is down at SSL and most of campus, why didn't I nor my clients notice this?" The answer: because all of our SETI@home servers are down at the data center/colocation facility. One of the reasons we made this move over a year ago was to avoid these power/cooling/network issues that were all too common here at the lab and elsewhere on campus. So here is a good example of why this was a good move.

Oh the internet just came up as I was just writing that last sentence. There was a major router failure on campus affecting many departments, and it was replaced fairly quickly (the whole outage lasted ~2 hours). Pretty severe failure it seems, and handled fairly well.

Anyway, the projects I mentioned in the last thread are still cookin' on the stove, with the usual bureaucratic obstacles and other mundane derailments due to our general lack of manpower. But there's positive progress on all fronts. I'm back to full speed development/testing on the science database optimizations, our initial issues with the Xyratex box (pretty much all due to our previous lack of lustre expertise) are behind us, and the new general web site will be launched once I get back from yet another short out-of-town stint.

Okay, back to work.

- Matt

see comments

7 Jan 2014, 23:34:02 UTC
The nightmare that was my 2013 schedule is behind us. Actually 2014 may be better, or may be worse. Time will tell.

Ugh, it's been a while. Once again I'm finding myself in major-catchup-mode when trying to figure out what to mention. On the surface, it doesn't seem that much has been happening (maybe), but we've been quite busy on various projects. Here's some of them off the top of my head.

First, neither the perfect solution nor the money ever appeared to magically speed up our science databases, so were are working on plan B: vastly reduce the size of these databases so we can actually work with them! We've been coming up with clever ways to quickly bring our final science results down to less than 10% its original size without a minimal, perhaps zero, compromise in sensitivity. That's one project. Maybe we'll create a new client which can do this reduction for us (which will require much larger workunits).

Second, we recently obtained a generous donation of a lustre file server from Xyratex. There was some effort to install this system and ramp up on managing lustre, but now we have a 120TB sandbox to play with. Currently we're using it to house all kinds of data from various other SETI projects, or as a backup/SETI@home data buffer. As we get more comfortable with the system we'll push on it with more i/o intensive projects.

Third, as SETI@home chugs along we are also dividing our efforts working on various other SETI projects. This will all be clearer after we launch (no ETA as of yet) our new web site.

Meanwhile we had a few crashes recently. As we now have faster network inside the colocation facility and larger disk arrays, we are hitting some linux i/o limits causing CPUs to randomly lock up (requiring hard power cycles). Anybody have any tips on this front? I'm messing with /sys/block/device/queue/scheduler to see if that helps.

Also on of Eric's systems had a RAID6 on it which suffered 3 drive failures within a week over the holiday. Such cruel timing. We're recovering from that (it's backed up regularly) but until it's back on line other co-dependent systems are getting headaches.

Same old, same old. I'm looking forward to the newer web site, which will have more contributors and more information - one of our general problems over the years.

- Matt

see comments

13 Aug 2013, 21:07:53 UTC
Hello again! Once again I'm emerging from a span of time where I was either out of the lab or in the lab working on non-newsworthy development, and realizing it's been way too long since I drummed up one of these reports.

We had our usual Tuesday outage again today. Same old, same old. However last week we had some scary, unexpected server crashes. First oscar (our main mysql server) crashed, and then a couple hours after that so did carolyn (the replica). Neither crashed as much as the kernels got into some sort of dead lock and couldn't be wedged - in both cases we got the people down at the colocation facility to reboot the machines for us and all was well. Except the replica database needed to be resync'ed. I did so rather quickly though the project has been up for a while and thus not at a safe, clean break point. I thought all was well until after coming out of today's outage when the replica hit a point of confusion in its logs. I guess I need that clean break point - I'm resync'ing again now and will do so again more safely next week. No big deal - this isn't hurting normal operations in the least.

Though largely we are under normal operating conditions, there are other behind the scenes activities going on - news to come when the time is right. One thing I can mention is that we're closer and closer to deciding that getting our science database entirely on solid state drives is going to be unavoidable if we are to ever analyze all this data. We just keep hitting disk i/o bottlenecks no matter what we try to speed things up.

Any other thoughts and questions? Am I missing anything? Yes, I know about the splitters getting stuck on some files...

- Matt

see comments

19 Jun 2013, 19:12:40 UTC
Here's a (long overdue) status report. I've was out of the lab for all of May. During that time Eric, Jeff, and company got V7 out the door. Outside of that, operations were pretty much normal (weekly outages, a couple server hiccups, and slow but steady scientific analysis and software development). V7 gives us, among other things, a new ET signature to look for: autocorrelations. Eric described this and more in his thread here.

I think it's safe to say the move to the colocation facility is looking to be a success. The extra bandwidth alone is a huge improvement (yes?). Having less mental clutter involving system admin is another gain. Thus far we had only one minor crisis that required us to actually go there and fix things in person. That's not the worst problem, as the facility is easy enough to get to and near a good cafe. I still spend a lot of time doing admin, but definitely less than before, and with the warm fuzzy feeling that if there are power or heating issues somebody else will deal with it.

Server-news-wise, we did acquire another donated box - a 3U monster that actually contains four motherboards, each with 2 hexa-core Xeon CPUs and 72GB of memory, and 3 SATA drives. Despite being in one box, they are four distinct machines: muarae1, muarae2, muarae3, and muarae4. You may have noticed (or not) that muarae1 has already been employed to replace thinman as the main SETI@home web site server. We hope to retire thinman soon, if only because it is physically too large by today's standards (3U, 4 cpus, 28GB) and thus costing us too much money (as the colocation facility charges us by the rack space unit). It is also too deep for its current rack by a couple inches and hindering air flow. The plans for the remaining muaraes are still being debated. Eric is already using another as a GALFA compute server. By the way, as I write this thinman is still around and getting web hits from the few people/robots out there that have IP addresses hard wired or really stubborn DNS caches.

The current big behind-the-scenes push involves cleaning up the database to get all the different data "epochs" (classic, enchanced, multibeam, non-blanked, hardware-blanked, software-blanked, V7, etc.) into one unified format, while (finally) closing in on a giant programming library to reduce and analyze data from any time or source. Part of the motivation is the acquisition of data from the Green Bank Telescope, and folding that data into our current suite of tools. In particular, my current task is porting the drifiting RFI detection algorithm (which I last touched 14 years ago!) from the hard-wired SERENDIP IV version to a generalized version.

Oh yeah there is a current dearth of work as I am about to post this message. We are on it. We burned through the last batch much quicker than expected.

- Matt

see comments

8 Apr 2013, 22:10:38 UTC
So! We made the big move to the colocation facility without too much pain and anguish. In fact, thanks to some precise planning and preparation we were pretty much back on line a day earlier than expected.

Were there any problems during the move? Nothing too crazy. Some expected confusion about the network/DNS configuration. A lot of expected struggle due to the frustrating non-standards regarding rack rails. And one unexpected nuisance where the power strips mounted in the back of the rack were blocking the external sata ports on the jbod which holds georgem/paddym's disks. However if we moved the strip, it would block other ports on other servers. It was a bit of a puzzle, eventually solved.

It feels great knowing our servers are on real backup power for the first time ever, and on a functional kvm, and behind a more rigid firewall that we control ourselves. As well, we no longer have that 100Mbit hardware limit in our way, so we can use the full gigabit of Hurricane Electric bandwidth.

Jeff and I predicted based on previous demand that we'd see, once things settled down, a bandwidth usage average of 150Mbits/second (as long as both multibeam and astropulse workunits were available). And in fact this is what we're seeing, though we are still tuning some throttle mechanisms to make sure we don't go much higher than that.

Why not go higher? At least three reasons for now. First, we don't really have the data or the ability to split workunits faster than that. Second, we eventually hope to move off Hurricane and get on the campus network (and wantonly grabbing all the bits we can for no clear scientific reason wouldn't be setting a good example that we are in control of our needs/traffic). Third, and perhaps most importantly, it seems that our result storage server can't handle much higher a load. Yes, that seems to be our big bottleneck at this point - the ability of that server to write results to disk much faster than current demand. We expected as much. We'll look into improving the disk i/o on that system soon. And we'll see how we fare after tomorrow's outage...

What's next? We still have a couple more servers to bring down, perhaps next week, like the BOINC/CASPER web servers, and Eric's GALFA machines. None of these will have any impact on SETI@home. Meanwhile there's lots of minor annoyances. Remember that a lot of our server issues stemmed from a crazy web of cross dependencies (mostly NFS). Well in advance we started to untangle that web to get these servers on different subnets, but you can imagine we missed some pieces, and the resulting fallout of a decade's worth of scripts scattered around in a decade's worth of random locations expecting a mount to exist and not getting it. Nothing remotely tragic, and we may very well be beyond all that at this point.

- Matt

see comments

28 Mar 2013, 19:49:07 UTC
Once again we had a long period of rather stable uptime and thus little drama and stuff to report about. We've also been quite busy preparing for the big move to the colocation facility next week! I posted about this on the front page already, but brace for a long 3-day outage starting on Monday during which we'll unrack most of our servers, schlep them to the colo, hook them up, then battle a hundred expected network issues, and then a hundred unexpected network issues. Brace for unreachable servers and web sites! (I'll put up some stub web sites best I can.)

Earlier this week we already brought one test server down there and hooked it up, and we've been getting our feet wet with the various remote connectivity and network managements tricks and tools. Fun stuff!

So I have little to report at the moment except I'll see y'all on the other side, hopefully with improved uptime and network bandwidth! And unless I forget to take nicer pictures on Monday during the big move, here's one last iPhone 3GS version of the server closet taken a few minutes ago...

- Matt

see comments

21 Feb 2013, 20:34:01 UTC
I already posted this on the front page, but FYI there's going to be another lab-wide power outage all weekend, during which all our servers will be unreachable. Hopefully this is the last of this sort of thing, and/or we relocate to the colocation facility before it happens again.

Meanwhile, we've hit a few bumps in the road. I don't think anything dire is happening outside of normal, expected drive failures and kernel hangs. But it's been causing cascading failures on the public facing servers thanks to the web of dependencies each machine has on another. It may seem bad, but everything is more or less okay. I think. I continue to aggressively upgrade and prepare for the impending probable move to the colocation facility, so maybe I'm exercising some lingering, forgotten hardware and configuration issues.

That's all I have to report for now, tech-wise. Behind the scenes development has been largely focused on getting a new polyphase filter bank splitter into production. The current splitter has standard, known FFT artifacts causing dips in sensitivity at the edges of workunits and rolloffs at the edges of the whole 2.5MHz band, but this new splitter will create workunits that exhibit more even sensitivity across the whole spectrum, as well as more sensivity in general to find singals in the noise. We also are turning corners on (finally) getting the NTPCkr back into regular production.

- Matt

see comments

30 Jan 2013, 20:12:18 UTC
The other day synergy (the scheduling server) had one of its (more and more frequency) CPU locks. I'm pretty sure this is a problem with the linux kernel, and not hardware, as this problem happened on bruno when it was the scheduling server. Maybe this is could be a software bug, but it's a pretty ugly crash the seems to be an inability to handle high demand. Maybe it's the way we have the system tuned. In any case, this happened just before the regular weekly outage, so the timing wasn't too bad.

During the outage I wrapped up one lingering project - merging a couple large tables in the Astropulse database. This is why the ap_assimilators have been off for most of the past week. I also have been getting more aggressive in upgrading the OSes on the backend systems for increased security and stability.

In reality the main pushy for upgrading the OSes is to bring everything to a point which will require a minimal amount of hands-on server administration... because we are currently evaluating the pros and cons of moving our server farm to a colocation facility on campus. We haven't decided one way or another yet, as we still have to determine costs and feasibility of moving our Hurricane Electric connection down on campus (where the facility is located). If we do end up making the leap, we immediately gain (a) better air conditioning without worry, (b) full UPS without worry, and (c) much better remote kvm access without worry (our current situation is wonky at best). Maybe we'll also get more bandwidth (that's a big maybe). Plus they have staff on hand to kick machines if necessary. This would vastly free up time and mental bandwidth so Jeff, Eric, and I can work on other things, like science! The con of course is the inconvenience if we do have to be hands-on with a broken server. Anyway, exciting times! This wouldn't be possible, of course, without many recent server upgrades that vastly reduced our physical footprint (or rackprint), thus bringing rack space rental at the colo within a reasonable limit.

I'll have more news on this front, of course, as we work our way through various hurdles, or end up backing out of the move and keeping things where they are. I should mention recent a/c fixes in our current closet were a total success, so there now seems to be less of a reason to rush into a colo situation. On the other hand, we have yet another planned lab-wide power outage coming up in February. We're getting real sick and tired of those. This wouldn't be an issue at the colo.

- Matt

see comments

10 Jan 2013, 21:55:19 UTC
The new year is unfolding nicely, more or less. Wow - 2013. Every new year now sounds like a science fiction year. I don't really have anything major to report, but here's another update anyway.

We were supposed to have some more lab-wide power repairs last weekend. This got postponed to a later date which has yet to be settled upon.

As I've been mentioning for years, the boinc server backend (everything pertaining to creating the workunit, sending it out, receiving the result and processing it) performs in many parts on a set of constantly changing servers of disparate make and model and power, and thus some problems involves so many moving targets that it's almost impossible to diagnose. I tend to refer to these times when performance is lower than expected as "server malaise." It also doesn't help we are dealing with an almost constant malaise given we are pretty much maxed out on our network connection to the world 24 hours a day. This is like running a retail business with a line out the door 24 hours a day - no quiet time to clean the place up, restock the shelves, etc.

Usually when we see some queue backing up, or network traffic drop, the procedure is somewhat like this: 1. check to see if a server or important service (httpd, informix, mysql) isn't running - these are easy to find and hopefully easy to fix. 2. check to see if some BOINC mechanism (validation, assimilation, etc.) is stuck on something - these are relatively easy to find (by scanning logs and process tables) and sometimes easy to fix, but not always. 3. check to see if everything is kind of working, just slowly. If this is true, we tend to write it off as "server malaise" and wait and see if it improves on its own - the functional equivalent of "take two aspirin and call me in the morning." Usually we find things improve on their own over time, of if not then more obvious clues as to actual problems make themselves clearer. We simply don't find it an efficient use of our very limited time to understand and solve every problem perfectly.

I mention all this as we certainly had a few malaises over the past few weeks. The one last week was due to the one cronjob failing to run, which didn't update some statistics, which led to some splitters running too much and generating too much work, which led to a bloated database and bloated filesystem, which led to slow backend processing, which took about 4 days to clear out, but it eventually did without any effort on our part. During that time general upload/download bandwidth was constrained a tad, but we survived.

Otherwise, things are well. The recent (or relatively recent) server upgrades have been a major blessing, and more are planned. During the outage on Tuesday I actually moved some servers around such that *all* the SETI related servers are now in the closet (as opposed to our auxiliary lab). This is a first, I think. Outside of our desktops all SETI machines are in the racks.

Of course, this is just in time for the closet a/c to be in need of repair. This surgery happening on Monday, and may take a couple days, during which the projects will all be down (with limited servers left up to keep the web site alive with a warning on the front page and status updates). We hope to be back up Tuesday afternoon. There is a chance repairs won't work. We have a plan B (and C) if this happens but let's just be positive and cross that bridge if/when we get there.

Oh yeah one random note. Yesterday I had some fun with this database weirdness. Somewhere along the line, perhaps during one of many sudden power outages, a small set (i.e. about 10 out of 3,000,000,000) of the spikes in the database were cloned, and became two entries in the database, with the same id #s. This is "impossible" as id #s are primary keys and supposed to be unique. So which of the clones we were seeing was depending on how you were selecting these spikes - selecting by id or by some other field you'd get one clone or the other. This wasn't apparent at all until I tried to update values in these spikes, and then when selecting them I'd get the unupdated clone version and it looked like the update wasn't working. Long story short I finally figured this out and got rid of the clones. But yeah databases sure can be funny sometimes.

- Matt

see comments

20 Dec 2012, 21:11:10 UTC
One more quick update before the apocalypse. Or holiday week off. Or whatever.

We seem to be still having minor headaches due to fallout from the power failures of a couple weeks ago. The various back end queues aren't draining as fast as we'd like. We mostly see that in the assimilator queue size. We recently realized that the backlog is such that one of the four assimilators is dealing with over 99% of the backlog - so effictively we're only 25% as efficient dealing with this particular queue. We're letting this clear itself out "naturally" as opposed to adding more complexity to solve a temporary problem.

I did cause a couple more headaches this morning moving archives from one full partition on one server to a less full partition on another. This caused all the queues to expand, and all network traffic to slow down. This is a bit of a clue as to our general woes. Maybe there's some faulty internal network wiring or switching or configuration...?

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

Okay. See you on the other side...

- Matt

see comments

12 Dec 2012, 23:08:56 UTC
I returned to the lab again on Monday (after nearly 2 months off traveling all over Europe from France to Bulgaria and everything in between). Many thanks once again to Jeff and Eric who maintained operations during my absence (and dealing with the heinous power outage/database corruption woes last week).

During that power failure we lost one of our lesser servers (lando). Not sure exactly what happened to it, but it kept crashing. Luckily we had an ample replacement server on the shelf, and thus lando has been reborn. I set up this new system and more and more we're using Scientific Linux, which is a lot like Fedora but geared towards a bit more stability (instead of major version upgrades every 6 months and falling off support shortly after each upgrade). Basically it's an OS for people who use computers to actually compute! So far so good.

Anyway, the fallout of this last outage is that we are weighing several giant plans to move forward in the new year regarding how we maintain (or perhaps relocate) our server closet, with better network, cooling, power, remote kvm access, and UPS protection all parts of this equation.

Our assimilators are falling behind, or not catching up as fast as they should. Jeff and I are stumped about this at the moment, as there are no obvious smoking guns, but it may just be a typical case of several hidden bottlenecks working in conjunction with each other to give us a headache. It's not a real problem right now, but we'll be kicking things around on this front in the coming days.

I also just started a secondary funding drive e-mail, basically a follow-up to the mass mail sent in October/November. If you haven't opted out of such mails, or your spam filter isn't too aggressive, then you should be seeing one of those in your mailbox sometime in the near future. Of course, we already vastly appreciate the donation of your computer cycles!

Okay, back to work. I'll be around for the next while. There's more crazy world tour plans in the spring, but nothing solid yet, and definitely nothing until then. I'll be here until at least mid April, if not longer...

- Matt

see comments

6 Dec 2012, 18:50:30 UTC
We have recently come out of a painful outage. Last Thursday, 11/29, there was an unexpected power outage at Space Sciences Lab. It lasted some 20 minutes. Eric came over as quickly as he could to shut machines down, but he works in another building from where our machine room is, so the UPS's had run out their fairly short on-battery time by the time he got there. It was a perfect storm in that both Matt and I (who work a few feet from the machine room) were both out.

Most machines came through OK, but three did not. Lando, an older administrative work horse (and splitter machine) appears to be dead. We have some spares from which to choose its replacement. More tragic was the fact that the master BOINC database, and its replica, suffered unrepairable corruption. This was an astonishing bit of bad luck. Both machines are on UPS and both machines have battery backed RAID controllers. One would think that all database logging would have at least made it to the RAID controller, but it obviously did not.

In order to recover the master database, we had to actually delete all of the underlying files and then recreate all of the databases from scratch before recovering from backup. A simple recovery from the backup did not work. After recreating the databases and then recovering from the backup, we ran all of the MySQL binary logs to recover up to a point in time just before the outage. Then we took a fresh backup of the database in case the next step did more harm than good. The next step was to run an extensive table check/repair on all tables in both the production and beta databases. All tables reported OK. Good! We then brought the projects up and used the fresh backup to restore the replica.

One might ask why we don't have machines automatically shut down in an on-battery situation. A good question with a lot of history. To make a long story short, our server complex has enough cross dependencies that if machines come down in the "wrong" order, other machines can hang. Plus some of of old UPS's would hiccup and cause a spurious shutdown (I'm not sure if our current crop have this problem). This was enough of a headache that we went with a very simple design. Our database machines would have battery backed RAID and be on UPS with no automatic shutdown. The theory was that the UPS would hold the machines for the duration of very short (one or two minute) power outages and, beyond that, the RAID controllers would save any pending IO. This very simple design has served us well but, as we see, not in all cases.

Eric came up with a good compromise. We will configure the BOINC replica database machine to immediately shut down (after stopping the database and unmounting its file system in case the shutdown hangs) upon detecting an on-battery condition. Nothing is dependent on this machine, so a spurious shutdown would not be a disaster. This should prevent a disaster of this magnitude from recurring.

see comments

2 Oct 2012, 23:18:47 UTC
Hello again. Today was the usual outage day, but we got a *lot* done, so I figured I'd report on a bit of it.

Everything in the server closet is now on the new Foundry X448 switch. Of course this is all internal traffic - the workunits/results are still going over our Hurricane Electric network. Still, it's a major improvement in quality and may actually grease several wheels. In fact, we may use it to replace the HE router as well at some point.

The download servers have been trading off for a bit - we are now currently settled on using vader and georgem as the download server pair. As well, I just moved from apache to nginx on those servers. I think it's working well, but if any of you notice weird behavior let me know!

Otherwise, Jeff and Eric worked pretty hard today to align the beta and public projects - for the first time in a while (years?) their database configurations match, which will make the immediate future of development a lot easier (we've been dealing with having several code sandboxes and so forth for a while).

In less great news, carolyn (the mysql server) crashed for no known reason. Probably a linux hiccup of some sort, which is common for us these days. The very silver lining is that it crashed right after the backup finished, and in such a manner than didn't cause any corruption or even get the replica server in a funny state. It's as if nothing happened, really.

However one sudden crisis at the end of the day today: the air conditioning in the building seems to have gone kaput. Our server closet is just fine (phew!) but we do have several servers not in the closet and they are burning up. We are shutting a few of the less necessary ones off for the evening. Hopefully the a/c will be fixed before too long.

- Matt

see comments

18 Sep 2012, 21:12:27 UTC
Sorry I've was away for a while there then once again fallen out of the habit of making regular tech news reports. Then again it's a sign of some stability here that I haven't had that much to say.

Lately I've been largely working on this random noise data file (10ja10zz). We recently encountered more issues with how results are being reported - nothing we can't fix - but in order to recalibrate everything we felt the need to see what would happen when straight up random noise enters the system.

So I created this bogus file and it already passed through the system a couple times using the standard splitter and the new polyphase filter bank splitter (which creates workunits with a flatter frequency response). We have several more tests to do, so expect another pass or two (or ten) of this file.

A word about the name "10ja10zz" - as many of you know the naming convention for these tapes are DDMMYYNN where DD is the day of the month, MM is a two character abbreviation of the month, YY is the two digit year, and NN is the sequence "number" for that day, i.e. we start with "aa" then "ab" then "ac" etc. Usually we never get past something like "aj." I wanted to create a bogus name that fit in with this format. To make it easier I wanted the day-of-month and year to match, and "01" would make sense but "01" isn't really a valid value for multibeam format files (which this is, and multibeam started in '06) so I just flipped it and made it "10" for both. I picked "ja" for January as that seemed easy enough, and then "zz" as that's the last possible sequence "number" and highly unlikely. It was only after the fact that somebody pointed out to me that the "ja" in January and the bogus "zz" spell "jazz." So we've been since calling this the "jazz" file.

Meanwhile we have gotten a new switch in the closet - a nice Foundry X448 - once again donated by the GPU User's Group. We've been slowly physically moving things around to make room for this switch (in a logical place in the racks) and today I got a few servers plugged into it, including the web server. That means these very bits you are reading right now went through that switch.

- Matt

see comments

26 Jul 2012, 20:54:37 UTC
A quick update. I'm the only one here in the lab this whole week, so I've been busy dealing with chores more than anything (though I did end up with some time to clean up a couple coding projects).

After the regular outage we had a bit of a network freakout caused by our science backup again. I guess this is what happens when you speed up disk i/o to the point where reading from it is so fast that writing backups over the network without throttle causes NFS to barf. Oops. I thought we got over this, but apparently not. Sorry about that. This actually mostly affected the mysql database server carolyn even though the backup was happening from science database server paddym. In any case, outside of a temporary outage and some minor cleanup there really wasn't any harm. Oh yeah I guess the mysql replica server on oscar got confused during all that so it'll be offline until I can resync it during the next weekly outage on Tuesday.

The science database is actually bloated temporarily as one thing I'm working on this week is finally merging fractured tables. Over the years we hit various logical limits in our larger tables (workunit, gaussian, triplet) and had to split these tables into smaller pieces. Now with the power and disk space of paddym we can finally merge these tables back into one again. So while this process is happening in the background there are redundant versions of signals in multiple tables. Fair enough. We'll drop the fractured tables eventually.

I'm also working on getting a backup web server at the ready, namely jocelyn (the former mysql replica doing nothing now that oscar is the mysql replica). I'm not sure if high loads on the current web server are local, but in any case we have a mirror which we may employ at some point.

My tech news updates will become more staggered than usual as I'll be on the road playing rock star for 11-12 weeks before the end of the year. More info (if you are so inclined to care) is in a staff blog thread over here.

- Matt

see comments

19 Jul 2012, 18:30:56 UTC
Well, the database shuffles continue. We decided that, hey, oscar is no longer the master science database server, and has the same configuration as our master mysql database server (carolyn) so it could easily take over the replica mysql duties from jocelyn (which had been failing at keeping up over the past few weeks). We think jocelyn finally reached its limit at this front, and thus oscar is now the replica mysql database, and jocelyn may be retired soon. Well, jocelyn may be an ample compute server but honestly oscar, even after taken on all the mysql duties, still has more memory and cpu cycles left over than jocelyn has doing nothing. It has 28GB of memory, which is a lot for being a compute server, but not a database. Plus jocelyn's storage is an external fibre channel jbod using software RAID. We can pretty much remove 4U of gear from the closet today if we wanted to. Or jocelyn will become another web server. Many options. It's great to see these server upgrades finally moving forward after certain dams were broken.

Also for the record jocelyn had been doing such a non-perfect job of keeping up in general over the past year that we have been aiming all queries (except maybe a scant one or two stastical queries per day) at carolyn. In other words, jocelyn was mostly just an up-to-date (or close-to-up-to-date) live backup of carolyn for a while. This may change, now that oscar's "seconds behind master" has pretty much been pegged at zero since it started up. Not that carolyn needs any extra help.

Oh yeah - thanks for spotting the "robots.txt" issue in the last thread. I added "/sah/" to the disallow list. We have been hit pretty hard by spiders lately and the /sah/ portion of the URLs were getting lost in the log noise - anyway we'll see if adding that line helps.

- Matt

see comments

12 Jul 2012, 20:13:42 UTC
There has been all kinds of slow shuffling behind the scenes lately, but the bottom line is: paddym is now our new master science database server, having taken over all duties from oscar! The final switchover process was over the past few days (hence some minor workunit shortages) and had the usual expected unexpected issues slowing us down yesterday (a comment in a config file that actually wasn't acting like a comment, and some nfs issues).

What we gain using paddym is a faster system in general, with more disk spindles (which enhances read/write i/o), a much faster (and more usable) hardware RAID configuration, and most importantly a LOT more disk space to play with. We have several database tables that are actually fragmented over several tables - now we have the extra room to merge these tables together again (something that several database cleaning projects have been waiting on for months). And, the extra disk i/o seems to help - a full database backup yesterday took about 7 hours. On oscar it usually took about 40.

So that's all good news, and thanks again to the GPU User's Group gang who helped us acquire this much needed gear! And lest we forget as an added bonus we now have oscar up for grabs in our server closest - it will become a wonderful compute server, among other things.

Meanwhile our mysql replica database on jocelyn has been falling behind too much lately. It's swapping, so I've been adjusting various memory configuration variables and trying to tune it up. I'm thinking this is becoming a new issue as, unlike the result and workunit tables which are constanly churning and roughly staying the same size, the user and host tables slowly grow without bounds. Maybe we're starting to see the useful portions of the database not fitting into memory on jocelyn anymore...

- Matt

see comments

20 Jun 2012, 21:53:23 UTC
Some news. Yesterday we had our usual weekly outage, and shortly after the floodgates opened again bruno (the upload server) crashed. Except we quickly found it didn't actually crash. It was turned off. By the web-enabled power strip. For no apparent reason. We turned it back on and everything was okay, but now it seems like we have a flaky web-enabled power strip on our hands. It is interesting to note that this power strip was plugged into the same breaker as thinman - the previous webserver system that died during that last unexpected power issues. So maybe some funky voltage clobbered this strip as well. Well, we have a spare one which works so no big shakes there. And yes, we ruled out foul play.

As for the crashy desktop machines, I may have fixed one. The theory being, oddly enough, too much thermal grease was employed thus reducing the effectiveness of the heat sink. Oops. Well, I'm not quite convinced that was the problem, and we're burning it in now. If it survives a week without crashing, great. The other system is not doing as well. I think we're aiming to get insurance money from the university to cover the cost of these systems killed or injured during these outages. Meanwhile, we're operational, so no real disaster.

In better news, georgem is now not only hosting all the workunits and running some backend BOINC services and scientific analysis processes, but it's also hosting all the data (~13TB) from a recent survey of the galactic center collected at Green Bank Telescope. Several grad students will be processing this data on georgem itself.

Also paddym has been cleared to finally reformat all its drives into a giant RAID10, and we can now start the process of duplicating the whole SETI@home informix science database on oscar over there. As well it's already actually serving a mysql database containing Kepler data, also collected at the GBT, which we're soon to use old SERENDIP code to analyze in-house.

Oh yeah we also found a bug that had been causing a lot of Astropulse splitters to fail, thus reducing the amount of AP workunits being sent out. This has been fixed, and so expect more AP work.

- Matt

see comments

11 Jun 2012, 22:17:30 UTC
Kind of a bumpy weekend. So we moved that database (which handles the website) from Dan's new but oddly crashy desktop on my new desktop. Then over the weekend MY new desktop started crashing at random. You'd think this is now clearly related to the database, but Dan's desktop continued to crash after moving the mysql database off of it. And upon further inspection both systems sometimes crash before the OS is even loaded.

So this looks like a hardware problem after all. Funny how both of these new systems are failing in the same manner. We think it has to do with the power outages from a couple weeks ago sending some jolts into these perhaps more sensitive systems.

But speaking of outages, completely separate from those previous power issues which have since been fixed, there was a brand new problem affecting just this building (and all the projects within it, including SETI@home/BOINC). This one was worse, starting in the middle of the night, and by the time anybody could do anything power was up and down several times, and some outlets delivering half power, etc.

The repairs were much faster, and we were stable again around noon, but upon turning everything back on we found we completely lost thinman, the main web server. Totally dead. However, quite luckily, we happened to have a spare old frankenstein machine kicking around, and I was able to do a "brain transplant" i.e. swap the drives from thinman to this other machine. Now this other machine thinks it is thinman and is working quite well as a web server. Dodged a major bullet there.

I also happened to have my old desktop nearby, so I'm using that as I diagnose the new crashy one. Not sure who is responsible for all these damages and lost time, but it definitely shouldn't be us.

- Matt

see comments

7 Jun 2012, 21:24:31 UTC
Hello again. So it seems we have the lab-wide (actually hill-wide) power issues behind us. The bottom line is the short circuit (in a major underground line that brings power to several buildings) was found, and fixed, and we are back in action. That sure ate up a lot of our time. This is also proposal season, so I've been lost in some paperwork as well.

Some minor issues also caused some bumps in the road. Dan's new-ish desktop has been crashing at random. I'm still trying to diagnose that. This would normally be no problem except as it happened I was keeping a database on that system which helped serve the web site. So for a couple days that site was getting all messed up until I finally moved that database elsewhere. The machine is still crashing, I have no idea why (it isn't temperature - I'm guessing it's a software issue but not sure what). At least the web site is stable.

Also had one, maybe two, disk failures in synergy. Not that big a deal since the RAID protected the data thus far but I'll need to procure some disk replacements soon for that. They're 1TB SAS drives, so we don't have any spares kicking around (unlike SATA drives, of which we have plenty at the moment).

Outside of that I've been helping Andrew sort out and archive 13TB of data he recently collected at Green Bank, while using paddym as temporary storage for that. Once we're done with that I can then reconfigure paddym and we can start making it a master science database!

- Matt

see comments

29 May 2012, 20:24:57 UTC
Hello all - here's a quick message to inform you that yes, we are coming out of the usual Tuesday maintenance outage at the moment, but in less than two hours I'll be bringing everything down again for a planned lab-wide power outage to make repairs after the short circuit that clobbered several buildings two weeks ago. This planned outage will last roughly two days. I'll try to bring a few things up here and there (like the web site just inform people what's going on) on generator power if possible, but simply expect everything to be off and unreachable for about 48 hours starting about 90 minutes from the posting of this message.

Also I'll note that yes, bruno (the upload server) had a garden variety kernel choke on Saturday morning. I was able to kick it with Dan's help (he ultimately came up to the lab himself to power cycle the machine). Usual drill around here.

- Matt

see comments

22 May 2012, 22:45:02 UTC
During the normal weekly outage last week I took the opportunity to convert georgem not only into the workunit storage server, but a single workunit download server (as opposed to using vader and anakin, which are mounting georgem's disks over the network). This was a bust. I believe I had apache cranked way too high and the kernel crashed. Before it completely went down for the count there were some NFS inconsistencies causing corrupt workunits to be generated on georgem, which only happened for a short time and we didn't notice until they were already sent out.

In any case the crash definitely seemed like an OS/software problem and not due to struggling hardware. Nevertheless I felt pretty heroic about being able to completely stop everything and revert back to using vader and anakin as download servers before I left the lab for the day. But that heroism got lost...

Because that night (Tuesday, a week ago) the lab had a sudden, unexpected major power outage. In fact, all the buildings that make up the Space Lab went dark, as well as the nearby Math Sciences Research Institute and the Lawrence Hall of Science down the hill. Of course lots of our systems went down in an instant, others after the UPS batteries drained, and none of it graceful. Even worse: an hour or two after the outage power came back up for only a split second, jolting everything before we had the chance to reach the lab and unplug everything.

Without any known cause there wasn't much we could do. Jeff did come up early the next day and unplugged everything to prevent further power surges. I came up the following day to check in on progress, clean things up, etc. but as I left the campus electricians were still popping down every manhole and doing laborious tests to find the short, and it seemed like we wouldn't be back up until Monday.

But luckily they soon found the short, and it was in a part of the loop with a spare cable in the same conduit which made replacement far easier. Power came on and stabilized early Friday morning. Jeff, Eric, and I all worked together to power everything back up safely and start the projects. We were very lucky: thus far it seems like we escaped with no hardware damage, nor any data corruption. Some RAID sets had to resync - no big deal. Phew.

- Matt

see comments

8 May 2012, 21:24:15 UTC
Being a Tuesday, we had our weekly outage (database maintenance, backups, etc.). When we came back up today did you notice anything different? Hopefully not. But I did take one important step today. As of right now, workunits are being stored on (and therefore served off of) georgem's disks. So far so good. If all goes well, I'll make georgem the one and only download server (currently vader and anakin are still handling that task in tandem).

We have been doing some testing with a new splitter, hence why a relatively small number of recently sent workunits are 2.8MB in size. Oops. These should behave normally, and yeild normal results, but will take 8 times longer to process.

I also got a new RAID card (once again from the generous folks from the GPU User's Group) to put in paddym. We're still waiting on drives currently in use holding data taken at Green Bank to return to us (any day now) which we'll then put into paddym and start attempting to make it a science database server. One step at a time though.

- Matt

see comments

30 Apr 2012, 22:53:46 UTC
End of the month update. I've been actually gone for most of it, but there hasn't been too many noticeable problems or major issues, right?

Well, this weekend we had yet another signal table run out of extents. I went through the usual grind this morning to create new database spaces and got things rolling again by noon (local time). You may have noticed a dearth of work overnight, but we're back to full production now.

The newer servers (georgem and paddym) continue to get configured, assembled, and put into action. It's still unclear, but becoming ever more likely, that paddym will become the new science database server, and oscar will then become the compute server paddym was originally intended for.

By the way, one drive failed on carolyn over the weekend and the hardware RAID gracefully handled it without user intervention. Yay! A RAID configuration that actually did what it's supposed to!

Otherwise, things have been fairly light in server-land. I'm mostly working on data cleaning, analysis code, and other non-server development lately hence not much to report.

I should mention the GPU User's Group still continues to spoil us - here's a pic of my desktop at the moment, complete with new 24" monitor and ergonomic keyboard:

- Matt

see comments

3 Apr 2012, 22:12:45 UTC
Today's regular outage went pretty quick and smoothly. All the databases are fairly happy at the moment, and therefore maintenance was minimal.

During the outage we finally got the other recently donated server, paddym, into the closet. Here's a picture:

Here's the current inventory starting from the top of the left rack: a bunch of network switches (with the small CASPER server lost somewhere in there), oscar and carolyn (the two HP servers donated last year mounted next to each other on a sliding shelf), paddym, synergy, bruno (with all the blue lights), and thumper.

From the top of the right rack: anakin, georgem, the KVM for the closet, the 45-drive JBOD, one of Eric's hydrogen survey servers, and the whole gowron complex (the Snap Appliance and external drive arrays).

Not shown: various UPS's on the bottoms of these racks, and the rightmost rack in the closet, which contains most of the other servers commonly mentioned here (except for a few which still hang out in our satellite lab in room 329).

In the meantime Jeff and I have been mostly working on software. I actually got old SERENDIP IV RFI rejection code, which I haven't touched in about 12 years, to start reading data from a mysql database (instead of from flat files). This plumbing will come in handy when working on new data being collected at Green Bank. Jeff is continuing to optimize the NTPCkrs. We actually stumbled upon a major potential path of improvement yesterday. We shall see.

But speaking of science analysis, we also decided recently that the next priority is some major spring cleaning of our science data. We've been managing through the years, but there have been many events that caused the data to be non-standard. Like when we discovered some subset of our data was accidentally precessed twice, or had the frequencies reversed. These data aren't corrupted, by the way - the broken fields can be recalculated. We also have double entries of signals which may skew statistics. Sometimes the tables aren't fully accessible at once. Like the few times we ran out of extents in one table, and therefore split it into two, but never got around to merging it back into one.

We've been getting by with one software hack after another, but enough is enough. The next step is to tackle all these old problems and make the database one whole cohesive data set again. This shouldn't take too long, especially now that we have both paddym and georgem (and all the associated drives) to help out. Plus we can do most, if not all of this, in parallel with the normal daily public project operations and data analysis R&D. It's just a large set of nagging problems we'd like to get behind us already, and now we have the resources to do so.

Oh yeah I should also point out, on top of gathering funds and purchasing georgem and paddym, the GPU Users Group also came through and getting us a couple new spiffy, fast, and wonderfully quiet desktop machines to replace our current noisy/flakey ones that have been dropping like flies. Here's one of them in action (and yes it is actually hooked up to a perfectly good 19" Sun monitor!):

That's about it for technical news, though I should mention I'm revving up to head out again and go play rock star for a couple weeks. I'll be quickly passing through Argentina, Chile, and Brazil this time around. See you back here in a few.

- Matt

see comments

27 Mar 2012, 22:49:20 UTC
Another outage day (for database backups, maintenance, etc.). Today we also tackled a couple extra things.

First, I did a download test to answer the question: "given our current hardware and software setup, if we had a 1Gbits/sec link available to us (as opposed to currently being choked at 100Mbits/sec) how fast could we actually push bits out?" Well, the answer is: roughly peaking at 450 Mbits/sec, where the next chokepoint is our workunit file server. Not bad. This datum will help when making arguments to the right people about what we hope to gain from network improvements around here. Of course, we'd still average about 100Mbits/sec (like we do now) but we'd drop far less connections, and everything would be faster/happier.

Second, Jeff and I did some tests regarding our internal network. Turns out we're finding our few switches handling traffic in the server closet are being completely overloaded. This actually may be the source of several issues recently. However, we're still finding other mysterious chokepoints. Oy, all the hidden bottlenecks!

We also hoped to get the VGC-sensitive splitter on line (see previous note) but the recent compile got munged somehow so we had to revert to the previous one as I brought the projects back up this afternoon. Oh well. We'll get it on line soon.

We did get beyond all the early drive failures on the new JBOD and now have a full set of 24 working drives on the front of it, all hooked up to georgem, RAIDed up and tested. Below is a picture of them in the rack in the closet (georgem just above the monitor, the JBOD just below). The other new server paddym is still on the lab table pending certain plans and me finding time to get an OS on it.

Oh yeah I also updated the server list at the bottom of the server status page.

- Matt

see comments

22 Mar 2012, 19:54:22 UTC
Since my last missive we had the usual string of minor bumps in the road. A couple garden variety server crashes, mainly. I sometimes wonder how big corporations manage to have such great uptime when we keep hitting fundamental flaws with linux getting locked up under heavy load. I think the answers are they (a) have massive redundancy (whereas we generally have very little mostly due to lack of physical space and available power), (b) have far more manpower (on 24 hour call) to kick machines immediately when they lock up, and (c) are under-utilizing servers (whereas we generally tend to push everything to their limits until they break).

Meanwhile, we've been slowly employing the new servers, georgem and paddym (and a 45-drive JBOD), donated via the great help of the GPU Users Group. I have georgem in the closet hooked up to half the JBOD. One snag: of the 24 drives meant for georgem, 5 failed immediately. This is quite high, but given the recent world-wide drive shortage quality control may have taken a hit. Not sure if others are seeing the same. So we're not building a RAID on it just yet - when we get replacement drives it'll soon become the new download server (with workunit storage directly attached) and maybe upload server (with results also directly attached). Not a pressing need, but the sooner we can retire bruno/vader/anakin the better.

I'm going to get an OS on paddym shortly. It was meant to be a compute server, but may take over science database server duties. You see we were assuming that oscar, our current science database, could attached to the other half of the JBOD thus adding more spindles and therefore much needed disk i/o to the mix. Our assumptions were wrong - despite having a generic external SATA port on the back it seems that the HP RAID card in the system can only attach to HP JBOD enclosure, not just any enclosure. Maybe there's a way around that. Not sure yet. Nor is there any free slots to add a 3ware card. Anyway, one option is just put a 3ware card in paddym and move the science database to that system (which does have more memory and more/faster CPUs). But migration would take a month. Long story short, lots of testing/brainstorming going on to determine the path of action.

Other progress: we finally launched the new splitters which are sensitive to VGC values and thus skip (most) noisy data blocks instead of splitting them into workunits that will return quickly and clog up our pipeline. Yay! However there were unexpected results last night: turns out it's actually slower to parse such a noisy data file and skip bad blocks than to just split everything, so splitters were getting stuck on these files and not generating work. Oops. We ran out of multi-beam work last night due to this, and I reverted back this morning just to the plumbing working again. I'm going to change the logic to be a little more aggressive and thus speed up skipping through noisy files, and implement that next week.

I'm also working on old SERENDIP code in order to bring it more up to date (i.e. make it read from a mysql database instead of flat files). I actually got the whole suite compiled again for the first time in a decade. Soon chunks of SERENDIP can be used to parse data currently being collected at Green Bank and help remove RFI.

- Matt

see comments

8 Mar 2012, 23:29:39 UTC
The good news: The two new servers arrived (bought by donation made to, and assembled and shipped by, the GPU Users Group)! Here they are unpacked on the table in the center of our lab, along with the 45-disk JBOD (also donated by the GPUUG).

From left to right, that's the JBOD, georgem (Supermicro box), and paddym (Intel box). I'll have better pix when we actually start playing with this stuff. These will go a LONG way towards upgrading (and retiring) a lot of the older systems in the closet. We are excited, to say the least.

In less good news, it pretty much seems that bane (the former scheduling server) is toast. We hoped to revive it and use it to replace a bigger/older internal admin machine, but no dice. Fine. Meanwhile people who diligently scan our network graphs may have noticed how "grassy" they are (as opposed to flat) due to bursty activity. The obvious first suspect was synergy, now loaded with the extra burden of the scheduling server. Wrong. The next suspect was carolyn, the mysql server, as it was getting a little extra I/O this week due to a science database backup being stored on its internal drives. Nope. We ultimately found what we think is the cause: turns out upon reboot from the power outage on Monday bruno (the upload server) started up an automatic RAID verify, which is slowing uploads down. This verify should end sometime tonight, and things are already seeming to flatten out.

Also... I've been wasting way too much time today getting a new desktop for Dan in order (as his died on Monday as well). Luckily Jeff had an old shuttle PC he donated from home kicking around. However it's been a hilarious comedy of errors. The first two drives I put in it failed during OS install. The third drive worked great, but I installed an older version of Fedora to save Dan from having to deal with (the atrocity which is) Gnome 3. Well, while configuring that OS I was stumped why I couldn't upgrade any of the security packages. Turns out that version of the OS was already end-of-lifed. Aaaah! Well, I'm installing the latest version of the OS now and Dan will have to just deal with learning the Gnome 3 ropes. The irony of course is that, due to obvious priorities (because Dan can't work) I've been spending most of my day fighting with a very old desktop PC, while three shiny new boxes on the table behind me go untouched. So be it.

- Matt

see comments

6 Mar 2012, 22:39:37 UTC
Yesterday (Monday) there were an emergency generator test which affected the whole lab. Even though this test was mostly for the benefit of another project here at SSL, this still meant we had to power everything down, wait for the test to complete, then power everything back up. For the most part it went okay, but we had a few casualties on the way back up. A small subset of outlets on the back of one of our UPSes failed (a broken internal breaker?) - not a big deal. Dan's cheap desktop system also mysterious died, and won't power on anymore. That's a worse problem, but not a showstopper. However our scheduling server, bane, failed to boot. This seemed like an OS install problem, even though we had successfully rebooted it before after upgrading it the other day.

Luckily we have synergy in our racks, and it took me less than 10 minutes to configure it as a replacement scheduling server. But before I take any pride in that feat I admit that some internal server errors were getting lost in the noise upon bringing the projects back on line. Turns out a max request size setting for mod_fcgid was high by default in the older OS on bane, but not as high by default in the newer OS on synergy, so we needed to set that explicitly by hand. Fair enough, but all evening a set of crunchers were finding it impossible to connect and get work. I fixed that this morning before the standard weekly outage.

Also it should be noted a bookkeeping cronjob running on bane (now missing with bane out of commission) caused the splitters to run out of work overnight as well. This also was fixed this morning. We should be more or less back to normal after we catch up for a bit. Sorry about all the workflow hiccups.

Meanwhile, what's up with bane?! I spent half the day today installing and resintalling the OS, thinking I'm getting on top of the problem each time, but nope. Seems like the Fedora 16 installer has some issues in general, compared to previous versions. Yeah, I know, we should be using <insert your favorite Linux distro here>. I'll keep kicking it - though we'll probably keep the scheduler on synergy, and hopefully use bane to replace a much larger, less efficient administrative system in the closet.

New server-wise, I just checked the tracking info. Looks like they will arrive on Thursday.

- Matt

see comments

1 Mar 2012, 21:26:17 UTC
End of the week wrapup. I'm still working on the workunit table cleanup, but we're in the paranoid-testing-before-we-drop-the-old-table phase. So far, so good.

As for Astropulse, the splitters are off, and will remain off for at least the weekend, for a couple reasons. First, we made some global changes to the science database schema, and thus the db library code, which affect both multibeam and Astropulse. So we still need to recompile the Astropulse splitter to accommodate these changes (and it cannot be run until we do). Second, from what I understand we are close to releasing another Astropulse client, which will also require some splitter-related tweaking. Both these things are waiting on Eric, and he's out of the lab until Monday.

However, somewhat conveniently, we had a RAID drive fail on the Astropulse database server this week, so it's been quite nice and easy to replace this drive and rebuild the RAID while everything is quiescent. So there's that silver lining.

In case nobody noticed I had to mess around with jocelyn (the mysql replica server) today. It's root filesystem filled up as the qlogic card started cluttering the logs with dozens of useless messages a second. I upgraded several packages and the kernel and rebooted the system and that seems to have calmed it down.

The two new servers (paid for by donations to the GPU User's Group) have been assembled and soon to be en route. We'll start playing with those hopefully by early next week! These will really go a long way towards improving the performance per rack unit of our server closet!

And to echo what I already posted on the front page: The entire lab is undergoing some electrical power tests on the morning of Monday, March 5th. All SETI web sites and servers will be unreachable for 2 hours (from 8am to 10am, Pacific Time).

- Matt

see comments

28 Feb 2012, 23:36:32 UTC
Over the past few days (starting around Friday) we had continuing fallout with the science database repairs made over the previous weeks. Nothing we couldn't diagnose quickly or handle, but there were patches of low workunit availability. Long story short, after rebuilding our workunit table we hit some index corruption issues that didn't rear their head until we suddenly stumbled upon them.

Today we dropped all the indexes and rebuilt them from scratch during the usual weekly outage. So far so good. I also used the outage this week to upgrade the OS on the scheduling server (which was fairly out of date).

We also brought in some freshly compiled splitters which contain new database plumbing - a step toward us having the splitters themselves being more sensitive to telescope status when the data were recorded. This code is currently dormant but after some testing and calibration will ultimately keep us from creating and sending out large numbers of "noisy" workunits.

This plumbing however hasn't been compiled yet into the Astropulse splitters, which is why they shall remain off for now.

- Matt

see comments

22 Feb 2012, 21:34:44 UTC
So... another week another minor server crisis. This one was brewing for a while - we've been getting memory errors/upsets on our main internal file server (which hosts, among other things, all the files that make up the SETI@home web site). We got replacement memory, and were hoping for a quiescent moment to swap it out, but after two crashes in one day (on Tuesday) I just went ahead and did the swap.

So far so good (i.e. no further crashes), except we're still getting memory upsets in the server log. I only replaced 2 of the faulty DIMMs (which were noted as faulty by the motherboard), but maybe others need replacing as well.

In the meantime I found that project recovery today was significantly slowed by the result web pages on our site, so those are turned off at the moment (as I'm writing this).

Meanwhile other tasks this week included cleaning up the lab (the fire marshall is visiting today) and resurrecting SERENDIP code I haven't touched in over a decade. I got it to compile, now I'm just removing the non-fatal compiler warnings one by one. We'll use this code to help process Kepler data (which happens to be in a similar format to our old SERENDIP data). Maybe I'll even get back to analyzing the SERENDIP IV data set (also over a decade old and it may be worth taking another look at it with this code).

- Matt

see comments

16 Feb 2012, 21:04:27 UTC
Hello gang. I'm back from the latest bout of alternative career maintenance. Seems like I didn't miss too much, and unlike normal the server problems waited until *after* I returned. My next disappearance (only about 10 days) will be in mid-April (touring in Argentina, Chile, and Brazil).

Before the usual Tuesday server outage Jeff noticed the splitters having trouble inserting new work into the science database. After some detective work and tests we found we hit one of several possible informix logical limits: we ran out of extents in the workunit table.

Not a big deal, and we hit this limit with other tables several times before. But the fix is a bit of a hassle. Basically you have to recreate a whole new table from scratch with more extents and repopulate it with all the data from the "full" table. We have a billion workunits in that table, so to speed this process up we only moved over workunits 90 days old (or newer) before turning the projects on again. We only need 90 days of recent workunits around for the assimilators to work, but to get the NTPCkrs rolling again we need to repopulate the whole thing, which we'll do more casually.

Not sure if anybody noticed, but I got the "connecting client types" page working again (for the umpteenth time). Let's see how long before it breaks again for some inexplicable reason:

Okay. I'm sure there's lots more to report but I'm going back to beating down my e-mail spool.

- Matt

see comments

12 Jan 2012, 21:01:42 UTC
Hello people. I'm actually about to head out again shortly (once again for a whole month) so let me get y'all caught up before I disappear.

Let's see. There's been a lot of the usual hiccups over the past couple of weeks. Overloaded servers locking up and requiring a hard restart, drives failing and being replaced, bringing machines down on purpose to upgrade the OS, etc. No singular event was tragic or noteworthy, but the quantity of such events has been slightly higher than normal.

Meanwhile various projects have been pushing along. After enough analysis, database tweaking, and data dumping/reloading, we finally created some test "small signal tables" containing the top 1% signals on which to do our final analysis. Turns out doing the same on the 100% full (and constantly growing) tables was a performance disaster. Basically we're now determining what our i/o needs and parameters are with much smaller cases, and then going from there. Right now the signal tables entirely fit in memory, but part of this equation is adding more spindles to the science database array to improve disk i/o as well. This is where the GPU User's Group-donated JBOD comes in. More on that below.

Another project I've been working on is to get the splitters (the programs that make workunits out of raw data) to become sensitive to VGC (voltage gain control) values available in the raw data headers so that we can avoid splitting areas with low VGC values (and therefore loud noise). In layman's terms: we're trying to set everything up to automatically reject noisy workunits before sending them out. We know one or two beams (out of fourteen) are sometimes flaky, and keeping those workunits out of the pipeline will help reduce network competition for downloads.

This should have been fairly straightforward, however during the course of testing we're finding more than one or two beams with various problems. More like 5 or 6. This may be for several different reasons, including bogus or misreported VGC values. This is on a front burner, with several parties involved here and at Arecibo.

Speaking of network competition - yes, we're away that we are dropping all kinds of connections during uploads/downloads. This isn't because of our router (which was definitely the problem over the summer before we added RAM to it), but somewhere else further up the pipeline. Still figuring this out, but it's certainly load related.

Hardware wise, we took an archive server out of the closet to make way for the JBOD mentioned above. The archive server will move into our secondary lab down the hall (where other servers currently reside). We were going to install the JBOD on Tuesday but the hole in the rack made for it isn't big enough to let us mess with internal cabling. Given that we hope to hook this up to at least two separate servers, we'll likely need to mess with internal cabling. So we're going to try to do our best with that while the JBOD is still on the table in our lab.

Oh yeah.. our web server crashed due to overloading last Friday, likely due to an article Andrew published about recent Kepler analysis results. He didn't clearly enough state that these plots were radio frequency interference, and thus we got clobbered due to confused news reports that we found ET. The usual drill, basically. The text of the article was cleaned up. Eric suggested we put disclaimers on the top of every web page on our site that says, "everything we find is Radio Frequency Interference unless we specifically tell you otherwise."

Okay. I should wrap this up. As a parting gift here are a couple random recent photos:

Here's the new JBOD, as seen from behind, sitting on our lab table. There's only 21 (currently empty) drive bays back here, but on the front there are 24 full ones in front.

Here's the current state of our server closet across the hall. Note the hole in the middle rack - that's where the JBOD is going.

And for fun, last week I shot this photo, which is the entire Bay Area consumed in fog, which we at the lab (over 1000 above sea level) are enjoying lovely weather over said fog.

Wow those pictures are blurry. Well, it's from my iPhone 3GS. Not exactly state of the art.

So! I'm now official on the road. I'll be playing with my band MoeTar in Whittier, California on Saturday (opening up for the Allan Holdsworth Band), then I drive up to Seattle to meet some of the guys in Secret Chiefs 3, and then we all drive in the tour van to Denver, where we meet to remaining guys (flying in from NYC and Sydney, Australia). We'll rehearse two days, then tour for a few weeks all over the western US (with one stop in Vancouver), co-headlining with the awesome band Dengue Fever. Should be fun!


- Matt

see comments

20 Dec 2011, 23:35:51 UTC
Hello, world. Here's another random, non-comprehensive status update regarding our servers, quite possibly the last one before the end of the year.

So what are we dealing with lately. Well, carolyn (the mysql server) seems to have some funky memory. Or maybe it's an overactive watchdog in the kernel. Hard to tell, but the warning messages we're getting aren't given us the warm fuzzies. Operations are more or less normal, so we're just keeping an eye on it for the moment (don't really want to do any surgery before the holidays). Meanwhile, it did have a standard issues CPU lock last week, requiring a hard reset and database recovery. However annoying (and it seems the modern day linux kernels are getting more and more prone to this sort of misbehavior) it's so far easy enough to recover from after hard power cycling the machine. We have to do this a lot on bruno (the upload/compute server) quite often, and oscar (the informix database server) the other day as well. Every time we eventually recover just fine.

Also mysql-wise, we seem to be having performance issues that defy easy understanding and explanation. Maybe this is memory related (hope not), but probably just due to some black-box mysql internal bookkeeping. During some testing/tweaking I turned off the daily stats dump scripts, and (oops) forgot to turn them back on. So there was a period of 5-6 days without stats dumps. Sorry about that.

Another thing we have to keep an eye on is server closet temperature. Seems like (without clear notification) we are already in "holiday energy curtailment" mode. With less people around, lab-wide environmental controls (which assist our server closet cooling) are ramped down to save energy. Makes sense, but that still means temperatures rise in our closet, which isn't happy-making. So far they only went up a degree or two on average. Just one more thing to worry about.

Onto brighter news. The gang over at the GPU Users Group has been incredibly helpful to us thus far. They recently donated a 45 JBOD drive array, and as of today 28 2TB and 6TB drives (for this array and/or data transport to/from Arecibo). We'll use this, along with more drives to come and another whole server, to upgrade various parts of our current server backend in the near future: science database server storage, upload server, download server, and main BOINC admin/compute server... They are still collecting donations over at their site (see our
donation page for a link to their paypal-based donation site) going towards this new hardware.

Happy holidays, safe travels, and all that. See you in the new year...

- Matt

see comments

9 Dec 2011, 0:14:48 UTC
Had a couple server mishaps yesterday afternoon and this morning. For no apparent reason (at the time) carolyn wedged pretty hard. That's our mysql database server, so when that gets locked up, everything BOINC/SETI@home related does as well. I was able to recover it by the early evening without too much ado, except - as it always happens when a master mysqld database suddenly crashes - the replica database on jocelyn is all out of whack. I'll sort that out next week. During the recovery though, and continuing through today, I'm seeing weird kernel messages relating to power. From what I've read this is likely due to faulty (or unseated) memory, but may be worse - like a CPU or motherboard problem. Great. Anyway, this is all on my radar.

This morning Jeff came in and found bruno (the main BOINC admin machine and upload server) was now wedged. This happens from time to time on these busy servers due to non-hardware reasons, and a quick reboot usually fixes it, which it did in this case. All systems seem to be go for now (except for the replica db mentioned earlier).

Otherwise, smooth sailing... I guess. In the background Jeff's actually been working on some time-critical non-SETI work and I've been immersed in the usual dozen-or-so mini projects. However, we're making progress on streamlining the science database - a first pass at improving the NTPCkr throughput and then determining what hardware we may need (if any) beyond that.

- Matt

see comments

29 Nov 2011, 22:53:31 UTC
We're coming out of our usual weekly maintenance outage. I was quite productive today. Outside of the usual tasks I upgraded the OS on a couple backend servers. This was much smoother compared to similar chores last week.

I also rebooted the master mysql database server, thinking it could stand to have its pipes cleaned (see my post griping about this last week). Well, that didn't help. This may just be a perfect storm. There are a lot of BOINC backend queries which run "every 24 hours" but the way these jobs are implemented they run on average "every 24.05 hours." Over time they migrate to when the outage is happening, and therefore they wait, and then slam against the database once we come back online. We might have to force migrate these until later. At least that's the next thing to try.

By the way recently, as a cost saving measure, the entire Space Lab has migrated to using Calmail - the campus wide e-mail system. That way we can stop wasting scant precious IT resources on maintaining our own lab wide mail servers. Turns out the Calmail is turning out to be kind of a bust. Now that most of our lab is dependent on it it's been crashing almost constantly. For example, we didn't have e-mail for most of the Thanksgiving weekend, and today the whole system is once again kaput. One or two short outages here and there are acceptable, but I'm starting to call this a major disaster.

- Matt

see comments

23 Nov 2011, 23:59:58 UTC
Before we disappear for the long Thanksgiving holiday weekend I figure I'd catch you up on a couple things.

Keeping up with good security practices I'm in OS upgrade mode around here. So far so good getting our machines up to the latest rev of Fedora (FC16) but I hit a couple snags with vader yesterday. Vader is one of the two download servers, as well as a general BOINC backend server - you may have noticed I moved some assimilators/splitters/etc. off of it yesterday before the upgrade.

Anyway, there was a bunch of tiny annoyances during the whole process that ate up my whole day. Things like messed of network configurations and such. I'm kind of peeved how much Fedora and linux has changed over the past couple of years. I don't need job security in the form of relearning fundamental changes to OSes that worked just fine a month ago. Long story short it seemed like the only way I could truly configure the network was to yum in an old version of the network configuration GUI and use that to create the proper startup scripts.

I had to reboot vader a lot during all this, and some more again this morning, but downloads are now back to normal (albeit dropping packets as usual since we're constantly maxed out). I also got some of the BOINC backend processes running on it again.

Meanwhile after the last couple of Tuesday outages we've had a hard time recovering in general. Those watching the traffic graphs may have noticed how depressed they were upon coming back on line. This was mostly mysql's fault. It's doing some kind of mysterious i/o and/or internal bookkeeping causing queries to take forever after our weekly outages. My suspicion is that we just need to reboot the mysql server to clear some pipes. It's been a while. We'll do that next Tuesday.

- Matt

see comments

9 Nov 2011, 20:53:50 UTC
Funny story. About 3 years ago I realized that the BOINC database has result ids stored a integers, which are 4 bytes long and signed by default. The sign takes up one bit, thus leaving 31 bits remaining for the value. That means the maximum value is 2^31 (2 to the power of 31, or 2147483648). I mentioned this at this time, noting we were well on our way towards this maximum value, and put it on the "things we'll need to fix eventually" list.

Nobody has been really watching this (I've been pretty much out for over two months until this week), and sure enough we hit that limit yesterday, and the whole BOINC backend pretty much barfed. We tried to implement a "quick fix" by changing the result id signed integer to an unsigned integer (both in mysql and the C code), thus giving us an extra bit for the value. Now that means the maximum value is 2^32 (2 to the power of 32, or 4294967296). That should have bought us a couple more years.

However, this quick fix didn't really work. There's all kinds of code in BOINC that needs to be changed to get unsigned integers to work. Dave made some of these changes and Jeff tested them this morning, but still to no avail. More necessary fixes were found. We seem to be once again creating and sending out work at the moment. However the hood is wide open on BOINC now, so we're watching things carefully over the next day or so.

We're certainly not done - there are tons of cosmetic fixes that need to be made (our logs are full of entries containing negative result ids). In the long term we'll have to do the same for workunit ids, and at that point we'll probably go ahead and make them long longs (which are always 8 bytes, as opposed to longs, which are 4 bytes on 32-bit systems and 8 bytes on 64-bit systems) in the C code and bigints in mysql. At that point our id space will max out at 2305843009213693952, which should probably be enough. That's a million results a day for 6.3 billion years. If we're still running SETI@home 6.3 billion years from now there's probably nobody out there. Agreed?

We've been bitten by this long ago in informix, and have since been storing larger numbers there as int8's (8 byte integers) or doubles.

Warning: since we didn't come across this problem in advance and solve is gracefully, there may be some ugliness in the form of blocked results in weird states - these will most likely time out on their own and get resent. Sorry if this causes any confusion in the coming weeks.

By the way, it should be mentioned there were some random download server issues over this past weekend. No big deal - usual stuff regarding linux kernel hangs. We kicked the servers on monday morning and they went back to work.

- Matt

see comments

2 Nov 2011, 17:26:55 UTC
Usually when I take large chunks of time away from the lab the servers get sad and Jeff has to deal with some extra sysadmin chaos beyond the usual grind we both deal with day to day. This time they kindly waited for me to get back.

The BOINC web/alpha project server went kaput on Monday, the day I returned. This wasn't the worst tragedy, as all I had to do was reinstall the OS. However during that process one of the non-OS filesystems in the machine got screwed up. Classic linux software RAID behavior: failing for no apparent reason, and instead of recovering gracefully like it should it gets into a funny state that makes recovery impossible. As I griped about in the past: linux software RAID is really only good for organized, faster storage, with the side benefit of maybe, on very rare occasions, actually protecting the data. The cons for linux software RAID is that it will eventually go bonkers and ruin your day.

Fine. I rebuilt the RAID, and we have daily backups of the filesystem in question (which happens to hold the entire BOINC web site). Well, turns out the lab-wide backup system was also having problems. Talk about bad timing. Long story short we had to wait 48 hours for the backup system to get back on line, and only just now am I recovering the web site. It should be back later this afternoon in some form or another (I hope).

- Matt

see comments

6 Oct 2011, 22:20:37 UTC
Hey gang. I've been back in the lab for a few days. Figured I'd say hi and mention a couple things.

The HE problems are indeed getting weirder, and multi-faceted. We know the router itself needs more memory. Getting memory isn't the problem. Getting access to the router is. Knowing this, one hopeful option is to perhaps get ourselves off the current link and move entirely back to using campus infrastructure, now that there's enough bandwidth to handle us. But there are so many parties involved on all fronts that, as always, this sort of thing is moving at a snails pace. Meanwhile, one of the routers in our chain, unrelated to us but still affecting us, was the victim of a DDOS attack the other day. Another reason we need to simplify our setup already.

Note that there have been other issues affecting general connectivity. For example: our mysql schedule database swelled too large because db_purge wasn't running for a while, so it started falling out of memory and slowing everything down. This is clearing up on its own at the moment. There were also some scheduler bugs that have been introduced but then mostly if not entirely have been fixed. Meanwhile we turned off "resend lost results" until the smoke clears a bit.

We're also weighing our options for improving the science database throughput. The solutions include (and aren't mutually exclusive) moving entirely to solid state disks (which I find a little scary), changing the schema of our signal tables to bifurcate into good/uninteresting signals (which will vastly reduce lookups and what we need to keep in memory, but will require major changes to all our backend code), and perhaps just adding another disk enclosure with SATA drives.

Meanwhile I just started another informative mass e-mail. It's going out now verrrry slowly (due to recent campus mail configuration changes). If you're curious, here it is.

By the way that Secret Chiefs 3 US/Canada tour was super fun, and I'm about to head out on a shorter one in Europe (Iceland/France/England). There may be other similar tours on my plate in the new year (Western US, Australia, South America). Sorry about the absence, but I'll be back in November and then not going anywhere for a couple months I think.

- Matt

see comments

24 Aug 2011, 20:34:46 UTC
I'm still here, but this is probably my last tech news item for a long while. Eric/Jeff will try to keep you up to date on the nerdy behind the scenes stuff while I'm gone. They are equally (if not far more) qualified to do so.

So.. regarding this current dearth of workunits. We had a routine drive swap on thumper (our file server, where we keep all the raw data among other things) after one drive started showing signs of impending failure. This unexpectedly caused three problems: 1. the drive swap confused the RAID and we couldn't easily get it out of degraded state, 2. this somehow in turn corrupted the xfs filesystem on said RAID, causing us to lose our on-line cache of raw data, and 3. other systems couldn't mount this filesystem anymore, even after it seemed to be in a stable enough state.

Tie all that together, and you can't make workunits. The good news is we didn't really lose any data, as it's all archived elsewhere, so the weekend was spent copying a lot of raw data back onto systems in our lab. Anyway the long and the short of it is after the dust settled it was easy to un-degrade the RAID (though once again I'm annoyed by the wonky/unpredictable nature of linux software RAID). That took a day to resync. Then I spent a day copying everything off the xfs-corrupted filesystem, made a fresh new reformatted partition, and just started copying everything back. I also kicked all the other machines enough to start mounting this new, remade partition.

All you really need to know is: it's all looking pretty good, and we'll start making workunits again probably by sometime tomorrow morning, if not sooner.

Meanwhile everything else is pretty much fine. I'm actually mostly busy helping Dan/Eric cobble together a spate of NASA grant proposals. Keep your fingers crossed on those.

- Matt

see comments

11 Aug 2011, 23:07:27 UTC
Okay, we didn't fix the HE connections problem, but are getting closer to understanding what's going on. Basically our router down at the PAIX keeps getting a corrupted routing table. We reboot it, which flushes the pipes, but this only "evolves" the issue: people who couldn't connect before now can, but people who could connect before now cannot, or people don't see any change in behavior. This is likely due to a mixture of: (a) low memory on this old router, (b) our ridiculously high, constant rate of traffic, and perhaps also (c) a broken default route.

We're looking into (c) at the moment, and solving (a) may be far too painful (we don't have easy access to this router, which is a donated box mounted in donated rack space 30 miles away). So I've been arguing that we need to deal with (b) first, i.e. reduce our rate of traffic.

Part of reducing our traffic means breaking open our splitter code. Basically, one of the seven beams down at Arecibo has been busted for a while, thus causing a much-higher-than-normal rate of noisy workunits. We've come up with a way to detect busted beam automatically in the splitter (so it won't bother creating workunits for said beam) but this means cracking open the splitter. This is a delicate procedure, as you can really screw things up if the splitter is broken - and usually needs oversight from Eric who is the only one qualified to bless any changes to it. Of course, Eric has been busy with a zillion other things, so this kept getting kicked down the street. But at this point we all feel this needs to happen, which should reduce general traffic loads, and maybe clear up other problems - like our seemingly overworked router facing HE unable to handle the load.

Of course, it doesn't help we're all bogged down in a wave of grant proposals and conferences, and I'm having to write a bunch of notes as part of a major brain dump since I'm leaving for two months (starting two weeks from now). I'll be on the road (all over the Eastern North America in September, all over Europe in October) playing keyboards/guitar with the band Secret Chiefs 3. It's been a crazy month thus far getting ready for that.

- Matt

see comments

9 Aug 2011, 22:04:43 UTC
It's looking like we might have find the culprit of the random HE connection problems - a corrupt routing table in one of our routers. I believe we cleaned it up. So... did we? How's everybody doing now? Of course, we're coming out of a typical Tuesday outage, so there's a lot of competing traffic.

Also jocelyn survived just fine doing its mysql replica duties over the weekend and through the outage. Though we hit one snag with a difference between 5.1 and 5.5 mysql syntax. How annoying! Not a major snag, though, and everything's fine.

Jeff and Bob are still doing tons of data-collecting tests trying to figure out the best way to configure the memory on oscar, the main informix/science database server. Will more memory actually help? They jury is still out. Or the trial is still going on. Pick your favorite metaphor.

- Matt

see comments

4 Aug 2011, 21:28:25 UTC
Not that it's all bright and shiny, but how about I just report some good news?

Looks like we got beyond the issues with the mysql replica on jocelyn. Basically we swapped in a bunch of different qlogic cards (which we had laying around) and one of them seems to be working. We're also using a new fibre cable (this new card had a different style jack so I was forced to do so). So far, so good - it recovered from the backup dump taken this past Tuesday, and currently as I type this sentence only 21K seconds behind (and still catching up best I can tell). Of course, we need to wait and see - chances are still good it may hiccup like before.

And also finally there's some non-zero hope in the HE connection issues front: one tech there may have a clue about a router configuration we may need to add/update on our end, though I'm still unsure what changed in the world to break this. I sent them some test results, now I'm just waiting to hear back.

You may have noticed some of our backend services going down today. This was planned. The short story is we just plucked 48GB of memory out of synergy (back-end compute server) and added it to oscar (the main science database server). So now oscar has 144GB of RAM to play with - the greater plan being to see if this actually helps informix performance, or are we (a) hopelessly blocked by bad disk i/o, and/or (b) dealing with a database so big that even maxing out memory in oscar at 192GB won't help. In any case, testing on this front moves forward. The more we understand, the more we learn *exactly* what hardware improvements we need.

- Matt

see comments

27 Jul 2011, 20:37:40 UTC
Here's another end-of-the-month update. First, here's some closure/news regarding various items I mentioned in my last post a month ago.

Regarding the replica mysql database (jocelyn) - this is an ongoing problem, but it is not a show stopper, nor does it hamper any of our progress/processing in the slightest. It's really solely an up-to-the-minute backup of our master mysql database (running on carolyn) in case major problems arise. We still back up the database every week, but it's nice to have something current because we're updating/inserting/deleting millions of rows per day. Anyway, I did finally get that fibrechannel card working with the new OS (yay) and Bob got mysql 5.5 working on it (yay) but the system's issues with attached storage devices remain, despite swapping out entire devices - so this must be the card after all. We'll swap one out (if we have another one) next week. Or think of another solution. Or do nothing because this isn't the highest priority.

Speaking of the carolyn server, last week it locked up exactly the same way the upload server (bruno) has, i.e. the kernel freaks out about a locked CPU and all processes grind to a halt. We thought this was perhaps a bad CPU on bruno, but now that this happened on carolyn (an equally busy but totally different kind of system with different CPU models running different kinds of processes) we're thinking this is a linux kernel issue. We'll yum them up next week but I doubt that'll do anything.

We're still in the situation where the science databases are so busy we can't run the splitters/assimilators at the same time as backend science processing. We're constantly swapping the two groups of tasks back and forth. Don't see any near-term solution other than that. Maybe more RAM in oscar (the main science informix server). This also isn't a show-stopper, but definitely slows down progress.

The astropulse database had some major issues there (we got the beta database in a corrupted state such that we couldn't start the whole engine, nor could drop the corrupted database). We got support from IBM/informix who actually logged in, flipped a couple secret bits, and we were back in business.

So... regarding the HE connection woes. This remains a mystery. After starting that thread in number crunching and before I could really dig into it I had a couple random minor health issues (really minor, everything's fine, though I claimed sick days for the first time in years) and a planned vacation out of town, and everybody else was too busy (or also out of town) to pick up the ball. I have to be honest that this wasn't given the highest priority as we're still pushing out over 90Mbits/sec on average and maxing out our pipe - so even if we cleared up these (seemingly few and random) connection/routing issues they'd have no place to go. Really we should be either increasing our bandwidth capacity or putting in measures to not send out so many noisy workunits first.

Still, I dug in and got a hold of Hurricane Electric support. We're kind of finding if there *is indeed* an issue, it's from the hop from their last router to our router down at the PAIX. But our router is fine (it is soon to reach 3 years of solid uptime, in fact). The discussion/debugging with HE continues. Meanwhile I still haven't found a public traceroute test server anywhere on the planet that continues fails to reach us (i.e. a good test case that I have access to). I also wonder if this has to do with the recent IPV6 push around the world in early June.

Progress continues in candidate land. We kind of put on hold the public-involvement portion of candidate hunting due to lack of resources. Plus we're still finding lots of RFI in our top candidates which is statistically detectable but not quite obvious to human eyes. Jeff's spending a lot of time cleaning that up, hopefully to get to a point where (a) we can make tools to do this automatically or (b) it's a less-pervasive, manageable issue.

That enough for now.

- Matt

see comments

23 Jun 2011, 21:35:31 UTC
Here's another catch-up tech news report. No big news, but more of the usual.

Last week we got beyond the annoying limits with the Astropulse database. There's still stuff to do "behind the scenes" but we are at least able to insert signals, and thus the assimilators are working again.

The upload server (bruno) keeps locking up. This is load related - it happens more often when we are maxed out, and of course we're pretty much maxed out all the time these days. We're thinking this may actually be a bad CPU. We'll swap it out and see if the problem goes away. Until then.. we randomly lose the ability to upload workunits and human intervention (to power cycle the machine locally or remotely) is required.

We've been moving back-end processes around. I mentioned before how we moved the assimilators to synergy as vader seemed overloaded. This was helpful. However one thing we forgot about is that the assimilators have a memory leak. This is something that's been an issue forever - like since we were compiling/running this on Sun/Solaris systems - yet completely impossible to find and fix. But an easy band aid is to have a cron job that restart the assimilators every so often to clear the pipes. Well, oops, we didn't have that cron job on synergy and the system wedged over the weekend. That cron job is now in place. But still.. not sure why it's so easy for user processes to lock up a whole system to the point you can't even get a root prompt. There should always be enough resources to get a root prompt.

The mysql replica continued to fall behind, so the easiest thing to try next was upgrading mysql from 5.1.x to 5.5 (which employs better parallelization, supposedly, and therefore better i/o in times of stress). However, Fedora Core 15 is the first version of Fedora to have mysql 5.5 in its rpm repositories. So I upgraded jocelyn to FC15.. only to find for some reason this version of Fedora cannot load the firmware/drivers for the old QLogic fibre channel card, and therefore can't see the data drives. I've been beating my head on this problem for days now to no avail. We could downgrade, but then we can't use mysql 5.5. I guess we could install mysql 5.5 ourselves instead of yumming it in, but that's given us major headaches in the past. This should all just work like it had in earlier versions of Fedora. Jeez.

Thanks for the kind words in the previous thread. Don't worry - I won't let it get to my head :).

- Matt

see comments

14 Jun 2011, 23:06:13 UTC
Usual outage day. Project goes down, we squeeze and copy databases, project comes back up. It seems the mysql replica is oddly unable to keep up with much success anymore. I think the cause is our ridiculously consistent heavy load lately thus keeping the databases busier than normal. Anybody have any theories about what is causing the ridiculously consistent heavy load? What's also a little strange is the CPU/IO load on jocelyn is low... so what's the bottleneck? I'd have to guess network, but it's copying the logs from the master faster than executing the SQL within those logs. So...?

And speaking of high production loads I also just noticed we're low on work to split. Prepare for tonight to be a little rocky as files are slow to transfer up from the archives and get radar blanked before being splittable.

By the way, the Astropulse assimilators are off because the database table containing the signals had one of its fragments run out of extents. In layman's terms it reached an arbitrary limit that we'll now have to work around. We'll sort this out shortly.

Kepler data is here in a big ol' box and being archived down to HPSS. It sure is nice seeing the network graph for the whole lab going from a baseline of ~50 Mbits/sec to ~250 Mbits/sec when we started that procedure. Too bad we're still currently stuck using the HE connection for our uploads/downloads. Maybe someday that'll change.

Sorry my posts continue to be intermittent. I apologize but expect things to get worse as the music career will temporary consume me. You may see rather significant periods of silence from me for the next... I dunno... 6 to 12 months? I'm sure the others will chime in as needed if I'm not around.

- Matt

see comments

9 Jun 2011, 22:33:54 UTC
So bruno (the upload server) has been having fits. Basically an arbitrary CPU locks up. I'm hoping this is more of a kernel/software issue than hardware, and will clear up on its own. In the meantime, we did get it on a remote power strip so we can kick it from home without having to come to the lab.

As for thumper we replaced the correct DIMMs this time around on Tuesday. But then it crashed last night! So there was some cleanup this morning, then re-replacing the DIMMs with the originals, and then coming to terms with the fact that the most likely scenario is that those replacement DIMMs were actually DOA. So we're back to square one on that front, hoping for no uncorrectable memory errors until the next step.

In better news we moved some assimilator processes to synergy and were pleasantly surprised how much faster they ran. In fact, we are running the scientific analysis code now which has been causing the assimilators to back up, but they aren't. That's nice. Really nice, actually. [EDIT: I might have spoken too soon on this front - not so nice.]

Still trying to hash out the next phase for the NTPCkr and how to present all this to the public. We're doing a bunch of in-house analysis ourselves just to get a feel for the data and clean up junk, and as expected most of the "interesting" stuff is turning out to be RFI. We want to get it to a point where we're presenting people with candidates that contain signals which aren't always obvious RFI. That would be boring and useless.

- Matt

see comments

1 Jun 2011, 23:09:44 UTC
Long time no speak. I've been out of town and/or busy and/or admittedly falling out of the habit of posting to the forums.

So I was gone last week (camping in various remote corners of Utah, mostly) and like clockwork a lot of server problems hit the fan once I was out of contact. Among other things, the raw data storage server died (but has since been recovered), oscar wedged up for no reason (a power cycle fixed that) and Jeff's desktop had some issues as well (nothing a replacement power supply couldn't handle).

Then we had the holiday weekend of course, but we all returned here yesterday and continued handling the fallout from all that, as well as the usual weekly outage stuff. We're still using thumper as the active raw data storage server and worf is now where we're keeping the science backups. Basically they switched roles for the time being, until we let this all incubate and decide what to do next, if anything.

This morning we brought the projects down to replace some DIMMs (the have been sending complaints to the OS) on thumper. One thing I kinda loathe about professional computing in general is poor documentation - a problem compounded by chronic zero-index vs. one-index confusion, and physical hardware labels vs. how they are depicted in the software. Long story short despite all kinds of effort to determine exactly which DIMMs were broken, it wasn't until after we did the surgery and brought everything back on line that we found out we probably replaced the wrong ones. Oops. We'll have to do this again sometime soon.

There are some broken astropulse results clogging one of the validators (which is why it shows up on red on the status page). We'll have to figure out an automated way to detect these results and push them through (it's a real pain to do by hand). In the meantime, this is causing our workunit storage server to be quite full, and might hamper other workunit development sooner than later.

Gripes and server issues aside, there is continuing happy progress. I'm still tinkering with visualization stuff for web based analysis of our candidates (for private and potential public use), and we have tons of data from the Kepler mission arriving here any day now which will be fun to play with.

- Matt

see comments

26 May 2011, 16:10:32 UTC
Yesterday evening, one of our storage servers lost contact with its expansion enclosure. Upon power cycle, the expansion reappeared but the raid that spanned both head and expansion looks to be gone. This server contained the raw data that the splitter uses to create workunits. No raw data was lost because it is all backed up.

The evening before last, our science database machine locked up. A power cycle "fixed" that problem. As of yet we find no clues as to the cause of the lock up.

see comments

26 Apr 2011, 22:38:58 UTC
Heigh Ho - it's been a while. Not much to write about as most everything has been status quo, plus I was out of town for a few days in there. Yeah, there have been some minor quirks in the meantime - par for the course, I guess. We did have our Tuesday outage today, which beyond the normal tasks included swapping out dying drives with new ones in two of our major servers: thumper and bruno. You may have noticed the web site go dead for 15 minutes there while thumper was off line. The fact that both drive swaps/reboots went along quickly and without and hitches speaks well of our current server quality and configuration, I guess.

In case anybody missed it, Eric responded nicely to the current wave of news regarding the Allen Telescope Array going into hibernation. I swear every time there's a SETI related article in a major publiation (positive or negative) we have to do some kind of damage control, cleaning up various journalistic errors and reader misconceptions.

- Matt

see comments

7 Apr 2011, 22:42:01 UTC
Turns out one of the inputs to our data recorder got messed up. I don't have the details, but this only happened recently - we haven't yet gotten the raw data from Arecibo to split into workunits, etc. And the problem will be fixed soon if not already. The key now is to add some smarts to our scripts to avoid splitting the particular beam on that particular day or so. I guess this is a feature we've been meaning to add, and now we have a reason.

I've been messing a lot with gprof the past couple of days trying to find the subroutines causing the greatest slowdowns in our final analysis code. This actually wasn't getting me very far - I just now had to resort to some fprintf's with microsecond-level timestamps. No big news - we're database i/o constrained but now I have some numbers. We're brainstorming ideas about how to keep only the stuff we need in memory (as opposed to whole tables or indexes). Also I've been getting back to work cleaning up the NTPCkr pages for ultimate public consumption (and ultimately public assistance in helping us find and score detections). Slow, steady progress.

Otherwise all systems are go - bulking up the data pipeline before the weekend, queues are draining, etc. The mysql replica seems to be chugging along just fine. We'll see how it goes through the weekend before we call that project done.

- Matt

see comments

5 Apr 2011, 22:43:45 UTC
Happy Tuesday! We had our usual outage today (mysql cleanup/backup). The replica mysql server is still having issues. Over the weekend while it was catching up after being rebuilt it hit some corrupted relay log data. This is a bit troublesome - either the logs were corrputed on carolyn (the master) or they got corrupted during to transfer to the replica, or there are still fibre channel issues on this system causing random storage corruption (even after swapping out the entire disk cage, cable, and gbic). I'm rebuilding the replica yet again (with today's backup) and we'll go from there...

Some good news: The entire lab recently upgraded to a gigabit connection to the rest of the campus (and to the world). Actually that was months ago. We weren't seeing much help from this for some reason. Well today we found the bottleneck (one 100Mbit switch) that was constraining the traffic from our server closet. Yay! So now the web site is seeing 1000Mbit to the world instead of a meager 100Mbit. Does it seem snappier? Even more important is our raw data transfers to the offsite archives are vastly sped up, which means less opportunities for the data pipeline to get jammed (and therefore running low on raw data to split). Note this doesn't change our 100MBit limit through Hurricane Electric, which handles are result uploads/workunit downloads. We need to buy some hardware to make that happen, but we may very well eventually move our traffic onto the SSL LAN - this is a political problem more than a technical one at this point.

Over the weekend we had some web servers croak here and there, affecting the home page, workunit downloads, and scheduling. We think this was all due to removing ptolemy from our server mix (the system is powered off and its name and IP address are back in the general pool). Many machines/scripts still have references to ptolemy (or files on ptolemy). We did our best to clean this up before shutting it off, but we knew there would be some minor aches and pains. In this case, a generic web log rotation script was having fits and not killing/restarting apache very well.

- Matt

see comments

29 Mar 2011, 23:06:03 UTC
Well, we had more data pipeline issues over the weekend resulting in low work production, but we cleaned that up on Monday and I added yet more AI to the whole suite of scripts to hopefully get beyond more potential snags.

Today was our usual outage day and we made the most of it. Besides the usual mysql database compression and backup, we took care of the following:

1. Got the seemingly broken 3510 fibre-attached RAID on jocelyn out of the closet, and replaced it with a fibre-attached JBOD which we already had lying around. A software RAID partition is syncing up as I type this.

2. Built another fragmented index for one of the signal tables on the science database and played around with some speed tests.

3. Finished moving over by hand the remaining workunits that were copied elsewhere when the workunit storage server locked up a month or so ago.

4. Did all the tweaking necessary on all the systems to let go off internal server ptolemy so we could shut it down for good. Yet one more 32-bit machine retired.

5. Noticed that synergy rebooted itself a couple times this morning, so we took it off one potentially flaky power strip. Man that system is sensitive to slight power fluctuations. At least we hope that's the case.

6. With ptolemy offline Eric immediately cannibalized one of its drives to use in one of his hydrogen study database servers, which had a recently failed drive in it.

7. Replaced the failed drive on our raw data storage server with one of many kindly donated (once again) by Overland Storage.

I think that's about it, minus some less interesting details.

- Matt

see comments

22 Mar 2011, 22:41:34 UTC
Our raw data storage server had a drive failure early over the weekend, which locked a bunch of stuff up including workunit production. Oh well. We were able to sort it out when we all got back in the lab on Monday, but it wasn't until late in the day that enough radar-clean data was created for the splitters to chew on and make more workunits.

At nearly the same time the above drive failed (during major thunderstorms here in Berkeley, which is probably just coincidental) the replica database on jocelyn crashed yet again. This system keeps losing the external storage (in the form of a Sun 3510) and mysql freaks out. We're not sure what the issue is but today we became fairly confident the problem is local to the 3510 (and not jocelyn itself). An amber light on the back of it means "RAID controller failure" which in this case means this box is pretty much useless. However, on a long shot Jeff suggested I reseat all the drives (most of which have been mounted in the system since we first got it roughly 8 years ago). I did, and the 3510 for the moment seemed willing to play nice. I started recovering the replica database one more time but the 3510 disappeared yet again. We're brainstorming where to move the database - it's not worth replacing that 3510, so we'd need other storage options... Or perhaps not have a replica but some other home-grown backup option.

Meanwhile we still have creepy rpc.idmapd problems. This daemon, only on a few select systems, keeps dying at random with an "I/O Possible" message. When it dies, some mounted file systems are suddenly full of files owned by "nobody." I have a workaround for the time being - a cron job that restarts rpc.idmapd every few minutes.

Had the usual Tuesday outage today. Spent that time messing with the above, dropping some unneeded science database indexes (maybe that'll speed things up as it'll free up buffer space?) and building a necessary index.

- Matt

see comments

17 Mar 2011, 23:01:18 UTC
I think we'd all like more consistency, but given the random nature of raw data collection, delivery, and processing, we're always going to hit period where workunits are scarce, and that's okay. Part of the delivery requires humans regularly swapping drives in docks, so we're constrained by work hours and sleep and such. Even when new data is brought on line it take about 4 hours before the first file is "radar cleansed" and made available to the splitters, which then take however long to convert the clean raw data into workunits. In short, the feedback loop isn't so great, so an even flow is difficult.

We had a few more gremlins to play with since yesterday's power outage recovery. Weird stuff involving idmapd of all things, which never ever gave us problems before. It's dying for some reason on a couple systems, messing with file ownerships, though with mostly cosmetic results.

Outside of that, looks like Eric has been rolling out the new SETI@home client in beta. It's been a while since we had one of those.

- Matt

see comments

16 Mar 2011, 22:05:46 UTC
The lab wide power outage more or less went along with any major problems. We powered down all the systems yesterday afternoon (and unplugged most of them to be safe) and brought them back on line this morning. Right off the bat there were no failed disks or broken RAIDs or anything like that except for Eric's hydrogen survey system which was giving him hell for a couple hours due to one partition that needed a ridiculously long fsck. But there were still funny little gremlins like mangled mount permissions, or various headaches caused by NetworkManager. I swear this program or whatever it is only exists to create random problems for network managers to solve and thus protect their jobs (hence the name of the software package itself). We try to remove this package from every system but sometimes it comes creeping back as a dependency from a seemingly unrelated update. Yeah, we have it excluded in our yum.conf's but that line was missing on two machines, and those were the two having problems upon reboot this morning.

Anyway... as much as a pain this was for everyone it completely obscured a problem down on campus which would have caused an hour long network outage last night. But since we were already completely off line, no harm no foul.

- Matt

see comments

14 Mar 2011, 22:21:53 UTC
First and foremost, there are some electrical tests going on this week at the lab, which probably won't affect us and probably won't cause power surges, but there might be minute-long blips. Given that and the funky wiring in this building which has always been full of gremlins we're not taking any chances. We'll take everything off line once the usual outage tasks are over on Tuesday afternoon, then coming back up Wednesday morning. Roughly 4pm to 10am, Pacific Time. Sorry for an inconvenience.

Good news: With the questionable UPS out of the loop, synergy didn't reboot itself on the two-week mark. So we're fairly certain those reboots were not due to any problem with synergy.

We're still having science database resource issues. Pretty much when the backend science stuff runs (ntpckr and rfi) the science database throughput nosedives due to all the random i/o, which in turn causes the assimilators to slow down and that queue to back up. This morning I put some logic into my network/project monitoring code that when the assimilator queue hits a high water mark, shut off the ntpckrs and rfis.

- Matt

see comments

9 Mar 2011, 23:07:10 UTC
Sorry for the lack of news lately. Busy busy busy. Among other things I've been roped into helping Dan on his latest proposal (I generally hope to avoid these).

We started recovering the replica mysql database this morning - it's taking a while to catch up so we won't really be using it until, say, tomorrow. Also regarding database config we finally got beyond some shared memory configuration woes on oscar - so we can build indexes and update stats without having to make the database quiescent (which is what we've been having to do lately). I also made some major strides towards getting ptolemy off line for good. Just gotta copy some archival stuff off and that's about it.

Last night there seems to have been some network issue that was further down the pike but affected all our traffic for a few hours. Whatever it was (I still don't know) it was fixed and we recovered without any intervention. Gotta love that.

And yes, there was a problem with the beta uploads, but Eric pointed that out to Jeff yesterday and he fixed it pretty quick. I think it was just a broken symlink from when we moved back to using gowron again for uploads/downloads.

- Matt

see comments

3 Mar 2011, 23:07:27 UTC
The replica mysql database is down and will stay down until we can recover it during the next weekly outage on Tuesday. This is fallout from the failed (and recovered) RAID on that system. We thought we'd pushed beyond the minor corruption caused by the short blip, but apparently not. The easiest thing to do is rebuild it from scratch from the master after a backup. In the meantime, carolyn can handle the full load on its own without breaking a sweat.

Bob's been doing a great job keeping the splitters well fed with raw data (even over the weekends). Still, we get gummed up with funny raw data files that break the whole analysis suite for one reason or another. So we spent some time this morning adding some more "broken file" smarts to the software. It's all a work in progress. We're running low on our data backlog for the moment, but Bob's getting more from the off site archives as we speak.

- Matt

see comments

1 Mar 2011, 23:15:19 UTC
Happy March to one and all. Haven't have much to write about lately, but here's a round up.

We had our usual weekly maintenance outage today during which we took care of all kinds of stuff besides the usual mysql database compression/backup. Early this morning I noticed the replica mysql server had some broken tables, which led me to discover a drive had failed on that system last night - a 73GB fibre channel drive. Not a big deal, as we have tons of these kicking around from older servers at this point. This was easy enough to hot swap, though I got lost in some internal closet networking updates as this disk array is only accessible via telnet. And then the mysql daemon on the replica freaked out a little bit when the new drive was introduced, so I had to reboot the system, re-fix broken tables, etc. etc. etc. The replica is still catching up (will be for a while).

Today we also moved synergy off the probably-flakey UPS. Yeah, I know we should have done this earlier, but just haven't gotten around to it yet. If anything this gave us one more data point in the form of yet another automatic biweekly reboot at Sunday around 3pm (a couple days ago). Now the UPS is out of the equation, we have to wait 2 weeks to see if this was indeed the problem.

What else... we moved a lot more bits from ptolemy onto thumper. You may notice some general speedups on the website or elsewhere. We hope. And Jeff and I tackled a ton of timing tests for the science database on oscar. We're finding all the bottlenecks and finding ways around them. The good news is the database select throughput has gone from 100 spikes/second to 17,000 spikes/second. However these are under optimal conditions. In reality we'll have to deal with many of the aforementioned bottlenecks. Also: gowron is back to being the main workunit server (the full transition is far from complete, though).

That's been my day so far. How's your day?

- Matt

see comments

22 Feb 2011, 23:10:23 UTC
We're back from the long weekend, which seems to have been a good period of catching up after some hard times. All systems were happy, and Bob kept the splitters well fed with raw data. We had our usual outage today for database backup. We were hoping to get back to using gowron as the workunit storage server but the rsync back is taking forever (perhaps slowed by an automatic RAID recheck on thumper which nobody asked for - I set it so these automatic rechecks won't happen again).

As for the ptolemy shutdown project, that continues to drag on a little bit. Once again thumper is part of this mix, so the last few (large and active) directories being copied over are taking forever.

- Matt

see comments

17 Feb 2011, 21:34:34 UTC
It's raining pretty hard all week, especially today. This is never good for the air conditioning (it's more efficient on dryer days). Strangely the beeper went off while Jeff and I were here in the lab - and the only reason it goes off is if computers in the closet are well beyond some high temperature threshold. Well, actually there's another reason it would go off: somebody misdialing and calling the beeper number. Turns out the latter was the case. Ha! Still, temperatures are slightly higher today in the closet. Keeping an eye on that.

Projects continue along, like the decommissioning of ptolemy. This was a multi-purpose, heavily used internal file server, so I've been lost in a lot of rsync'ing, cleaning up stale symbolic links and hard paths in scripts/crontabs, etc. However annoying it's a good opportunity to do some filesystem spring cleaning. However thorough I'm being I'm still bracing for unexpected things to break when we cut over (next week some time?).

Speaking of next week, Monday is a holiday (President's Day) so don't expect much activity from us. However, by Tuesday's normal weekly outage we hope to have all the workunits copied back to gowron from thumper so we can revert to our original state and regain a little more normalcy.

We did shut down the assimilators/splitters for a while today to do some database read/cache settings during quiet states to remove all variables and confirm our understanding how informix is caching results/indexes/etc. The good news is we seem to understand the plumbing. The bad news is we still aren't where we want to be i/o-wise. Still working on it.

- Matt

see comments

15 Feb 2011, 23:50:09 UTC
The weekly outage went by super fast this morning. Most of the time is spent waiting for mysql to compress all its tables. But if we've been silent most of the week, there isn't much to compress.

Nevertheless it still took some extra time to come back online as we had to confirm everything transferred from gowron to thumper before letting the floodgates open. And here we are, back in business. Thumper is nicely handling the workunit storage traffic while we reconfigure gowron, a process which seems to be going along smoothly.

- Matt

see comments

14 Feb 2011, 22:27:06 UTC
Slow, steady progress... We're hoping to have everything copied from gowron onto thumper by tomorrow. Yeah, I know it's going slowly, but there's lots of bottlenecks (degraded RAID, NFS, tons of small files as opposed to a few big ones). After the usual outage we might actually have thumper ready to be the temporary workunit storage server so we can get back to business while doing the necessary upgrades on gowron (which make take as much as a week, unobtrusively running in the background).

That new-ish server synergy rebooted itself on Sunday. This concerned me as this has happened a couple times already. However, I discovered the three reboots thus far all happened on Sunday at 3pm, and two weeks apart from each other. There are no smoking-gun cronjobs, but it is plugged into an old UPS of unknown quality, so we're going to remove that from the equation and watch what happens. The reboots have all been harmless thus far.

Somebody somewhere on these forums asked what our server makeup was. It certainly isn't limited to what's on the server status page. If you just count the unix-based machines, there are currently 26 systems all told. Combining all the stuff inside, we have roughly 100 CPUs, 500GB RAM, and 150 TB raw storage. There are also several appliances (routers, switches, UPSes, kvms, remote controlled power strips, etc. etc.). Usually in these threads I'm griping about public facing servers, or ones causing the BOINC back end to jam up for one reason or another. I rarely mention the mundane, day-to-day, garden variety IT stuff.

- Matt

see comments

10 Feb 2011, 22:13:48 UTC
First the good news. I have thumper all configured and ready to roll as our mega file server. In fact it's already rolling. Note this isn't a public facing server, but will indirectly help the various public services in many ways, including making the sysadmins working on SETI@home/BOINC a lot happier in general. Lots of really fast disk storage for database backups, raw data transfer buffers, doesn't randomly reboot itself like our current home account server, etc.

Mmmkay. Now the less good news. Looks like gowron is having some fundamental RAID issues. The issues has been whittled down to one RAID1 pair tagged as degraded that won't rebuild no matter what we do. THe guys at Overland have been super helpful - but this is actually an old SnapAppliance (not a box that Overland sells) and running a (very) old version of the OS. So it's looking like our best bet to move forward is to upgrade the OS on the thing. However to do so we need to copy the workunits on the system (about 2 terabyte's worth) elsewhere temporarily. How about... thumper! That copy process is happening now.

Meanwhile, we'll be off for the foreseeable future. Like at least until next week, I imagine. Bummer.

- Matt

see comments

9 Feb 2011, 0:03:30 UTC
My touring band arrived in Boston, MA, and we began hunting for the club we were scheduled to play at. Once in the heart of town we pulled over and looked at a map (i.e. 1998 technology). Lo and behold we were only two blocks away from the venue! However, it still took us 90 minutes (no exaggeration) to get there, due to an impossible sequence of non-right angle one-way streets. One false move, and you were forced to drive around the whole park and try again. On two occasions the club was in our actual line of sight but we still couldn't legally reach it from our current approach. Eventually we stumbled upon the correct permutation of traverses and suddenly landed in front of the building, much to our surprise.

I mention this story as it is an exact analogy to me getting a bootable OS on thumper this whole past week. One false move, and I had to reinstall the OS from scratch. However, we seem to be done with that, and of course on hindsight the ultimately solution is simple enough. The main obfuscations were (a) funky linux drive enumeration, (b) weird unpredictable linux raid behavior, but mostly (c) grub installation via the fedora installer is a bit, well, confused. I'm thinking the installer isn't ever expecting a 48 drive system where the only available boot drives are #0 and #1 according to the BIOS, but #24 and #28 according to linux. I basically had to install an OS without a raided boot, then reinstall grub by hand on both drives (using the grub shell as grub-install wouldn't work), then replace the flat boot partition with a raided booted partition, etc. etc. ETC.!

Of course I'm taking the day off tomorrow so I won't complete the configuration until Thursday.

Meanwhile, the project is just now coming up from the regular weekly outage - sorry about the delay. Usual stuff, though we added some spike-time-index fragmentation performance tests (which we wanted to do during a quiet time). They weren't all that positive - unclear what the current bottleneck is, though it doesn't seem like either the database or disk/network i/o. Maybe some code somewhere needs optimizing.

- Matt

see comments

7 Feb 2011, 22:26:46 UTC
Wow what a mess this thumper OS install has been. I really don't want to go into details except that I've probably installed the OS at least twenty times over the past week, and that I'm reconsidering my career path (just kidding). It's amazing how stupidly complicated this has been - I'm just trying to get it configured the way that makes the most sense, but looks like we're going to have to stick with what works instead. There has been little pressure to rush this as nothing is directly depending on this system, but given how much of a time sink it has been and the need for its disk space is growing we need to get something going. Also ptolemy rebooted itself last night as a reminder that we really do need to start wrapping things up on this front.

Meanwhile on Friday gowron (the workunit storage server) had a drive failure that locked up the whole system until I came in (on my off day) and forced a reboot. This inspired the whole RAID to resync, which takes at least a day. Fine. We came in this morning and started the projects up and replaced the failed drive... only to have ANOTHER drive fail on the system, locking it up, etc. etc. etc. So the current resync will happen all night, leading us into our regular weekly outage tomorrow.

Oh yeah, during all this I had to force reboot bruno (the general BOINC administrative and upload server) which apparently spiralled out of control last night due to gowron's missing mount.

All the newer systems are still working great, and science database tests and improvements continue along as planned.

- Matt

see comments

2 Feb 2011, 0:13:35 UTC
So bruno is back, and synergy is free to get back to bashing on it with some actual science analysis stuff. We'll still keep a constant rsync of the result uploads happening in the background from bruno to synergy, so synergy can be a "hot backup" for bruno if it comes to that again.

Today during the usual outage we took care of that swap, but I also attempted to get thumper converted (as I mentioned yesterday) to its new role as internal-use mega file server. However, this system has funny disk controllers which renumber the drives upon every boot, making installing an OS quite difficult, being as there are 48 drives in the system and it's hard to tell which ones are the boot drives.

By the way, the lab-wide outage tomorrow (which I also mentioned yesterday) will indeed affect all traffic including uploads/downloads, so expect an hour or two of silence from us in the morning.

- Matt

see comments

31 Jan 2011, 23:44:06 UTC
Tomorrow during the usual weekly outage we're planning to make the switch from "synergy" bruno back to the "real" bruno. The synergy substitute has performed wonderfully, though it did reboot itself yesterday (not exactly sure why but I think a runaway process clobbered the system). This was why the uploads weren't working between yesterday afternoon and this morning. Speaking of outage tasks, I also hope to start converting thumper into a non-database server tomorrow. Going to be a pretty busy day I guess.

Today I've been during the usual post-weekend cleanup and some analysis stuff. I'm doing a timing test right now on oscar doing some of the hefty reads necessary to generate all the plots for public consumption. On thumper these were taking 50 minutes per a specific time-wise chunk of spikes. Now it's more like 35 minutes per chunk. So that's good, and we haven't even fragmented this table/index yet, and still have plenty RAM to use for buffer space. Slow, steady progress.

Oh yeah, heads up: There's going to be a lab-wide network outage out of our control Wednesday morning (Pacific time) around 5:30am until 7am. This may not affect the Hurricane traffic, so you might not notice except for not being able to reach this web site, and event then we might be down much less than 90 minutes.

- Matt

see comments

27 Jan 2011, 22:00:38 UTC
The bruno revival project continues: The old bruno is fully configured and we're rsync'ing the results from synergy to it. We're looking to completely flip the systems identities again on Monday, thus getting us back to normal.

Meanwhile the ptolemy/thumper project continues: The last of the data on thumper we care about is being copied off, and we hope to completely blitz all the filesystems early next week. Then we'll copy everything on ptolemy to thumper and shut off ptolemy for good. I think at that point the only active 32-bit system in the server closet is anakin, which is just doing downloads, so whatever.

We've been really busy, network wise, and just barely managing to keep up with raw data demands. Hopefully this will calm down sooner than later.

Some of the backend processing (and web pages) are gummed up as we're doing a major index drop/rebuild on the science database (which locks some of the tables). Some of the indexes existed in only one physical file (or chunk) - we're now fragmenting these over several files, to allow quicker lookups and/or more simultaneous lookup threads. This index build should wrap up sometime later tonight (hopefully).

- Matt

see comments

25 Jan 2011, 23:56:04 UTC
Progress. We had our regular weekly outage (mysql backup/compression) during which we continue fixing older problems and tackled newer stuff.

To update the bruno status: I think I solved all its disk problems, I "shredded" the the root drives - something lingering vestige of a former partition on there was making the Fedora installer go nuts. Then I was able to successfully get a new OS on there and boot it up. I then managed to upgrade the firmware on the 3ware raid card, which seems to have removed its penchant for making drives go missing upon regular system reboots. So.. without any need for additional hardware we got the old bruno ready to assume its old duties again. Meanwhile synergy has been doing a good job pretending to be bruno. By next week sometime we'll be back to where we were.

Meanwhile, I finally had a moment to add the memory recently donated for synergy - so it's up to a full 96GB of RAM (just like oscar and carolyn).

There was some hardware shuffling in the closet, so both oscar and carolyn were shut down during the outage, which means both databases need to flood their caches for a while before the project gets back up to speed. During the oscar reboot I set the data partition to mount with the "noatime" flag - this may help i/o a little bit. Also still messing with raid configuration on those systems. We may see additional performance improvements over time.

We're also aggressively working on the ptolemy/thumper transformations, which means getting all the stuff on thumper currently off of it so we can reformat everything on that system and have it take over ptolemy's duties (all internal use). I was hoping to do this partition by partition by long ago we decided to make all these partitions on top of a single LVM volume group, which means removing partitions require a major song and dance - unless we just blow it all away at once. I choose the latter.

- Matt

see comments

24 Jan 2011, 18:37:38 UTC
The problems last week with bruno (which continue, and I'll address below) completely overshadowed problems with our radar blanking suite which suddenly was unable to convert raw data into clean data which can then be split into workunits. So we ran out of work to send out over the weekend, and I was personally unable to do anything to help the effort in figuring out why. However, immediately this morning I spotted the problem. Long story short, this was one of those cases where the wild error messages with impossible number values were obscuring the less obvious real problem, which was simply a configuration file had gone missing. I replaced this file, and new work should be coming down the pike shortly.

Back to bruno: the woes continue with this system regarding its drives, though I am trying a few more things out before I throw my hands up in complete frustration. It would indeed be a shame to simply abandon this server as it has a lot to offer if it works. We'll have our server meeting later today to discuss where to go next on this front - I just wanted to give y'all an earlier than normal "heads up" today given the loss of work, etc.

- Matt

see comments

21 Jan 2011, 0:21:17 UTC
As expected it took about 1.5 days to copy all the results from our failed upload server (bruno) to the new upload server (synergy). I was out yesterday hence the lack of update from me, but nothing could get done until the result copy finished anyway.

Jeff and I tackled the remaining stuff this morning to bring synergy back up, and it's now pretending to be bruno. It's working fairly well except, predictably, the disk i/o subsystem isn't happy with lots of little random i/o's (there are only 4 working spindles on synergy, as opposed to 20 on bruno). Still, it's working heroically to recover from the past two days of data distribution silence.

Meanwhile, what the heck is wrong with bruno? I wish we knew. I've been battling this all day since getting synergy on line. It seems there are fundamental issues that transcend disks/partitions/controllers. Random drives are disappearing, random partitions are disappearing, and this was still happening after taking the 3ware card out of the system entirely... We're stumped. It might just be a cluster of simple problems with confounding symptoms. I give up for now.

By the way, bruno was named after Giordano Bruno.

Also by the way, somebody asked if we should have two upload servers. We used to have the upload server split onto two systems but this wasn't helping - in fact it was making it worse. The problem is not the lack of bandwidth i/o, but disk i/o. The results have to live somewhere, and require lots of random read/writes. So it's best if the upload server saves the results on directly attached storage. If it is also serving them over NFS (or likewise equivalent) such that a second upload server can write to them, it's too much of an overhead drag. So the upload server has to be a singular server which also (1) holds the results and (2) as much of the backend processing on these result files as possible. I think right now the only backend processing on results which bruno does NOT do is assimilation, which vader handles. You might think "why not just have the upload server save the results IT gets on ITS own storage?" Then we end up with two piles of results, randomly split, and then the NFS/mounting bottleneck is simply pushed down the pike to the validators, who need to read both piles at once.

- Matt

see comments

18 Jan 2011, 22:02:28 UTC
Nothing like coming back from a long holiday weekend and having one of your main production servers croak as soon as you arrive. It's a sunny day outside and I was stuck wearing my fleece jacket and fingerless gloves inside a well air-conditioned server closet.

So what happened? Not sure exactly, but bruno (the upload server, as well as the main boincadm administrative server) was all hung up as soon as we started the normal Tuesday outage. I had to reboot it, and that was that - it wouldn't come up properly again.

It seems to be a multiple-part problem. There was a disk failure, and the 3ware card in this system has always given us trouble. What kind of trouble? Well, if you reboot the system (without a full power cycle) random drives go missing. That's kind of a problem, no? I don't think this is a single broken card - a labmate has similar problems with the same model in his system (I forget the model #, but it's 24-channels). Anyway, the big RAID10 holding all the results was tagged as degraded and rebuilding now.

That's fine, except the OS (which is on separate partitions and not under the jurisdiction of the 3ware card) isn't booting either. Jeez! The good news is I can boot of a Fedora live CD and see both the root and upload storage drives, so there's no data loss. It just won't boot!

The other good news is that, if we need it, we have a backup system already: synergy! It might be getting pulled into prime time sooner than expected. It doesn't have nearly the large number of disk spindles as on bruno, but this might not be an issue - there's still plenty of disk space on it. And a lot of memory for potential file system caching. It's still undecided if we're going to make synergy the new bruno, but I'm at least copying everything there now just to be safe.

I might still be able to get bruno up this afternoon, but if not, looks like we're down for the evening (it'll take that long to copy everything over to synergy).

- Matt

see comments

13 Jan 2011, 23:21:50 UTC
The extra memory for synergy has arrived. I haven't put it in yet - will wait until next week as it's kinda busy right now. I know the server status page says 96GB are in the system - I'm just getting ahead of myself.

It's official: the lab has a gigabit link down the hill. This has been the case for a couple weeks now, actually. It turns out this whole project ended up being easier/cheaper than expected, so the whole lab paid for it. This means we're sharing the link, and it's still separate from our hurricane electric traffic which includes the workunit/result distribution, and which is still capped at 100MBit/sec. So... the only advantage is that we no longer have to throttle our raw data transfers to our off-site data archives, but that may potentially be a huge gain - if we're data starved and/or need data from the archives, we can get it more quickly. We may still be able to put some of our hurricane traffic on this single link, but some hardware is needed (which Jeff is pricing out) and more political bridges need to be crossed.

Oh yeah... that's for the reminder in the last thread. I reset the purger so results stick around at least 24 hours before being deleted.

- Matt

see comments

13 Jan 2011, 0:04:58 UTC
So synergy is currently acting as a maul replacement. In fact, I turned off maul to keep the temperatures down in our secondary lab where these and other servers are currently located. I've been running all kinds of other tests on synergy that I've been up[ until recently running on vader. In short, we're burning it in.

We continue to work on the disk i/o issues on oscar. When we do these weekly database backups it really gums up the works. Some of you have noticed the assimilators falling behind, etc. Actually Jeff stopped the aforementioned science analysis programs to allow the database to finish up its current extra load. It's also running a weekly "update stats" on various tables. I'm still hopeful we have a solution without our current means that we have yet to try (give more memory to informix, db or raid configuration tweaking, fragmenting the indexes, etc.).

Meanwhile, I'm back to work mostly on some actual science/visualisation stuff.

- Matt

see comments

10 Jan 2011, 23:54:09 UTC
Good news: synergy was waiting for me in a box when I arrived this morning. Even better is that after it shipped last week another donation was given to double its memory, so it'll have a total of 96GB of RAM very soon. Thanks again to Todd and the GPU Users Group for all your generosity and help!

Todd shipped it with an OS so I could make sure it wasn't DOA. Everything looked good, but then there was a comedy of errors trying to locate missing Fedora install CDs, and then trying to burn new ones. This turned out to be amazingly difficult (broken CD burners, broken CD burning software). Ultimately it was lucky that Bob brought in his Mac laptop as that was the first thing in our office that was successful in creating an installer. Jeez. Anyway, the afternoon is being spent getting this system set up with a "general SETI configuration" on our lab bench. I guess you might want a photo of "first light."

By the way, the bus broke down on the way up the hill to the lab this morning. We actually had to get out and walk the remaining 300-500 feet (in altitude). Happy Monday!

- Matt

see comments

6 Jan 2011, 22:15:44 UTC
The informix tweak planned yesterday was postponed and completed today. Why was it postponed? Because the weekly science backup (which happens in the background - doesn't require an outage like the mysql database) wasn't done yet. Normally it takes a few hours. But during major activity it looks like it'll take 10 days! Jeff stopped the ntpckr/rfi processes and that sped things up.

This clearly points out oscar's inability to handle the crazy random i/o's we desire, though to be fair oscar is indeed operating better in its current state than the old science database. There's still MANY knobs to turn in informix-land before we need to add more disk spindles. For example, we still haven't given all the memory available in the system over to informix. The tweak we made today added an additional 20GB to the buffers. Note that it takes a bout a week to fill these buffers, so we won't notice any improvement, if any, until then.

Meanwhile I've been back to working on my various ntpckr and data testing projects. It's hard to page these pieces of code back into my RAM once they've been flushed to disk - know what I mean?

- Matt

see comments

4 Jan 2011, 23:07:58 UTC
Short message today. Wow, that outage went pretty quick this morning! Of course the replica mysql database on jocelyn is sweating to catch up as I write these sentences, but still.

We plan at least one more tweak on informix, which we'll likely do tomorrow (we only need to stop the assimilators/splitters when we bounce the informix engine, so you shouldn't notice anything except for some red lights on the server status page).

Tracking the new server: last seen in Minnesota this morning (which is closer than where it was in Wisconsin yesterday). It's slated to be here on Friday.

- Matt

see comments

3 Jan 2011, 22:12:30 UTC
Happy New Year!

We seem to have been running rather smoothly since last I wrote. And the servers were nice enough to wait until we were all back in the lab before going crazy. Well, it's not that bad - just lots of tiny problems. The astropulse science database server got stuck in some deadlock and needed a hard power cycle. Then I tried fixing this nagging db_dump problem (it hasn't regularly generating daily stats dumps) by moving the process from bruno to lando, and lando couldn't handle it - and I ended up needing to hard power cycle lando as well. Having lando reboot caused bruno to freak out a little bit as it was in the middle of a package update when lando disappeared out from underneath it. So I had some rpm database cleanup to deal with to get bruno to start upload results again. Oy!

Meanwhile we were bringing services up and down in a controlled manner to make some more science database tweaks on oscar. We're still deep into trying to figure out how to improve its throughput in general. We're finding the disks are still a bottleneck (even after the RAID restripe), but if we can get informix to cache more efficiently then disk i/o is less important. Or we'll add more disk spindles. In the meantime we have many knobs to turn.

I got the tracking # for new donated server "synergy" - should be here on Friday afternoon, so we'll start playing with it next week!

Regarding the weekly outages: we're sticking to the general idea which I mentioned recently - generally have our standard Tuesday half-day outage, and perhaps other planned one or two day outages as needed (with some effort to provide ample advance warning), but otherwise leave all public facing data services up and running as much as possible. However in the first sign of trouble, these services may be taken down for extended periods. The bottom line is you probably won't notice any difference between current operations and those from six months ago (before the weekly extended outages). The goal is to aim for 24/7 uptime while maintaining staff sanity.

- Matt

see comments

29 Dec 2010, 21:18:22 UTC
Last post of the year!

I understand the SETI@home/BOINC back-end is a bit confusing. I think there are only a handful of people who understand all the relationships between every step of the BOINC finite state machine, the SETI@home data pipeline, and every server in our closet. I can only really scratch the surface of these details during these relatively pithy missives.

So let me try my best to quickly answer the general question: "did those two brand new servers (oscar and carolyn) help?"

First of all, there are about 100 known problems with our systems at any given time. Most of these are low priority and "time out" on their own. There are still a bunch of high priority issues, of which these new servers addressed *some*.

One easy problem to fix was replacing mork (the randomly crashing master mysql database server) with carolyn. This has been great... and I think we just (as of this morning) solved the current batch of disk i/o problems (by properly tweaking the write cache settings). Carolyn also has a lot more disks and memory than mork had, so there's still a lot of room to grow as needed.

The other new server, oscar, is taking care of two major problems. First, the science database on thumper was abysmally slow - so now this is on oscar. That's great, however it's not running as fast as we'd like. We still haven't fully benchmarked this, though. Maybe at worst we'll need to add more disks to the system - we shall see. The second major problem oscar is helping to resolve is ptolemy - our internal administrative file server - which, like mork, also randomly crashes from time to time locking everything up. Now that thumper is off the hook as a science database server, it can take over and easily handle ptolemy's current functionality, and then we can retire ptolemy. There's actually a third minor problem that's also getting fixed: thumper's root partition is on a messed-up software RAID device which kinda works but has been scaring us for way too long. It'll be great to have this server-shuffle opportunity to have a quiet moment to reinstall the OS on thumper and fix that RAID.

But that's pretty much it for now. Other major problems exist with no clear solutions. In fact, many of them are data driven, or network infrastucture driven, and therefore out of our sysadmin hands - no server upgrades will solve them.

That said, I hardly want to sound hopeless (and definitely not ungrateful). We're pros at working through our various struggles, and gaining oscar and carolyn has been the largest improvement in years, with still more benefits to come as we get rolling full bore in the new year and more aggressively shake out the remaining configuration problems. Plus a third new donated server (synergy) is coming down the pike, which we'll use to address other current shortcomings. When oscar, carolyn, and synergy are all being used to their fullest potential (a month or two from now?) let's revisit what our biggest system needs are. It may very well be that these new machines did in fact shake out the bulk of our current issues, and we'll be in good shape for years to come.

Happy new year! May we have actually publishable results in 2011 - positive or negative I don't care - it's science either way. We certainly could stand to get something meaningful in the journals concerning all the data we've been reducing for 11 years.

- Matt

see comments

27 Dec 2010, 22:00:31 UTC
Ah, the few days back at the lab between Xmas and New Year's... The university assumes nobody works at this time, so the buses aren't running, and so I have to drive into the lab. But of course they are still handing out parking tickets to people without regular parking permits (like myself, who rarely drive to the lab). So I gotta park elsewhere. Anyway...

Except for bruno (the upload server) having fits we were pretty much running smoothly all weekend. However bruno is also the main BOINC back-end administrative server for the SETI@home/Astropulse project, so when it has fits, everything kinda gums up. We couldn't get into bruno remotely (full process table?) so it waited until this morning when Jeff got in and rebooted it.

There was some cleanup after that, and we seemed out of the woods, but we're still having these mysql issues where the database enters these long periods of flushing pages to disk. We all agree that this is largely due to the increased demand (after all the long/short outages over the past two months, and perhaps a bout of short runners). Increased demand means more deltas, which in turn means more fragmented pages. We have these weekly outages to defragment the database, but given the load it's like 3-4 weeks of fragmentation within one week. We're thinking the outage tomorrow will largely fix this, but we're still tuning other stuff in the meantime. We already gave mysql access to more memory, but Bob predicted this wouldn't help, and he was right. He's trying other stuff now.

So the plan is to hang on have just the normal outage tomorrow, then be up (as best we can) the rest of the week and throughout the New Year's Eve weekend. Then in the new year we can really start squeezing these new servers and see what they got.

Oh yeah - I turned off the "resend lost results" for now to reduce the load on mysql. This is temporary.

- Matt

see comments

22 Dec 2010, 21:25:56 UTC
Everything on the oscar-raid-restripe project went along swimmingly, and I was able to take care of several steps at home last night so that we could get the whole project back online in the morning today.

Funny aside #1: at the exact moment I was finally comfortable enough to issue the command that blew away the old raid device from my home terminal, the network connectivity in my house randomly disappeared. Needless to say this incited minor panic as I suddenly couldn't reach any SETI servers and thought I must have somehow locked up the whole server closet. Eventually I figured out it was a local problem, my home router rebooted itself and I was back in business. Phew.

Funny aside #2: turns out when I installed the OS on oscar/carolyn I didn't bother removing the NetworkManager package, which gets installed by default for reasons beyond my power of conception. I've ranted about this before but it seems the only reason NetworkManager exists is to generate completely random, unexpected, obfuscated network problems on systems in order to give network administrators (a) something to do to kill time, and (b) ensure job security. In this case, without cause or prior consent it added funny loopback address statements in /etc/hosts which caused remote informix connections to fail. I immediately removed NetworkManager at that point and added a "exclude=NetworkManager*" line to yum.conf (which I should have done before).

Anyway, right off the bat there seems to be at least some improvement with the smaller stripe size on the raid, but we're collecting i/o stats under load and have some tweaks in mind which may continue to help general performance. All told, this ended up just being a one-day outage well worth taking.

Meanwhile, upon turning everything back on mysql started exhibiting some old, bad habits. I haven't seen this behavior in a long while, but sometimes when it gets hit hard enough it says, "Yo! Back off! In fact, I'm going to block all incoming queries and write to disk for 10-20 minutes. I won't tell you why, and there's nothing you can do about it." Fair enough - but if you were having trouble reaching the website an hour ago, that's the reason.

That's pretty much it for this week. It's officially the Xmas holiday starting tomorrow. Whatever you choose to do, enjoy! I'll be checking in from home over the "time off," but we're hoping it's kind of a "set it and forget it" kind of weekend instead of an "oh well everything will be down until we return on Monday" kind of weekend.

- Matt

see comments

22 Dec 2010, 0:39:00 UTC
Today's Tuesday, so we had our usual "mysql reorg/backup" outage, but everything is still down as we continue with the "oscar restriping" project. Most of the data has been copied from oscar to carolyn as I write this, but this will take well into the evening to complete. Later on tonight I hope to restripe the RAID partition and start copying everything back - then we'll be ahead of the game. Otherwise, we'll just start copying everything back tomorrow morning.

Meanwhile, I turned the web site features back on but we're leaving the rest of the project down to reduce i/o contention during these major copy/transfer phases.

Not sure if anybody noticed but the lab had some major DNS issues for a while today. This was out of our jurisdiction, but still affected us - we couldn't see our own web sites/servers from within the lab, but it was clear from the httpd logs that (at least some) people were able to connect. Weird stuff.

- Matt

see comments

20 Dec 2010, 23:43:02 UTC
Over the weekend we "caught up" with workunit demand, given the low workunit limits that were set, so I set the limits to the effectively-infinitely-high setting. Of course, this was around the time the software blanking suite of programs needed to be kicked forward a couple times (we're not sure exactly why they hang - they just do). So we were low on raw data for a while there.

You may notice a new addition to the server status page. Jeff has been folding his NTPCkr/RFI code into the BOINC management fold, so those processes are on the page now. They are currently running on maul, which is a compute server donated a while ago that's been working "behind the scenes." It's a nice system, but it loses contact with its keyboard more often than not - something weird about the USB bus. So the ability to login isn't guaranteed unless you ssh in. That alone is enough to make me insist it never rises above "compute server" status.

We're also trying to max i/o on oscar in preparation for the big restripe project tomorrow. This is why the assimilators are currently off. We're doing a database backup today, then moving all the database files to carolyn tomorrow, and on Wednesday restriping oscar and moving the database files back. That's the plan, anyway - hopefully this will all be done by Thursday morning, which is when I'll try to start everything up again.

- Matt

see comments

16 Dec 2010, 22:49:16 UTC
We're back to shoveling out workunits as fast as we can. I mentioned in another thread that the gigabit link project is still alive. In fact, the whole lab is interested in getting gigabit connectivity to the rest of campus, which makes the whole battle a lot easier (we'll still have to buy our own bits and get the hardware to keep them separate). Still, it's slow going due to campus staff cutbacks and higher priorities.

With the heavy load on oscar (splitting and assimilating full bore) I got some good i/o stats to determine how much we should reduce the stripe size on its database RAID partition. This will be enacted next week during the return of the 3-day weekly outage. It's unclear how regular these extended weekly outages will be - we'll figure that all out in the new year.

But back to oscar... we were pushing it pretty hard today - almost too much. It looked like we were about to run out of workunits for a minute there but I caught it just in time. We're still trying to figure some things out.

By the way, I think there was some general maintenance around the lab in general, which may have caused a temporary network "brown out."

- Matt

see comments

15 Dec 2010, 23:57:44 UTC
We're still struggling to get raw data on line fast enough to keep up with workunit demand.

I should point out that the system that locked up over the weekend is an otherwise great file server that was graciously donated to us by Overload Storage along with continued speedy tech support and free replacement drives whenever we ask. The catch is that we're officially beta-testers on this funny version of the OS. So the system unexpectedly locks up, I dunno, once a year? That's more than acceptable as it's just a raw data storage server, and the upshot of a this system locking is, at worst, a temporary dearth of workunits. Even worse things can happen.

ALSO - more importantly - if this system didn't fail we would have run out of raw data anyway and be in the same boat we are currently in. Maybe even sooner.

Anyway... Another hangup was a cheap gigabit switch that fried a month ago and we replaced with a cheap 100 Mbit switch. This was being used to connect our non-closet servers with the closet. During the mega-outage, fast connectivity wasn't necessary. However, when we started to split/assimilate again it became apparent we needed to get that stuff back on a gigabit link. So one is on order, and in the meantime Jeff dug up a tiny 5-port switch so a few of the machines (that are splitting and running the software blanking suite) are talking full speed again to the closet. This may improve the workunit shortage situation over the next couple days.

- Matt

see comments

14 Dec 2010, 23:19:15 UTC
So over the weekend we had a drive failure on our raw data storage server (where the data files first land after being shipped up from Arecibo). Normally a spare drive should have been pulled in, but it got into a state where the RAID was locked up, so the splitters in turn got locked up, and we ran out of workunits. The state of the raw data storage server was such that the only solution was to hard power cycle the system.

Of course, the timing was impeccable. I was busy all weekend doing a bunch of time-pressure contract work (iPhone game stuff). Dude's gotta make a living. I did put an hour or so on both Saturday night and Sunday afternoon trying to diagnose/fix the problem remotely, but didn't have the time to come into the lab. The only other qualified people to deal with this situation (Jeff and Eric) were likewise unable to do much. So it all waited until Monday morning when I got in.

I rebooted the system and sure enough it came back up okay, but was automatically resyncing the RAID... using the failed drive! Actually it wasn't clear what it was doing, so I waited for the resync to finish (around 4pm yesterday, Pacific time) to see what it actually did. Yup - it pulled in the failed drive. I figured people were starved enough for work that I fired up the splitters anyway and we were filling the pipeline soon after that.

In fact, everything was working so smoothly that we ran out of raw data to process - or at least to make multibeam workunits (we still had data to make astropulse workunits). Fine. Jeff and I took this opportunity to force fail the questionable drive on that server, and a fresh spare was sync'ed up in only a couple hours. Now we're trying our best to get more raw data onto the system (and radar blanked) and then served out to the people.

Meanwhile the new servers, and the other old ones, are chugging along nicely. The downtime yesterday afforded us the opportunity to get the weekly mysql maintenance/backup over early, and I also rigged up some tests on oscar/carolyn to see if I can indeed reset the stripe sizes of the large data partitions "live." The answer is: I *should* be able to, but there are several impossible snags, the worst of which is that live migration take 15 minutes per gigabyte - which means in our case, about 41 days. So we'll do more tests once we're fully loaded again to see exactly what stripe size we'd prefer on oscar. Then we'll move all the data off (probably temporarily to carolyn), re-RAID the thing, then move all the data back - should take less than a day (maybe next Tuesday outage?).

- Matt

see comments

7 Dec 2010, 23:57:54 UTC
Today was a "normal" Tuesday outage to back up the mysql database. You may have noted the result table sizes have dropped considerably since we turned on the "resend-lost-results." Hopefully this solved a lot of the ghost workunit problems people have been wondering about forever. If the database can handle it, no reason to leave that setting as is. A lot of people also noticed the server status page line "Results returned and awaiting validation" should really read "Results returned and awaiting validation as long as all the other back-end queues are zero." So most of the time this reads correctly, but if there's a large backlog somewhere this can be quite misleading. It's a painful query to get exactly what we want all the time, so fixing this is low priority.

Meanwhile, after the outage we started the splitters up (though there were some initial configuration snags that required a quick shut down and restart). Actual new work is being generated and sent.

So here we are.

<sound of champagne cork>

Well, not so fast. I'd say we're "at the light at the end of the tunnel" as far as the public side is concerned, but there is still major cleanup on the inside before we're fully out of the tunnel. Some agenda items include:

1. Getting oscar up to speed: Right now it's operating pretty much as fast as thumper (which seems disappointing at first), though without using any CPU or disk i/o (which means it's able to do a LOT MORE if we tell it to). That's because informix is configured exactly as it was on thumper, so there are some artificial bottlenecks in place. We're collecting stats to understand what knobs to turn, and then we'll really crank them up.

2. Converting thumper to it's new role as internal file server: Remember that our main internal file server (which houses a bunch of important, heavy-random-access data and accounts) is as much of a crashy liability as mork was. So this conversion still needs to take place, but can happen over time while we're live.

3. Basic electrical stuff: Jeff and I tried to move as much around as possible, but there's still some server closet power issues to address.

4. All the tiny specks of sysadmin revolving around replacing old servers with new ones (dangling mounts, dead entries in /etc/hosts, zillions of scripts referring to now-defunct paths, etc.).

I'm also busy revving up the engine to start sending out the annual end-of-the-year news/funding drive mass e-mail. I know many of you already donated in some form or another (thank you!) but this sort of thing needs to happen. I apologize for any redundancy on this front.

- Matt

see comments

2 Dec 2010, 23:32:50 UTC
Short status update: I turned on both the result/task pages and the resend-lost-results - the latter of course clearing out the pipes (still clogged with various ghost workunits).

The science database is fully loaded on oscar now, and we're now rebuilding all the indexes, running "update stats" on all the tables. This is taking a bit longer than we thought, though we have some knobs to turn to speed things along after the current set of queries pushes through. We're still looking at maybe starting the assimilators on Monday, and then new work creation on Tuesday. While not far fetched, that's still being optimistic. The database on thumper is indeed turned off. I'll adjust the server status page once oscar is able to show a green "running" status.

Jeff and I did some more server closet cleanup, but nothing noticeable in a picture.

- Matt

see comments

30 Nov 2010, 23:10:57 UTC
As planned, we are now recreating the master science database on oscar using the cleaned-up backup dump from thumper. This should take about a day. We were worried about the slow disk i/o when we started this process - isn't this new machine supposed to be faster? Well, I dug into the RAID config on oscar a bit and tweaked a few parameters which quickly sped up the disk i/o to roughly 900% better than this morning.

While this is going on Jeff and I tackled the closet some more - today's job was more power cable organization but mostly worked on rewiring all the ethernet cables so they were more orderly. Perhaps you noticed various servers going down at random as we unplugged/replugged systems one by one. Here's where we're at now:

- Matt

see comments

29 Nov 2010, 23:19:13 UTC
For those who were celebrating, hope you had a lovely thanksgiving weekend. Things around here were fairly mellow, though progress continues.

The thumper-to-oscar conversion is still on schedule. In fact, just minutes ago we dropped the old spike table now that all those spikes were copied into the current spike table. It's amazing how fast you can drop a table containing 1.3 billion rows, though I did feel a disturbance in the force.

This morning we stopped the assimilators so the database is in a quiescent state. We're now backing it up one last time, and tomorrow morning we hope to "recover" oscar using this backup, which means it'll get populated with all current scientific data. This may take a day, and then we'll burn it in by starting up the assimilators again, maybe run some NTPCkrs. If all goes well we're still on for opening the floodgates again early next week.

In the meantime Jeff and I continue to rearrange the closet, cleaning it up, shuffling servers between racks and breakers to regain some organizational sanity. We also were tired of having the kvm monitor on top of the rack which was hard to read from down below. This picture shows the current status of things, including the monitor now nicely at eye level.

- Matt

see comments

24 Nov 2010, 22:49:14 UTC
Informix is running on oscar and is now initializing all of its dbspaces. We hope to start moving the science data over in the first part of next week.

see comments

23 Nov 2010, 20:59:01 UTC
Okay then - after some extreme DBA this morning carolyn is now the master mysql database server and jocelyn is the replica. So that project is officially DONE! Actually, there's a lot of low-priority cleanup to deal with, but all the main plumbing is working and the projects are back up such as they are.

Now all server side focus is on oscar. By far the most important thing to fix during this major long outage was our science database - getting a new mysql database rolling was just icing on the cake. But I guess we still need to finish making the cake. Most of our i/o bottlenecks over the past few years have been somehow linked to thumper (both as a database and file server) so getting this done is essential before we get back on line.

Jeff found a comprehensive list of missing spikes (which I mentioned yesterday) and will begin inserting those. We'll then eat some turkey, then have an all-hands-on-deck week next week to get oscar going. We simply cannot get back on line before then, and so we're still looking at new workunits being generated a couple weeks from today at the earliest. I guess if we're really lucky it'll be sooner, but highly doubtful. I know we're anxious to get rolling again, but remember that when you're dealing with billions of rows of data (in the form of a terabyte of raw files), each step takes many hours no matter how clever you are or how fast you type. It's also easy to get lost in theoretical maximum speeds, which never take into account (a) the dizzying array of initial preparations before even starting, (b) actual speeds, (c) the many extra steps necessary when being careful (like backing up a database one last time before dropping a table containing a billion rows), and (d) unpredictable software/hardware behavior requiring us to go back to N steps in the cookbook and try again.

- Matt

see comments

22 Nov 2010, 18:50:27 UTC
I'll write today's message early as this week is a short holiday week so we're kinda busy.

First and foremost, carolyn is now the *only* mysql replica - I just turned the other replica (the troublesome server mork) off, perhaps for good. Yay! That's one of the two new servers more or less ready for prime time, though we still hope to make carolyn the master (and jocelyn the replica) today or tomorrow.

We're still far from getting the whole project back on line - we have the other new server, oscar, installed and ready to roll, but still need to (a) install and configure informix on it, (b) clean up the science database on thumper, and then (c) transfer all the data from thumper to oscar. This may take a while - the spike merge (which was the last major part of the "clean up") did finally complete last week (after running about 2-3 months) but there was still a discrepancy of about a million missing spikes which Jeff is successfully tracking down. So there are a few extra merges to do yet. We probably won't really dig into getting oscar on line until after Thanksgiving.

Of course, what's a weekend without an unexpected server crash or two? On Saturday afternoon a major lightning storm swept through the Bay Area. Other projects in the lab (located in the other building) had major power outages. Luckily we were spared a full outage, but apparently a couple of our servers got hung up around this time, perhaps due to some kind of non-zero power fluctuation. The servers were thumper and marvin - each located in different rooms, and on different breakers. It is funny that these two machines are our current two informix servers (thumper holds the SETI@home scientific data, and marvin holds Astropulse). So there was some cleanup to deal with this morning (database/filesystem recovery, hung mounts, etc.) but really no big shakes and we're back to normal (whatever normal is these days). Both systems were on surge protectors so I'm not sure why they were so sensitive - maybe the crashes were random and the timing was coincidental with the storm.

- Matt

see comments

19 Nov 2010, 0:41:31 UTC
Today we got carolyn up and running as a mysql replica - if all goes well both mork and carolyn will be replicating from the master database on jocelyn without hitch over the weekend. This will be a good "burn in" to prove that carolyn is handling the job, and we'll make it the master next week. Maybe. It is a short week due to the holiday, but then again we're being more aggressive than normal to push projects forward.

To answer some questions from yesterday's thread:

First and foremost, in that picture oscar is on top.

And that picture of the "old bruno" is not the "new bruno" which is still being used. The "new bruno" is actually the "old bambi" which is why you don't see bambi anymore on the server status page.

We did confirm with HP technician that, ventilation-wise, it is okay to stack these servers on top of each other. However, this is likely not to be their final resting place in the racks. We might clear up space in the middle rack where ther rails should work and put oscar there. Or install yet another shelf in the current rack if there's space.

Yeah, paypal is still not accepted by the University. This is completely out of our control. Despite major griping by us and other groups around campus, not to mention obvious benefits for using paypal for donations, there are various legal and bureaucratic reasons that make this a non-starter.

Those books in the background of that picture are actually a dictionary and thesaurus, neither of which have been actually cracked open in, I dunno, a decade?

- Matt

see comments

17 Nov 2010, 23:19:37 UTC
Here are some initial pictures/notes regarding the newest additions to our server family, oscar and carolyn (bought for via the kind donations of several generous SETI@home participants).

This is the old version of the server bruno, along with it's fibre-channel attached disks. This machine used to be the upload server, as well as the result file storage, and therefore also handled many result-related tasks. But it was having too many problems related to the funky RAID setup and limited storage, so we migrated all of its functionality to a bigger/better server and, yesterday, finally pulled this out of the closet to make some room in the racks.

Here are the boxes the two servers came in, already unpacked. A lot of time was spent figuring out how to use and install the rail systems that came with, only to discover at the last minute they wouldn't fit in any of our racks. I think in our entire history we were successfully able to properly rack mount 2 servers. Okay maybe 3.

Here are the two servers sitting together in the rack on a shelf where the old bruno used to be. Actually maxwell (the old BOINC web/alpha server) was sitting in that space too, and was moved one rack over. They already have their RAIDs configured and OSes installed. We're now tackling the database configurations and installs. We'll put mysql on carolyn and informix on oscar. The server cut off at the bottom is ptolemy, which will be replaced by thumper once oscar replaces thumper.

More to come as we progress...

- Matt

see comments

3 Nov 2010, 18:39:10 UTC
Quick update during our mega-outage. All the bureaucracy is behind us - new servers have been ordered days ago, just waiting on those to arrive and doing major database cleanup/etc. in the meantime.

To that end, among other things we've been trying to drain all the outstanding workunits/results as much as possible, but in a sane, orderly fashion. I just turned on the file delete/database purge processes, but only *after* granting all pertinent credit to users/hosts/teams (regardless of overdue/wingman status). I'm talking about 3,290,000 results were credited over the past 24 hours. I may have to do this granting again once this first round of cleanup is over.

- Matt

see comments

28 Oct 2010, 21:49:45 UTC
We've decided to keep the project down until the new servers are up and running and the databases migrated to them.

The forums will stay up.

The back end and the upload server will stay up until we clear the outstanding results.

The time line we are looking at is about one month - two weeks for the servers to arrive and another two to get them going. We'll see as time goes on whether or not that's too aggressive.

The down time will be used for preparing the databases for migration. For example, on the science side, we can finally finish a big merge of the spike table and drop the old spike table. This will make the database smaller and easier to migrate.

We will also use the time for science processing and analysis.

More later...

see comments

28 Oct 2010, 18:29:47 UTC
The order is out the door and we expect to have the new machines in hand in 2 weeks.

We are getting two identical HP servers, each consisting of:

A Proliant DL180 G6 chassis with redundant power and fans.
The chassis has 12 drive bays and these are all populated with 1TB 3G SATA drives
2 Xeon quad core E5620 processors.
96GB RAM as 6x16GB DIMMs. This allows for doubling the memory while still using the original DIMMs.
An external (unpopulated) 12 bay drive cage. We may well need this for the science DB server (oscar).

see comments

28 Oct 2010, 1:27:39 UTC
Just a quick note. Obviously, jocelyn is up. Mork is recovering.

The purchase orders for both oscar and the new mork went out late today or will go out early tomorrow. It takes a while for these things to work their way through the purchasing pipeline.

We decided to go with HP for these machines. They gave us a very good deal. We are getting two identical (oscar class) machines. I'll post the specs in another note. We hope to have them on hand in about 2 weeks.

At this point, we are discussing what we will do between now and when the new servers are on line.

see comments

22 Oct 2010, 3:25:35 UTC
Well, bummer. The boinc db on jocelyn crashed last night. The mysql message made mention that the crash could be due to file system cache corruption. So I rebooted jocelyn in hopes of clearing this. I then ran checks on all of the tables and did a backup in case we need it to get mork going again as the replica.

I will attempt to start the project tomorrow morning, pacific time.

see comments

21 Oct 2010, 1:50:22 UTC
Our capacity is a bit dicey right now. So to keep things from getting out of hand while we are not watching, we are running uploads only over night and will turn on downloads tomorrow morning (pacific time).

see comments

20 Oct 2010, 19:18:15 UTC
The good news is that forums are up and the projects will be up soon.

The bad news is that the work limits will be quite restrictive for a while. This is because we swapped the boinc mysql db master and replica servers. The master had been mork and mork has just become too unreliable The new master is jocelyn and it has less than half the memory of mork.

The good news is that the bad news is temporary, because yesterday we ordered a new mork! It should be here in a couple of weeks. Details on this will follow. BTW, we also ordered the new science db server!

We'll have to feel out the outage schedule over the next few weeks. Thank you for your amazing support and your patience.

see comments

13 Oct 2010, 18:55:52 UTC
I'm starting a thread to let people know what's going on with the mork (our boinc DB server) issue.

As most of you know, mork will sometimes hang, requiring a power cycle to boot. There are no footprints as to what causes this. So we strongly suspect hardware.

Mork has a sister machine (mindy, of course) that never really worked (both are donated, used, HW). So mindy is mork's parts machine. This is a little dicey because we don't know why mindy did not work.

The RAM in these machines are arranged on 4 daughter boards. Last week we swapped all four of mindy's identically populated memory boards into mork. But at least one of the "new" sticks was bad because mork then showed differing amounts of memory across subsequent boots.

So we returned mork's original memory and ran the first three memtest tests. They showed no error. The final several tests are very time consuming and we may or may not do them, as mork's OS is down for these tests.

Today, we swapped mindy's two power supplies into mork. This is not because we strongly suspect the power supplies but because this is an easy exercise.

If mork hangs again, we are likely to replace the entire machine. Further component testing is becoming too cumbersome and time consuming. And after all, we now have the funds to do this because your very generous donations (thank you!!!).

see comments

12 Oct 2010, 2:37:25 UTC
I apologize for the low limits for this run. We really wanted to get through the run without mork locking up. This, plus we did not start coming off the 90Mbps max until today. There was apparently quite a backlog of demand. In addition, we are still tuning the scheduler on bane.

We will get good compression with the boinc DB reorg tomorrow and I plan to bring the project up with the high limits the next run.

see comments

6 Oct 2010, 17:57:16 UTC
It's been a painful week, but with some progress.

The server run before last was cut short by our upload space filling up. That was fixed by the bruno migration and we started the last server run a bit early.

But a crash of our primary boinc db machine, mork, got the secondary db server, jocelyn, out of sync. That meant that all of the read only queries had to go to mork instead of jocelyn. This overwhelmed mork and I turned off web access just so the server run could continue. Then mork crashed again Monday evening. Ouch.

Yesterday, we did our normal backup of mork and are recovering jocelyn from that today. The forums are up, but result viewing is disabled at the moment. We need to clear the back end queues ahead of the next server run and mork resources are needed for that.

Mork's tendency to crash seems to have accelerated. Perhaps this is secondary to the cooling crisis we had a couple of weeks ago. Actually, "crash" is not the correct term. It simply hangs and requires a power cycle to boot. Fortunately, we have mork on a networked power strip and can power cycle it remotely. Upon boot, there are no footprints whatsoever as to the cause of the hang. This sounds like hardware. So today we are going to bring mork down to swap out all of the memory and remove a couple of unused components in a desperate attempt to fix the problem. The forums of course will be down during this operation.

see comments

25 Sep 2010, 20:22:56 UTC
Reality got a little ahead of us on this one. We were days, a week tops, away from migrating
upload service from bruno to bambi. This will double our upload space and allow us to turn
off bruno.

We will now move this up and make it top priority come Monday. We need to reconfigure
the raid on bambi and then let the raid sync. At that point we can both turn the projects
on and start migrating the results from bruno to bambi. We hope that this will be early in
the week. We'll then leave the projects on through next weekend, ie no normal 3 day outage.

see comments

23 Sep 2010, 18:10:36 UTC
Sorry about the extended two-day website brown-out just now. The mysql database server crashed during the "re-org," so that had to be restarted, then it crashed *again*. We didn't get a successful backup out of the thing until last night. That's a little bit annoying, and a little bit worrisome.

Let's see.. it's been a while since I put forth a litany of server issues. Except for the a/c debacle last week everything has been more or less status quo, but this week there was extra shuffling. Allow me to elaborate:

There have been some interesting unexpected consequences due to these extended weekly outages. For example, the amount of results hanging out in the mysql database has pretty much doubled (growing slowly but consistently over the past two months), which is causing minor indigestion: the database backups and re-orgs take much longer, and workunits and results are hanging out on disk much longer (and filling up their respective disks). But also some power users are trying to return hundreds, perhaps thousands, of results in a single scheduler request. This last thing was an issue because these requests were failing due to an apache request-limit-size bottleneck, and then the scheduler itself would barf on it. Well, the thing is, up until this week the scheduler had been running on anakin - one of the last few 32-bit machines in our closet. A new scheduler was built and tested to work on 64-bit systems. Long story short, this week we moved the scheduler onto bane, which was an under-utilized 64-bit machine just handling one half of the workunit downloads. And moved bane's downloads onto anakin. This was done via ip address swapping, so no worries about DNS rollout. We'll try this out either today, or when we open the floodgates tomorrow. By the way, we're looking into the "ghost" issue. That might explain the aforementioned "result indigestion" or at least part of it.

Also the server has been suffering from OS rot, getting hit by several simultaneous web spiders, and just plain getting outdated and outgrown. It has served us well, but we finally bit the bullet and moved all that functionality to a newer, faster, better system and so far so good.

Fairly soon I'm going to blow away the current filesystems on bambi now that marvin is the trusted Astropulse database server. This should be quick, though I expect some snags (we had trouble before on this system having the BIOS recognize the 3ware RAID volumes as bootable drives). Once that's done we'll start moving all of bruno's functionality to bambi, and finally retire bruno (another flailing, troublesome 32-bit machine).

We're still trying to nail down the exact specs of the new science database server - Jeff has been doing some additional research regarding CPU upgrades - but that'll get purchased really really soon I swear.

- Matt

see comments

17 Sep 2010, 18:06:06 UTC
Except for a couple of minor details, we have decided on the new science database machine. We again want to thank the donors very much for making this purchase possible! We received $9K in donations earmarked for this server. We received another $1K during the donation drive period that was not earmarked. We're calling it $10K donated for the server. The machine we are getting is around $13K with tax and shipping.

We are planning to get a Silicon Mechanics iServ R515.v2.1 outfitted as follows:

CPU: 2 x Intel Xeon E5620 Quad-Core 2.40GHz, 12MB Cache, 5.86GT/s QPI, 80W, 32nm
RAM: 96GB (12 x 8GB) Operating at 1066MHz Max (DDR3-1333 ECC Registered DIMMs)
NIC: Dual Intel 82574L Gigabit Ethernet Controller - Integrated
Management: Integrated IPMI 2.0 with KVM over LAN
Ext. SAS Connector: External SAS / SATA Connector for JBOD Expansion (SFF-8088) - Integrated
Drive Set: 12 x 1TB Seagate Constellation ES (6Gb/s, 7.2K RPM, 16MB Cache) 3.5" SAS
3ware 9750-4i, 6Gb/s SAS/SATA RAID (4-Port Int) 512MB Cache & BBU
Power Supply: Redundant 1200W high-efficiency power supply with PMBus - 80 PLUS Gold Certified
Warranty: Standard 3-Year Warranty

This system has 24 drive bays, of which we are initially populating 12. We are giving it the maximum possible memory w/o going to 16GB DIMMS (which I think it will take but is not a normal option and would be very expensive).

see comments

15 Sep 2010, 20:21:10 UTC
This has been a very difficult several days and we are still far from out of the woods.

This past Saturday morning the air conditioning in our server closet started acting up, apparently cycling on and off. Around noon that day, we deemed it bad enough to come to the lab. It's a good thing, because the AC was completely down when we got here. We shut most machines down and restarted the AC. It seemed to hold. But later that day our monitors showed the temperature increasing again, even with a small number of machines running. We came back to the lab and shut down everything except the web servers. That small load is OK even with no AC.

That's the way it has been, off and on, since. The physical plant people have been here several times. They have been doing a good job, even though low staffing levels have cut into the time that they can give us. The current diagnosis is that the AC has a bad condenser fan. Now it is a mater of getting the part - not trivial, unfortunately. In the meantime, they rigged up a piggyback fan, which did help some. Just not enough to run the project.

We're hoping that the new fan gets here soon.

see comments

10 Sep 2010, 20:01:49 UTC
Uploads are disabled for the moment.

see comments

4 Sep 2010, 0:04:12 UTC
Hi All,

This will likely be the last "sever run" post, unless we change things. The limits are as they have been:


see comments

27 Aug 2010, 16:56:33 UTC
Same protocol as last week. We're first letting the uploads peak and trail off before starting downloads. We're battling a workunit storage problem and the uploads will push a lot of work all the way through file deletion.

The limits for the run are:


see comments

20 Aug 2010, 15:18:54 UTC
We're starting with just the uploads for an hour or two. This was suggested on the forums as a way to minimize timeouts. Also, we need to clear some workunit storage space and moving completed work through quickly will do this.

As for the limits, I accidentally started with the ending limits last week! But is was OK so that's where we'll start this week. I may raise the limits even more as the run progresses and we come off the peak.

see comments

19 Aug 2010, 21:58:18 UTC
Hey gang. Another week slips by much faster than expected. Maybe it seemed fast because I've been lost in a land of javascript, php, broken web standards, pointless browser differences, and ultimately little final results. What's this all about? I'm working on some more fun features for the NTPCkr candidate public voting pages coming down the pike. For example, a way to easily zoom into these waterfall plots to closely inspect interference near candidates. There's some neat flash/javascript based graphic packages out there that sort of do this, but underneath the flashy good looks it's all clumsy and client side and can't handle the amount of data we're pushing out. So I'm rolling my own tools, after trying out another javascript based package that should have been plug and play but was more like just plug.

This should have been easy, but nothing works as expected on the WWW. It's becoming a major time sink, though I'm close to finishing one test example - which only works on Chrome. And Chrome does this terribly annoying thing of resizing images however it sees fit, with no option for (a) users to turn this off or (b) web designers to force a certain size. One general problem I have with the internet and all related technology is that there way too people who implement "practical" features with zero thought about design, and somehow even less consideration for the actual designers. I swear - I don't know how anybody does web development full time without stabbing themselves in the eye with a fork. It's like being a surgeon who only has access to a random pile of variably sized band aids. And you're asking yourself, "well how do I make an incision?" and the experts reply, "well, duh, you use the wrapping and make a papercut, you n00b!" Anyway...

Server wise, the databases are playing nice this week thus far, and the mysql replica is working and caught up for the first time in a week. We had some issues with the upload storage just before the planned start of the outage on Tuesday. This is just one of those things that will time away as server shuffles continue. Bob is working on getting Astropulse copied to its new server. I didn't have much time for any other upgrades beyond that, but have been helping Jeff brainstorm through the current NTPCkr performance issues. Oh yeah - he's running the show tomorrow and may try the "only let uploads through at first" for a couple hours upon opening the floodgates.

Hunh. Just noticed now our workunit storage server is quite full again. Well, other things are stored on that server and I'm finding one of the causes of bloat are the db purge archives, which archives all workunit/result information from the mysql database as flat files before deleting them. If we didn't purge these from mysql we'd have billions of rows by now, which would be impossible to deal with. At any rate, the only really useful information in these files is which participant worked on which result, which will come in handy when we need to figure out who gets to share our Nobel prize. So I guess I have some file parsing/management in my new future to whittle these 700GB of archives to 10GB of user-to-result lists.

- Matt

see comments

13 Aug 2010, 15:37:15 UTC
We're on line with these limits:


planned Monday limits:


With the MySQL replica down, read-only queries that would normally go to the replica will hit the primary instead. We'll see what the impact of this is.

see comments

12 Aug 2010, 20:58:48 UTC
Wrapping up the weekly "extended outage." Jeff's actually out today, but will be back to turn the servers on tomorrow (i.e. Friday, when I'm usually out).

I finally got around to testing a drive on mork (the mysql server) that the RAID card deemed "failed" at some point, but maybe that was a transient problem as it seems fine now. Nevertheless I went through the rigamarole of pulling that drive, putting a new on in, testing it, making it a new hot spare, etc.

That's all good, but the week in general has been tainted by mork issues in general. It had one of its regular mystery crashes on Tuesday (followed by a long recovery). Then last night, and again this morning, the RAID mirror of two solid state drives (where we keep the innodb logs) started going flakey on us. The partition would just disappear, sending mysql into fits. We were able to quickly recover, but we're abandoning the solid state drives for now. Honestly, they weren't adding all that much to the i/o picture because we were cautious about how we were implementing them. Now I'm glad we were cautious. The upshot of all the above meant that we had to recovery the replica as many as four times so far from the weekly backup. What a pain. The latest replica recovery is happening as I type this. All I hope is that all systems are normal and stable by tomorrow.

Everything else is fine. In fact, more than fine as a set of very generous participants donated $6000 towards a new server that will become the new science database server. THANK YOU!! We're still spec'ing out said server, but will go ahead sooner than later now that we don't have to set up a funding drive!

Meanwhile I'm still chipping away at various data analysis projects, Jeff's been fighting with data syncronization issues that have been creeping in more and more lately. We also had a "design meeting" regarding where to go with the public involvement of candidate selection. I'm finding some plug-n-play visualization utilities on line, but pretty much I'm finding (like always) it might just be easier and better if I do it all myself with tools I already know. However, some improvements go beyond that scope, so I'm digging into AJAX which is good stuff to know, I guess.

- Matt

see comments

6 Aug 2010, 18:32:48 UTC
We're on line with the same starting limits as last week:


and with the planned Monday limits also the same as last week:


Astropulse is back on line and that should ease the raw data consumption rate. Without AP running SAH tears through the data such that keeping the splitters fed over the weekend becomes a challenge.

see comments

5 Aug 2010, 21:28:30 UTC
Another catchup post. I'm still trying to page in everything I missed in July - it doesn't help that shortly after the last post I got a nasty summer cold. I'm back in business now.

We had another mysql database server crash over the weekend, which Jeff handled remotely without much ado. The upload server also had its directly attached storage array freak out again. This is becoming a common event, resulting in the software RAID getting in some funky state (which has always been reversible thus far).

Other than that, the servers are still chugging along. As for the grand server shuffle, progress has been made and a definite plan is in motion. Basically marvin is becoming bambi (the Astropulse database) and bambi is becoming bruno (the upload/BOINC admin server) and bruno is being turned off. Meanwhile some new machine (we'll acquire somehow) will become thumper (the science database) and thumper will become ptolemy (internal file server) and ptolemy will shut off. Getting bruno and ptolemy out of the picture means two of the three servers prone to random crashes/hardware issues will no longer be on line. The third such server is mork, which is the only server remotely close to handling the mysql database load, so no options for fixing that anytime soon. We have our hands full anyway fixing what we got.

I also (finally) got a test suite working for all my birdie tests (i.e. putting a fake signal or "birdie" in the raw data, blanking it, splitting it, then running clients on it to see if the birdie still appears). This took me a while as I had to remember all the various bits and pieces of this puzzle, some of which I haven't touched for months. Now that it's all in one big script, which is nice. Oh yeah I also parallelized the software blanking pre-processing, so new data can get on line twice as fast as before (if resources are available).

Jeff's going to put some newly compiled Astropulse back end services on line tomorrow. Hopefully that's all good or else we'll likely run out of work over the weekend (which happend last weekend, but was mostly hidden by the mysql database server crash).

It's summertime, so people are in and out of the lab a lot, but enough of us will be in one room at the same time next week that more meaningful plans/management discussions will take place regarding NTPCkr and other scienctific analysis stuff.

- Matt

see comments

30 Jul 2010, 13:27:54 UTC
We had a machine crash (actually more of a hang) last night. The machine was mork, which runs the boinc database. MySQL recovered surprisingly quickly. So now I am letting the queues drain before bringing the project on line.

see comments

28 Jul 2010, 23:25:41 UTC
Hi, All - I'm back from three weeks off. I don't claim to know all the details about what happened while I was away, but outside of (the usual) fits and starts here and there it generally seems positive. The extended weekly outages seem to have given Jeff a lot of time/focus on NTPCkr progress, for example.

I am however a little disappointed how slow the spike table merge has been going. It's still not even close to finishing. At current rates, if running 24/7 (which isn't likely) it'll take roughly 20 days to complete.

Now that I'm back some more effort will be applied on major server shuffling. The current plan is that we have server "marvin" ready to go (after I re-RAID and reinstall the OS) to become the new astropulse database server, freeing up bambi to become either the new upload server or internal file server (both of which need replacing). We'll probably end up buying a new server once we spec it out to become the new SETI@home science database server, and turn thumper into whatever bambi didn't between the two upload/file server options. Follow all that?

Nobody else was around on Monday when Dave requested some minor fixes checked into BOINC code get put forth to the public. This was fine and I did so, unaware that all this VLAR code realized during my absence was turned off by default. So for a couple hours there all workunit types were being sent out to all processer types. So be it. There may be other issues with this scheduler and the default settings. Waiting on input regarding that...

So what did I do on my summer vacation? Among other things managed family visits, tackled some iPhone game contract work (so I guess this wasn't a full vacation), recorded and mixed a bunch of music and played a few good gigs - one of my bands played last Saturday night at the Great American Music Hall (that's me on bass guitar, occasionally shaking the cramps out of my right hand). Anyway, it's nice to be back at the SETI mine.

- Matt

see comments

23 Jul 2010, 15:54:19 UTC
The servers are all up. We are trying to start with a high limit this week to see what happens. The limits currently are:

40 per CPU
320 per GPU

Depending on how it goes, we may set the final day limits to 8x this rather than unlimited. The calculated hope is that this will allow for a 4 day queue filling (3 + 1 for good measure) while not maxing out bandwidth usage. As always, we'll see.

Server software changes this week include the much desired VLAR behavior (not assigning VLAR WUs to GPUs) and a hook in the assimilator for doing RFI filtering at assimilate time (this should reduce the burden on the back end "ntpckr/rfi filter" loop). The VLAR change will not be evident until all previously split work is distributed.

see comments

19 Jul 2010, 16:28:59 UTC
Even though we have less than a day left in this run, I am starting a sticky locked thread as Pappa suggested.

see comments

16 Jul 2010, 3:47:04 UTC
We have a connectivity problem somewhere between our SSL router and our PAIX router. The connection between our PAIX router and Hurricane Electric seems fine.

Several people are looking into this.

The problem needs to be resolved prior to bringing the main project servers on line.

see comments

14 Jul 2010, 23:06:04 UTC
We are beta testing a change whereby VLAR WUs are not scheduled onto GPUs. We hope to move this to the main project next week.

see comments

10 Jul 2010, 15:01:29 UTC
Things are looking OK from on this end. When the project was brought on line yesterday, none of the public facing servers were dropping TCP connections with the exception of the upload server. TCP drops on the upload server went to zero in about three hours.

The boinc database is keeping up. It was doing ~1000 queries per second mos of the day yesterday. It's down to about half of that now. Hiding those two threads (jobs limits and outage schedule) really helped. I'm not sure why those queries were hanging around so much. Number of posts? Waves of popularity?

The assimilators suddenly decided to start crashing on vader - a general protection exception in libc. I need to track that down. In the meantime, I moved the assimliators to bambi where they appear to run fine. Except for the known, occasional, memory leak which can, and did, bring a machine to it's knees. Another thing to track down. In the meantime (there are too many "meantimes"), I put an assimilator restarter in place on bambi. This method has been working well on vader. The assimilator queue is now draining.

Others have reported it, but I will report it again here. The job limits we started, and ran with, with yesterday were:

CPU 5 per processor
GPU 40 per processor
total (global) limit : 140

About an hour ago, I upped it just a bit to 6, 48, 150. I will remove all limits on Monday.

We will go for a better mix of files (and angle ranges) going into next week's server run.

see comments

1 Jul 2010, 22:39:33 UTC
Barring unexpected incident, we'll be turning the spigots back on tomorrow (Friday) morning as planned. Thanks for your patience as we sort out what kind of server outage schedule ends up being the most productive - a definite work in progress. So what did we accomplish this week during the downtime?

Programming wise, Jeff was able to tackle some longstanding datarecorder issues. You may have noticed our results-to-send queue has been growing rather large - these are some test tapes Jeff's been splitting which will be sent out rather quickly once the floodgates open (the status page already shows the schedulers are up which is wrong - that's a bug I need to fix). I did some cleanup of our various internal libraries - stuff that would never get done under normal-operation circumstances but has been bugging us for a long time. I also fixed a web site bug here, a donation processing bug there.

Server wise, I got to upgrade the OS/mysql versions on the BOINC database servers - another thing that's been bothering me for a while. We also were able to do some testing/planning for some major server shuffling - trying to get the right services on the right servers, and the most important services on the most reliable servers. We still may have to get new hardware. I'll let you know.

Data wise, we were able to get back to merging our various spike tables together full bore - doing so while the project was up was causing all kinds of headaches. We'll have to turn the merge off over the weekend, of course. I also was able to do a whole bunch of data integrity testing - it's nice to be able to pull 1 Gbyte of signals out of the science database without the query getting blocked, or worrying about blocking other queries.

In short, it may not seem like much this first week given the extended downtime, but the mood around here is a lot better when we have the time and resources to take care of longstanding projects without worrying about squeezing them in edgewise. I think general productivity will vastly improve over time, and we'll adjust the outage schedules accordingly.

Speaking of time, I'm actually outta here - going on a three week vacation for various reasons. It'll actually be a "staycation" so I'll be on call to help in case of a crisis...

- Matt

see comments

23 Jun 2010, 21:40:02 UTC
Since last I wrote a lot has happened. Looking at the traffic graphs it's like feast or famine - either we are unable to create/send out workunits, or we're sending out as many as we can fit through the pipe. Mostly it's been the usual gremlins.

However regarding the past 24 hours it was a new problem: the result space on the upload server filled up unexpectedly, which would have been fine except this (perhaps) inspired some RAID freakout on the system. We couldn't really sort it out until this morning. From the looks of it we had something like a six drive simultaneous failure. Jeff and I beat on it for a while - we eventually assumed this was just a hardware blip, and the data was more or less intact on the drives, but the RAID metadata got a little screwed up. Long story short we were able to carefully bring down the RAID and recreate the meta devices from scratch with the data intact, and all was well. Phew. For the record we do have a virtually-up-to-date result storage backup at all times in case of catastrophic failure on this system.

In any case, the main culprit was our disks filling up, so as I write this we're keeping the project down until major queues drain and the constituent workunit/result files can be deleted.

On a more happy (perhaps) note, yesterday the core group of us were in the same place at the same time (which is rare) and we had an ad hoc meeting about our current project status/plans, especially in light of many recent server problems, increasingly random schedules, and embarrassingly low funding. We're all kind of tired and beaten up and wanting some results already - so I like to think this paved the way for several large and ultimately positive changes in the future.

Also Jeff has been working on this nagging mysterious problem where some of our raw data files are only getting partially processed (which vastly increases our "burn rate" and leads to unexpected workunit shortages). He found some major clues today, and we brainstormed why this is happening and what the exact effect is. At least there's a smoking gun on that front.

- Matt

see comments

16 Jun 2010, 20:02:05 UTC
Another day, another perfect storm.

We had our usual weekly outage yesterday (for database backups/maintenance/etc.) during which we take care of other hardware/project issues. Such as yesterday - we finally got our remote-controlled power strip configured and hoped to put on one of our crashy servers (ptolemy) on it.

This meant bringing ptolemy down, which pretty much kills *everything* including all the web sites/BOINC servers. We did so, only to find during the course of installationg the config on the power strip get reset somehow, so we had to fall back. All told, this meant an hour of delay/downtime, and we were once again at square one.

After that Dave and Jeff were coordinating getting some new scheduler fixes online, which required some database updates. So we didn't start the backup until after noon, which in turn meant the projects wouldn't be ready to come back on line until after well 5pm. Jeff manned that from home, but it turns out some poorly behaved yum upgrade of httpd on anakin in the meantime secretly broke the httpd config which was impossible to diagnose/fix at the time. So we were down for the night until we could figure it out in the morning.

I guess one silver lining being down all night meant Jeff and I had an opportunity to retry installing the power strip on ptolemy with minimal interruption (as we were already in the middle of a major interruption!). This time: success - as far as we can tell after one test, if ptolemy now crashes the power strip will detect this within 30 minutes and power cycle it. Hopefully this will vastly reduce our downtime when this happens again (usually on the weekends).

As I type this Jeff is still getting most of the BOINC back-end pieces working one by one, but at least we're doling out work for the moment as fast as we can.

I know most of you who read these updates know this already, but it bears repeating: nobody working directly on SETI@home (all 5 of us) works full time, and we all have enough other things going on that make it impossible for us to be "on call" in case of outage/emergencies. In my case, I currently have four regular separate sources of income with jobs/gigs in four completely different industries (covering all the bases in case one or more dry up). As for last night, when the httpd problems arose, I was working elsewhere, and when I checked in again around 10:30pm everyone else was asleep and I didn't want to start up the scheduler processes without others' input as they were still effectively on the operating table. We're pretty much given up any hope for 24/7 uptime, but BOINC takes care of that as long as you sign up for other projects.

On a more positive note: the "spike merge" is coming along, albeit slowly. May take one more whole week to complete. And we're still doing R&D regarding server shuffling to improve our science database throughput (and therefore speed up our candidate searching).

- Matt

see comments

9 Jun 2010, 22:34:27 UTC
Let me address the "no work" issues as of late. We've been running low on work to send out (or had the schedulers turned off) for several reasons:

1. Each raw data file has to go through a local software based radar analysis - a suite of programs that takes over 3 hours to run per file. This should keep up with the incoming data flow, but some nagging NFS/mounting bugs cause this suite to lock up several times a week. Each time it does the whole systems getting new data on line is clogged until a human can figure out where it was in the process, clean it up, and start the broken file over again (resulting in many hours of lost processing time). For example this morning we found it all jammed last night, cleaned it up around 9am, and finally around 12:30pm new workunits were available again. We're working on adding some band-aid solutions to this particular problem.

2. Server crashes: mork and ptolemy are prone to crashing for no apparent reason. Either of them going down causes the project to halt until we recover. Sometimes it takes days to fully get back to a regular work-flow pace again. We're trying to shuffle services around to get ptolemy out of the picture. Why ptolemy instead of mork? Mork is a much bigger system and therefore much harder to replace - plus when it goes down the download servers are at least still able to work for a while.

3. Some data files error out pretty quickly due to noise or garbage data.

4. The CUDA clients sure burn through work fast.

5. Some CUDA clients were returning garbage. To combat this a fix to the scheduler was put on line this Monday, but was unable to start it without errors. It took Eric, Jeff, and I all day, and most of the next morning, to finally find the obscure problem - which was actually a misleading redirect in the apache config (that was put in many months ago). By the time we fixed it, we were already into the weekly outage.

So lots of battles on this front. In any case we are collecting data at this point (on 2TB drives, which means we'll lose less data waiting for the Arecibo operators to swap out the older 500/750GB drives), and still have a backlog of stuff to process in our archives. The lab is also getting a Gbit link to the world in July so the slow transfers to/from these archives will no longer be a bottleneck. Note this link is for the whole lab and our SETI specific data link will remain at 100MBit. Still, it's an improvement.

- Matt

see comments

2 Jun 2010, 19:31:08 UTC
Another monthly-ish report.

First the good news, before it seems like I only want to blather about the funny/annoying stuff. Jeff has been hammering on the NTPCkr to incorporate the newer RFI removal code. Before the plumbing was of the form: signals come in, pixels are scored, the best ones are displayed for the public to see. Now the plumbing is: signals come in, pixels are scored, the best ones are displayed in a sort of "preview" form and sent into the RFI loop, which then forces the pixels to be rescored (after bad signals are removed), and if they still happen to be in the top ten they'll have all the associated plots for the public to analyze.

I'm also still pecking away at data integrity tests. I have the "birdie injector" (which sticks fake signals in the raw data) working to some extent. After some full tests we're finding these birdies in the results reported back from the clients - though it seems that we might have to add another retroactive signal correction in the future. Don't worry - if this in fact true it's not a big deal. It's easy to fix and there's no lost scientific integrity. Other than that, there's continuing testing happening in my copious free time. I also wrote up a scientific newsletter about radar blanking.

Of course, our server woes have peaked a bit recently, coinciding nicely with the holiday weekend and a mass e-mail. The mass mail was part of the problem actually - there was a link to several video files which were much larger than I assumed. Like hundreds of megabytes larger. So that made our web site (and the whole lab's internet connection) a little bit sluggish. Oopsie.

But the two machines prone to random crashes did just that. First mork went down taking the BOINC user database with it. Recovery was easy enough, but then the next day ptolemy went down taking everything with it. Dan actually came up on Sunday to power cycle the thing by hand (with my guidance via phone).

Yeah - it's on our list of over 200 critical things to get a remote power strip installed on ptolemy. I'd rather we just have systems that didn't crash. Unfortunately the functionalities of these machines are such that transferring them to other machines is impossible. However, we do have thumper... and marvin...

I've been hoping to reorganize thumper and upgrade its OS for some time now. We finally had the window to do that this past month, but there was one hangup after another postponing this project. Meanwhile we have marvin set up for test database purposes. It's more or less a functional equivalent of thumper, but with a lot less drive spindles. Still, the plan now is to burn marvin in, move the science database there (temporarily if not permanently), and then thumper is free to be completely wiped clean. Maybe we'll make thumper the new ptolemy and retire old ptolemy. That'll be one less crashy server to deal with. As for replacing mork... well we need another system with a bunch of CPUs, many disk spindles, and at least 64GB of memory. Not happening any time soon.

Let's see... other projects... oh yeah - we're now merging the spike tables. We had to split the spike signal a while back due to reaching some logical constraint in the database. After the dust settled on various other projects we're ready to make that one whole spike table again. Easier said than done, but what isn't around her? Anywho.. this was why the spike signal counts on the science status page seemed a little off for a while.

- Matt

see comments

29 Apr 2010, 21:01:37 UTC
Okay it's been a while and nobody else is chiming in so here are some general random updates. Sorry these are so few and far between. Not my fault.

We did successfully, finally, split the informix databases up. Instead of both redundantly housing SETI@home and Astropulse, one is specifically running SETI@home, and the other Astropulse. We lost our redundancy, but we back these systems up weekly and in a pinch can always regenerate lost data by splitting it again and sending it out to y'all. What we gained was a massive amount of i/o. Actually more like the Astropulse i/o isn't clobbering normal SETI@home day-to-day operations anymore. Like all things, this procedure took far longer than expected - mostly due to one of the SETI@home tables being strangely hard to drop off the one server that no longer needed it - something about a user-defined type in that table causing informix to crash when the deletes were done en masse.

There are still the usual set of other systems projects or problems waiting for our time and attention. Our master mysql server, mork, has been stable but may reboot itself unexpected at any point. Luckily when this happens recovery has been short and painless. We'd replace this system, but need another system with similar drive space and cpus and 64GB RAM which we don't have. Even worse is our main file server (which, among other things, houses our web site and home accounts) is slow and also prone to unwarranted random crashes. Some systems still need an OS upgrade. I also want to rebuild the RAID on thumper...

In brighter news, the gigabit link project got a kick in the right direction. Short story: turns out the whole lab wants a gbit connection to campus and suddenly has some discretionary funds for this. So we might partially piggy-back on that bandwidth. Anyway, the increased-bandwidth patient still has a pulse. Of course, we haven't really our own 100Mbit ceiling too much lately, so this is hardly an emergency at the moment.

Also... our data drive bay down at Arecibo was broken. We finally shipped them our bay working here at Berkeley just so they could continue to collect data, but that meant we could read new disks, only process data already on disk (or in our archives, i.e. old stuff that had yet to be properly processed). Anyway, we got the broken bay sent up here last week and Jeff found it was just a bent pin in the cable that connects to the power supply, so we have two functioning bays again in the two separate locations, and are reading newer data off drives for the first time in a while.

Other than that (and the usual set of minor tweaks and crashes that require a few minutes here and there) we've been running fairly well in a steady state. Dan continues to mostly work on CASPER stuff. Dave is working entirely on BOINC development. Jeff and Bob are manning the general data pipeline. Jeff and Eric are working on NTPCkr stuff - mostly RFI analysis/excision and candidate rescoring. While I seem to be part of all projects around here (like everybody) I've been forcing myself away from systems stuff except as needed - another reason why I've had little motivation to write tech news reports.

I've been mostly working on data quality stuff - one program that injects fake signals into raw data to test various parts of our blanking/analysis suite, and a bunch of other programs to test basic data integrity. Stuff that should have been done years ago, but better late than never. In short, the results are pretty much all good, but there are several database corrections of varying magnitude which need to be carried about before we can truly reduce the data even more. Stuff like pointing corrections, or general rescoring.

The basic game plan, as it has been, is to rally behind the NTPCkr suite once the RFI/scoring stuff is working and the science database can handle the full analysis load in earnest. If you're frustrated by lack of advancement on this front, maybe it'll help to think of all the previous NTPCkr pieces made public part of a "proof of concept beta test." We do hope to have this rolling, complete with volunteer analysis and input, sooner than later. It's funny the SETI institute is working on their own volunteer analysis project. Basically just another thing that gets the public confused about who actually manages SETI@home. Anyway, you know how little labor resources we have, so we do what we can.

By the way, that NASA balloon project that crashed and burned this morning involved the great efforts by several of our lab mates here at Berkeley. Many years of planning/production lost in an instant. Total bummer.

- Matt

see comments

22 Apr 2010, 4:28:46 UTC
We had a couple of problems tonight. ptolemy, our main file server for user accounts went down at about 5:05pm. Of course that's 5 minutes after Matt and Jeff left, so that left me as the default sysadmin. They're both more patient than I am and are less likely to just pull the plug out of the wall.

So I rebooted ptolemy, and it crashed again about 5 seconds after it came back up. And again. And again. Eventually I figured out that vader was trying to do a lot of writes to ptolemy and that was causing the crash.

I couldn't get vader to respond to anything, so I just pulled the plug out of the wall. I tried a few times to restart it, but it just hangs during the boot process. So our assimilators are down, among other things. We may run out of work at some point.

Hopefully Matt or Jeff will fix it tomorrow.

see comments

17 Mar 2010, 17:11:28 UTC
Thumper crashed around midnight, stopping anything that needed to talk to the science database. We're rebooting now, but it'll probably be several hours to resync the RAID arrays before we can turn work generation or result handling back on.

see comments

17 Mar 2010, 2:20:50 UTC
I'm not the best person to do tech news, because much of the time I don't have a clue, but here's what's up.

The sudden upswing today maybe means we're back to full upload capability. Maybe when I get home, I'll see that my upload backlog has cleared. <hoping> Who knows if campus will tell us what the problem was. Hopefully it won't happen again this coming weekend.

Lando has been upgraded to FC11, but cat't run Astropulse splitters anymore until we build new ones. We're temporarily running an astropulse splitter on Thumper.

There are rumors of a potential upgrade to part of a gigabit link, but they are still rumors AFAICT.

see comments

22 Feb 2010, 6:36:19 UTC
Tonight's database problem was caused by a bunch of queries to a certain forum thread hanging. I don't know yet whether this was an accidental or a deliberate denial-of-service attack. Probably accidental, but I'm checking it out anyway.

see comments

19 Feb 2010, 19:17:01 UTC
Gargh! The science database on thumper went down at 2am due to a filled root partition. One of the raid arrays on thumper lost a drive at about the same time, and uploads are still too slow.

I've fixed the first problem, a hot spare automatically fixed number 2 and will be working on number 3 now.

Happy Friday!


see comments

17 Feb 2010, 22:51:35 UTC
Well, shoot. Right at the end of the work day yesterday the air conditioning unit failed. What's worse is that the cause is still a complete mystery. When the campus A/C techs came up in the early evening they just pressed the reset button and it came back to life.

But that was after a panicked fury of shutting down every server possible to save their lives. Eric was the first on the scene and smelled burned plastic, heard broken fans, and quickly started unplugging everything he could. I came up later after the A/C was on to get the web servers going again (so people could at least see we were still alive).

This morning rolled up our sleeves and surveyed the damage, which actually wasn't too bad. We definitely lost one UPS, and possibly a power supply in one of our file servers (though it seems okay for now). Eric's hydrogen survey server seemed to take the brunt of the damage, and he was ready to reinstall the OS on what disks remained visible to the system, when suddenly after the nth reboot all drives were visible again and all data was still intact. Well, that was a pleasant surprise.

Still, there was a bit of RAID and database recovery on various servers, which is why the project largely remained offline until the end of the day today. This is still going on, so we probably won't be fully back to normal until tomorrow morning at the earliest.

- Matt

see comments

16 Feb 2010, 23:38:24 UTC
Hello again. Happy President's Day - we had the Monday off, plus I took the whole previous week off to go hang out in Kauai. First real vacation in a while, and last for the foreseeable future.

So what did I miss? Looks like the upload/scheduling servers have been clogged a while due to a swarm of short-runners (workunits the complete quickly due to excessive noise). This should simmer down in due time. Plus we're having the usual outage today so there will be painful recovery from that as well. And things were running a little late today as a permissions problem held up the start of the outage. Patience.

While we did finally get the science database back in working order, we were finding the server still didn't have enough resources to meet our demands. So a new plan is being put into action over the coming weeks: instead of having both SETI@home and Astropulse reside on one server (thumper) and both replicated to another (bambi) - we're going to have SETI@home live on thumper and Astropulse live on bambi, both without replication. This will keep painfully long Astropulse analysis queries from clobbering the SETI@home project (which has been happening a lot lately). We may implement some form of our own replication, but we do back up the database regularly (and store those backups off site), so the replica doesn't buy us that much, especially considering we could double our database power by converting it to another primary server.

- Matt

see comments

5 Feb 2010, 0:06:07 UTC
So yeah, turns out the science database was having a migraine, not just a headache. I had to give it another swift kick last night. But after some rough seas this morning it seems to have just suddenly righted itself (at least for now). The symptoms were kind of new - Informix would be stuck at a checkpoint, while there was literally zero disk i/o on the system for upwards of an hour. Stopping/restarting Informix helped both times, but didn't seem to solve anything in the long term. What's more mysterious is the cause. We were running fine last week, even after starting Astropulse. What changed? We were quick to blame some extra Astropulse analysis queries (as they wrecked us before) but we still got the same symptoms after killing those. Was it merely the weekly post-outage recovery, which normally floods all our servers? Well, this was the first time we had an outage recovery while Astropulse was involved in a while, so maybe that's part of it. In any case, we're keeping a closer eye on the science database these days.

- Matt

see comments

3 Feb 2010, 23:42:45 UTC
Nothing major to report, hence the lack of updates. We had our usual weekly outage yesterday for mysql maintenance. During that threw a newly compiled transitioner and scheduler into beta containing various bug fixes. The recovery was fine, though it's hard to tell as the network graphs (which are hosted by central campus, not us) seem to be broken at the moment.

We're actually having server closet temperature issues again as well. So I spent a chunk of time going around to various servers and implementing "sensors" to get more temperature data - want to make sure we're not being misled by a single server with a broken fan or something. Should have done this a while ago, but I can say that about pretty much everything I'm working on.

The science database is having a bit of a headache, mostly due to some extra Astropulse related analysis queries above and beyond the usual set of splitters/assimilators hitting the thing. We had to give it a kick an hour ago, and it recovered just fine (mostly since that kick killed the analysis query hogging all the i/o). We really need to improve this part of our server farm - when the NTPCkr is fully operational the science database is going to need all the juice it can get!

- Matt

see comments

27 Jan 2010, 23:23:25 UTC
As predicted, the science secondary did indeed catch up to the primary again, so all's well on that front (for now). And in case anybody noticed, we quickly turned the splitters/assimilators off for a bit to replace the failing drive on thumper - something we planned to do during the outage yesterday but couldn't. Easy squeasy - I'm glad we pay for Sun service on that system as drives are going fast. I can safely say the rumors that SATA drives fail frequently are true.

What you may notice is our servers being clogged for a spell, as Eric just turned the astropulse splitters back on (hooray!). We'll see if all goes well on that front - it's been a while and certain parts of the engine may need oil.

- Matt

see comments

27 Jan 2010, 0:01:32 UTC
Happy Tuesday outage day! We're recovering from our regular weekly maintenance downtime now. Jeff and I hoped to replace a potentially bad drive on thumper during the outage, but then realized we hadn't "failed" that drive yet. Upon doing so, this triggered the expected RAID resync, which took 4 hours. That wrapped up just as I was bringing the projects back on line. So we'll replace the drive later. Maybe tomorrow (it doesn't necessarily have to be during a Tuesday outage - we can power down thumper, swap the drive, and bring it back up without interfering with any public workunit/result scheduling or transactions).

Catching up from the weekend... the good news is the secondary science database server finally became operational again. The bad news is that for some mysterious reason it lost contact and fell behind a bit this morning, which in and of itself wasn't a big deal (this happens all the time), but this forced the primary into some quiescent mode which made no sense to us. Ultimately we found the only way to get things straight again was to bounce both database engines. The secondary is still catching up as I write this.

- Matt

see comments

21 Jan 2010, 22:42:54 UTC
The mysql replica did finally catch up on its own without any intervention on our part. I like when that happens. Likewise, the science secondary database is chipping away at its backlog - still may be a few days away from being completely caught up and functional.

There's been a programming push lately. Jeff and Eric and working hard on the RFI code, and Eric and I have been working on getting a new fake data generator rolling for more robust testing purposes. The NTPCkr development/testing/employment progresses at a slow pace - a lot is waiting on the current state of our science databases. Outside of getting the secondary operational, there are other major improvements we hope to make to speed things up.

The weather has been wacky, severe, and continuous. I like it, but still keep your fingers crossed we don't get a power outage.

- Matt

see comments

19 Jan 2010, 23:13:31 UTC
Long holiday weekend (it was Martin Luther King Day yesterday) during which no major snags, but a couple minor ones. The data pipeline ran dry, but Jeff got on top of that before most people noticed. The mysql replica also lost touch with the master and threw itself offline. This is an old problem that went away for a while, but is apparently back to annoy us. Not a big deal, except the alert e-mails kinda got "lost in the noise" of the holiday weekend and we didn't kick the thing back to life until this morning.

Meanwhile, we're having the usual weekly mysql maintenance outage. The replica caught up a bunch while we were offline today, but still may take a day or two to fully get back in sync. Until then, any queries made to that database will be slightly out of date. Fine.

In better news, this last iteration of the secondary science database recovery project seems to have worked, or at least working. It took 6 days as expected to fully back up and restore the secondary from the primary, which was expected, but this time we had enough logical logs on line so that they didn't "wrap around" during this process and we were forced to try to recover from continuous logical log backups. We tried recovery from the continuous logs last week - The bothersome thing is that should have worked. Anyway... the secondary up and doing the final stages of recovery now.

- Matt

see comments

13 Jan 2010, 0:05:06 UTC
We had an unexpected short outage early this morning. One of our internal file servers crashed, hanging everything. Jeff noticed it upon arrival at the lab this morning and kicked it (and the projects) back to life. Of course an hour later we had to bring everything back down again for the usual weekly maintenance (for mysql database compression and backup). The first outage caused a bit of delay, hence the extended length of what was otherwise a rather vanilla outage.

Jeff and I have been on a binge cleaning up the lab a bit, which has gotten overrun with cables, hardware, compact disc cases, etc. that we'll never need or use. Get it all out of here! Last week we uncovered an unlabeled box containing a motherboard and some RAM which would fit perfectly in this one tower Intel donated a year ago but never worked. So I spent a chunk of yesterday replacing this motherboard, only drawing blood a few times during the course of handling or maneuvering around all the unfortunately sharp heat sinks/solder joints/inner edges of the case. I also managed, while forcing the main power supply plug into the new board, to jam my right index finger down full force onto a set of exposed pins, one of which plunged a good half centimeter underneath my fingernail.

Did I ever mention how much I hate dealing with hardware?

Anyway... the new board works, but for some reason installing the OS on it has been a pain. I'm currently at attempt number five - all the problems stemming from the disk partition layout. This OS install worked perfectly on other systems, including the easy disk formatting GUI. Not sure why on this system I'm only able to create three primary partitions via the GUI, not four. I ultimately had to partition it myself in rescue mode to get it to behave how I wanted. Weird and frustrating. On top of that, the installer was logically swapping sdb and sdc, so when I placed a RAID on what I thought was sda/b it came up "missing" a drive and failed. Whatever. It's sort of working now. Not sure exactly what we'll do with it - probably just replace the slightly less powerful (and crashy) BOINC web server. Two CPUs, 8 GB of memory...

Meanwhile, some more bad news: we're having to backtrack a few steps in the secondary science database recovery project (on bambi). We were able to recover from the backup (a process that took a week or so) but the logical logs have since wrapped around. So we could recover from the continuous logical log backup, right? I mean, that's why we do the continuous logical log backup, no? Well, apparently we can't. Not sure why. So we're going to try to do the whole recovery/rebuild again in a manner that will hopefully take less than the time for those logical logs to wrap around (about 4 days). We'll see. Let me remind you this has zero effect on the public part of the project - well, except that astropulse is still kind of on hold until we're done. Yes, that's much greater than zero.

- Matt

see comments

6 Jan 2010, 23:47:01 UTC
Still catching up from the reduced/random schedule during the holidays. The science database rehabilitation project still continues. We're nearing the end: the primary science database (thumper) is now corruption free, stable, and logging properly. The secondary science database (bambi) is being rebuilt as I type using the science database backup we made on Monday. The rebuilding is going rather slowly - we predict it will take 11 days (!) at current rates. As I typed this paragraph we noticed the rebuild was stuck. We feared we had to reboot the system and start again from scratch but luckily we were able to find the errant process locking the whole system, and everything else sprung to life, continuing where it left off. Phew.

By the way... not to rain on the parade, but during the holidays one of the drives in thumper's RAID issued some warnings. Last time that happened we got some, well, um... corruption. I doubt we'll have to go through this whole rigamarole again. If anything, just a small part of the cookbook. Ah, probably not worth worrying about. We'll run some checks when all the above is through and see where we're at.

In better news, I got scram_peek working again. What's that? It's a little utility that runs down at the telescope and reads various diagnostics as they are broadcast around the local net. Stuff like current telescope position, if alfa is running, etc.. This hasn't been working since our data recorder issues a loooong time ago, so our science status page (where we post such info) has been rather stale. One major stumbling block was the old scram_peek ran on a solaris machine, but that particular system died. We had no other solaris system handy so I had to recompile it on linux. It's really old code, linking against even older libraries. I had some compiler errors to work through - annoying but nothing too extreme.

Anyway, I'm looking at the science status page right now and the ALFA receiver light is green. That's beautiful. You may also notice the # of spikes in the science database is shockingly low. That's because we recently split the spike table into two (it grew beyond the bounds a single logical table could handle). We'll combine them again at a later point. Until then, that number is off by a billion or so (1,341,844,240 to be exact).

- Matt

see comments

5 Jan 2010, 0:14:02 UTC
Hi - just a quick note to say happy new year and we are slowly ramping up services/etc. again after the time away. Well, it wasn't really time away, as Jeff and I (and Eric and Dan) were all around dealing with the planned, massive lab wide power outages during the holidays. Of course there were some glitches, not sure if I'll ever get around to spelling them all out... nothing really all that exciting except one file server keeps coming up in "forced RAID resync" mode despite going down gracefully. This is why we're still keeping the project offline for now. Not so great, but I took the opportunity to do tomorrow's usual outage today. So once the RAID is resync'ed (tomorrow morning, hopefully), we'll turn everything on.

That one mysql database server did crash again, as it usually does, thus getting the replica out of whack. I'm also cleaning that up today.

- Matt

see comments

23 Dec 2009, 0:30:10 UTC
Oy! So the day ended yesterday with some good and bad news. The good news was that the air conditioning problems we were having were not due to our a/c unit, but due to the whole building turning off some circulation fans in order to save money over the holidays. Apparently these fans have been helping us out a lot. So we got them to turn those fans back on again until we figure out a better situation in the new year. At least they said they'd turn them back on again...

The bad news was that we discovered that spikes and gaussians were failing to be inserted into the science database by the assimilators. These were actually two separate problems that pretty much ate up our entire day today trying to figure out. The spike table simply needed more space. The gaussian table errors were terribly misleading, and we barked up several trees before determining there was some corruption in one of the indexes. We dropped a couple of the less crucial indexes until we were able to insert gaussians again. Jeez.

Other than that... we're ramping down our presence here at the lab now that the holidays and forced furloughs are upon us, but we'll of course be popping in from time to time anyway (remotely or directly) to deal with various chores, including this massive power outage on Sunday/Monday.

Happy remainder of the year!

- Matt

see comments

21 Dec 2009, 22:52:47 UTC
Regarding the science database issues over the past couple of months, let me recap: this is the informix/science database, not the mysql/user database. We noticed that one of the tables in astropulse got corrupted. No big deal - we lost a couple rows out of 80 million. In the process of fixing this we noticed that the astropulse portion of the database hasn't been replicated properly to the secondary informix/science database. So this whole project was to fix two things: the corrupted table, and the broken replication. Ultimately we learned a lot along the way cleaning this up ourselves, but each iteration has been sloooow, and a lot of time was lost trying certain things which seemed obvious, but didn't work like we expected. We're nearing the final stages of this (we hope). One silver lining is that we'll get to test recovery of the secondary from our weekly database backup - these backups are 1.2 terabytes in size at this point, so we don't test this procedure often.

So.. there's been upload issues starting yesterday. Not sure why, maybe we were just being blitzed more than normal. We tweaked our configuration around this morning so that the scheduling server is now also handling 25% of the upload load. Maybe that'll help push the clog through.

In worse news, our server closet temperature shot up way too high this weekend. Machines were running 10 degree (Celsius) hotter than normal, and well beyond spec and in some cases the "danger zone." This isn't good. We're hoping somebody on campus will come up today and inspect our a/c system, but given it's the holidays we might not get anybody until the new year, in which case... we'll have to shut everything down for a while. We shall see.

- Matt

see comments

18 Dec 2009, 0:06:37 UTC
Regarding the secondary science database recovery debacle, we're throwing in the towel on that one. We tried to be clever by only dealing with specific sub-databases/tables in question, but the inner workings of Informix are way too complex and protective. So at our next earliest convenience we're going for the slow, brute force method, i.e. we're going to totally drop all secondary databases, back up everything on the primary, the recreate all the secondary databases from the backup. This is much like how we do it in MySQL land, but that database is 10s of gigabytes - the science db is upwards to 100 times that size.

We ran out of work to send out this afternoon. That'll be fixed shortly. Minor problem.

The donation drive letters continue to trickle out, requiring occasional attention on my part. I did pass along comments to the higher-ups made here and elsewhere, but that's as far as I went. I'm kind of sticking to my role as "the guy who just sends the mail along" for my own sanity.

- Matt

see comments

17 Dec 2009, 0:08:41 UTC
Outside of the usual end-of-the-year fund raising efforts that occupy a lot of my time, there's some actual technical projects going on. You may have noticed a dip in work an hour or so ago. Maybe not - it was quick. We're in (what we hope to be) the final stages of this massive science database shell game that's been taking months to complete at this point. The problem with a database so huge, so active, and so uniquely implemented is that paths of action are never entirely clear and one small misunderstanding could lead to a week of cleanup and starting again from scratch. All part of the big learning curve. Bottom line is we're almost there.

Oops... spoke too soon. Looks like the current restore phase aborted on its own <big sigh>.

Jeff and I also spent a moment considering some massive power outages over the holidays. Yes, there are major power upgrades happening on the hill starting later this month (affecting many buildings, not just ours). It makes some sense to do such things during what is usually "down time," i.e. when most everybody is on winter vacation. Of course that means computing staff has to be around to (a) safely power everything off before the outages and (b) power everything back on. And yes I mean two outages - one on December 27th/28th and another the following week. So that's four total complete power ups or power downs combined. In theory we could just leave everything off for a whole week after the first power down to save ourselves the extra cycle, but given we're trying to keep participants happy during donation season... well, it's worth the trouble. Yes, there will be announcement on the home page once we have a more solid plan.

Oh... look at that. Seems like Dave is implementing some new generic BOINC project news/announcement code - the upshot of which thread creation is broken and there's a new message board forum called "News" with my name on every post. I'll have him fix that shortly... I only write technical news. Okay.. it's fixed.

- Matt

see comments

10 Dec 2009, 21:33:23 UTC
So the first round of donation pleas are being sent. Sending out mass-mails is an art and a science, none of which I claim to be good at. I'm sure lots of these mails are being spam blocked or whatever, but we can only do so much to given our resources to get the word out. One good piece of news is that paypal donations may actually be a possibility in the very near future. Imagine that.

I've said it before, but it's worth repeating: I appreciate all the efforts and contributions (in all forms) of our wonderful volunteer user base. I know we're already getting your valuable computer time, so the monetary donations you may decide to give us are in addition to your current level of generosity.

Anyway.. back to work. What was I working on? Oh yes. Data pipeline stuff. What else... So we did end up abandoning that strangely resource-hungry science database index building task. We're now just dumping the whole table fragment to an ascii file, dropping the fragment, and rebuilding from scratch via that file. That may end up being a lot faster after all.

An ATI version of the client is currently in beta test. I have no idea about anything beyond that, but it seemed worthy of at least mentioning it here.

- Matt

see comments

7 Dec 2009, 23:59:35 UTC
Hello again. After the long holiday weekend I disappeared out of town for a week, during which the rest of the gang put out several fires. To recap: around the time I was last here we were dealing with a trio of disasters. First, we wasted 2 days pulling data up from the archives that happened to be bad/useless. Second, the secondary database (and splitter server) bambi crashed. And third, the switch handling all our Hurricane Electric traffic gave up the ghost. This all got cleared up in bits and pieces by me and Jeff before, around, or immediately around turkey day. Then I hit the road.

Meanwhile, the astropulse signal table reload project still lingers on! I'll spare you the details because I wasn't here and don't really understand them, but playing a shell game with a hundred million rows' worth of data ain't easy, and there have been annoying or unexpected hurdles each step of the way. As it stands now, the project is pretty much off again as we're trying to rebuild an index and all resources are required to get this done sooner than later. It's already taken a couple days and hasn't shown much progress.

It's not all bad news. Eric continues to make progress on RFI mitigation, and Jeff is moving forward on other aspects of the science code. The NTPCkr is waiting on the above science database issues, but improvements are still being made. And mork hasn't crashed in a couple weeks now (not sure why, though - it's due for a crash). I also may try another OS upgrade tomorrow on bambi, which will test doing the same upgrade on thumper, which will then solve several root/RAID issues on thumper, and then we can start improving the disk I/O issues elsewhere on the system.

And it's donation season! Actually, it's getting late in the season. I'm going to be lost in mass mail coding/etc. for a while...

- Matt

see comments

26 Nov 2009, 17:07:21 UTC
Oh well, we tried. We thought we would just have to put some extra minutes monitoring the data pipeline over the weekend (after wasting a lot of time bringing up many broken files), which wouldn't have been too bad, but...

Then bambi crashed last night - it's our secondary science database server but also manage a lot of the data pipeline stuff. I happened to be free so I drove up to the lab around 10:30pm and rebooted it. After that, the pipeline zipped right along.

That is... until 11pm when the router up and died. Or something along the entire Hurricane Electric network path died. We have no idea. Jeff and I fought with it (both remotely) this morning, but we're throwing up our hands at this point and going on holiday.

Might as well have everything fail at once, and at the start of a long holiday weekend. Why not?

- Matt

see comments

25 Nov 2009, 23:34:12 UTC
Okay then. The mysql commit behavior we were testing was an absolute failure - though for expected reasons (not enough disk i/o, even with the solid state drives). It was worth a shot, but we fell back to the old commit behavior for now.

However, this caused a lot of backend processes to clog up including the transitioners, which ultimately meant the splitters burned through all kinds of raw data files before they realized we had more than enough work on disk. This could have been bad, i.e. filled up our workunit storage server, but luckily it didn't even come close to doing that.

Anyway, we reverted this morning and all the dams broke for a while... until we ran out of work to send out. Turns out the last 10 files I brought up from Arecibo are all broken. <sad trombone>Fwa wa wa waaaaa</sad trombone>. This is particularly frustrating as I was busting my hump trying to get enough work on line before the long holiday weekend, and now we have zero. So it'll be to me and Jeff to check in over the next few days and kick the pipeline along. We'll be out of real work to send out until this evening at the earliest, and quite probably hit long periods of no work throughout the weekend. Fine.

In better news, we did the last bits to get the Astropulse signal table fully copied over to another database fragment - only losing a few rows here and there (as opposed to many thousands as originally thought). Work will resume on Monday to make this exchange old/new fragments and hopefully the science database will be much happier.

That's it for now.

- Matt

see comments

24 Nov 2009, 22:46:11 UTC
At the end of the day yesterday our raw data file server lost a drive. The bottom line as far as you're concerned is that we had to stop the creation of workunits until we got on top of the RAID resync issues this morning. But by then we were into our normal weekly outage, so you've been unable to get any work for a while, and will continue to not be able to do so until I start splitting up again - probably later this evening.

Meanwhile, every other part of the project is coming back online. We're testing the new mysql commit behavior (mentioned in yesterday's post). It's not looking good right out of the gate, but that may be due to mysql needing to read everything back into memory again after a bounce to pick up the configuration change. I may have to bounce it again if it continues to be a problem. I hope not, but it's no big deal either way.

Looks like Bob got most, if not all, the corrupt astropulse table finally copied over to another table so we can drop/recreate the data and get rid of this corruption (which has been causing us random headaches over the past month or two). I just ran some preliminary tests on the data integrity. Looks good.

- Matt

see comments

23 Nov 2009, 22:46:08 UTC
How about that? We made it through the weekend without a server crash! We haven't done much to improve the situation, so maybe we're just getting lucky (or maybe we've just been unlucky). Anyway, we've been happily shovelling data through the pipeline and collecting results.

However, we're still working on getting the corruption out of the science database. Every step takes a long time (days), as we're playing a large shell game with a database table that is reaching 100GB in size. That doesn't sound like much in some regards, but this is all being done on a row-by-row basis, plus we have to ensure data integrity at each step, etc. It's slow.

Back to the mysql database for a second - one thing we'll try tomorrow is moving mysql to commit-on-every-transaction behavior. Normally now it commits either once a second, or when the buffer is full. We tried this before and it was a major failure - the disks array on jocelyn couldn't handle it. But now we're on mork, where the logs are on solid state drives. Worth a shot. Normally we're processing hundreds of queries per second - so this new behavior will prevent up to hundreds of queries from disappearing during a crash, not to mention keep the replica in sync as well so we don't have to go through the painful exercise of recreating it every time the master goes nuts.

Still.. I admit I'm feeling fairly certain that we won't be able to stay this way very long and have to revert back to our current behavior. It'll be fun to try, though. This may make the recovery after the outage more painful than usual.

It's also rapidly approaching beg-for-donations season. A mass e-mail probably won't happen for a couple weeks (given everybody's holiday schedules). Once again it's up to me to figure out how to squeak out a large pile of e-mails before we're (wrongly) spam blocked - a mystical art.

Also, for our non-U.S. folks, this upcoming Thursday is our Thanksgiving holiday, so please forgive the short work week in advance.

- Matt

see comments

17 Nov 2009, 22:48:26 UTC
Okay so mork (the mysql database server) crashed again on Friday, and Jeff/Eric took care of getting that all back on line without much ado. Okay, yes, this is a crisis now, but we're not sure what the problem is, nor do we have any immediate solution (since we don't have another 24 processor system with 64GB of memory hanging around). Each time this happens jocelyn (the replica server) gets out of sync and is rendered useless until we can recover it during the next Tuesday weekly outage (which we're just getting out of now, and the jocelyn recovery is taking place as I type). So it's slightly frustrating that jocelyn, a powerful server in its own right, is twiddling its thumbs a lot of the time these days waiting to be resynced. Sigh.

We're also still hitting one snag or another trying to remove the corruption in the astropulse signal table. We'll fix it eventually - it's just a matter of shuffling around rather large tables containing millions of rows, etc.

I tried doing an OS upgrade on our web server this afternoon, but this had to be abandoned as the root RAID device was showing up half degraded during the install for no apparent reason - and when I'd bail on the install and restart the old OS the root RAID would look just fine. Weird.

Wow. Rereading these tech news items they always sound so negative. Okay then here's some good news: Eric and Jeff have been making great leaps in various parts of the scientific analysis back end, i.e. in the NTPCkr and first levels of interference rejection. I'm hoping there's more specific news to report on those fronts in the near future.

And there was recent mention of SETI@home perhaps suffering from "feature/scope creep." I actually completely agree with this concern, but this is a common, general problem with academic (i.e. non-professional) endeavors. The lack of resources is usually the main cause, then catalysed by the lack of hard deadlines and financial risk. That said, I think we do a pretty amazing job, given what we have, keeping the whole engine running while making slow but nevertheless non-zero progress on the final data products. The glacial speeds sometimes drive me crazy, but I usually solve that by involving myself in other professional/commercial jobs on the side that have harder defined goals and immediate rewards. I would like to see SETI@home "take a break" to devote all our efforts towards the science part for a while, but I admit there's both pros and cons going this route. I'm currently outvoted on this front, so we stick with the status quo.

- Matt

see comments

13 Nov 2009, 0:01:13 UTC
Turns out the replica recovery was much faster than expected on Tuesday, so I was able to get that on line before the day was out. Then we had the day off yesterday, and now today. Let's see. Seems like I've been lost in testing land today. First, we finally decided on a method to fix the corruption in our Astropulse signal table. It's just one row that needs to be deleted, but we can just delete it using sql - we have to dump the entire database fragment (containing 25% of all the ap signals) and reload it without the one bad row. I wrote a program to test the data flowing in and out of this plumbing to make sure all the funny blob columns remain intact during the procedure. Bob also sleuthed out that this particular corruption actually happened months ago, not during this last RAID hiccup. Fine. Second, I'm also working on a suite of more robust tests/etc. for the software radar blanked results, now that we're getting lots of them.

- Matt

see comments

10 Nov 2009, 22:58:34 UTC
Today's Tuesday - that means we had our normal weekly maintenance outage, and we're recovering from that now. Outside of the normal database compression, backup, and log rotation type tasks we also took care of the following:

1. Replaced the faulty drive on thumper (the primary science database server). This system is on Sun Service so such hardware failures are trivial. A drive fails, we call Sun, they send us a new drive right away, we plop it in, we send back the old drive, done. However there are still nagging problems on thumper at the OS/database level that still require our attention (a corrupt row in the Astropulse signal database and that funky root/RAID configuration that can only be fixed during a clean OS install).

2. Upgraded mysql on both the master and replica servers (mork and jocelyn) to version 5.1.37. This was finally made available in the Fedora distros and from what I've been told may fix those unload/reload formatting bugs. While we were at it, we yum'ed up pretty much everything.

3. Rebooted mork and ptolemy to pick up crash-dump parameters for the kernels. We were going to install debug versions of the kernels but Jeff was having odd results with that while testing one on his desktop, so we're holding off for now. Rebooted jocelyn to pick up a new kernel as well.

That's about it for the outage. Recovery will continue for a while. I'm rebuilding the replica mysql database right now using the dump from today. When that's finished we'll start up the replica (maybe tomorrow morning).

Speaking of tomorrow morning, it's a holiday (Veteran's Day), so I won't be up at the lab (probably just doing the usual "check in from home every few hours and tweak this and that").

- Matt

see comments

10 Nov 2009, 0:24:48 UTC
Our master mysql database server (mork) crashed on Sunday. The first crash when we brought mork on line way back when was a "fluke" - the crash a few weeks ago was explainable (or so we thought) - but now we're in the realm of "grave concern" about this particular server. However, the result of each crash is just an annoying chunk of downtime - the actual data remain intact after recovery, and recovery goes along without too much ado. Maybe we have just been lucky so far. I could see a flat out crash being a bit more disastrous.

Eric did the remote work of initial and post-reboot cleanup, Dan actually came up to the lab to physically power cycle the machine, which Jeff walked him through over the phone. I assumed we'd all just wait until the next day when we're all back at the lab to set things right (after all, we've have longer unexpected outages before). When I returned from prior obligations to find the projects up I was pleased by the heroic effort. Still, I quickly noticed that the splitters were in a funny state which required my intervention or else we would have immediately run out of work to send out, so I fixed all that.

Anyway, we'll have to do some extra recovery tasks tomorrow during the regular outage. This will include putting a debug kernel on mork and some other crash-test stuff that may hopefully give us clues if mork decides to disappear again.

- Matt

see comments

5 Nov 2009, 22:53:58 UTC
Eeeeoooo. Looks like this minor corruption in the science database is really snagging us, at least right now. We're talking one or two rows of the zillions in the astropulse signal table - but informix isn't being very informative about which row or two, nor what to do about it. Meanwhile, this broke the replication of astropulse - or at least we think it broke replication. This may very well have failed for some other reason.

This hasn't been a public data flow issue - we can still split/assimilate multibeam and astropulse work for the most part. Still, it's been preventing us from doing any science for a while now. So it's roll-up-our-sleeves time. We're doing a more robust table check (and hopefully repair) overnight tonight, and had to shut off astropulse splitting for now. Which means only multibeam workunits for the near term.

Meanwhile we filled up the raw data drive during all this software blanking analysis. I forgot to carry the one or something. Anyway, no big deal, some minor cleanup this morning, and we're back on track with that.

- Matt

see comments

4 Nov 2009, 23:28:41 UTC
Our internal file server ptolemy crashed again early this morning and Eric had it rebooted by the time I got in. This is getting to be more than a minor concern. We're going to start collecting kernel crash dumps so we can at least get a clue what's wrong if this happens again.

Informix tweaking continues. Some page corruption did get uncovered during the last science database backup, probably due to the RAID hiccup last week. Not a big deal, but that's just another thing on the list of "maybe that's the problem" when trying to get the database to do anything outside of the usual splitting/assimilating.

Meanwhile, version 2 of the raw data pipeline is getting more and more automated - you'll should see a few more files appear on the to-split queue throughout the evening without any intervention from me.

- Matt

see comments

3 Nov 2009, 22:13:35 UTC
Tuesday is our outage/maintenance day. This was the first database compression/backup using the solid state drives on mork for the innodb logs - there are a lot of variables at play (like the result table only being 80% the size it was last week), but at first glance it seems like that alone shaved quite a bit off the compression time. Cool. Bob also tweaked another informix parameter, bounced the science database, did some table checks, etc. - maybe this will improve our science database performance (which has been strangely prone to "locking up" as of late). Or maybe not (after restarting the project we still had some queries lock everything up - some work still to be done, I guess).

I also got a couple scripts in order such that I'm getting on top of the data pipeline again. Hopefully we won't run out of workunits again as badly as this past weekend.

Just got back from a meeting discussing the university's current furlough plan - yeah, due to state budget cuts we are being forced to take days off - a kind, gentle way of enacting pay cuts, but not pay cuts really in our case - since we aren't paid by state funds (it's all donations) we are only being forced to take days off for "parity" but SETI still gets to keep its funds. Fair enough, as I understand we're all swimming in the same bowl of soup and belts are being tightened all around. And I already take several days off a month without pay, so in my particular case it's a complete wash.

- Matt

see comments

2 Nov 2009, 22:57:50 UTC
In case you haven't noticed, we've been low on workunits. As warned in several previous tech news items (and now on the front page) we're still in the process of converting our data pipeline to use the new radar blanking suite (to vastly reduce noise/interference). This conversion process has been slowed by several factors, including these two: it takes a long time to bring up old data from our archives (approximately 4 hours per 50 GB file), and it turns out a lot of these files contain garbage that make it impossible to process (which we can only discover after spending the time to bring the files up here). We are also low of current data because ALFA has been offline for a month due to maintenance.

In better news, ALFA is back up and we're collecting new data again. As well I moved the "testing phase" version of the data pipeline onto the main production data file server, which should generally help as we'll at least speed up disk i/o. Also our assimilator queue finally drained to zero again. I see that people are complaining about lack of work on various threads. We don't guarantee a steady stream of work, but do understand that such a steady stream is important for maintaining public interest. We're doing what we can. I'm getting another file on line as I type this - should be splittable (I hope) sometime this evening.

Our science database server (thumper) lost another disk over the weekend. No big deal, and the RAID recovered with a spare just fine - but nevertheless this is just another reminder that we really need to reconfigure the disk arrays on that system - they are unwieldy and inefficient.

- Matt

see comments

29 Oct 2009, 21:35:05 UTC
As predicted the data well temporarily ran dry overnight, but I'm trying my best to keep up with demand today (and set it up for over the weekend).

Weird thing today - I've been noticing intermittent problems connecting to the science database to make the most trivial queries. We thought this, and the assimilator queue backing up, were probably due to Bob's recent configuration changes to the informix database engine perhaps not helping so much. But then I noticed one of assimilators was inserting thousands and thousands of signals as fast as it possibly could from a single result file... since 7:40am yesterday morning!

This is not normal. Result files usually contain a handful of signals, maybe a few dozen tops. If they reach 30K in size they are automatically "cut off" and sent back to us. I tracked down the result file with all the signals - it was 1.6 gigabytes in size! Not sure how this happened, nor how it passed validation (though I have my theories), but it sure contained a lot of signals repeated over and over and over again. I moved that out of the way and hopefully that'll improve performance in general around here.

- Matt

see comments

28 Oct 2009, 22:43:56 UTC
Jeff is back in town and back in action here at the lab. He's now working on the NTPCkr/RFI stuff (which has been languishing due to lack of effort and the science database throughput woes which I've been alluding to lately).

As predicted, I did finally get the astropulse version of the splitter to compile (just some library/linking bugs that had to be hunted down and exterminated). So astropulse workunits using the software radar blanking system are going out! Meanwhile, I hit some more management snags with the multibeam stuff - I'm trying to blank/split really old files which we recorded before we had all the kinks worked out. Long story short, some files I spent a lot of time (days) pulling up from our archives and doing the first stages of radar analysis are unsplittable. Darn. I was hoping to just get beyond the dearth of data in the nick of time, but it looks like I got to pull more files up from the archives, and we'll run a bit dry before they are splittable.

Today is a particularly windy day, which means it's fairly clear. Here's a picture taken from my iPhone looking out from the lab patio onto the Bay. That the Lawrence Hall of science directly below me, then downtown Berkeley, then the Bay itself, then San Francisco, the Golden Gate Bridge, and the Marin Headlands in the distance. The detail isn't so great, so you can't see that the Bay Bridge is completely devoid of cars right now (it's shut down due to technically difficulties), which is quite rare and quite odd.

- Matt

see comments

27 Oct 2009, 21:59:49 UTC
As many of you already know, Tuesday is the regular outage day where we dry clean the mysql database and pack it down tight. We're recovering from it now. Today I also did some testing of the newly employed solid state RAID 1 on mork (the master mysql server). It seemed fine, so this device now holds the mysql/innodb logical logs, thus resulting in far less competing writes with the data RAID 10 (where the logs used to be kept). Will this help much? I dunno. A non-zero amount at least.

I'm still assembling the new data pipeline. Got a few files in the queue now for multibeam analysis, but I can't seem to get a new astropulse splitter to compile. I need to recompile so that it reads the software radar blanking bit instead of the hardware one, but I'm hitting some library/include issues. Sigh. One of those problem you know you'll get working eventually but right now the path isn't exactly clear, and everything will be annoying until you finally get a successful "make."

- Matt

see comments

26 Oct 2009, 21:51:34 UTC
Okay, so where are we... Over the weekend the raw data queue shrunk down pretty far, but don't fear. Astropulse ran out of work to do, and multibeam has maybe another day or two, tops. Meanwhile I'm working behind the scenes actually splitting a bunch of software-radar-blanked data from 2006. This is actually going out now to people, but just doesn't show up on the server status pages. I'd have to do some minor hacking to get these files to show up on that page, but that'll be moot fairly soon as all data will be software-radar-blanked and I'll just point the script to look in the new data directory (as opposed to having it look through two directories and figure out the combined status of everything).

Anyway, there's that. We might run a little dry over the next few days as I'm still scraping together disk/memory resources to get these old files pulled up from the archives, analysed, and embedded with the new blanking signal. Only then can these files be split into workunits. I'm working on it.

Meanwhile, we're still having sporadic problems with informix locking up on us. It's getting to be really frustrating, as you don't really notice anything is wrong until the workunit queue runs dry or something like that. The idea of migrating to another database engine is on the table again. Also, bruno was having some nagging mount issues so I just now rebooted it. You may have noticed the whole project disappearing for a half hour there. That was me.

Rumor has it Jeff is back in town. He was away for several weeks hiking in the Himalayas. I imagine he has jet lag and other kinds of recovery to deal with, and he'll appear maybe later this week.

- Matt

see comments

22 Oct 2009, 20:52:51 UTC
Eeewww. Last night ptolemy (an internal-use file server) crashed. Eric rebooted it this morning, and I still had a bunch of cleanup to do after that which took me until just about now. Other systems had to be rebooted, nfs/autofs daemons kicked, stale trigger files removed, etc. I also bounced informix as it seems like the science database was locked, but this happened to be two different coincident problems, one affecting the splitters, one affected the assimilators, and making it seem like both were hanging on the science database.

The latter problem was a real nuisance. I had to reboot vader, mess around with iptables/network configs, /etc/exports, etc. all of which seemed to do nothing. The problem was that vader couldn't mount the result storage device (which is exported from bruno) while all other systems had no trouble mounting it. I never figured out the exact problem, but yum'ing in the latest nfs-utils package seemed to massage the right muscle and suddenly it was visible on vader. Fine. Everything is sort of catching up now. Bob also got the mysql replica in working order again, so that's good.

Hopefully this isn't a sign that ptolemy is on its way out... Ugh.

- Matt

see comments

21 Oct 2009, 22:50:09 UTC
The mysql replica was turned back on today, then turned back off - Bob noticed it was still misconfigured, so he's re-recreating it and will probably turn it back on soon (within the next 24 hours).

Meanwhile, the software blanking pipeline is still warming up - actually some workunits from 2006 will go out (secretly) very soon. It's hard to tell how fast I can get this data up from HPSS and blanked. It all may be too slow to keep up with workunit demand, but we'll do what we can. It's hard to automate this stuff - I find the more I automate things the more time I spend cleaning up large, unexpected disasters.

- Matt

see comments

20 Oct 2009, 22:19:49 UTC
Recovering from the weekly maintenance outage right now, during which we took care of a couple extra things (above and beyond the usual mysql database compression/dumping). Eric replaced a failed root drive on his hydrogen database server. While he was at it, he upgraded the system's OS (it was way out of date). Meanwhile, I took the opportunity to finally bite the bullet and remove the SETI network's reliance on this server, as it hosted (for only historic reasons) the 32-bit libraries for informix - so when this server went down pretty much everything hung waiting for it to return. So this pointless dependency is no more, which is a bit of a relief.

I also added a couple recently donated solid state drives to mysql master database server mork, if only to create a tiny RAID1 on which to put mysql logs, and thus hopefully reduce disk contention on the data drives (which currently also hold the logs). We'll implement that new mini RAID over the course of the coming week.

Also, it turns out mysql replication was broken for the beta project this whole past week. Oops. So tomorrow we'll start the recovery of that (using today's mysql dump). I also turned of "show tasks/results" as the project recovers. Maybe I'll turn that on tomorrow after the smoke clears.

I'm still pulling up files for future software radar blanking analysis/processing. It's really slow given our various network bottlenecks (real or imposed).

Oh yeah.. I guess this is also technical news: Most days I take the train to downtown Berkeley, walk across campus, then ride the hill shuttle up to the lab (which is 1.5 miles up a very tall/steep hill). The shuttle's brakes failed on the way home - or at least showed enough signs of pre-failure such that the driver refused to go any further. He called dispatch to get another bus, but nobody was responding to his pleas. Given my tight schedule (and lack of cell phone service on the hill) I had no choice but to walk all the way down the hill myself, which wasn't the first time, and was no big deal - just terribly annoying.

- Matt

see comments

19 Oct 2009, 21:03:04 UTC
Happy Monday. We had some "brown-outs" during the weekend brought on by our science database getting clobbered. We're still not exactly sure why it locks up the way it does, but we'll improve the underlying disk i/o subsystem someday, and that could only help (usually when it has fits I find the respective disk arrays are almost to completely 100% utilized).

It's another rainy day around here, which means the air conditioner isn't as efficient, and the server temps are on the rise again. Scary, but there's not much we can do about it right away.

I'm actually pulling up old (2006) data from our off-site archives to be the first "production" data processed using the software radar blanking. We shall see how well it works (in both multibeam and astropulse) later these week, I imagine, if all goes well.

- Matt

see comments

15 Oct 2009, 19:08:57 UTC
FYI, the replica database server caught up and I turned the result views back on, etc. There are complaints that some results may have disappeared from the beta database. Bob and Eric are looking into it.

In the last thread, this old post of mine (from 2005) was quoted to point out the comedic irony that little has improved despite my claims:

> The SETI@home Classic backend is a tangled mess. There have been many problems over the years, most of which were invisible to the participants. None of these problems were fatal to the project or its science, but have resulted in an obnoxious web of ridiculous dependencies, confusing configurations, and unweildy databases. I am practically drooling dreaming of day when we get to turn all that stuff off and be done with it already. The BOINC backend is sooooo much easier to deal with.

I can see why this is funny (and I agree that it is), but allow me to point out in case people want to use this as some sort of sign of failure:

1. With the old server backend there was 0% chance that science would ever get done. Things like the NTPCkr were impossible in the old days.

2. We had a larger staff at the time of that post. Since we are currently working with less labor resources comparisons are unfair.

3. Our uptime has been much better since moving to BOINC, and downtime has been far more productive (users can work on other projects, etc.).

4. Yeah I admit there are still ridiculous dependencies, confusing configurations, and unweildy databases, but it's a completely different set than what I was referring to back then, and generally things are better across the board.

- Matt

see comments

14 Oct 2009, 17:51:33 UTC
Got finished kinda late yesterday hence the lack of tech news report. So last week we had lots of database cleanup to deal with due to server crashes over the preceding weekend. The mysql replica database suffered quite a bit, so we planned to recover it using a standard mysql dump file, except we discovered that the latest version of mysql is buggy and the dump files sometimes contain syntax errors. Great.

So this week we recovered the replica by copying all the myisam and innodb data files (and logs) from mork (the master database server). We actually did the first rsync on Monday to help speed things along on Tuesday, but there are so many large files it took forever to even to just do the "delta" rsync. That's why the outage was so long (this final rsync could only happen when the master database was quiescent).

This morning Bob and I made sure all the right config tweaks were made in /etc/my.cnf and started up the replica server. Only one minor snag at first which we fixed, now it's running again and catching up! We still have to figure out how to get mysqldump to work 100% of the time syntax-error-free. That's actually kind of scary.

Meanwhile, the Bay Area was hit with a record breaking storm yesterday. Yeah, I grew up in New York so I can say it was only really a "storm" by Bay Area standards but still we had low temperatures and high humidity. This wreaks havoc on our air conditioner, and the server closet has been hovering a few degrees away from disaster for a couple days now. In fact, ewen (Eric's hydrogen survey server) just lost a drive in the root RAID. It shouldn't be a big deal to replace, except that when ewen goes down for maintenance everything hangs (as there are lots of informix libs living on that system - we really need to move them off but have been loathe to do so for fear of breaking something else).

- Matt

see comments

12 Oct 2009, 23:02:21 UTC
The latest software blanking tests were also a success, so we'll start putting older pre-hardware-blanked data into production, now that we can remove the radar. Yay! May take a few days to rev up this engine. Meanwhile Eric has been making progress on the "zone RFI" rejection software/algorithms, so we can start getting rid of the garbage that makes up our current "top candidates."

The mysql replica was pretty much rendered useless by all our poking and prodding last week. We'll recreate it from scratch tomorrow (we hope). We are still concerned that we suddenly don't have a reliable backup mechanism, if mysqldump occasionally gives us dumps containing hidden syntax errors!

- Matt

see comments

7 Oct 2009, 20:35:19 UTC
The replica recovery is on hold for a while. We've experiencing random, intermittent issues when trying to recover one database with a mysqldump from another. This used to always work perfectly, but then something in mysql 5.1 screwed up the quoting and backslashing. I was able to get around this before by writing a script that parsed the large dump files one line at a time, but even that isn't working now. Bob has found other complaints on the web about this, so maybe there's a bug fix somewhere (we're certainly not going to pore through 20GB of ascii looking for missing backslashes and whatnot). We might have to do another dump from scratch, which won't happen until next week, which means the replica may be offline until then. Still, when we recover from the outage (and the weekend backlog) I might still turn on the "show results" flag so users can see recent result history, etc.

I'm working on a second test file to help solidify our warm feelings towards the software radar blanking suite. This will get split/sent out tomorrow (unless I suddenly disappear on an impromptu vacation, which is known to happen from time to time).

Due to popular demand I put a little effort in this morning to cleaning up the technical news main page - it's been a while since I have done so, so the page has gotten rather large and painful to load for people on slow connections.

- Matt

see comments

6 Oct 2009, 22:43:01 UTC
Quick post-weekly-outage wrapup: everything went fine, albeit a little slow given recent events. The replica recovery is going on now. Hopefully it'll continue along safely overnight and we can turn the replica back on sometime tomorrow.

One hilarious note. All our server reboots over the weekend dislodged several instances of sendmail, which then went on to send forth unexpectedly large queues of cronjob/server related e-mails to me, Jeff, and Eric. We're talking about 35,000 e-mails, all of which went through the lab spam firewall first, thus clobbering everybody's e-mail in the entire Space Lab for about 24 hours. Fun.

- Matt

see comments

5 Oct 2009, 20:43:16 UTC
Okay that was an ugly weekend. On Saturday morning I came to realize that our master mysql database server (mork) had crashed. I was the only one available at the time so I came up to the lab and rebooted the thing. We really need to improve our remote kvm/power cycle situation. I babysat the reboot long enough to see that mysql was recovering, knowing though that the replica would be out of sync (and need to be regenerated from scratch during the next weekly backup).

But then everything else crashed, and also hard enough to require human intervention. This time Eric eventually came up on Sunday to try to reboot a series of servers, but to no avail - they kept locking up shortly after reboot.

So Monday morning (today) we came into the lab and started cleaning up the server situation. Eric finally found the cause of the latter, if not all, of our problems. We have a pseudo user account is the "user" that runs a lot of stuff, apache processes, cron jobs, some of the BOINC back end servers, etc. For some reason the .history file had grown to 8GB in size, and it was full of garbage. Not sure why just yet, but that meant every time one of the above processes started, the shell tried to read in this impossibly large history file. Oops. Once Eric deleted this file all these dams broke free and we were able to safely recover all the databases/etc. throughout our long morning.

- Matt

see comments

1 Oct 2009, 19:41:59 UTC
Some random news items as the work week winds down. First, we did finally get some data drives from Arecibo - the last of them until we start observing again in early November (at the earliest). So that'll tide us over for a short while. Second, it seems like the third time's the charm: preliminary results from the third software radar blanked data test are looking good! We might roll this into production as early as next week. This means we can start analyzing a wealth of pre-2008 multibeam data that was otherwise useless.

We're still having some science database throughput issues that's keeping us from running the NTPCkr as much as we'd like. More and more this is becoming my number one priority.

- Matt

see comments

29 Sep 2009, 21:36:20 UTC
Hello all - usual outage day again today. It's an interesting battle between our two mysql database servers. Okay maybe not that interesting. But mork has far more RAM, and jocelyn has a much faster disk array. And we see what we expect - mork is a much better master server as it can hold the database in memory and do all kinds of random access, but during the outages jocelyn does its database table compression much faster, as it involves a lot of sequential writes to disk. Anyway, we're back up - not much shakin' on that front.

We did have an outage last night for an hour. This was a known event involving some network infrastructure maneuvering down on campus. It was unclear how long this would take, so we didn't bother with any kind of panicky warning on the home page that we were going to be down for an unspecified amount of time. I think you're all used to that by now anyway. Plus, the good news is that this was one more task out of the way such that campus can get back to determining our bandwidth upgrade needs.

I found yet another radar blanking bug. I least I'm *finding* the bugs I guess, and it's much easier to fix them once they are spotted. Anyway, iteration 3 will commence sometime in the next day or so.

And thank you to Tiaan Geldenhuys, who donated a bit of javascript to our NTPCkr page such that if you zoom in on the Google skymap you'll see the border of the "pixel" which makes up this candidate.

- Matt

see comments

24 Sep 2009, 19:29:14 UTC
Hey gang. Sorry to say the first software radar blanker tests were kind of a bust - apparently some radar still leaked through. But we have strong theories as to why, and the fixes are trivial. I'll probably start another test this afternoon (a long process to reanalyze/reblank/resplit the whole test file - may be a day or two before workunits go out again).

To answer one question: these tests are happening in public. As far as crunchers are concerned this is all data driven, so none of the plumbing that usually required more rigorous testing has changed, thus obviating the need for beta. And since there are far more flops in the public project, I got enough results returned right away for a first diagnosis. I imagine if I did this in beta it would take about a month (literally) before I would have realized there was a problem.

To sort of answer another question: the software blanker actually finds two kinds of radar - FAA and Aerostat - the latter of which hits us less frequently but is equally bad when it's there. The hardware blanker only locks onto FAA, and as we find misses some echoes, goes out of phase occasionally, or just isn't there in the data. Once we trust the software blanker, we'll probably just stick with that.

On the upload front: Sorry I've been ignoring this problem for a while, if only because I really see no obvious signs of a problem outside of complaints here on the forums. Traffic graphs look stable, the upload server shows no errors/drops, the result directories are continually updated with good looking result files, and the database queues are normal/stable. Also Eric has been tweaking this himself so I didn't want to step on his work. Nevertheless, I just took his load balancing fixes out of the way on the upload server and put my own fixes in - one that sends every 4th result upload requests to the scheduling server (which has the headroom to handle it, I think). We'll see if that improves matters. I wonder if this problem is ISP specific or something like that...

I'll slowly start of the processes that hit the science database - the science status page generator, the NTPCkrs, etc. We'll see if Bob's recent database optimizations have helped.

- Matt

see comments

23 Sep 2009, 20:46:09 UTC
Had more science database woes at the end of the day yesterday - processes (including splitters) getting logjammed. I'm hoping a couple "update stats" commands will fix all that.

Speaking of splitters, I'm actually running (drumroll please) the first software radar blanked data through a splitter right now, and workunits will be distributed to the public fairly soon. This is still in test phase - we shall see if the software blanking performs better than (or worse, or the same as) the hardware blanking. I'm guess with a couple tweaks here and there my code will be far better.

- Matt

see comments

22 Sep 2009, 20:43:14 UTC
Today was an outage day, with nothing special to report on that front. One interesting note is that our master mysql database server (mork) has 24 processors and 64 GB of memory, and the replica server (jocelyn, which used to be the master) has 4 processors and 28 GB of memory. Eric recently cleaned out really old rows from the beta result table - now the entire database fits better in memory on jocelyn, and in turn this database engine generally performs better than mork. How could this be? Because despite have far less memory and processors, jocelyn has more disk spindles (and faster disks, for that matter) than mork. Not really all that surprising, but it's fun to see our suspicions about disk performance confirmed with memory being less of a bottleneck. In any case, both servers are zippy and today's outage wasn't very long, was it?

So the weekend went by with nary a blip, or even a single alert from my web of alert scripts. This pretty much never happens. We always get kind of warning, severe or otherwise - high load on this server, replica database is falling behind, rising temperatures in the closet... but nope. Everything was just fine.

However yesterday we did have one short traffic dip due to the science database getting locked up on too many internal user queries, so the splitters weren't creating work for a couple hours there. No biggie - we killed the queries and informix sprung back to life. It is a bit worrisome how locked up the database can get, though, and it's hardly predictable when (or why) it does.

I'm actually running my software radar blanker through an entire 50GB test file right now. It processes in roughly twice real time (meaning a file containing n hours of data takes 2n hours to find radar and blank it). Not to worry - we can run many of these in parallel. I could also make several code optimizations if need be. Anyway, I'm hoping by the end of the week to trust this suite of software enough to start processing our large backlog of 2007-2008 data by next month.

Oh yeah one more thing - we do know that "queries/second" field is blank on the server status page. For some reason the same exact informational query on one server returns in a different format
than the other, so our general "db stats" script is sorta broken. Bob is fixing it.

- Matt

see comments

17 Sep 2009, 19:55:10 UTC
As Josef pointed out in yesterday's thread we are indeed unable to get any new data from the telescope until early November. This is a problem because we have only a few drives full of data on our shelf, and maybe a few drives down at Arecibo (which we asked to have shipped up to Berkeley).

The silver lining is that Jeff has been putting effort into getting the data recorder crashing issues fixed - now that project can be back-burnered and he can focus on RFI issues. Meanwhile I'm cracking on the software radar blanking stuff. I actually made a significant advance this morning, discovering that at any given time the radar patterns we are locking onto can drift as little as 0.1 samples, with drastic results in our ability to find the radar. I've solved that little bit, and it's all pretty much plumbing/testing/deploying at this point. Hopefully I can get this rolling before we completely run out of data. Of course, I always feel that running out of data shouldn't be that big a deal.

By the way, one of the reasons I've been lax with these threads lately is that I'm getting tired of being the sole focus for tech support/donation queries/etc. Please don't be insulted if I address roughly 0% of your requests that are personally addressed to me. I simply don't have the time. I keep asking for additional web presence and user interaction from the others or perhaps the hiring of actual web support staff, to no avail.

- Matt

see comments

16 Sep 2009, 20:53:41 UTC
Hello again. Sorry about the lack of information lately. I was out sick a large chunk of last week.

Anyway... it's been business as usual more or less. The raw data pipeline really shrunk down but fresh data finally arrived from Arecibo, so we were able to flood the queues again. But I see that we're in a period of about two weeks of zero observations, so we might tighten the belt again before too long. The new mysql setup (mork as master, jocelyn as replica) has been working quite well the past couple of weeks. We have another mork-like server (tentatively called mindy) but, like most of our equipment around here, was a donated system of unknown quality. Several hours of fighting with it yesterday makes me believe mindy may be a dud (processor errors during boot, etc.).

There have been complaints lately about uploads. I don't see any immediate problems on my end. I see files appeared on the server at the normal rate. The traffic graphs don't show anything vastly awry. Eric's been messing with the apache/balance settings on that system so I defer all questions to him.

Eric and Jeff are working on the first gross-level RFI removal infrastructure. Once that's in place the NTPCkr data will start making slightly more sense (the top candidates are all pretty much junk right now). Until then, I will only upload the top ten list by hand every so often.

- Matt

see comments

3 Sep 2009, 17:32:56 UTC
Sorry about the delay in posting. I've been around, just busy. Those interested in more info should note that we are posting general weekly meeting updates at

Outside of lots of little network/system hiccups which have been addressed in our usual whac-a-mole manner, there has been continuing data pipeline issues. The data recorder at Arecibo has been crashing, seemingly randomly. This wouldn't be a big deal but it requires human intervention to reboot, so when it locks up at night, we can miss hours of data. Meanwhile, our reserves are pretty much running dry. We do expect a shipment of at least 4 full data drives by early next week. We may run out of data of the weekend, but that's okay. And yes we are aware of splitters stuck on certain files.

On a more positive note, server mork (a new 24 processor/64 GB RAM intel system) is working beautifully as our master mysql database server (handling a sustained 2500 queries/second without breaking a sweat). Meanwhile we reconfigured jocelyn to be the replica server now. There are some gotchas we've been working around so not all pieces have fallen into place on that front, but we're close. The former replica server, sidious, has been retired (it's actually powered off and sitting on a lab bench).

I haven't updated the NTPCkr candidate list in a while as the candidate scorer program seems to lock up the primary science database. I'll mess around with that today (mainly trying to force it to connect to the secondary science database server).

Little progress on the radar blanking front, though still non-zero progress. Finding the time is difficult.

- Matt

see comments

19 Aug 2009, 22:07:01 UTC
Okay. Spent a large chunk of the day hacking the last final bits of the NTPCkr web page together and made it available for public viewing. Yippee! There's a link on the front page in the news section if you're looking for it.

There's still a ton more work to be done on this page, as well as the NTPCkr itself, and this is still just the first step in many as far as final data analysis is concerned. We haven't even touched radio frequency interference removal yet (outside of the tools we already have from other SETI projects that we could retrofit for SETI@home). Still, it's a (seemingly rare) major step in the right direction around here.

I also had a code walkthrough with Jeff/Eric about my radar blanking difficulties. Eric had several good things to try, which I'll get started on once I post this message. Actually I might look into the stuck science status page first...

- Matt

see comments

18 Aug 2009, 22:41:06 UTC
Outage day, usual drill: shut everything down, back up the mysql databases, fire off a science database backup as well while we're at it, compress the mysql tables (which get fragmented over the course of a week), and start everything back up. As far that was concerned, everything went smoothly.

However, we were hoping to hook up a couple extra solid state drives to the new replica server mork. The plan was to put some mysql logs on these drives to help unload extra i/o from the rest of the database drives. We got all the hardware in place and hooked it up today, only to find the server BIOS wasn't seeing these drives. In the time allotted for this task I determined this was either due to (1) bad cables, or (2) motherboard weirdness. Since this is an Intel donated server with an "experimental" motherboard, all best are off. I did prove we could see the SSDs when I swapped cables around, but given the current setup we couldn't run normally like that (long story). In any case, I think we're fine without these drives for now, and may still go along with the plan to make mork the master next week.

Other than that, radar blanking woes continue. I'm going to have Eric and Jeff look at my code tomorrow and point out what I'm doing wrong, if anything. I also hope to get some version of the NTPCkr page online tomorrow (he says with little fanfare).

- Matt

see comments

17 Aug 2009, 21:22:19 UTC
Okay things haven't been running so well the past couple of days. First, there were some mount problems in the middle of last week which caused our assimilator queue to clog up. This inflates our result table causing all kinds of table fragmentation which never helps the general pipeline. Later in the week I noticed the spike table in the science table was running out of space, so Bob added a few more database chunks. That process eats up a bunch of disk i/o, causing splitters/assimilators to slow down temporarily. But then we hit some major chokepoint causing work production to grind to a halt.

Actually it was worse than that - things were working normally, but only really slowly. This makes it hard to find an obvious smoking gun. Usually this is a symptom of heavy disk/database i/o on thumper. We were testing all that this morning by turning processes off but to no avail.

So.. remember how I mentioned in my last note how we just got new raw data from Arecibo? Well, the script copying it over to the raw data storage server failed to register the file system was full, and packed it up tight. Turns out this caused the storage server some distress, and when I finally checked into it this morning the load was high and all the nfsd's were in disk wait. I deleted one excess file, the nfsd's sprung to life and the whole dam broke, the splitters charged full steam ahead, and the network bandwidth is now tapped out trying to catch up on demand. Fair enough.

- Matt

see comments

13 Aug 2009, 20:22:28 UTC
I was actually out the past couple of days. Family stuff, including an adventure where we had to tow our Prius almost 100 miles back to Oakland (it freaked out and lost power on I-5). It's in the shop now - luckily these newfangled cars store debugging information so they were able to locate the problem (flakey potentiometer causing erratic accelerator information, and as a failsafe the Prius cut its own power).

Anyway.. during the past couple of days Jeff and Bob handled the Tuesday outage, and Eric tackled a couple general network issues as well (the upload server got misconfigured somehow and was dropping excess connections, and then the assimilators were dead in the water for a while there, causing the queue to back up, the workunit disks to fill up, and finally the splitters to shut down - which is why we ran out of work to send out last night). All seems much better now, albeit jammed with traffic.

In better news we did finally get the first two data drives from Arecibo as recorded by the upgraded data recorder and new external drive docks under normal operations. So we're not going to run out of raw data after all, or at least not just yet. I'm copying those raw data files onto our local drives as I type.

- Matt

see comments

10 Aug 2009, 21:30:52 UTC
Happy new work week (for those with "standard" work week formation)! The weekend was rather quiet - no major outages or glitches. We burned through all the data we have on line for Astropulse, but still have plenty to process for multibeam. We do have at least a couple drives containing hot, fresh data coming up from Arecibo any day now. We're also hoping the amount of time ALFA gets to observe actually increases, or else we'll always continually be dangerously close to being, if not completely, out of data to process. As far as problems/concerns go, this is a good one.

I got the first rev of the daily cronjob running right now which creates an updated "top ten" candidate list (via the NTPCkr) to be parsed by some PHP for public consumption. It's running now, and taking a long time. I'll see how long it takes before making anything live, being as how we'd like to run this every day, but may be forced to pace it slower than that.

As for radar blanking, I'm finding the correlations still aren't clearly defining which is radar and which isn't. I'm going to talk to Dan about that shortly.

- Matt

see comments

5 Aug 2009, 21:57:21 UTC
Raw data pipeline: Jeff and I are mining old files that were only partially done for one reason or another. Hopefully these can keep us crunching until we get more data from Arecibo. To add insult to injury, it seems the observatory has been suffering from several power outages the past few days, probably due to thunderstorms.

MySQL databases: So far so good with mork as the replica. We recovered pretty quickly from the outage yesterday. I'm hoping the freeze yesterday was a fluke, or caused by some temporary variable which has since changed, or at the very least next time it happens we'll have some kind of smoking gun somewhere on the system. We're looking into getting its twin "mindy" on line sooner than later.

NTPCkr: Jeff and I met this morning to discuss the current status of what we need to do to get this thing on line. To be clear, Jeff has been doing pretty much all the work on the NTPCkr engine, and I've been helping with the cosmetic/web stuff. Anyway, Jeff has a couple bugs to clear up. Nothing major - things like the reporting mechanism sometimes spits out the same candidate twice. I've been working on web site stuff, like putting in all the hooks to allow people to discuss candidates amongst themselves on a separate forum. Once we clear up our current set of bugs/updates I'll fire up a daily cronjob which will (a) generate the current "top ten" list, (b) pull all the data from the science database from these candidates (if not already on disk) for plotting purposes, and (c) create discussion threads for each candidate (if they don't already exist). Then we're live, but we'll have many "version 2.0" tasks to address right away.

- Matt

see comments

4 Aug 2009, 22:52:27 UTC
Tuesday is our usual outage day, as many of you are firmly aware. Today was the usual drill, except we have two replica databases to deal with. We set the "alter table" scripts on these two systems simultaneously, prepared to laugh at how much faster mork will perform than sidious.

And it was doing great, even faster than the master database (jocelyn)... until it crashed. And it was the worst kind of crash - the system simply froze, requiring a hard reset, and there was not a trace of any evidence anywhere upon reboot about what happened. So now we have the completely opposite of a warm fuzzy feeling about mork, but nevertheless even with this setback, and the ensuing innodb database recovery, it still wrapped up all its tasks around the same time as the master database, and so both master/replica are back online and serving requests. I didn't need to temporarily turn off the "show tasks" pages because we can handle them, even right after an outage. The old replica (sidious) is still chugging away on its table compression tasks, and will probably be done with those around midnight.

Meanwhile the rest of the day I've been gathering data and making plots to better understand the radars that clobber our Arecibo data. Selecting thresholds is rather difficult, as it changes from file to file where the baby ends and the bathwater begins. Sigh. But we're close, and can do a rough enough job of getting most of the radar out without losing too much data.

People asked about the NTPCkr pages. Oh yeah.. That.. Jeff and I were pushing on those last month, then I disappeared on vacation, and then we both were at the OSCON in San Jose, and then the new replica server finally started working so that's been occupying our time, along with scrounging data together to process. Sorry about the delays. I know we're close to publishing something. This is kind of an important addition to the web site so we want to make it kinda works before embarrassing ourselves with broken/misleading information.

- Matt

see comments

3 Aug 2009, 23:19:10 UTC
A relatively spotless weekend (though I did arrive this morning to find 1000 e-mails in my inbox - all warnings about mount issues from a behind-the-scenes compute server). The new replica server "mork" caught up pretty much instantly last week once the whole database was read into memory (about 32GB) and is now actually serving as the main replica for now, if only to stress test it. We still may crack it open and reconfigure it if we find the drive configuration is a bottleneck. In any case, if you're looking at results on this web site, you're pulling them off mork.

We are also getting close to running out of data. Just as we got the data recorder working again they had two weeks without any Alfa observations. We're currently trying to split raw data files that were only partially split for one reason or another, but after that... looks like my software radar blanker project has been bumped up in priority. No need to panic, at any rate - we probably have a couple weeks, I think, and we might get a burst of new data from Arecibo during that time.

- Matt

see comments

30 Jul 2009, 19:46:03 UTC
So we seem to have gotten over the hump with this new replica server. I should point out working on this server has had zero effect on the rest of the normal project operations, except for perhaps eating up all my time. Anyway, my script got around the dump/restore bug, and after some configuration headaches this morning we are successfully replicating on mork! Of course, sidious continues to be the replica we are using for production, while mork is considered "beta test."

It is catching up on the backlog far slowly than we hoped, especially given the power of the machine. Of course, power is measured in network, disk, memory and cpu. This system certainly has cpu (24 processors!) but word on the street is that mysql actually *drops* in performance after n processors. What "n" is, and what the penalty is remains unclear. Also, this system has fewer disk spindles than sidious (8 compared to 10), and they are slower disks, I think. So we may be seeing a disk i/o hit, but iostat doesn't really show anything amiss. The system is also in our lab and not in the closet, so there may be an extra network hop or two slowing things down. Anyway, as it progresses we'll gauge its performance and act accordingly.

As for changing linux flavors, the current issue here is mysql versions, and not so much linux distributions. As mentioned elsewhere we're trying to adhere to a homogenous setup, and we have less than zero time to mess around with anything experimental like trying new OSes on for size. In any case, Fedora works well enough, and while I generally swear by open source software for both philosophical and practical reasons, I do understand that you get what you pay for.

- Matt

see comments

29 Jul 2009, 22:32:28 UTC
So.. getting mork on line as a test replica server still continues to be one headache after another. We finally got the hardware working, finally got the drive configuration set up, finally got the OS installed, finally got MySQL fired up, and we were populating the databases using Tuesday's dump files.

Then we hit a completely mysterious error and consistently at the same point in the dump file. Long story short, I spent pretty much all day today trying to find the cause of this error. At this point we're about 90% convinced it's an actual bug in the MySQL version that comes standard with Fedora Core 11 (version 5.1.35) where it fails reading mysqldumps containing large text fields. This seems like a major problem, no? Anyway, the same mysqldump worked on a test 5.0.x database engine. So I'm looking to upgrade this version beyond what's in the current Fedora repositories. What a pain!!!

I just turned on the "show results" flag, even though our current replica is still far behind reality.

- Matt

see comments

28 Jul 2009, 22:32:02 UTC
Had the usual Tuesday outage today for database maintenance. Nothing too exciting to report about that except we continue to have progress getting new server mork on line as secondary replica (and hopefully someday primary master). MySQL is running on it, and all the tables are being populated as I type this.

A note about the "old junk" I mentioned yesterday. I was talking about real junk (gutting parts servers, shipping boxes, etc.). We still have the E450s that were our various servers during the "classic" phase of SETI@home. We keep talking about auctioning those off but I doubt any of us will ever have the time to coordinate that. Maybe we'll donate them to the Smithsonian.

- Matt

see comments

27 Jul 2009, 21:54:56 UTC
Not much time to report very much, but the good news is that we finally got one of those new Intel machines working. Eric was in over the weekend installing a new disk controller card, and Jeff and I wrapped up the OS install/configuration today. We now have a new system with 24 processors (4x6 2.13 GHz) and 64 GB ram. We'll try to make this a replica mysql server (in addition to sidious) and see how it does, maybe tomorrow...?

Data-wise, we're finding the Alfa receiver isn't on as much as we thought, and we're running low of data from our archives, as well as data currently on-line. Actually, that's not true at all - we have plenty of data taken between January and April 2008 which has the hardware radar blanking signal (so we can reject RFI), but was accidentally pre-precessed (so we have to unprecess after the fact). Not that big a deal.

About to disappear into the basement and throw out a bunch of old computer junk we haven't used in many years (various people are complaining about how much space it's taking up, which is fair).

- Matt

see comments

23 Jul 2009, 20:21:14 UTC
Oh, hello. I was out of town most of last week on vacation, then Jeff and I were at OSCON 2009 down in San Jose until today. Despite being billed as an open source developer conference we got all kinds of linux sysadmin and mysql tips and tricks from various experts that we may apply towards better diagnosing of system/network/database issues in the future.

That all said, I haven't had the time to catch up on the lengthy discussions here in this forum during my absence. I imagine it has been mostly about our continuing network struggles. This may all become quite moot quite fast as Eric started rolling out the updated scientific analysis configuration, which is an easy knob to turn as we can increase sensitivity, thus improving our science, with the additional happy side benefit of reducing demand on our servers. I think, though, that we have now just reached the limits of that particular knob before getting diminishing returns.

Apparently there were a few servers that needed to be kicked while I was away. Jeff and Eric took care of all that. Mount issues and the like. We also seem to have our new disk arrays set up both at Arecibo and here, so the raw data pipeline should be kicking into full swing again soon. This is good as we're down to our last 10 files that we've been bringing up from the archives (there are a lot more files, but they require the radar blanking software to work in order to be processed, and I haven't gotten around to that yet).

- Matt

see comments

13 Jul 2009, 22:11:48 UTC
The data pipeline over the weekend seemed to be more or less okay, thanks to running out of Astropulse workunits and not having any raw data to split to create new ones. Of course, I shovelled some more raw data to the pile this morning, and our bandwidth shot right back up again. This pretty much proves that our recent headaches have been largely due to the disparity of workunit sizes/compute times between multibeam/Astropulse, but that's all academic at this point as Eric is close to implementing a configuration change which will increase the resolution of chirp rates (thus increasing analysis/sensitivity) and also slowing clients down so they don't contact our servers as often. We should be back to lower levels of traffic soon enough.

We are running fairly low on data from our archives, which is a bit scary. We're burning through it rather quickly. Luckily, Andrew is down at Arecibo now, with one of our new drive bays - he'll plug it in perhaps today and we'll hopefully be collecting data later tonight...?

To be clear, we actually have hundreds of raw data files in our archives, but most of them suffer from (a) lack of embedded hardware radar signals (therefore making it currently impossible to analyse without being blitzed by RFI), or (b) accidental extra coordinate precession, or (c) both of the above. Software is in the works (mostly waiting on me) to solve all the above.

- Matt

see comments

9 Jul 2009, 22:09:13 UTC
Not much news. Eric, Jeff, and I are still poking and prodding the servers trying to figure out ways to improve the current bandwidth situation. It's all really confusing, to tell you the truth. The process is something like: scratch head, try tuning the obvious parameter, observe the completely opposite effect, scratch head again, try tuning it the other direction just for kicks, it works so we celebrate and get back to work, we check back five minutes later and realize it wasn't actually working after all, scratch head, etc.

Thanks for all the suggestions the past couple of days (actually the past ten years). Bear in mind I'm actually more of a software guy, so I'm firmly aware that there's far more expertise out there regarding the nitty gritty network stuff. That said, like all large ventures of this sort the set of resources and demands are quite random, complicated, and unique - so solutions that seems easy/obvious solution may be impossible to implement for unexpected reasons - or there's some key details that are misunderstood. This doesn't make your suggestions any less helpful/brilliant.

Okay.. back to multitasking..

- Matt

see comments

8 Jul 2009, 19:03:43 UTC
Once again it took the replica all night to recover. I started it up this morning, and it's catching up now. Well, almost. I'll turn the "show tasks/results" feature back on once it really starts catching up.

There's been a lot of discussion lately about our bandwidth woes. I actually talked to Blurf this morning on the phone regarding the (rather generous) push to donate money/hardware towards solving this problem. Let me try to paint a big picture here.

We pay for a gigabit of bandwidth from our private ISP (Hurricane Electric), but can only use 100 Mbits given current campus infrastructure. Most of campus is on gigabit already, but our lab is all the way up the hill - so it's much harder and more expensive to improve the old wiring/routing. The entire rest of the Space Lab uses about 10 Mbits/sec, so there is absolutely zero push by anybody else to spend money/effort on this project. Luckily, there was a spare 100 Mbit cable which is what we are using for the Hurricane Electric link.

While we pay for our bits, they still have to route through campus in order to ultimately hook up with the right backbones. That means we have to adhere to campus's network specs, which in turn means we can only use very specific brands/models of hardware, and can only act once they've fully researched our needs. We opened up a ticket months ago asking to start this research. We got word a couple days ago this research has more or less finally begun. Not much progress, but still non-zero. This may seem impossibly slow, but campus really pretty much always has much bigger fish to fry. Plus our requests usually present them with something new they haven't dealt with before, and therefore they are far more careful.

Ultimately, we should be presented with a couple options from campus which include exact pieces of hardware to be obtained. It's still not clear how much cable has to be upgraded and where, but we know we'll need two new routers, if not also other hardware. When campus gives us this final report, only then can we start figuring out how to obtain the necessary hardware.

As for other options, like going wireless... There actually used to be a building down in the flats that got wireless bandwidth from us. The experience was that it was quite slow and prone to suffering during bad weather, etc. This was a while ago, but still there is enough concern about reliability that nobody seems to want to go down this path.

Of course, another option is relocating our whole project down the hill (where gigabit links are readily available), or at least the server closet. Since the backend is quite complicated with many essential and nested dependencies it's all or nothing - we can't just move one server or functionality elsewhere - we'd have to move everything (this has been explained by me and others in countless other threads over the years). If we do end up moving (always a possibility) then all the above issues are moot.

Another important thing to consider is that we can always reduce are bandwidth demands via other means, which I also explained in another recent threads. Things like removing redundancy (and putting a cap on workunit downloads per day per host), or adding scientific analysis. Or, to be a little extreme, calling SETI@home done, turning off the downloads for good, and moving on to the next thing (something I am actually in favor of doing sooner than later, but the others around here seem to disagree).

I definitely appreciate past and current efforts to help us get beyond the current bandwidth crisis. However, as noted above, there are enough variables involved that I'd hate for you all to start collecting money directed towards a solution to a problem which might just go away. In the meantime, thanks as always for your patience (and crunching time when you actually do get workunits) - we'll keep working with what we got and see if we can't get beyond the storm sooner.

- Matt

see comments

7 Jul 2009, 22:35:01 UTC
Had the usual weekly database maintenance outage again today. It looks like our mysql database has shrunk for two weeks in a row now (due to less results out in the field). This is a good thing as it means more internal I/O resources. We're recovering from the outage now as I type this. I still expect it to take a while (maybe a full 24 hours?) before we stop dropping connections left and right.

As for that raw data storage server issue mentioned yesterday... turns out it was, uh, user error. A partition filled up. Oops. Still, not sure why the data trasnfer tools (to pull data up from our off-site archive) wasn't noticing that the disk was full and kept trying to write to it over and over and over again.

Question: does anybody out there actually *use* NetworkManager? Or does it exist simply to confound and annoy? I'm willing to believe it's a useful tool, but unfortunately my experience pretty much shows the latter - it randomly and unexpectedly breaks network connections without remorse. I have made it a habit to remove that package and all my machines whenever I find it. Of course, I just installed a clean OS on my desktop. Suddenly firefox is starting up in "work offline" mode, even though I uncheck the box every time. I did some research and found, ha ha, it was my old nemesis NetworkManager getting in the way - it got reinstated with the new OS install. One quick "yum erase" and firefox was once again starting up actually connected to the internet, which I think is preferable, no?

- Matt

see comments

6 Jul 2009, 23:00:18 UTC
It's still pretty ugly out there - we're maxed out our bandwidth and mysql resources. We were able to squeeze out a few more cycles from the upload/scheduling servers this morning, but generally it's been quite impossible the past week or so. Clearly this is a result of increasing our user base, and the growing percentage of results being processed by cuda clients.

To solve this problem we have several options. There is non-zero but nevertheless slow progress in both the bandwidth and mysql fronts, so we're effectively stuck with what we got for now. We could go to single redundancy and keep the split rate the same. This will immediately divide out outgoing bandwidth in half, but people will, on average, get less work to chew on. We could also increase the resolution of chirp rates that we process, thus lengthening the time it takes to process a workunit. We may do both. From what Eric tells me compressing workunits only helps multibeam, and only by about 20%. Almost not worth considering, since that will get us 5-10 Mbits back, and we need something like 50.

The other annoying thing is that on Friday/Saturday our raw data storage server got hung up while we were copying a file up from our archives. This caused splitting to slow down until we ran out of work to send. Not sure why this was the case, as I killed that transfer and everything worked fine after that. Even more mysterious is that, while bringing the same file up again this morning it choked our server once more. Why this one particular file is having such a random and extreme negative effect is beyond me at this point, but we're doing other tests, etc.

You know, I should point out that while I write these daily missives I tend to disagree with a lot of policies that end up getting enacted around here, which it makes it difficult for me to defend one practice or another that might be discussed on these threads. Anyway, don't blame the messenger.

- Matt

see comments

2 Jul 2009, 18:24:11 UTC
Looks like we're back in another noisy period, or at least the bandwidth is maxed out enough that it's constraining both downloads and uploads. Let's just try to ride this storm out - it should hopefully clear up on its own.

Regarding the videos I linked to yesterday, there were plans to get the powerpoint images linked into the actual camera footage, but I guess that never panned out. That's fine. Or maybe that only happened on the live feed... Anyway, you get the basic gist of what we're trying to say from this footage. I was kind of rushing through my talk - how do you condense 10 years of effort into 20 minutes?

We were hoping to get the NTPCkr pages up this week but I'm finding that I really need to update the FAQ and other informational pages before making this live, lest we get flooded with common questions. Plus we have a little bit of feature creep, which is okay - better to rush and do these things now or they'll probably never get done.

- Matt

see comments

1 Jul 2009, 19:38:40 UTC
Sorry about the forums (and other web site features) being shut off for over a day. These Tuesday outages are really taking forever. I guess we've been really busy, which means our tables get ridiculously fragmented throughout the week. Plus I noticed our database is easily 50% larger than it has been about 2 months ago. And the replica lost a couple of its CPUs recently (it's a used/donated system and the CPUs were known to be flakey from the start). Anyway, since the normal recovery procedure was so painful last week I opted to keep all web page database lookups offline until the replica was caught up. Once again, I'm sorry for the inconvenience.

To make up for that, how about some videos from the SETI@home 10 year anniversary? I'll link these to the home page soon enough. Consider this a sneak preview for those who read these threads. Let me know if there are problems downloading/viewing these mpegs.

Data recorder-wise... After all the effort to work with what we got, we're finally throwing in the towel on the current set of data drive enclosures. We have a plan B and plan C already in place - just a matter of deciding which one to enact. Meanwhile, I'm pulling old data off the archives at a pretty good clip - hopefully fast enough to keep up with demand.

Otherwise, I'm still working on NTPCkr and radar stuff. And I adjusted the stats scripts that generate the numbers on the server status page. The Astropulse numbers up until this morning reflected version 5 workunits/results, now they reflect version 5.05.

- Matt

see comments

29 Jun 2009, 22:16:48 UTC
Another wacky weekend. Sounds like we were sending out a bunch of short workunits, which strains our bandwidth resources. Plus uploads were clogged for a while. The server was too busy and dropping connections, so the uploads weren't even reaching the server. On Saturday morning I did some TCP tweaking and seemed to clear up that log jam for the time being.

This morning it came to my attention that we've been sending out workunits with the "application/x-troff-man" mime type. This was because files with numerical suffices (like workunits) are assumed to be man pages. This may have been causing some blockages at firewalls. I changed the mime type to "text

The SERENDIP web page was updated for the first time in many years. There's a link on the front page about that.

We plan to get the public NTPCkr candidate lists on line this week, ready or not. Trying to squeeze a couple more features in at the last minute, but I'm sure there will be bugs to work out and more features to add later on.

- Matt

see comments

25 Jun 2009, 20:59:16 UTC
Fallout continues from the outage on Tuesday. Turns out the minor corruption in various MyISAM tables is messing up replication. Every so often a duplicate entry appears on the replica queue which is easy to remove but requires human intervention. This is causing the replica to fall further and futher behind. I'm loathe to give up on it, though, as that means being forced to point all queries, including non-essential ones, at the master. And that'll break everything.

We also had to fall back to using two download servers, but we did so using simple DNS round-robin load balancing. Obviously this wasn't working out so well. DNS rollout/caching is never balanced (we saw this several times before, especially during the feeder mod polarity issues a year or two ago). So this morning we fell all the way back to using "pound" - which forces exactly 50% of all incoming connections to go to the first server, and the rest to the second one. This immediate broke the current download log jam, though of course we're still maxed out bandwidth-wise as I write this paragraph.

Seems like there are a lot of frustrated people on these threads. There's no right or wrong way to feel about these outages. We're kind of a special case. At the core we're an academic project with no deadlines - normally nobody gets hurt if science is delayed a day or a month or a decade. On the other hand, we're forced to be "professional" since we're asking for various forms of support from many thousands of people, and you can't have that large a number of people involved without some sort of professional grade management and public relations. It's a daily puzzle marrying the two completely separate worlds.

- Matt

see comments

24 Jun 2009, 19:56:44 UTC
Despite efforts to reduce the outage time yesterday, the database was bloated enough (for various reasons) to take all day compressing/backing up. The replica wasn't even close to being ready to done by the time I left the lab, and still wasn't done before I went to bed last night. That meant all queries had to be aimed at the master, including all the read-only stuff that usually hits the replica - stats collection scripts, result state count scripts, the daily credit multiplier calculation (which is rather expensive), and lots of annoying web scraping queries.

All those excess things pretty much killed us throughout the evening. The replica was finally available in the morning, albeit fairly far behind the master. Nevertheless I was able to start cleaning up the mess. However, two other problems were revealed.

First, going to one download server wasn't a good thing. It seems impossible to me that apache can't handle all the downloads on one system - especially given the abundance of free resources. It drops connections regardless of how much network/httpd.conf tweaking I do. So we fell back to using two download servers, and that immediately solved everything. Of course, we've been offline for 24 hours, so there's gonna be lots of traffic for a while making it hard to upload/download anything.

Second, there was minor corruption in the MyISAM tables in the mysql database. Not sure what caused that but given the database was clogged all night all bets are off. The most notable effect of this was some weird behavior in the forums. Some simple "repair table" commands found the problems and claims to have fixed them.

Anyway.. it's clear we still have much work to do cleaning up our current mysql situation. Sigh.

In better news, looks like me and Jeff are going to the OSCON 2009 in San Jose in July - the O'Reilly open source convention. Maybe we'll get some hot tips about improving the linux/apache/mysql/php performance around here. Tim O'Reilly himself helped hook us up with free passes (he's been nice to us over the years).

- Matt

see comments

23 Jun 2009, 23:09:29 UTC
Usual outage today (which happens every Tuesday for mysql database compression/backup). It went really long - I guess we've been busy inserting/deleting all last week. We went back to an older policy of doing simultaneous compression on both the master and replica, which should vastly speed up post-outage recovery. Until today we've been letting the compression commands (i.e. "alter table user type = innodb") to pass from the master to replica via the usual channels, but they wouldn't happen in parallel (as the loooong queries had to complete successfully on the master before the replica would start processing them). This caused the replica to be as many as four hours behind when the project started up again in the afternoon. The benefit of doing it that way was less work/management and accidental updates/inserts during the outage wouldn't get lost. Going back to doing it in parallel, we have to stop the replica before we start and reset the master after we're done, thus increasing the chance of these lost queries, but so far we've had 0 such incidents during these weekly outages since we started using mysql years ago.

A weekly planned outage is usually a good time to take care of some offline chores. Today I cleaned up lots of unnecessary mounts in a effort to reduce our automounter maps as much as possible (so we don't have such a tangled web which can be quite painful when one server disappears). I also made vader the sole download server, thus freeing bane to be whatever we want - which will be useful to handle certain services temporarily as we go around upgrading the out-of-date operating systems on lots of these machines. I think vader can handle the load alone.

I hear the presentations from the 10th anniversary celebration have all been converted to mpegs. It's a few gigs worth of stuff on a computer down on campus. A flash drive containing all that will appear up here at our lab sometime in the near future. Or it may be hosted on an interim server. We shall see.

- Matt

see comments

22 Jun 2009, 20:53:56 UTC
It's fairly clear that the recent updates we made to the general mysql/state counts/splitter fold has vastly improved our recent weekend woes. There were still a couple dips here and there, but no wild swings like before.

Except this morning one particular query - from the scheduler - was clogging the works. We figured we'll just let it push through, i.e. let nature take its course. We assumed it was an expensive lookup, but after a couple hours of waiting I ran the same query on the replica and found there was only one (!) row in question. So what the heck is mysql doing? We killed the query and eventually the logjam cleared.

I'm finally scraping up enough space to pull a lot more work up from our archives, so Astropulse will be kicking in again, at least at some low level. This should also help reduce the deman on our limited resources since those workunits take longer to process, which means a lighter load on our database/download/upload/scheduling servers.

- Matt

see comments

18 Jun 2009, 22:36:40 UTC
Some things got lost in the server reboot chaos/mayhem yesterday. One being that results were not being correctly stored on disk, despite all diagnostics showing otherwise (the incoming traffic looked normal, the upload apache servers were responding with "200" status, all the BOINC backend queues were nice and low). However, after rebooting the upload server yesterday the result RAID partition failed to mount. Actually this is a known quantity - there's something odd about this particular RAID partition that requires human intervention after every reboot to get going. Well, that human intervention didn't happen. Oops. Anyway, this was ultimately discovered thanks to various complaints from various parties, and fixed. Hopefully not too much headache/annoyance out there as the backlog of failed results clears out and corrects itself.

The new splitter method is now in production - where we're getting counts from a regularly updated table rather than each splitter process making the same redundant query over and over again. This would seem like a job for triggers, and we may go that route, but we already had the programming/plumbing in place to make this table (i.e. the process that collects numbers for the server status page, which already displays those same counts) - so this was easier to implement. We'll see if we get less network dips over the next few days...

- Matt

see comments

17 Jun 2009, 20:16:11 UTC
I've been busy. Almost too much to write about, none of it all that interesting in the grand scheme of things, so I'll just stick to recent stuff.

Our main problem lately has been the mysql database. Given the increased number of users, and the lack of astropulse work (which tends to "slow things down"), the result table in the mysql database is under constant heavy attack. Over the course of a week this table gets severely fragmented, thus resulting in more disk i/o to do the same selects/updates. This has always been a problem, which is why we "compress" the database every tuesday. However, the increased use means a larger, more fragmented table, and it doesn't fit so easily into memory.

This is really a problem when the splitter comes along every ten minutes and checks to see if there's enough work available to send out (thus asking the question: should I bother generating more work). This is a simple count on the result table, but if we're in a "bad state" this count which normally takes a second could take literally hours, and stall all other queries, like the feeder, and therefore nobody can get any work. There are up to six splitters running at any given time, so multiple this problem by six.

We came up with several obvious solutions to this problem, all of which had non-obvious opposite results. Finally we had another thing to try, which was to make a tiny database table which contains these counts, and have a separate program that runs every so often do these counts and populate the proper table. This way instead of six splitters doing a count query every ten minutes, one program does a single count query every hour (and against the replica database). We made the necessary changes and fired it off yesterday after the outage.

Of course it took forever to recover from the outage. When I checked in again at midnight last night I found the splitters finally got the call to generate more work.. and were failing on science database inserts. I figured this was some kind of compile problem, so I fell back to the previous splitter version... but that one was failing reading the raw data files! Then I realized we were in a spate of raw data files that were deemed "questionable" so this wasn't a surprise. I let it go as it was late.

As expected, nature took its course and a couple hours later the splitter finally found files it could read and got to work. That is, until our main /home account server crashed! When that happens, it kills *everything*.

Jeff got in early and was already recovering that system before I noticed. He pretty much had it booted up just when I arrived. However, all the systems were hanging on various other systems due to our web of cross-automounts. I had to reboot 5 or 6 of them and do all the cleanup following that. In one lucky case I was able to clean up the automounter maps without having to reboot.

So we're recovering from all that now. Hopefully we can figure out the new splitter problems and get that working as well or else we'll start hitting those bad mysql periods really soon.

- Matt

see comments

11 Jun 2009, 21:52:12 UTC
Spent the morning clearing out my mail spool - something that could easily eat up a full day if I let it. It's amazing how these "this will only take 5 minutes, tops" tasks add up, especially when there are about 100 of them.

Bob found the mysql replica has been falling behind a bit more than he though it should, and after some poking around I found iptables getting in the way. So I did some reconfiguration on that system, rebooted it, and now let's see if it is operating any faster... This wasn't the crux of our mysql woes, but it may help a little bit (less chance the stats queries will rely on the master if the replica is always caught up). Actually as I write this I see we're in another difficult period. Eric was actually just up here and suggested a workaround for one of the queries that has been given us the most headaches lately. We might implement that in the near future. We also should try throwing some of this new hardware at the problem (if we could ever get it working).

The dust is settling after the anniversary a bit - still haven't gotten any video from the students putting it all together. Dan, having spent some time in Arecibo recently has new insight about the radar problems we've been having - so I may get yet another code rewrite on my plate in the near future. Hopefully this will be the final revision that will actually get completely and be used to clean up a huge backlog of dirty data (waiting to be processed). Jeff and I hope to also get some NTPCkr far enough along to present something to the public. I know I've been saying that a while.

- Matt

see comments

10 Jun 2009, 22:12:33 UTC
Playing around installing the new Fedora Core on my desktop today. So far so good. It seems any time anybody in any context mentions a specific flavor of linux this inspires discussion, usually in an incredulous tone, about why in god's name would you even consider using version x instead of version y, etc. I understand the pros and cons, and we're not going to change anytime soon, if ever. Personally I'm waiting for the day when operating systems disappear and we can all get back to work.

Still haven't gotten any of the Intel systems up and running for various reasons. I'm abandoning all of them for now. Very frustrating - every time I solve one problem another takes its place.

And the inability to collect data at Arecibo continues - the problem has been narrowed down to the (very old) EDT card working on a newer OS. The good folks atEDT are working on it (even though they don't even sell this card anymore, I don't think...).

- Matt

see comments

9 Jun 2009, 22:27:28 UTC
Well I employed my database code adjustments yesterday afternoon... and they seemed to have had a decidedly opposite effect than what we expected. So I reverted them back last night. Back to the drawing board on that front - I'll think we'll basically move from understanding the problem to eventually adding more hardware so it isn't a problem.

The key to that is getting hardware to work. Eric figured out the issues we were having on one of the newer Intel servers (the RAID controller card had to have a jumper moved around, even though I checked the jumpers already and they matched a similar card in a similar system that is working just fine). Of course, it's a hardware RAID controller, and it won't let me do JBOD, so I was forced to make 8 individual RAID groups, each containing one drive. This is annoying enough, but they RAID bios contains primitive enough mouse drivers that each step of pointing the mouse and clicking on the appropriate button took anywhere from 5 to 60 seconds. So it took me about 90 minutes to configure the RAID. Of course, we could have just stuck with using hardware RAID but for benchmark purposes we're comparing this system to one with similar software RAID. So there ya go.

Our BOINC server - one system that handles the website, all the alpha project stuff, etc. - has been having more and more problems as of late, all resulting in the CPU load spiralling out of control. We're in the process of getting another one of these new Intel servers up and running to replace this older server. Of course, we're hitting all kinds of other problems trying to boot the thing. More on that tomorrow if it's still offline.

Downloading FC11 today. All the mirrors are jammed.

And of course we had our weekly outage. No big developments there - Bob took care of all that. He did notice during our weekly science database backup that we had some corrupt database pages. This may be because of something else he discovered - the disk space made available for Astropulse had filled up sooner than expected. So he added more disk space to those tables.

- Matt

see comments

8 Jun 2009, 21:31:03 UTC
Dan and company are wrapping up their work at Arecibo and heading home today (I think). It was a painful weekend trying to get our data recorder working again (and installing the new SERENDIP V data recorder) but all is well, more or less. We even did some observations of the crab nebula (and its known pulse) which Josh then found in the data using Astropulse, providing a good end-to-end test. We'll send workunits using that data once we get that raw data up here. We ultimately found our SATA drive enclosures were a major part of the headache, and we're planning to replace those with USB enclosures... probably.

It was a painful weekend network-wise - the increased active user load (mixed with the lack of long Astropulse workunits to send out) means a lot more activity on the result table in the database, which means periods of mysql choking. We're adjusting some code to do "dirty reads" which may help conserve resources. For example, the count of the result table to determine the current size of the ready-to-send queue doesn't have to be 100% accurate, so locking the table to do such a query is overkill. We'll see if that works, or helps.

We hope to replace these database servers, or at least the mysql replica, with one of these new Intel servers. They have tons of CPUs and gobs of memory, but the disk controller doesn't work. Actually, that's unclear - we replace the card with one we know works, and that wasn't behaving either. Until we can figure that out we're stuck with what we got.

- Matt

see comments

4 Jun 2009, 22:27:11 UTC
A day full of troubleshooting. Still trying to get one of these Intel servers up - everything in the system works except the hardware RAID. We got the new drives in the mail today, but still can't get into the RAID bios. We do have a card we know works in another Intel server which we'll swap in sometime but we're tabling this project in general for now...

That's because Dan and a bunch of the CASPAR students are down at Arecibo to install their new SERENDIP V data recorder. They'd like to test it while they are there, of course, which means comparing its functionality with our recorder, as well as do some observations of the Crab Nebula to run through Astropulse, etc. What does that mean for us? That means we really need to get our SETI@home data recorder SATA drives/enclosures working. They have been off line for well over a month now, but now that we have our own people with immediate access to the machine it's speeding up the debugging process. Still, there are plenty of mysteries that seem impossible to figure out. Jeff's frustration with SATA/USB/drivers/linux is palpable coming all the way from the other side of the room. In fact I just heard him tell the gang down there to install a new OS on the system (the current OS is ancient, and quite possibly the source of our woes).

Meanwhile Jeff and I are continue to tinker with NTPCkr stuff. I've been trying to optimize the NTPCkr page, finding that it spends most of its time parsing the XML of the zillion multiplets (groups of similar signals) within each candidate. So at this late hour we may change how we divide the multiplets up into "barycentric" (tight in frequency space) and "non-barycentric" and just score them according to frequency tightness. This may not only yield far less multiplets, but they may be ranked better as far as how interesting they are. There's gonna be more tweaking/testing on that front.

- Matt

see comments

3 Jun 2009, 22:00:49 UTC
Today started messing with one of the new Intel servers. We're still waiting on drives to ship before doing much with it, but at least it boots off of DVD. There are some other kinks to work out as well. I think we're going to call it "mork." We hope to at least replace sidious with this machine, and if we get the other servers working, than replace others. In general we always wish to reduce the hardware we need to maintain - i.e. have less machines doing more stuff. However, we'd like to do so without increasing our single points of failure (redundancy is nice). And given we never buy anything we have to generally stick to a "work with what you got" philosophy.

A small note about the front page "weekly outage" status - that's a line at the top of our project_news data file which is commented out. Every Tuesday morning I uncomment it (if I remember to) so people can see it, and hopefully later that day (if I remember to) I comment it back into oblivion. Sometimes I forget, or recovery is slow enough that I keep that warning there so people can get some idea why they're having trouble connecting. In any case, it's human controlled and therefore prone to error.

- Matt

see comments

2 Jun 2009, 23:29:04 UTC
Had the weekly outage today - the normal database/compression/cleanup stuff was by the book, however we took the time to address some other hardware issues. First and foremost, we replaced the failed drive on thumper. I was griping about this yesterday and how this means we'll have to reboot, which means we're forced to resync the root RAID devices. Well, that's happening now. I also upgraded the kernel on worf. That sort of went well - except upon coming back on line one of the spare drives was marked as failed. We're dealing with that now.

Coming out of these weekly outages has gotten painful given our increased rate of traffic lately, and these web queries that continue to clobber us. I try to aim these at the replica, which helps, but right after outages the replica is effectively offline for many hours as it is still busy recreating the giant tables. So I have to temporarily aim those web queries at the master, which makes recovery even slower. We gotta figure this all out, come up with a better weekly backup/reorg policy, or get that new replica server up and running sooner than later. We did order drives for it - should be here later in the week.

- Matt

see comments

1 Jun 2009, 22:27:24 UTC
Lots to talk about today. Let's start with the weekend: we had the usual drill of running out of raw data files for the Astropulse splitters to chew on. Due to file transfer speeds up from our off-site archival storage (NERSC) we can only put a few files up a day, which Astropulse goes through in no time. This isn't a big deal, but in order to regulate this a little better we adjusted the weights of the two applications so that the feeder gives 97% of its slots to multibeam, and 3% to Astropulse. This shouldn't change the current regular behavior, but will help smooth out the peak periods I think. There's still some BOINC logic changes that have to happen to keep Astropulse from taking over too many systems.

Some good news: Intel once again came through with a slew of donations - five servers to be exact. These are mostly test/used systems so three require some TLC to bring on line (a couple of those may be used as parts to boost up one of our current compute servers). However one of the remaining two will get our attention right away and became the new mysql replica server. I haven't confirmed the specs, but I've been told they each contain four 6-core CPUs and 64GB RAM. Intel would like us to do some benchmark tests right away, so expect a new server (or two) in the fold in the coming weeks. I guess I need to update the hardware donation page...

Of course, the release of Fedora Core 11 has slipped a couple times, but I hope to start a major wave of OS upgrades (or installs) next week as well.

The other big project is dealing with thumper - our science database server. We're replacing a bad drive tomorrow, which means rebooting it, which in turn means it will go through some painful RAID resync upon coming back up (due to its drive naming issues). We know we can fix this resync problem by reinstalling the OS, which we'll do when FC11 is out and we tested a similar install on bambi (the secondary science database server) first. Once that's working, we'd like to re-RAID the data drives (from RAID5 to RAID10) to vastly speed up throughput (necessary for NTPCkr performance). But to do that we need to get all the raw data off first. And to do that we need to first install a kernel update on worf (the NAS from Overland Storage which we are beta testing) so we can safely move all our raw data there. Oy. So many ducks to get in a row. Anyway.. one step at a time...

- Matt

see comments

28 May 2009, 20:37:47 UTC
Question: so what's up with the near time persistency checker (NTPCkr)? If the live web streaming were working last Thursday you would have seen the tail end of my and Jeff's talk where Jeff went into a little details about the current status of things. Basically, we have some screws to tighten here and there, but the general thing is working. We're up against some database throughput issues which we hope to fix sooner than later, plus we are still tweaking the scoring algorithms. We hope to have a public page available soon where you can peer into the progress of things. Until then, here's version 0.0.1 of the NTPCkr FAQ.

It's becoming clearer that we need to adjust the weight of our applications so that we send out more SETI@home/multibeam workunits. We have things effectively set such that Astropulse work gets sent out as soon as it becomes available. This was partly to expedite getting as many Astropulse results back as possible (in the interest of getting that science done) but this is getting less and less possible given our resources and current participant demands. Things on this front may shift in the near future.

We've been near our bandwidth limit for the past day since unclogging the mysql database, providing more data for Astropulse to split, and our active user base going up about 15% over the past couple of weeks. This may account for recent upload/download difficulty. It looks like it's getting better, as least for the moment.

- Matt

see comments

27 May 2009, 22:07:00 UTC
Had a few more bandwidth woes early in the morning. Turns out this was due to the replica recovery yesterday - a lot of long queries were still being aimed at the master. I turned the replica on, which immediately helped (though it is about 10-15 hours behind and slowly catching up so some stats may seem a little screwy).

Before we figured that out Jeff and I were a bit stumped as we thought this had to do with Astropulse work availability. In the process of looking for clues we discovered that for a long time Astropulse had an extra defunct project sitting in our applications table. This meant the feeder was saving a third of its slots for a project that will never have any work. I fixed that. I don't think that was causing any major problems lately, but it sure wouldn't help them, either.

This morning I dusted off some code - a program that would fix our doubly-precessed signals. I was hoping some changes Eric had since made to the (incredibly arcane) database code would have fixed some long standing problems, but they didn't. This isn't Eric's fault - it's some garbage in the esql libraries that won't let me do updates to rows with user-defined types. This normally isn't a problem as we can insert signals just fine. Updating them, however, is the problem, at least using esql. So I'll shelve this project once again - in the meantime we have a patch of signals that we cannot use to find candidates as their coordinates are slightly wrong.

Oh yeah - people were asking: I'm not sure when video of our anniversary talks will become available. The students involved in the filming/editing are also working on SERENDIP V, and they're in a mad scramble to get that ready for deployment down at Arecibo next week.

- Matt

see comments

26 May 2009, 22:32:21 UTC
We're back after the long holiday/anniversary weekend. Phew! That was fun, and now we can get back to work on some outstanding projects.

First off it should be noted the weekend had some issues. For some reason the "forum preferences" table broke again, which wouldn't be that big a deal, except this messes up replication. I kicked it every few hours over the past couple of days which didn't help very much. So we're reloading the replica from scratch yet again. This'll take some time, so the recovery from today's regular outage may be particularly painful.

Meanwhile a random drive on thumper failed. No surprise - there are 48 drives in that thing. It's RAIDed, we're getting a spare from Sun, no big deal. Still, this will exercise our problems with rebooting thumper at this point - so this bumps up in priority our need to reinstall the OS on the thing.

I'm still trying to move data from our archives up here for Astropulse as fast as I can. We have over 100 files yet to transfer. I hope we get the data recorder back in working order before we use up all these files.

- Matt

see comments

20 May 2009, 21:47:21 UTC
Another short note just to check in. Good news is that I finally was able to get more than just 1 or 2 files up from HPSS for Astropulse to chew on. In fact, I got 4 files! Well, that's still not very much, but more are on the way. We'll really have to get crackin' on the data recorder issues once this week is through.

It also seems that we have continuing problem with these difficult web queries clobbering us from time to time. I put a "hack" in place yesterday that I thought was helping, but Dave noted our problem may be from persistent mysql connections. Since php is embedded in apache, whenever it starts up it opens a database connection and keeps it open through multiple page requests. While we put explicit code to use the replica on the result pages, apparently php won't flip from master to replica (or vice versa) during these persistent connections, so we need better logic to handle all that. In the meantime it seems like we're in another ugly long query phase clogging the pipes. Still very annoying.

This is my last tech news item until next week, probably. Will be busy tomorrow with the big event and all.

- Matt

see comments

19 May 2009, 23:29:17 UTC
It's Tuesday, that means outage time (for database backup/compression/etc.). Today's outage was by the book, and we're recovering from that now. We're still sloooowly getting more data back up here from our archives at NERSC, though the Astropulse splitters are tearing through those pretty fast. We were also having continuing issues with loooong queries on the mysql master database. We thought we fixed that yesterday. Looks like we didn't. Dave and I poking around with that for a while.

Other than that, chipping away on NTPCkr stuff for Jeff, getting things in order for the big event on Thursday. Wow - I got exactly 48 hours from now to get my little talk straight.

- Matt

see comments

18 May 2009, 23:13:38 UTC
Happy Anniversary! Though we're officially celebrating later this week it was actually ten years ago yesterday that we launched this thing. We didn't know what to expect, and our ftp server was immediately clobbered from thousands of people simultaneously attempting to download the client. I remember a blur of chaos as we procured other ftp servers (and a remote mirror) that day. I still joke that we've been trying to catch up ever since.

The general workunit/result flow was a little weird lately. First, we ran out of data for Astropulse to process. The splitters kinda burned through a lot of these files - I'm wondering if there's something else going on - or maybe just data quality issues. We also updated some web code which broke our (temporary) master/replica code when looking up results via the web, so the database got clobbered again for a while. This morning Dave re-enacted these changes to use the replica and checked the code in. And once again we had a couple weird mounting issues - bruno was hung on bambi, lando was hung on thumper. This sudden rash of mounting problems is getting annoying if not worrisome. We had to reboot both bruno and lando, which I did this morning. I'm also pulling up some data from Arecibo to get Astropulse rolling again at least from time to time.

- Matt

see comments

14 May 2009, 20:40:07 UTC
We are quite preoccupied with anniversary stuff so we've been doing the bare minimum amount of systems administration to get by until after the event. Still, it should be mentioned we continue to have SATA/driver issues on our data recorder at Arecibo, and haven't collected new data for about a month now. While we have a pile of data yet to crunch readily available on disk, I started pulling up unanalyzed data from our offsite archives.

Before doing so I went through the whole data inventory rigamarole this morning. We have 1787 raw multi-beam data files (mostly all 50GB in size) archived, of which 338 haven't been split at all. However, a portion of these files were recorded before 2008, i.e. before we had a hardware radar blanking signal embedded in the data. So until we get my software radar blanker working (a project postponed until post-anniversary) we can't chew on these files without dealing with major radio frequency interference. This isn't a major problem: 1225 of the 1787 archived data files are from 2008 or later, and of these 249 have yet to be split. So we got plenty of numbers to crunch until we get the data recorder working again.

- Matt

see comments

13 May 2009, 19:24:37 UTC
No real server news today, but I'll respond to a couple things mentioned in the previous thread.

I said we have about 150 CPUs in our server fold. Of course, looking at the list of machines on the server status page you see about 40. First, this isn't a complete list - it only contains public facing or critical servers. We have a lot of other systems that are doing tangential tasks or behind-the-scenes stuff. We also have several appliances (like the NAS's) which contain multiple CPUs as well. Still, this number may be inflated a bit due to hyperthreading on some servers. I think the actual number of physical CPUs is still above 100 though. Plus, as I was calculating this just now I found that two of the CPUs on sidious have apparently died. This is no surprise - it's a used/experimental machine and had CPU issues since day one, which is why it is the replica mysql server and not the master.

The talk (which happens next week) should be viewable over the net after it happens. I don't think we're going to do live streaming or anything like that. We're going to meet and discuss early next week what our options are.

- Matt

see comments

12 May 2009, 21:32:39 UTC
Today's Tuesday, which means regular outage day for us. The project is already coming back to life as I write this sentence, though Bob still has some work to do to sync the beta replica database up again (a process which failed last week due to one of the tables unexpectedly needing repair).

I got a funny call out of the blue yesterday from a person who works at a music production facility in LA. They do a lot of CPU intensive work there, and were surprised to find a bunch of BOINC clients running on their systems slowing things down. I'm guessing a former employee (or current employee afraid to speak up) planted them on as many CPUs as possible. Anyway, I'm not sure how he got my number, and even less why he chose to call me of all people, especially since the clients were all apparently running Einstein@home. Nevertheless, I gave him some uninstall tips, and that was that.

Still working on the talk, which is slowly coming into shape. I'm trying to squeeze in 10 years' worth of digressions about work creation/distribution, databases, web sites, and networks, as well as back-end server war stories into about 20 minutes. It's been a trip down memory lane, and we're kind of kicking ourselves for not taking as many pictures back
in the day of our puny little setup. I can't believe we got this thing off the ground with 3 Sun Ultra 10's (all doubling as desktops for me, Jeff, and Dan) and 2 IPC's. Our current server closet contains about 150 CPUs, 100 TB of disk, and 150 GB of RAM.

- Matt

see comments

11 May 2009, 21:08:02 UTC
Over the weekend we hit a bit of a traffic "depression" - in other words we were sending out far less work than we should and so our outgoing bandwidth dropped. Why? Well, due to a single garbled astropulse file the astropulse assimilator was bailing, and so the queue was growing, and so workunits were staying on disk longer, and so we ran out of workunit storage, and so the splitters revved down. Eric kicked the assimilator in question yesterday, and we caught up more or less.

This morning I found bruno (the upload/BOINC general admin server) was having similar mounting problems that thumper was having the end of last week - it was hanging on a mount to anakin (the scheduling server) of all things. This didn't affect anything major, but the server status page was stuck since yesterday. Anyway this time I cut to the chase and reboot the system, which helped, but the drive arrays are configured in such a way that requires human intervention on boot to get fully working again. No big deal, but some result uploads were failing for a minute or two there.

Jeff and I practiced the first rev of our anniversary talk this morning. We need to trim it down by 15 minutes. I guess there's a lot to talk about (nothing regular readers of these threads don't already know).

- Matt

see comments

7 May 2009, 22:03:43 UTC
I came in this morning and went about my normal chores, including checking the raw data pipeline. We have automated scripts to do most of the work, including one called "splitter_janitor" which finds files ready for deletion, takes some action, and mails me/Jeff the results. Well, I didn't get any mail. So I looked at the system in question, thumper, and found the script was hung. Some poking around led me to discover that thumper was having trouble mounting directories on server ewen (Eric's hydrogen study server, which actually crashed yesterday but came up again just fine). Well, other machines were mounting ewen just fine. So what gives?

Sometimes the automounter needs a kick, so I restarted that. No dice. I restarted nfs/nfslock to no avail either. Hunh. Around this time I noticed the primary master science database, also on thumper, had gotten wedged. Great. Eric/Jeff were brought into the fold but nobody had any great ideas as to what was wrong and therefore how to fix it. We started killing processes one by one, including the database engine itself, which could only be stopped with a kill -9 (which isn't optimal, but informix has always been perfect recovering from such ugly shutdowns). With an empty process queue we still had mounting problems.

Normally one of the first things to try is a reboot as this is easy and usually works, but we were loathe to reboot thumper since (as you might remember if you are an avid reader of these threads) that its root RAID has some funkiness where, even if it's healthy, will show up as degraded (and require a long resync) upon reboot. But we had no choice at this point, so we rebooted it, and sure enough the system booted just fine (and we could mount everything again). That's the good news, the bad news is that our fears were realized, and we're in the middle of another long painful root drive resync. The system is functional in the meantime, so really it's not that big a deal - it's just annoying, and perhaps a bit scary.

Well, that ate up my whole morning. Then moved onto my Powerpoint/PHP tasks until Bob noticed the science database load was strangely low. This led to more snooping around, finally finding that our system vader (where the assimilators run) was having trouble mounting bruno's disks (where the result files are). So we weren't inserting results, which explains the bored science database. I rebooted vader, which is much easier than thumper, and that broke another dam.

- Matt

see comments

6 May 2009, 20:39:57 UTC
We recovered fairly well after the outage, despite all the minor annoyances as of late. We still have to resync the beta database on the replica - turns out there was corruption in those tables that didn't get noticed until after we brought everything up again. Well, not so much corruption as a bit somewhere that told mysql to not bother dumping the beta database because it thinks there's corruption. So when I tried to rebuild the replica with the dump (when the beta project was back on line) and found the dump was zero length, I issued the proper repair statement and mysql responded "0 errors" but then was able to dump everything. Whatever. It's fine for now - and it is just the beta database, so we'll clean that up next week.

As for fears of running out of data while we're waiting for the data recorder to get fixed: we still have plenty on line, and a few drives on the shelf full of data sent up from Arecibo as part of the last shipment they made before the SATA card went kaput. Plus we have a bunch (how much? not sure, but a lot) of data in our archives at HPSS which we haven't processed yet. So we're good for now, and maybe even a month or two.

As for those network graphs talked about in the previous thread: that particular graph is for a router down on campus which handles the tunneled traffic to/from our lab and destined for our router at the PAIX (where we hook up with our ISP bandwidth). So yeah, green shows "incoming" from the lab, which is what we see as "outgoing" i.e. downloads. And vice versa for the uploads. Of course, there's a tiny tiny bit of noise due to scheduler traffic which also goes over that link.

- Matt

see comments

5 May 2009, 21:42:36 UTC
There were indeed some weird lingering problems with the mysql database from this weekend. Some tables had bungled indexes. We think we cleaned that up during the usual weekly maintenance outage today. We also needed to regenerate the replica mysql database from scratch, so that'll be behind until later this evening (or tomorrow). The result pages may be out of whack until then. In fact, I just turned them off for now as they were eating too many resources.

By the way, we're still unable to collect data at Arecibo due to problems with the data recorder being unable to see the drives. Turns out the card we bought, which was an exact replacement of the previous card, is having driver issues. Why? Well, unbeknownst to us we weren't actually using the previous card - we were using a totally different card (i.e. one we didn't buy) this whole time. It's a mystery why the original card was swapped out and replaced with this third one, but we're kinda back at square one again. Sigh. Due to time zone/scheduling conflicts each iteration on this front takes about 24 hours (the staff at Arecibo is providing support for free, after all).

- Matt

see comments

4 May 2009, 22:27:44 UTC
The weekend was a little bumpy. The mysql database was showing signs of trouble Saturday. Eric was the only one paying attention at the time, so he restarted the database. Everything seemed fine, except he made some posts of the forum and then they all disappeared. This is still a mystery (the cause, the exact effects, and if it still a problem). Eric is trying to recreate and diagnose.

But we were still getting web scraped to death. I played a gig Saturday night, getting home around 1:30am. I noticed the lingering problems at that point and blocked a couple more IP addresses and kicked off the long queries. Things more or less recovered on their own after that (except for the validators, which I fixed in the morning).

So this is getting to be a regular problem, which I partially addressed this morning. I dug through the php code and quickly figured out how to get a couple of the offensive long queries to point at the replica database. This seemed to be quite helpful, but the replica is still behind due to the other problems mentioned above. So people are seeing about a day in the past when checking out their current results on our web site. It's confusing, but not the worst tragedy in the world, and it's a problem that will correct itself shortly. It'll all be caught up after the outage tomorrow.

To keep things interesting, we seem to be in a middle of a spate of weird workunits - ones where the data isn't kosher and therefore returning quickly. Eric is also on top of that one. In the meantime, our outgoing traffic is a bit pegged.

Less than three weeks until the anniversary. I'm getting my powerpoint together now. And I couldn't think of a worthy thread title theme this month, so how about apt titles for a change?

- Matt

see comments

30 Apr 2009, 21:21:40 UTC
We're officially three weeks away from the 10th anniversary celebration - I think Dave just put the official announcement of such on the front page. Jeff and I are bashing out all the details we can beforehand. I guess I will finally learn how to use powerpoint (at least the openoffice version).

So there were some splitters stuck after the outage so we ran out of work to send Tuesday night, but that got kicked back in line Wednesday morning. I wasn't involved with the outage and didn't notice until everything was better - I was taking the day off entertaining visiting family (which also explains the spotty nature of these current tech news items - sorry).

There are still lingering problems trying to record data at Arecibo. We sent them a new SATA card, which worked, but even though the part # was the same of the old card the connectors were different (I instead of L). Jeez. So we sent them the right cables. Now the drivers won't load - the system recognizes the card, but not the drive. What a headache.

Oh yeah. This is the last tech news item for the month, so after much anticipation (not) the thread title theme this month is revealed: names of cats I lived with throughout my life, some adored, some not so much. By far the best kitty ever was Normal (he and his littermates had Geek Love references as names). Our current cats (i.e. still alive and/or hanging around our house) are Olga (Alexei's sister) and Fner (Fnerina's feral half brother). Too bad our dog Laszlo - a purebred Doberman we recently rescued as an adult from the pound - still requires much effort in the ways of socialization, including reducing his desire to hunt down and eat smaller animals. We're working on it.

see comments

28 Apr 2009, 22:35:46 UTC
Busy busy busy, though not many fun adventures to report in the server realm. The weekend was fairly smooth, as was the regular database backup outage today. Bob went to the MySQL conference last week, so yesterday we discussed some plans for mysql upgrades, tweaks, etc. which we won't implement until the end of next month (i.e. after the anniversary). Of course, there was discussion about the Oracle buyout of Sun, and how that will affect the future of mysql. Apparently panic is unwarranted and we were reminded that the innodb engine, which is mostly what we use within mysql, was already partly an Oracle project. Anyway we shall see.

Jeff and I are continuing to spend our time doing what we can to get the NTPCkr rolling before the anniversary, as well as scraping a talk to present together about the general data pipeline (which we hope to end with the "unveiling" of the NTPCkr). Jeff's been hitting some execution efficiency hurdles (mostly involving many long database queries), but we discovered some more significant optimizations (mostly involving getting around having to query the database in the first place). These speed-ups require some logic changes, which then means fresh code walkthroughs. Extreme programming time.

- Matt

see comments

23 Apr 2009, 23:07:53 UTC
Today included more messing around with gnuplot and various web programming tasks. I also helped Dan format a pdflatex document. I'm kind of cursed with being really fast at working with these formatting markup languages, so such tasks get thrown onto the end of my work queue a lot.

I noticed we were having a network dip in the afternoon and found once again our web site was being DOS'ed. Somebody (or some robot) was scraping our site, completely ignoring our robots.txt file, etc. Quite infuriating. I wonder if it is officially unethical to make public IP addresses which exhibit this kind of foul behavior. The worrisome part is this kind of activity clobbers mysql (and thus the whole project), and last time this happened everything seemed to recover, and then the database crashed twice over the weekend. We shall see, I guess. It's recovering now.

- Matt

see comments

22 Apr 2009, 22:33:18 UTC
Looks like there were some beta project problems after the outage yesterday caused by a missing executable. That got replaced, and I think that everything should be okay now on that front. I heard rumors that regular users were seeing beta errors, but I'm hoping that was just confusion. I haven't heard anything since.

Other than that today was more or less a day of system/web plumbing. The web stuff I'm working on is becoming a major kludge due to time constraints. It's actually a conglomeration of C code and perl, php, and C-shell scripts. You know, whatever works. I'm a big fan of getting things working as soon as possible, then making it pretty later.

- Matt

see comments

21 Apr 2009, 22:16:04 UTC
Tuesday means weekly outage day for mysql database backup/compression. Since the replica got messed up during the duet of crashes over the weekend, we are using this backup today to recover the entire replica database from scratch right now. Should be ready to go in a few hours or so. I think the regular boinc stats xml dumps also broke over the weekend but those should be generating normally again now.

The secondary science database is also suffering some kind of malaise. Not sure what the deal is, but it's slowing down my NTPCkr web site development. I thought it was excess disk activity on the system (caused by writing a primary database backup image to one of its spare drives) wreaking havoc, so I waited for that to end, but still no dice. Had to stop/restart the engine and even then it went through some phase of vague recovery before I could access it again.

Finally got that replacement sata card for the datarecorder down at Arecibo. Jeff and I tested it in a system up here (mostly to make sure we didn't need to update its firmware) and I just put it in a box heading to Puerto Rico (along with a set of blank data drives). Hopefully it'll be a quick swap and we'll be back to recording data again.

Jeff and I are really getting into the mode of programming/development. I think we found a way to speed up the NTPCkr a little bit more this morning, which is always a good thing. I'm still mostly working on internal visualization tools (with some simultaneous thought to what the first rev of the publicly available pages may look like). Don't get too excited yet - it's mostly just a table of numbers.

- Matt

see comments

20 Apr 2009, 23:04:44 UTC
The mysql database crashed on Friday, then again on Saturday. The reasons are mysterious, though we've had similar crashes in the past - just not two in immediate succession like that. Most of the large, important tables (user, host, workunit, result) are using the innodb engine, while the many others (including team, forum preferences, posts, etc.) are using mysql's standard myisam engine. There's worry we may have lost a few rows in some of the myisam tables, though they seem to check out okay. The replica database, though, is in a confused state so we just shut it off for the time being. We're going to save any remaining cleanup for tomorrow's usual outage. As stated elsewhere, Jeff and I have adopted a policy of no-system-changes (except for emergencies) until after the anniversary. So as long as mysql continues to run well, we're not going to worry about this so much.

I know I write all these missives and therefore I get the brunt of the accolades (or otherwise) but Jeff/Bob pretty much took care of the entire mess above. I did log in on Sunday and cleaned up the server status page and the validators (which for some reason *have* to start on the command line, as opposed to the usual cron job which restarts stopped processes), but that's the usual drill (we're always logging in on nights/weekends to kick one process or another).

- Matt

see comments

16 Apr 2009, 21:39:09 UTC
Slow steady progress since the last tech news item. The science database continues to be massaged into shape from the past month of nastiness. It's working, but some indexes are still missing, and some queries are taking longer than we'd like. Sometime, probably next week, I'll turn the science status page updates back on - until then the numbers are old and/or flat out wrong.

We're narrowing down the cause of our data recorder woes to either the SATA card or the system itself. We're trying the former first. A new one is on order and we'll have to get it configured remotely (which is a lot easier than configuring a whole new system remotely).

We're also finding that we don't have the processing power we'd like. It seems like we lost a lot of active users over the past few months. I blame the recession. You could also blame Astropulse, I guess. In any case, we need more people. We're hoping the 10th anniversary buzz will help. And speaking of that, Jeff and I are putting all focus on the NTPCkr, just so we have something fun/new/interesting to present in time for any p.r. blitz. That means very little effort in systems/upgrades/etc. for the next 5-6 weeks. Simply don't have the time/manpower.

Sorry about the lull in tech news items. I was on vacation visiting 23 relatives. Many are under 5 years old, which meant a lot of them have colds, which meant I got sick immediately upon my return, earlier in the week.

- Matt

see comments

8 Apr 2009, 20:00:28 UTC
The science database choked last night. Nothing terrible - it was just unable to deal with the pulse index rebuilds as well as the usual outage recovery. So the assimilators got a little hung up for a while until the current index build was finished. It's still a mystery why this was as big an issue as it was - we've built indexes before on live, fully functional databases. Hmm. Apparently we have to be a little less cavalier about it.

Turning off a server for good always has unintended consequences. Shutting down milkyway yesterday caused mail from the web server to fail. A couple red herrings later I found the problem - the milkyway mail server replacement (clarke) wasn't configured to allow relaying from the web server machine. Easy squeezy problem to fix. Now reset password requests, forum moderation notices, private message alerts, etc. are being sent.

Spent way too much time hunting down the cause of a seg fault in my NTPCker web page code. It's kinda hard when it's a C program that's being executed within a c-shell script, which in turn is being called by a php script, and which is all running under apache. It's frustrating when everything works on a command line, but not within apache. Anyway I finally figured it out, or at least got it working. The irony is this code was to produce a tiny close-up waterfall plot around any given signal (to immediately spot symptoms of RFI), and once it was running Jeff and I realized our database query logic was slightly wrong, and the correct logic would take too long to be of any use in a dynamically generated plot on the web anyway. Sigh. Looks like we'll have to batch job it or something like that.

- Matt

see comments

7 Apr 2009, 23:15:25 UTC
Outage day today. No big news there on the mysql backup/compression front. We're busy building indexes that were lost during the pulse table rebuild, so that's adding some load to the science database. That may slow splitters/assimilators down at points over the next few days. We shall see.

I did shut down server milkyway for good today, which was our last solaris system still running. This makes me sad. In general, I still prefer solaris over linux, for what that's worth. And I definitely have had much better luck with Sun hardware than with anything else.

Lost in radar/ntpckr coding, hence the short note today. Now I have to catch a bus...

- Matt

see comments

6 Apr 2009, 22:32:20 UTC
Much progress over the weekend on the science database front. The pulse table has successfully been rebuilt, we started up the assimilators, and the queue drained to zero. With the influx of resources the splitters revved up and more workunits went out. All was well until the logical log on thumper filled up. This is a log of transactions which is necessary for database replication, and given all the pulse table activity it's no surprise it did get clogged up with extra transactions. When the log fills, the database engines have no choice but to hold still until there's log space again. Jeff noticed the dip in the traffic graph and got that all sorted out.

Just now there was another dip in the traffic caused by some DOS'ing on our web site causing some mysql database overload. Damn robots skimming stats off our sites... I made a quick route rule to block the offending IP. This damaging effect was probably unintentional but still very annoying.

- Matt

see comments

2 Apr 2009, 22:44:31 UTC
The science database issues slowly get better. The root drives are now all sync'ed up, but as I mentioned before this is only a temporary condition. This will fail again upon next bootup. That's fine because this forces the issue of reformatting the data RAIDs on the system which is something I've been wanting to do for a year now - might as well reformat the whole system, root, data, and all. The pulse table continues to get populated and assimilators remain off - at least for another day. We're about to run out of workunit disk storage (again) so expect another workunit shortage period in the very near future. My new rough estimate for the pulse load to finish is sometime tomorrow, and then we can turn the assimilators on, and we will be as back to normal (whatever that means).

One of the download servers (bane) has been having mounting issues the past few days, hence the locking-up of the server status page. I just rebooted the thing. Let's see if that holds.

Once again today was mostly a coding day. I've been annoyed by the radar blanking stuff, being as how the design has changed underneath me thus rendering a week (or two) of my effort moot. The old understanding was that we should only being seeing one type of radar at a time, but my output was showing this to be far from the case. Nevertheless once I got a quick handle on the fftw routines I made quick work of the correlation code and am already spotting radar quicker and more effectively. However a lot of graphing/threshold tweaking is in order before I can really start locking on and blanking.

- Matt

see comments

1 Apr 2009, 22:01:27 UTC
Let's see.. we're *still* waiting for the RAID resync's to finish and likewise the pulse table rebuild. Another day or two? Meanwhile, I cleared off enough space on the workunit machine such that we can keep producing/sending out work. We still can't assimilate very much until the pulse table rebuild is over, but at least the people can do science and get credit. I'm worried about mysql bloat with the large result table (over 2 million waiting for assimilation), but we've been here many times before and lived.

Lost in the chaos of outage recovery yesterday was a bunch of "make science status page" processes piling up on top of each other, causing extra stress on the science database, and eventually making the splitters jam up. Oops. I killed all those this morning and that particular dam broke. Now that we're catching up on satisfying workunit demand I think we'll be maxed out traffic-wise for a while, which isn't the worst of problems (that means work *is* flowing as fast as we can send it).

Lots of code walkthroughs with Jeff today regarding the NTPCker. It's getting to be a mature piece of code. Scoring mechanisms are almost all in place (though they still may need major tuning once we sift through enough real data). We're still concerned about our ability to actually keep it running "near time," i.e. will the database be able to handle the load? We shall see. A lot of database improvements to help this have unfortunately been blocked on the last couple of weeks' worth of problems with thumper.

Happy April Fool's Day! Don't believe anything anybody says! Actually that's good advice regardless of the day of the year.

see comments

31 Mar 2009, 22:48:04 UTC
Another Tuesday, another planned outage. We did the usual database compression and backup but it still took a long time as we're bloated with 2 million extra results waiting to be assimilated.

No big deal there, but of course we're still mired in the thumper projects. It's becoming a two-weeker (since the original crash the Friday before last). Remember we're fighting on two fronts: rebuilding the root drive RAID and rebuilding the pulse table. Starting with the former, all we (thought we) had left to do was install grub on one of the two bootable drives (even though the weird drive numbering causes grub to read the actual kernel image off a third, non-bootable drive). Before launching into that I rebooted the system just to make sure everything was working.

This system has very large ext3 file systems, and so I used tune2fs a while back to prevent a long (6-8 hour) forced file system check every 180 days (the default). Unbeknownst to us, it would *also* force a check every N mounts. So I was very displeased to find the system going through a round of forced checks when all I wanted to do was quickly reboot the thing. I was just going to let it go, but after a half hour I got sufficiently annoyed to just halt the check (gracefully) and re-tune2fs'ed to prevent this from happening again.

And upon coming up I was further displeased to find the only root drive (of the three) that appeared in the RAID was the one in the non-bootable slot. We're stumped as to why. Well, even though this RAID was seriously degraded, we powered down, did the planned drive swapping and brought the system up. Even though drives were swapped the only root drive this time in the RAID was the (new) one in the non-bootable slot. Fine. I'm pretty much of the opinion we need to reinstall the OS on this point to clean everything up, but until that happens we have some (oddly long) drive resyncs to un-degrade the RAID. Of course, this will all fail again upon next boot as far as I can tell.

Meanwhile, the pulse table reload that started yesterday failed last night. Since we have redundant database servers now, the informix engine is sensitive to anything that may bring the primary/secondary systems out of whack. This includes really long queries, like the one we started yesterday to copy 500 million pulses from one table to another. Back to square one. Jeff wrote a script that breaks this one query up into many smaller ones, thus hopefully circumventing any "long query" issues. We estimate this will be done Thursday sometime.

I did start up one assimilator - the trickery I mentioned yesterday (to let assimilation run alongside pulse table insertions) does work, however as the pulse table gets populated it eats up a lot of database locks, and the assimilator can barely get an insert in edgewise. In any case, I found a rich source of stuff to move off the workunit storage server, so at least that bottleneck will be temporarily alleviated.

Oh, yeah - end of the month, so that's the end of the current thread title theme. I think the only person who came close to describing the theme was QuietDad yesterday (apologies if others got it earlier). Anyway, the official theme was: Apple II hackers/game programmers who, as a budding young programmer myself in the 70's/80's, I thought were super heroes such that I fondly honor their names (real or otherwise). It takes a real game programmer to do *everything* - not just the game logic but also the design, the graphics, the animation, the sound, the music... and do it all in machine language (and 6 colors, including black and white, in 280x192 "hi-res" graphics).

- Matt

see comments

30 Mar 2009, 21:58:54 UTC
Monday, Monday. There was little done on the science database/pulse table problem over the weekend - we hit a couple snags so we tabled it until we were all here in the lab today. It looks like we're doing the big move successfully now (taking the 500 million pulses from the old table and inserting them into a new, better formatted table with more extent space). I was hoping that we'd be able to do some trickery to get assimilation flowing again simultaneously, but it looks like that isn't in the cards.

With the assimilator queue clogged we can't delete anything, which means we ran out of room to create new workunits, or at least enough to keep up with demand. Hang in there, folks. Work is on the way.

- Matt

see comments

26 Mar 2009, 20:25:02 UTC
So the focus is still on thumper, the science database/raw data server. Last night we finished resyncing all the root drives (a three drive mirror). We still have to do some swapping to install grub on the third and final drive - we'll do this during the outage next week. Until then we're officially resuming normal operations, at least at the server level. Phew. I started up several raw data transfer jobs since that's been backed up for a week.

Now we can turn our attention to the database. We're dumping the entire pulse table to a file so we can recreate the table in a larger set of db spaces. This is basically all you can do when you run out of extents - unload the table, then reload into new db spaces. I roughly estimate the unload will take at least 24 more hours.

Since we couldn't insert pulses until we got more extents, the assimilator queue grew fairly large. So why stop now? There's really no reason not to split/create new multibeam workunits - we can still insert workunits into the science database. So I started a single multibeam splitter if only to satisfy some workunit demand until we can assimilate again. Of course, if we can't assimilate, we can't delete - and we've been running low on space to store workunits. But being that we've been running only astropulse for a day that actually helped push a lot of ap workunits/results through the validation/assimilation/deletion queues, which in turn cleared up a fair amount of storage. So we're good for the moment, at least storage-wise (seems like even the one splitter is sensitive to the current heavy load on thumper).

Tomorrow is actually an official university holiday (the staff gets its one day of spring break). However, like always, Jeff, Eric, Bob and I will be poking and prodding at the servers remotely over the weekend.

- Matt

see comments

25 Mar 2009, 21:07:21 UTC
Mmm-kay. So where are we at with the science database...? The morning today was much like yesterday: me, Eric, and Jeff shouting over the deafening noise of the server closet, taking turns hunched over a monitor attached directly to thumper (the kvm monitor was having separate issues). Lots of reboots and unexpected (and unpleasant) results. Lots of thinking we found the problem only to reboot and (five minutes later) finding we were wrong, then having to reboot again off of DVD (taking another five minutes).

Basically our discussions were along the lines of: Why does the boot metadevice disappear when booting off of DVD? And why does the root metadevice disappear when coming up via grub? Didn't we resync these two drives yesterday? Oh look - the grub device map is referring to /dev/sdm, which was how the root drive was ennumerated when there were only 24 drives in the system - it should be referring to /dev/sdy now that we have 48 - so this must be at least one of our problems! Nope. Changing that did nothing. Etc. etc. etc. etc.

Well, whatever. It's been a two-day-long game like a demented version Towers-of-Hanoi - swapping drives, installing/reinstalling grub, resyncing devices, reconfiguring mdadm, then going back to step one and trying a different permutation. On hindsight it probably would have been easier to just install a new OS from scratch (though we would have had to recreate a web of informix configuration which also exists on the root drives). Right now the system is actually up (finally) and resyncing one mirror (again) and will have to sync another once that's finished. So we're offline for another day, and we haven't even gotten to the pulse table problems yet. I will stil try to get Astropulse running in some form later on today/tonight.

Funny thing: Oliver and Bernd of Einstein@home have been visiting from Germany, collaborating with Dave on some general BOINC stuff. They left just a couple hours ago, but we did discuss how when SETI@home is having issues such as this, Einstein@home certainly gets a huge "bump" from the suddenly influx of free CPU time. We joked how the these thumper issues strangely coincided with their arrival last week.

Meanwhile, I'm back on radar blanking detail. We're now trying cross-correlations to match radar patterns using fftw.

- Matt

see comments

24 Mar 2009, 20:27:33 UTC
The good news is that our regular Tuesday maintenance outage today chugged along quickly, and without incident. The not-so-great news is that we are still fighting with thumper to get it running properly again.

Jeff, Eric, and I whipped up a cookbook yesterday of the 7 or 8 steps to get thumper's root drive mirrored. As of this morning we had only one working drive with root/boot on it, but it's the spare drive sitting in the /dev/sda slot. According to the BIOS, the root/boot drives have to be in slots #0 and #1, but thanks to non-linear disk controller labels on the backplane these drives show up in linux-land as /dev/sdy and /dev/sdac. Of course, you can only install grub on /dev/sd[a-d] which means lots of disk swapping and rebooting and resyncing.

However, we're still on step #2 right now, and it won't finish until later tonight. The three of us were huddled over thumper for almost three hours - a frustrating period of time starting with us rebooting thumper "just to make sure everything is working" and then it wouldn't mount the root drive because of underlying issues with the metadevice. This was all mysterious, and after poking this and that it got worse - we could only boot in recovery mode off of DVD, and we had to hack partition tables and change disk identifiers before we could see root again. That's where it's at now: we're syncing the one working drive with a new spare, a process that we thought would take less than an hour but will take five, apparently.

To add insult to injury our pulse table in the science database on thumper ran out of extents last night, which basically means the tables are full even though we have disk space available. So as if the above ordeal wasn't enough, we'll need an additional day or two to recreate (or at least hack at) the pulse table to add more extents. Long story short, don't expect SETI@home to be generating any new work or assimilating anything for a week (unless we're lucky). We'll at least try to keep Astropulse working during this time, so computers that can run Astropulse will be kept busy.

When it rains it pours, but we'll be back to normal again soon enough.

- Matt

see comments

23 Mar 2009, 19:30:51 UTC
We had a crazy weekend in database-land. First and foremost, we had issues with one of the root drives on thumper (the primary science database server, among other things). We didn't completely lose the drive, but smartd has been issuing complaints recently about bad sectors, and then the whole system crashed Thursday sometime in the early evening. While I was able to get the machine back up and RAID resyncing from home that night, the timing was such that poor Jeff and Eric had to deal with the fallout the next day without me (I was in Carmel playing spy music at a corporate party - things like the theme from "Get Smart").

The drive arrangement on thumper is a little bizarre. There are 48 drives that sit in a 12x4 grid, with drive #0 in the lower left corner. However, due to the ordering of the six disk controllers on the system, the root drives (a mirrored pair) show up as /dev/sdy and /dev/sdac. This gave us a bit of a headache when installing linux on this the first time a while ago. The root mirror has a dedicated spare, which by some coincidence happens to appear as /dev/sda.

Since we never really exercised an actual root drive failure on thumper, Eric and Jeff spent Friday lost in a maze of conundrums. For example, given that grub only recognizes the first four drives in a system (/dev/sd[a-d]) how were things working all along? After some head scratching and drive swapping they got thumper back on line. We still need to replace a drive or two, and those just arrived this morning. Another confusing game plan awaits us as we take what we learned and actually try to apply it. Short story: we need to make a three way mirror of the root drives, after installing grub on the spare by booting from DVD, etc. Honestly I still don't quite get it as I write this up but I'm hoping I will after we go through the whole procedure.

And then yesterday jocelyn (the primary mysql server) had some issues. Eric restarted it, and things seemed to clear up without much ado in due time. To be safe we'll do some sweeping data integrity checks on all our databases, probably during the regular outage tomorrow.

- Matt

see comments

19 Mar 2009, 20:44:53 UTC
Another work week is drawing to a close for me (I don't come in to the lab on Fridays - sometimes I work from home - sometimes not). The servers continue to hold on as long as we have the hardware/network resources available (when will they become unavailable? Hours? Days? Weeks? Months?). Yesterday I mostly worked on NTPCKr web programming - stuff for mostly internal use, but a "lite" version will be made public eventually. Why the "lite" version? It's not because we have something to hide - we just don't have the web server/database resources to handle the traffic. The hope is that the public version will at least have a regularly updated list (every hour?) of the current most interesting pixels on the sky, and you can click on them and see where they are in the sky, and get some sense of why they scored well (numbers of signals, they line up with stars/extrasolar planets, etc.). The internal version will have, among other things, additional clicks so we could pull a window of signals out of the database, plot them, and we can scan for RFI - you can see why this would add a big load on our servers. Nevertheless, we'll see what we can manage, and try to much as much information as possible available to everybody.

Today I spent way too long dealing with confusing subversion/trac configuration. Annoying. I guess I should be getting back to radar blanking (sigh).

- Matt

see comments

17 Mar 2009, 21:37:38 UTC
Hello again. Sorry about missing a couple days there. The end of last week I did write a tech news item that I neglected to post as I got suddenly very busy at the end of the day with random programming tasks, and yesterday I was lost in many meetings and other post-weekend catchup. So be it. Here I am now.

The end of last week I was a stand-still with various projects, so I chipped away at neglected chores and other nagging annoyances. Like our new mail server's log filling up with cryptic automounter messages regarding a machine we haven't had on line in five years - I finally tracked this down to Eric's home-grown spam challenge script which made reference to this machine in its LD_LIBRARY_PATH. I also tried and failed to figure out why one of our systems, configured exactly like the others, refuses to acknowledge the lab-wide legato backup server. And I cleaned my keyboard for the first time ever (which was gross after years of eating at my desk, and this was probably not helping the lingering ant problem). Then I got lost in NTPCker stub web page design.

Yesterday there was much discussion about radar. Dan, recently back from Arecibo, confirmed some things and had news about others. The radar blanking code I took over and improved upon had faulty logic, caused by some early misunderstandings (not mine) about how the radar behaved. Most of the radar we see is from the airport, and that's all the hardware blanker thwarts. However, there are 5 other patterns we detect, including the aerostat balloon radar. So one problem is that at times we're seeing a jumble of various radars, making it very difficult to "lock on" and blank them. I'm working on that now. One other point is that the radar frequencies are all pretty much out of our band (typically around 1.3GHz - we're looking around 1.42GHz), but nevertheless are so loud they jam our receivers. However, sometimes if certain projects call for it the Arecibo operators turn on a high pass filter so that the radar frequencies under ~1.4GHz are completely silent. When this happens (about 20-25% of the time) our data are incredibly clean, even without hardware blanking. Of course, since we're piggybacking we can't control when the filter is on, but we do keep track of it in our data headers. We might prioritize this cleaner data for astropulse, which is far more sensitive to radar than SETI@home.

Today had the usual outage for mysql database backup/compression. I took extra time while everything was quiet to move a lot of big files around the raw data storage server - that's mostly why we were slow to get out of the outage this time around, but at least now I can start emptying the latest shipment of drives from Arecibo. Speaking of drives, there was some discussion about that, too. We may start trying to partially send data over the net, if not completely. We thought this was impossible due to bandwidth constraints, but operators at Arecibo told us to give it a shot. This is low priority since, however annoying, the drives, their enclosures, and the shipping rigamarole works well enough right now.

In general the public-facing servers continue to behave themselves. It's been a good couple of weeks. I don't believe in jinxes so I don't mind saying as much. I will say that the workunit storage server is filling up again - a factor of astropulse actually performing well, and workunits sitting around a long time waiting to validate. If it does fill up we'll have to deal with it.

- Matt

see comments

11 Mar 2009, 20:43:03 UTC
Lots of machine rebooting today as Eric is getting his new hydrogen server online, and I'm finishing work on moving mail servers around. This shouldn't have affected the outside world. During all this Eric gave Jeff and I a quick tutorial on merged file systems. Wacky stuff.

Radar wise, I got some lengthy notes from Phil down at Arecibo. Turns out by far most of the radar we see is from the airport, which was news to me, and that's the only thing the hardware blanker checks for. Discussions will continue.

Dan, while at Arecibo earlier this week, replaced our non-working raw data drive enclosure with one we've been using up here. It's unclear whether this helped or not. We're learning that SATA drives (and enclosures/backplanes) aren't necessarily meant for excessive hot-swapping, and will fail after N "mating cycles." This may be what we're coming up against.

- Matt

see comments

10 Mar 2009, 22:45:02 UTC
Tuesday means weekly outage day. Nothing really interesting or scary today. The only sysadmin thing I did during the quiet time was moving mail service off one machine (which we plan to retire soon) onto another. Still have a couple steps to go on that front.

I should mention that we upgraded our network connection from our auxiliary lab to the server closet from 100Mbit to 1Gbit. In practice this meant simply replacing an old cheap switch which a new cheap one. This was mostly for the benefit of Eric and his new compute server, but on the side helps vader (which handles half the downloads and all the assimilators) and our other compute servers maul and marvin (all of which still sit in the other lab, awaiting room in the closet).

Finally stopped being sidetracked enough to work on radar blanking again today. I'm finding some data is very clean and would like to not enforce blanking if it seems unnecessary. E-mails were sent to the experts for advice.

- Matt

see comments

9 Mar 2009, 22:38:16 UTC
Happy Monday, everybody. It was a pretty smooth weekend, so not much to report there. Today I mostly took care of chores and the less glamorous/interesting side of systems administration. Eric bought a new server for his hydrogren projects. We needed to put it somewhere, so we decided to put it in our current auxiliary rack, which is currently sitting in our other lab waiting to replace one of the smaller (and less useful) racks in the closet. One of the download servers (vader) is actually in this auxiliary rack already. Anyway, we discovered that yet again the rails for this server are ever-so-slightly too big given the current rail configuration. Annoyed but determined Eric and I put forth the effort of taking vader out of this rack (which is why it was offline for an hour there) and adjust the stupid rails. Now everything fits. Good.

To answer PhonAcq's question ("Now what is on the agenda to improve things to the next level of performance??"), there is always some looking ahead to what we'll need soon. First up is more memory in our mysql server (jocelyn). When all is well it can easily handle a mixed bag of 2000 queries/sec, but during peaks or other crises it may start to page and cause massive disk i/o. Given the current memory configuration it'll be quite easy to add 4GB ram to the system, which will help. Of course we're simultaneously scanning different avenues of download/upload bandwidth increase. We still have yet to do the whole project of converting thumper's RAIDs from 5 to 10, which will boost science database (and likewise splitter/assimilator) performance. There's more, but that's a good start.

- Matt

see comments

5 Mar 2009, 22:01:59 UTC
Once again not much hardware/server stuff to report. I guess the ap_validator "2" is failing due to seg faults. A fact that is obscured on the server status page (due to automatic parsing of configuration files) is that the ap_validator "2" does strictly astropulse_v5 workunits, while ap_validator "1" validates older astropulse workunits. In any case, I warned Josh, he's looking into it, etc. Probably a broken result file/database entry is causing it to seg fault and quit before doing very much.

Today was mostly conceptualizing/programming again for me, though focused back on radar blanking stuff as I should really get this done. I'm getting bogged down with "ragged files" - where the chunks of data aren't nearly ordered, thus causing confusion about where the software/hardware blanking bits are. This usually isn't a problem, except when a particular raw data file is ragged at the top or bottom, and the chunk containing blanking information needed by adjacent chunks is actually at the end of the previous file, or at the beginning of the next, or nowhere to be found at all.

- Matt

see comments

5 Mar 2009, 0:26:41 UTC
Don't really have much to report today, tech-wise. The replica problems I mentioned yesterday ended up not being problems at all. There was some network security stuff I got bogged down with yesterday afternoon and again this morning - campus is ultra paranoid, so when they see what they think is nefarious activity (false positive or otherwise), or even potential security holes that haven't yet been exploited, you have to pretty much drop everything and act on it, which is fair enough.

I spent pretty much my entire day getting the ball rolling on the "visualization" of the NTPCkr output. Jeff has some code working which dumps out giant blobs of xml detailing the "current best" points on the sky. So I spent the day writing up some php which digests this xml and makes nice tables, plots, etc. It's all very basic so far, but it's a start.

We're getting large bursts of network activity at midnight every day now. Not sure what's up with that. Somebody's got a cronjob somewhere doing something.

- Matt

see comments

3 Mar 2009, 23:11:39 UTC
Usual outage day today (database backup/maintenance, etc.). Actually it would have been "usual" except that certain finagling by us in the background may have messed the replica up. That remains to be seen - if it needs intervention the fix would be trivial.

Oh look. Somebody updated web code. Pretty colors. I think I overheard Dave talking to Rom about new forum features. I have no idea what they are.

Helped Jeff walk through NTPCkr code this morning, tracking down bugs, etc. In essence the goal of this program is simple - to find groups of signals in our data that fall within a certain window of frequency/space but have been seen over multiple observations, and preferably near stars/planets. But it's actually quite complicated - there's a lot of set analysis/manipulation requiring chunks of dense code where bugs can hide if you're not careful. Plus there are always new "special cases" we find (or dream up before we find them) that we need to consider. In any case, we're pressing to get this thing rolling and producing non-zero results before the 10 year anniversary of the SETI@home launch in May.

- Matt

see comments

2 Mar 2009, 23:01:18 UTC
Not much going on (SETI@home-wise) over the weekend. The fallout from those traffic woes over a week ago are pretty much all behind us (I think completely after we do the database compression tomorrow). The average temperature in the server closet has risen slightly, but we think this is mostly a function of current weather (it seems that during rainy/foggy periods the air conditioner is less efficient).

I did get another server online - something donated by Intel a while ago but only now found the time to set it up, add more memory, etc. It's going to mostly used as a compute server for Eric's hydrogen study project, which is good for SETI as that means his IDL processes won't be competing with our NTPCkr/radar blanking tests.

We continue to have raw data drive enclosure problems. This time the set down at Arecibo is getting funky. Very hard to debug remotely.

- Matt

see comments

26 Feb 2009, 19:46:29 UTC
Random day today for me. Catching up on various documentation/sysadmin/data pipeline tasks. Not very glamorous.

The question was raised: Why don't we compress workunits to save bandwidth? I forget the exact arguments, but I think it's a combination of small things that, when added together, make this a very low priority. First, the programming overhead to the splitters, clients, etc. - however minor it may be it's still labor and (even worse) testing. Second, the concern that binary data will freak out some incredibly protective proxies or ISPs (the traffic is all going over port 80). Third, the amount of bandwidth we'd gain by compressing workunits is relatively minor considering the possible effort of making it so. Fourth, this is really only a problem (so far) during client download phases - workunits alone don't really clobber the network except for short, infrequent events (like right after the weekly outage). We might be actually implementing better download logic to prevent coral cache from being a redirect, so that may solve this latter issue. Anyway.. this idea comes up from time to time within our group and we usually determine we have bigger fish to fry. Or lower hanging fruit.

Oh - I guess that's the end of this month's thread title theme: names of lakes in or around the Sierras that I've been to.

- Matt

see comments

25 Feb 2009, 22:48:42 UTC
It looked like we got beyond the current deluge without too much intervention. Good. Then our bandwidth spiked again. Bad. But then it recovered once more. Good. Oh well, whatever. We're still just in "wait and see if it gets better on its own" mode around here - if we hit our bandwidth limits (and we understand why) there's not much else we can do.

Spent a chunk of the day tracking down current donation processing issues. What a pain. I really need to document the whole crazy donation system so other people around here can fix these problems when they arise. Maybe I'll do that later today. Other than that, just some data pipeline/sysadmin type stuff.

A note about the server status page: Every 10 minutes a BOINC script runs which does several things including: 1. start/restart servers that aren't running but should be, and 2. run a bunch of "task" scripts, like the one that generates the server status page. Since this status page script runs once every ten minutes, it is only a snapshot in time - not a continuum. It also could take several minutes to run its course, as it is scanning many heavily loaded servers. So the data towards the top of the page is representative of a minute or two earlier than the data towards the bottom. And server processes, like ap_validator, hiccup from time to time and get restarted every 10 minutes, then maybe process a few hundred workunits, but fail again a second before the status page checks its status. So even though it was running the past couple of minutes it shows up as "Not Running." In short, don't trust anything on that page at first glance.

- Matt

see comments

25 Feb 2009, 0:16:11 UTC
Had our weekly maintenance outage today, including the usual chores. I took the opportunity to replace a failed drive on one of our administrative file servers. I also issued the long-overdue final "shutdown" command on another administrative server, kang, which we no longer use. Many years ago, during the early days of SETI@home, several Sun representatives came by one day to discuss our progress. We thought it was just an informal touching-base kind of meeting, but they told us at the end they were going to donate a whole rack full of 6 state-of-the-art Sun servers and 2 disk arrays. Sun has always been nice to us, but this was completely unexpected. We eventually dubbed this the "k-rack" as we named every server after a sci-fi character starting with "k" (kang, kodos, kosh, klaatu, kryten, koloth). Well, kang, was the last one to go - the end of an era. We're still using the rack itself, though - very useful.

Network bandwidth woes continue, moreso now that we're coming out of the weekly outage. Lots of discussion about this in the previous thread - let me see if I can wrap up all the major points quickly. There are three potential solutions to our bandwidth limitations that we are actively entertaining/researching with the related parties. They are: 1. get a full 1Gbit link up to our server closet (pros: zero migration, cons: time/cost - about $80K in parts/labor), 2. collocation on campus (pros: minimal cost/migration, cons: almost impossible nuisance having to administer from a distance), 3. have a third-party entity host/administer everything (pros: we can ditch sysadmin for once and get back to work, cons: major cost, major migration). Each of these solutions requires a major amount of "getting ducks in row" (due to equipment policies, contract terms, general scheduling issues, etc.) - it's hardly just a money issue. Of course there are other options, too, like putting all efforts into final data analysis and shutting down SETI@home. One major issue is that our server closet (roughly 100 CPUs, 100 TB disk, 200 GB RAM) operates atomically - it's all or nothing. We can't just move one piece somewhere else. It's long and complicated - please don't make me explain why unless there's a free pitcher of beer involved.

- Matt

see comments

23 Feb 2009, 21:06:51 UTC
Our outbound traffic has been pegged since Friday. This may seem like only a download problem, but it even affects uploads, as the basic syn/ack handshaking packets on the upload server get dropped along with the rest of the download packets that can't make it through the dam.

After discussions with Eric and Jeff, here's what we gather is happening. We use coral cache to reduce our bandwidth needs. Coral cache is an easy-to-use, free, third-party system which does some nice distributed caching just by redirecting the right apache requests to their servers. For example, somebody wants to download the latest astropulse client, they go to our download server, and then they redirected automatically to the coral cache server. The redirect is of the form such that, if the coral cache server hasn't done so already, it downloads the latest astropulse client from us, caches it, and then sends it to the requester. Once cached, it doesn't need to contact our servers again. So, in essence, all but one of the client download requests hit originate from sources outside our lab, thus saving us lots of bandwidth.

That brings us to problem 1. Many ISPs don't like redirects to third-party IPs. This is understandable. What happens in this case is a client downloads a new application, but instead of getting the actual executable they get a blob of HTML saying "this ISP doesn't like third party redirects," etc. Obviously the checksum of this HTML blob won't match the executable checksum, resulting in an application download checksum error. This has been a known problem. So we've been only using coral cache during the first couple of weeks after a new application is made available to reduce the pain of the download rush. A small fraction of our users will be inconvenienced by those redirect errors, but they'll get their clients in due time when coral cache is turned off after the initial "wave."

But then there's problem 2. An application download checksum error (a) doesn't cause exponential backoff and (b) causes all workunits also requested by this particular client to be errored out and resent. This is at least the behavior is older, yet still commonly used, boinc clients. Dave said most of that has been addressed, but if they're still bugs they'll be fixed.

In any case, what we saw this weekend was a confluence of these two problems. This may not have been an issue before due to lighter traffic patterns, but we sure fell off the deep end this time. Maybe there was a small set of heavily active clients this time around causing most of the pain. And once the network gets pegged, all hell breaks loose, and it takes a while to heal itself.

Eric actually had most of this figured out before we arrived today, and already turned off coral cache. At least the broken redirects spiraling out of control would stop happening. He also adjusted the tcp settings on the upload server to help get those partially working again (instead of only 2% uploads getting through, now it's about 50%).

The plan is to let this current state of indigestion pass on its own, and if needed change some BOINC settings (if not also BOINC code) so that future coral cache attempts will be direct links as opposed to apache redirects.

- Matt

see comments

19 Feb 2009, 20:41:57 UTC
As we move toward the weekend we're sticking with the current raw data storage workarounds, which means servers are loaded heavier than we'd like, but at least data is still flowing. I wouldn't be surprised if there are network hiccups or if the assimilator queue swells during the weekend.

So far this morning lots of chores. Bob and I got a shipment of empty data drives bundled up to be sent to Arecibo. I finished getting the new CPU server configured (now me, Eric, Josh, and Jeff are in less competition for cycles). I made more strides towards retiring the last two Solaris machines. Honestly, depending on the development/production environment I'd still probably prefer Solaris over linux. So I'm sad to see these systems go, but they are both very old Sparc machines that we simply don't need anymore.

Late last week Eric, Jeff and I had a quick meeting to discuss current candidate scoring algorithms - we're pretty sure we'll have to tweak them as we go, but we're in enough agreement to get started implementing this part of the NTPCker. Jeff's been all over that this week. I'm just now turning my focus back to actual development, too. My software radar blanker now agrees with the hardware blanker 90% of the time, which is a very good start. I can add an additional 5% just by adjusting thresholds, but the real test is to run software blanked data through the pipeline and see which workunits generate more RFI (the ones using hardware blanking or the ones using software blanking).

- Matt

see comments

18 Feb 2009, 23:39:35 UTC
Still having ups and downs with the raw data storage. Possibly a second disk failure. We'll get to the bottom of it soon enough. Traffic may be a bit rocky at times, but hopefully not so much. We also just noticed a drive failed on our upload backup storage. That RAID pulled in a spare without anybody realizing what happened until Jeff and I saw the little orange light in the closet today. We really need better monitoring tools. Actually, we have the tools - we just need time to implement them. Still, it's not a super-critical logical drive (it contains backup data from a separate RAID device) so we're not panicked trying to procure a new spare... yet.

I wish I had more positive things to report today. This details I'm failing to mention out aren't all that fun either. Not my day today, I guess.

- Matt

see comments

17 Feb 2009, 23:42:50 UTC
Over the long (President's Day) weekend one of our storage servers had a headache. Not a big deal, and we got to the bottom of it today (pretty much just a RAID drive failure). We were able to get a workaround in place so we could start generating/saving workunits again, and will slowly transition back to normal over the next day or two. It has been a bit rocky the last few days because the workaround involves a different RAID with far less I/O throughput.

There's always a bright side during work transmission failures: we get to catch up on backlogged queues. So by the time we had our usual database compression/backup outage today the result table was relatively small, and therefore got packed down nice and tight. That's always helpful.

Spent most of the day with the fallout of the above, while also getting a couple systems configured for new duty - mostly administrative/CPU servers that will replace a some older clunkers.

- Matt

see comments

12 Feb 2009, 20:20:58 UTC
Looks like "Astropulse V5" was finally released yesterday night. As far as I know so far, so good - work is going out, results are being validated. However, it seems like jocelyn (the master mysql database server) had a long period of mysterious pain over night, and recovered on its own this morning. This happens from time to time on our mysql servers, perhaps due to its own nebulous data scrubbing, or perhaps due to lack of memory which is becoming more a problem as the database continues to grow and less of it fits in RAM. Unless anybody out there has a couple Sun-qualified 2GB DIMMs that work in Sun v40z's kicking around, we're going to purchase a few. Currently the system has 28GB of RAM - 12 slots with 2GB DIMMs, the remaining 4 with 1GB. We hope to at least upgrade those four to 2GB. It is unclear whether or not our version of the v40z can take 4GB DIMMs (and go over 32GB total).

As for radar blanking, let me clear up the general picture.

Now that we are using the ALFA receiver (since 2006) we are susceptible to military radar, which causes many overflows in our SETI@home/astropulse analysis. The transmitter is aimed right at us approximately every 12 seconds, and then echoes bounce all over the mountains surrounding the telescope the rest of the time. Even the echoes cause us to overflow. The radar is fairly unpredictable - the military isn't very forthcoming about their transmission patterns, and when they are going to change to another pattern. Nevertheless, it is predictable enough: there are about 6 known "patterns" us civilians can lock on to.

Luckily, Arecibo solved this problem for us. They have a hardware device that broadcasts a bit letting all projects at the observatory know when it thinks the radar is on (1 for on, 0 for off). This we call the "hardware blanker" - and we inject this bit into an unused channel in our raw data. This has been quite helpful: when the bit is "1" we'd randomize the data, thus squashing the overflows. At least in theory - there were still three problems.

Problem 1: We only got the hardware blanker working sometime in 2007, so there is no such blanking information in the previous years' worth of data, thus rendering it fairly useless.

Problem 2: The hardware blanker sometimes isn't on like it should be, or even worse is mis-locked onto a wrong pattern and going out of phase with the actual radar, which also renders data quite useless.

Here's where my code comes in: The "software radar blanker." Actually, this is code/logic written by a summer student, Luke, and then I cleaned it up and (apparently so far) got it working. In short, the software radar blanker does a statistical analysis of the raw data - basically looking to see when we're blasted by radar and then trying to lock on to known patterns, and extrapolate from there. Luckily there's another free bit available in the raw data, so the ultimate plan is for raw data to come up here, go through the software radar blanker, and then process. The splitter will use the software and hardware radar blanker bits (exactly how is still up for discussion) to randomize the data. This brings us to...

Problem 3: The randomization shouldn't be totally random. Initially we were injecting white noise into the data when we were blanking. Turns out this causes edge effects and other artifacts during the client analysis. This noise was eventually shaped to fall in line with noise we'd expect to see from a quiet Arecibo. The exact mathematical details of this are left to others who aren't me. I was out of this loop.

All the above was taking too long, so Josh actually implemented code in the astropulse client to reduce some of these radar problems until they are completely solved. He isn't radar "blanking" (which happens during workunit creation) as much as having the clients find stuff that is probably radar and treating it accordingly. For what it's worth, one of the CASPER guys, Andrew, has been having the same exact military radar problems with the pulsar data they've been collecting at the ATA, so he's been simultaneously working on his own radar mitigation techniques. Man, the earth is noisy.

In any case, I figure it'll be about a month of testing/tweaking before we're actually using the software blanker.

- Matt

see comments

11 Feb 2009, 23:01:41 UTC
Before releasing the astropulse application Eric had to add a couple fields to the result tables in the science database that are now necessary. These are large fields, and it's taking informix forever to update the table. The job was started 24 hours ago and is still chugging along. I guess it doesn't help that the assimilator queue is still rather large (though it is draining). So the release is delayed until this job finishes.

The radar blanking stuff I was whining about the other day has nothing to do with the astropulse release, in case there was some confusion about that. Josh and I are working on two completely separate and different forms of radar mitigation. Mine is to better clean up data before any splitting/analysis, Josh's is to deal with radar that squeaked through the first pass and made it all the way to the client. The good news is that I made significant progress on mine today.

- Matt

see comments

10 Feb 2009, 23:03:00 UTC
Today's Tuesday - that means weekly outage. Outside of the normal database backup/compression drill I went through the rigamarole of changing the user id of mysql on the master database server (and updated the ownership of all its files), if only for administrative ease now that it matches the same user id as all other instances of mysql here in our group.

I also decided to yum up several servers that were lagging behind since we have been getting ugly yet harmless kernel warning messages for a while now. Unfortunately, this general update included a buggy nfs package (which I knew was buggy months ago but assumed they must have fixed this by now) which then locked up one of our main file servers, thus grinding everything to a halt. It was an annoying hour or so trying to figure this all out, and ultimately the only solution was to fall back to an old version of nfs. Not sure why this nfs-utils package is *still* in the repositories.

Josh is working on getting another astropulse client out into the world today, and is fighting with the code signing machine as I type this sentence.

Here's another problem we've been having over the past couple weeks, and it doesn't seem to be getting better: ants. I typically don't take a lunch break, and just nosh all day by my computer during small cracks of time. Dave and Jeff are the same way, and have the next two desks adjacent to me. Even though we're on the third floor the ants finally found the motherload of crumbs and unwashed utensils left on our desktops. There's not enough of them to find their exact point of entry nor plot their general plan of action. So throughout the day I've been mashing the little buggers as I spot them. Hopefully they'll just give up and disappear - meanwhile my work space is smelling more and more like formic acid.

- Matt

see comments

9 Feb 2009, 22:48:18 UTC
My mondays are generally spent (a) figuring out what went wrong over the weekend (if anything), (b) cleaning up the data pipeline which has been running on its own for three days, and (c) preparing for or sitting in meetings. Today wasn't so different.

Between my radar blanking tests, Jeff's NTPCkr tests, Josh's astropulse development, and Eric's hydrogen studies we're suddenly finding ourselves woefully low on CPU/memory power. Sure, we have 100 CPUs in our closet, but I'm kind of a fuddy-duddy when it comes to running non-critical processes on our high-availability public facing machines. This is frustrating to others as these machines are the ones best suited for the testing/development we're doing. Luckily, we have one server, maul, which can never be a critical system as it has a test motherboard which would be fine except it intermittently loses contact with the keyboard. So this is our one CPU server which is now usually overloaded to the point of unusability.

We do have two machines coming to the rescue: One from Intel, actually donated around the same time as maul. We haven't gotten around to installing an OS on it until today. Why? Well, that means also needing an IP address for it. The university charges us monthly per IP address we use, so to conserve funds we've been keen on only bringing systems online we actually intend to use, preferably to replace a current system. The second machine is a similarly powerful one that we received from a private donor last week.. but the motherboard was DOA. At least that's our theory. We'll get that replaced soon. Both systems will go a long way towards reducing our current development/testing constraints - something we haven't been worried about too much over the past decade because we've been mostly in a mode of data collection/reduction instead of final data analysis... in case you haven't noticed. I'm happy this is changing (or at least portending to change).

- Matt

see comments

5 Feb 2009, 23:57:23 UTC
Spent a large chunk of the day actually programming, which is nice. It seems like the network bandwidth bottleneck part of our malaise over the past couple of weeks has finally gone away - we're back down to a floor of 60 Mbits/sec. However, the mysql database is still quite clogged up. Looks like as I type this sentence we're still having fits as the splitters/feeders/etc. can't get their queries through fast enough. I'm hoping the bandwidth drop means the excess results were all finally downloaded, which means in the next few days they'll return, and we can finally get them validated/assimilated/deleted and out of our hair.

There was a sweeping change in web code brought on line this afternoon. This broke web account authentication, making it impossible for people to log in. Oops. Not my bad - don't kill the messenger. Anyway it was fixed quickly enough.

- Matt

see comments

4 Feb 2009, 22:01:23 UTC
Moving on... We seem to have eventually recovered just fine from the replica resync, as well as the outage in general. Traffic is still very high, but at least just below the point of impossibility. The assimilator queue is indeed dropping, which is a good thing, as that means we're inching closer to removing all the excess workunits and results from the disk, as well as the database. We still seem to be dealing with the result indigestion I described two days ago, but this too is sloooowly getting better over time.

We've been having some load issues on the web server (thinman). There were no obvious signs of being DOS'ed or over-spidered, if anything it seemed like apache developed a memory leak. I yum'ed in the latest kernel, rebooted the machine (in case anybody noticed a 5 minute outage earlier today), and it looks okay at this point. Maybe just a simple case of reboot-itis.

Just found another potential problem with the radar blanking code. Sigh... (Don't worry - it's not a C++ issue).

- Matt

see comments

4 Feb 2009, 0:35:10 UTC
So then. We had our weekly outage today. We knew it would be a long one - the result table is bloated for various reasons so it took forever to compress. This may help get past this period of "indigestion" I mentioned in the previous thread, but there's no sign of it getting much better any time soon. Expect continuing network pain. Plus Bob is resync'ing the mysql replica, so that'll be behind a bit in the near term.

Quite often we recompile all the back-end servers with code thoroughly tested in beta and switch in these new versions in the public project during the outage. We did so today, and the splitters and assimilators all freaked out upon starting up this afternoon with library linking errors. What a hassle. It seems like our servers are slowly getting more and more out of sync, given some are 32-bit, some are 64-bit, some are running this rev of the OS, some are running that rev, some have this package installed, some don't, etc. and this is apparently becoming a problem. Like we have time to clean this all up.

<obnoxious rant>
I was having an offline discussion with a friend who insists that C++ is a vast improvement on C, and that C programmers who complain about C++'s major failings are living in the past or "just don't understand." I wouldn't mind the debate except C++ afficianados usually adopt a smug, condescending tone regarding C programmers that reminds me of republicans describing democrats. In any case there was a programming mystery today that ate up a man-hour of my and Jeff's time. If the object in question was just a struct it would have been painfully obvious. Instead the problem was obscured in vague assignment operator behavior. Does anybody have an actual, simple example of C++ code that is (a) easier to debug than analogous C code, (b) required less manpower to generate, and (c) will be forever useful and understood? I'm willing to be convinced, but it hasn't happened yet. Maybe it's just a different (and not necessarily better) kind of brain that loves C++, but I tend to think it stemmed from the evil part of our monkey mind that turns a blind eye toward unnecessary complication for everybody in the hope that things may be easier for ourself later on. Or the other evil part of our monkey mind that foists contorted methodology on others as some sort of sick competition (which may be fun but is hardly productive). K&R = 200 pages. Stroustrup = 1000 pages. Is C++ really 500% better that it requires 500% the pages to describe? Nope. Case closed.
</obnoxious rant>

- Matt

see comments

2 Feb 2009, 21:54:21 UTC
Happy Monday everybody. I guess I should move on from the January thread title theme (odd little towns/places/features in southern Utah which I've been to during many nearly-annual backpacking/hiking adventures in the area - easily one of the best parts of the U.S.).

We did almost run out of data files to split (to generate workunits) over the weekend. This was due to (a) awaiting data drives to be shipped up from Arecibo and (b) HPSS (the offsite archival storage) was down for several days last week for an upgrade - so we couldn't download any unanalysed data from there until the weekend. Jeff got that transfer started once HPSS was back up. We also got the data drives, and I'm reading in some now.

The Astropulse splitters have been deliberately off for several reasons, including to allow SETI@home to catch up. We also may increase the dispersion measure analysis range which will vastly increase the scientific output of Astropulse while having the beneficial side effect of taking longer to process (and thus helping to reduce our bandwidth constraint woes). However, word on the street is that some optimizations have been uncovered which may speed Astropulse back up again. We shall see how this all plays out. I'm all for optimized code, even if that means bandwidth headaches.

Speaking of bandwidth, we seem to be either maxed out or at zero lately. This is mostly due to massive indigestion - a couple weeks ago a bug in the scheduler sent out a ton of excess work, largely to CUDA clients. It took forever for these clients to download the workunits but they eventually did, and now the results are coming back en masse. This means the queries/sec rate on mysql went up about 50% on average for the past several days, which in turn caused the database to start paging to the point where queries backed up for hours, hence the traffic dips (and some web site slowness). We all agreed this morning that this would pass eventually and it'll just be slightly painful until it does. Maybe the worst is behind us.

- Matt

see comments

29 Jan 2009, 23:25:26 UTC
The replica mysql database on sidious recovered more or less just fine. It may be ever so slightly out of sync with the master database. This means we'll probably rebuild it during the next weekly outage just to be sure.

The scheduling server was up and down yesterday afternoon and this morning. The scheduler CGIs have been segfaulting and adding core dumps caused the system to grind to a halt, needing a reboot. Turns out the problem wasn't in the CGI, but in apache itself (or the fastcgi module). This has been a problem in the past. We seem to have to tweak various apache parameters at random times, based on a chaotic, unpredictable equation involving current resources/demands, mysql health, network health, system health, various queue sizes, etc. Simply reducing the MaxClients to a much lower number caused the segfaults to disappear while still servicing all incoming requests.

We're running low on data to send out, and we're in a murky period where the weekend is rapidly approaching and we are still awaiting the latest shipment of raw data drives from Arecibo. We could pull up as-yet-unanalysed data from our archives, but the offsite storage archive (HPSS) is undergoing several upgrades and have been offline for days. We'll see how this all pans out...

- Matt

see comments

28 Jan 2009, 23:24:18 UTC
Last night sidious (mysql replica database server) rebooted itself. Yeah, we did just move this into the closet, so there's non-zero worry that something may have gotten injured in transit, or it's unhappy in its new home. On the flip side, our servers are rebooting themselves from time to time for no apparent reason except maybe high stress. I love all operating systems (this is sarcasm). Anyway, that meant mysql crashed ungracefully and has been recovering all day - however succesful this recovery is remains to be seen. It is just the replica, so no big shakes, really.

And this afternoon we ran out of work to send out. This was due to our science database getting "brain freeze" which is what I'm calling it these days. If you run the wrongly formatted query the whole engine silently grinds to a halt, effectively blocking all splitter and assimilator access. I found and killed the errant queries and the dam burst. So yet again we're recovering from an unexpected semi-outage this week.

Regarding the setisvn server (from last thread)... I'm fully aware of the poor configuration of that virtual domain. Low on my priority list.

- Matt

see comments

27 Jan 2009, 22:40:49 UTC
Last night, due to the high traffic I was grousing about yesterday, the workunit storage filled and therefore no new work could be generated, so we ran out of stuff to send to clients. This cleared up on its own this morning, but then we started the regular weekly database maintenance outage, so we'll be in a bit of connectivity pain for a while.

During the outage I tested the stability of our secondary science database server (bambi). In other words: will it survive reboot without missing drives? It did. So that project is more or less done, and we'll start focusing on the primary science database server (thumper) next.

Even more exciting is that Jeff and I added a couple more servers to the closet today: sidious and casper. The latter is a multi-purpose machine used by the tangentially related CASPER project. The former is the replica mysql database. We were happy to finally get it out of our "test lab" and into the closet because it's big, noisy, and there's a chance its particular network hangups will be solved by moving it physically closer to its friends (all talking over one switch, as opposed to traversing at least three). We have only one major server left to move into the closet: vader. This is all good news but we're kind of maxed out on power usage in the closet, and need to do some breaker tests before adding anything else.

- Matt

see comments

26 Jan 2009, 23:17:39 UTC
Due to various bugs on the scheduler/client side of things some users have been getting far too much work to do. This results in excess workunit downloads which eats up our bandwidth and makes it generally difficult for anything to happen, then queues start backing up, etc. The scheduler fix has already been employed late last week, a client bug-fix is in the works.

I have little to do with the above, and the problems should clear up on their own once traffic settles down. Today has been a catch-up-on-mundane-sys-admin tasks kind of day for me, which is fine once in a while.

- Matt

see comments

22 Jan 2009, 23:34:01 UTC
We continue to have problems mounting our raw data drives (which we fill down at Arecibo and drain up here). The symptoms are random, the error messages are random, and where these messages actually appear is random. Jeff and I are pretty much giving up trying to figure it out. We'll most likely remove as many moving parts from the whole system and deal with continuing issues as they arise. Not sure who/what to blame. Linux? SATA? USB? The enclosures? The cables? The drives themselves?

I actually got the software radar blanker working. Whether or not the output it generates is worth anything remains to be seen, but at first glance it looks pretty good. The proof is when I run this on a whole file and make some workunits, and then see if these workunits explode.

- Matt

see comments

21 Jan 2009, 22:18:50 UTC
The secondary science database finally recovered. As we poke and prod at this new configuration we're still finding things we might have done differently, but we're planning to just seal it up and call this project done. Actual gains in speed/performance are to be tested.

As many of you regular/avid readers know the last release of the cuda client got a little messed up - people were getting checksum errors meaning the files were corrupted. Bob did the code signing procedure this last time around from his desktop machine which has recently had problems with its memory DIMMs. This is our best, albeit vague and unsatisfying, theory as to why a small subset of files got corrupted when simply copying from one directory to another.

Continuing progress on radar blanking and the NTPCkr. Jeff and I are anxious to get these projects done already.

- Matt

see comments

20 Jan 2009, 22:58:04 UTC
Welcome back from another long weekend - we had MLK Day off yesterday, and the whole country has been running a little late this morning. Things went mostly well in server land. The astropulse validator was (still) choking on various results so the backlog grew and thus the workunit storage filled up again for a minute there. That means the splitters halted, and we ran low on work to send out for half a day. Other than that, no major events.

Today we began the final stages of the secondary science database shuffle. We were a bit disappointed by the results at first, and did some more reconfiguration/testing before learning to not trust the output of iostat so much as the other evidence that shows we may have improved our peak science db throughput by 10x. Well maybe not so much - we'll see - if it's 2x I'll be psyched. More work tomorrow on that (the secondary is still catching up from being offline for 5+ days).

A followup on a recent story about our Overland Storage servers. I recently mentioned we hit an unexpected 4 TB file system limit on our workunit storage server (gowron). Turns out we actually hit a physical extent limit, and this will be fixed in the latest OS release. This is really just an academic point - we could only grow to 4.25 TB max anyway, given the number of drives. Thanks again to Overland for continued support.

- Matt

see comments

15 Jan 2009, 21:45:17 UTC
This morning moved on to the next phase of the bambi RAID shuffle - destroying all current volumes and building a series of RAID1 mirrors in their wake. The initial sync will take until tomorrow. Sigh. We'll continue then.

Eric's server ewen (mostly used for studying interstellar hydrogen) crashed this morning. This should be a non-issue except due to various dependencies it hung some of our other servers. Upon restart it was having networking issues thanks to NetworkManager - something we try to uninstall on every system but apparently didn't on ewen. This is a piece of software that comes with linux distibutions which, as far as I can tell, exists strictly to create random network problems to keep your workday interesting. In better news, Bob's desktop is working again. The problem was actually a bad internal SATA cable. Or at least things are working since removing it.

The ap_validator is still offline, mostly. It restarts every 10 minutes, maybe gets a few results done, then segfaults. The astropulse people (not me) are working on it. I know nothing beyond that.

- Matt

see comments

15 Jan 2009, 0:09:47 UTC
Today started the process of reconfiguring the underlying RAID devices on the secondary science database server (bambi). I was able to scrape together enough spare drives within the system to make temporary space so I could shuffle things around. Given the amount of data each shuffle takes a long long time. In fact, we're kinda stuck on this project until tomorrow. Anyway.. the database is sitting on three concatenated 6-drive RAID5's. Actually, given the way LVM is handling things it's mostly all on one 6-drive RAID5. Don't ask me why we set it up this way. The plan is to convert these 18 drives into a giant RAID10. More spindles, better striping, etc. and we can take the hit in usable storage.

Other than that, and messing around with Bob's desktop (which seems to have gotten a weird case of OS rot), I'm still elbow deep in programming. I hate C++ so very much but I admit the standard template library is helpful once you wrap your brain around it all.

- Matt

see comments

13 Jan 2009, 22:58:50 UTC
Typical weekly outage (for database cleanup/backup). During so Jeff and I did some more server closet reconfiguration - we consolidated all the Overload Storage stuff (servers gowron and worf, and their combined 16 TB of raw storage) into one rack, along with our router (that connects us to our private ISP separate from campus). This gave us enough room to (finally) add another UPS to the fold - which is good as older ones have been complaining/dying. Our UPS situation is far from optimal, but we're working with what we got. We also (finally) got server clarke into the closet, which will act as a much-needed build/compute server, among other things.

Steady progress is being made on both NTPCkr and radar blanking fronts - in fact I should working on the latter. Tomorrow I may tackle the RAID re-configuration project on our secondary science server, which may vastly reduce i/o and therefore increase NTPCkr throughput.

- Matt

see comments

12 Jan 2009, 23:58:40 UTC
A rather quiet weekend, though the astropulse validator seems to have gotten locked up on something. Josh and Eric and looking into that. This morning was a little weird. An old UPS we were using as a glorified power strip just up and stopped working, thus removing power to various sundry items in our secondary lab which wouldn't have been a big deal but one of those items was a switch, so sidious and vader (and casper for that matter) disappeared from the network for a short while there. Nobody seemed to notice. In the afternoon Jeff and I plotted some physical server moves for tomorrow's outage. We'll see how much we get done - and as always we take small steps with these big projects.

Various cuda-related items were discussed in our server meeting today. A bug that was causing the triplet overflows was found, and the blue screen of death issue with slower nVidia boards is getting a workaround. New client and application releases in the near future should clear some of this up.

Back to work - which means plotting lots of radar data for me.

- Matt

see comments

8 Jan 2009, 22:26:13 UTC
I actually should be programming all day, but when I dive head first into such activity I have to take frequent breaks to let the CPU in my head cool off as I draw odd diagrams on the dry-erase board to solidify the logic and pseudo-code tumbling around my brain. During these moments of respite I may tend to more enjoyable things, like messing around with the raw data pipeline, or figuring out why, all of a sudden, we're not sending out any work.

The last thing was due to a problem we're seeing more and more around here. As we ramp up doing actual science where hitting the science database with one-off queries that somewhere contain the phrase "order by." This seems to give informix fits when it's busy. Apparently we need to free up, or create, more resources so the db engine has more scratch space to do sorting. Otherwise it jams up in a slow, quiet manner, and nobody notices until we observe side effects - like the traffic graph dropping to zero. So we're looking into that general problem now.

- Matt

see comments

7 Jan 2009, 23:56:34 UTC
Now it's Wednesday, which usually means my focus should shift towards programming tasks. This actually hasn't happened in a while due to holiday schedules and other crises, but the radar blanking code really needs to be hammered into shape already. See the plans page for more info on that. Lots of mental paging-in of C++ programming trickery.

But this morning I was still busy with a bunch of things on my systems task list. Our informix replica server bambi was having fits with exporting/mounting so I had to go through the rigamarole of rebooting the system - which always seems to be the fastest way to fix things when things go awry. I also plugged away moving tons of data around our internal network for eventual filesystem rebuilding, tending to the raw data pipeline, etc. - the stuff I've been talking about for a while.

I've been using an old "Solaris 8" software box (coupled with the shell of a long-defunct external SCSI hard drive enclosure) as a stand for my desktop monitor, unaware how over the years the box has been slowly morphing out of square and sinking towards the left thus slanting the screen more and more. That might explain the crick in my neck I've had the past six months. This unergonomic situation was finally pointed out today by fellow SSL sysadmin Robert. Anywho, I now have the monitor sitting onto my shuttle enclosure, and even though it's perfectly level it seems it's slanting to the right. Talk about accommodation - my brain really got used to the old lean.

- Matt

see comments

7 Jan 2009, 0:06:20 UTC
It's Tuesday, so that means database maintenance outage - the usual drill. We are recovering from that now. During the downtime I added more space to the workunit storage - actually reaching an unexpected 4 terabyte logical limit on that volume. This isn't a big deal, and we converted the two drives we can't use on this volume into extra spares which are always welcome. I also rolled up my sleeves and drew up a brand new power map of the closet which was until now sorely outof date. After we get Dan to measure the current draw directly at the breakers we can start safely adding machines to the closet.

Over the holiday break, at least since I last posted anything, there was only one real incident. Our scheduling server went kaput and required reboot. Dan and Eric actually took care of that as I was happily making a chunk of change playing a New Year's Eve gig at the time. The surprise outage had the benefit of reducing demand on our resources so we could finally drain our back-end queues, and we recovered nicely once everything was back up and running.

Jeff found the bug in the validator today that's been causing some confusion when comparing cuda vs. non-cuda processed workunits. He's working on the fallout/cleanup from all that while we're still trying to figure out why some cuda clients are overflowing on certain workunits.

By the way, welcome to 2009. I'm only now just getting back into the lab (was out of town between new year's day and yesterday). I have hopes of progress regarding UC Berkeley's SETI project in general.

- Matt

see comments

30 Dec 2008, 23:16:29 UTC
Yep, we had our usual Tuesday outage. Nothing special, except that the result table is vastly bloated due to the back-end queues being clogged for one reason or another. So the "compression" part of our outage took an extra hour (roughly). So be it. Hopefully the wheels were greased enough to continue letting these drain without much intervention on my or Jeff's part. In any case except a slightly painful recovery as we continue to catch up. We're also pulling up a bunch more unanalyzed raw data to keep the splitters happy during the long weekend. Other than that today.. a lot of planning and preparing for various bigger projects to tackle once the holidays are over and we're all back in the lab - adding yet more workunit storage, reconfiguring database/raw data storage, adding more stuff to the closet, upgrading OSes, retiring older machines, bringing newer ones on line already. That's all well and good, except that Eric, Jeff, and I have three separate higher-priority tasks to tackle before anything else if possible. Those are (a) wrapping up all radar blanking efforts (we still get too many result overflows due to missed and therefore unblanked radar), (b) noise shaping (the noise we're injecting to reduce the effect of the radar is causing predictable and removable but nevertheless messy analysis artifacts), and (c) the NTPCker (the real-time candidate finder/reporter - so we might have something positive to mention come our 10th year anniversary in May).

That's it - the last tech news update (from me at least) for 2008. I'm already looking forward to 2009. Maybe we'll get some or all of the above done.

- Matt

see comments

29 Dec 2008, 23:56:24 UTC
One short holiday week is behind us, now here comes another one.

We did fairly well over the weekend, considering we were pretty much maxed out the whole time. The assimilator queue finally drained, thanks to splitters starting to chew on raw data files physically located on the new raw data storage server (as opposed to located on the same server as the science database), but also thanks to the validator queue falling behind.

In times of low resources we do have some knobs to turn to help squeeze more juice out of our embattled servers. Sometimes you have to roll up your sleeves (or, in this case, pull out a calculator) and determine what processes needs what resource, and which are claiming too much. After some investigation it was clear this time around we were giving httpd too much - and this is a tunable we have to adjust every so often, depending on how many people are connecting at any given time, and for how long - otherwise you have too many httpd listeners hanging out doing nothing eating up valuable memory/cpu. Anyway, long story short I reduced the number of validators from 6 to 4, moved the validator logs to a different filesystem (reduce i/o contention), and vastly reduced the number of httpd listeners. So far so good - that queue is draining (and therefore the assimilator queue is inflating again).

We will have the usual outage drill tomorrow, followed by another set of "days off."

- Matt

see comments

23 Dec 2008, 23:00:32 UTC
Today had our weekly outage for mysql database backup, maintenance, etc. This week we are recreating the replica database from scratch using the dump from the master. This is to ensure that the crash last week didn't leave any secret lingering corruption. That's all happening now as I type this and the project is revving back up to speed.

Had a conference call with our Overland Storage connections to clean up a couple cosmetic issues with their new beta server. That's been working well and is already half full of raw data. Once the splitters start acting on those files the other raw data storage server will breathe a major sigh of relief. I was also set to (finally) bump up the workunit storage space yesterday using their new expansion unit - but waited until their procedure confirmation today lest I did anything silly and blew away millions of workunit files by accident. The good news is that I increased this storage by almost a terabyte today, with more to come. We have officially broken that dam.

I also noticed this morning the high load on bruno (the upload server) may be partially due to an old, old cronjob that checks "last upload" time and alert us accordingly. This process was mounting the upload directories over NFS and doing long directory listings, etc. which might have been slowing down that filesystem in general from time to time. I cleaned all that up - we'll see if it has any positive effect.

Jeff's been hard at work on the NTPCker. It's actually chewing on the beta database now in test mode. We did find that an "order by" clause in the code was causing the informix database engine to lock out all other queries. This may have been the problem we've been experiencing at random over the past months. Maybe informix needs more scratch space to do these sorts, and it locks the database in some kind of internal management panic if it can't find enough. Something to add to the list of "things to address in the new year."

- Matt

see comments

Technical News Archives: 2008 2007 2006 2005 2004

©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.