Message boards :
Number crunching :
Please somebody explain, the DB problem is getting ridiculous
Message board moderation
Author | Message |
---|---|
Ozgur Gurgey Send message Joined: 1 Jan 02 Posts: 25 Credit: 898,747 RAC: 0 |
Suppose, you have a DB on an old and slow system, and you want to move it to a newer and fast system...so to say, upgrade it... I think, the worst you could do, 1) Take the old one off 2) Dump the old DB to the new platform AND (here comes the part) 3) Take the new one offline, and continue on the old system in hope that the old one cope with the the old jobs (which it surely couldn't) plus the new job of replication... I don't know of course, which "tests" are needed, but, if the new one isn't already working on replication, the conflicts between the master and slave DB's will be much greater tomorrow than this morning. That means extra work... I would say, the only real test is getting the new DB online, forget about the replication and other stuff... this would only make the problem more complicated. After the new DB is online you could still make a replica in the background and noone will ever notice. If tests on hardware are needed than we/they could wait for a couple of days more, and get the job done. This said, I know little about MySQL replication, have done myself some replica on production DB's on MS SQL and Oracle... Please enlighten me... |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
> I don't know of course, which "tests" are needed, but, if the new one isn't > already working on replication, the conflicts between the master and slave > DB's will be much greater tomorrow than this morning. That means extra > work... On another thread, Matt Lebofsky said that the current server is not working well, but it is working. Work is being distributed. Given the way most people keep second-guessing the decisions made by the administrators, and how many seem to view the slightest interruption as a major tragedy, I'm sure they're being very cautious. Remember that last week the new server quit right in the middle of copying the database. So, while their procedure may be slow, it's also safe. |
Ozgur Gurgey Send message Joined: 1 Jan 02 Posts: 25 Credit: 898,747 RAC: 0 |
> On another thread, Matt Lebofsky said that the current server is not working > well, but it is working. Work is being distributed. Hmmm... if you mean this thread http://setiweb.ssl.berkeley.edu/forum_thread.php?id=11422 ("Are we back?looks like it."). I can see no complaint about the new server. it's just the old db that makes the problems. And no sign of distributing the work to the old and new servers. > Given the way most people keep second-guessing the decisions made by the > administrators, and how many seem to view the slightest interruption as a > major tragedy, I'm sure they're being very cautious. > > Remember that last week the new server quit right in the middle of copying the > database. > I know,of course, that criticizing is easy when you don't have the responsibility. And by no means is the situation a catastropy. But truely, my questioning is not about the duration of the process. It is simply the methodics. > So, while their procedure may be slow, it's also safe. It is absolutely slow and not necessarily safe. |
MattDavis Send message Joined: 11 Nov 99 Posts: 919 Credit: 934,161 RAC: 0 |
Thank you for your complaint. I'm sure that will fix the current DB issues. ----- |
Paul D. Buck Send message Joined: 19 Jul 00 Posts: 3898 Credit: 1,158,042 RAC: 0 |
> > So, while their procedure may be slow, it's also safe. > It is absolutely slow and not necessarily safe. The difficulty is that the current database is the reference. The new database is a copy, yet it is not clear that this copy is a good copy yet. And, though it is a copy, the "old" database has moved on, so the copy (good or bad) is no longer up to date. Until the database is validated there is no way that you want to change streams mid-horse .... So, they need to test for data integrety, data cleanliness, then bring it up to date, and THEN switch masters. To give you an idea of how hard this is, there is no way to tell for sure which name is good for a mailing address. So, as far as the database knows, "Paul D. Buck" and "Paul D. Duck" are equally valid names for a particular address. Similar issues exsist with any database ... it is not easy making sure that the data is valid ... |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
> I know,of course, that criticizing is easy when you don't have the > responsibility. And by no means is the situation a catastropy. But truely, > my questioning is not about the duration of the process. It is simply the > methodics. I understand that you are questioning the methodology. They have a database server that is running and handling the live database. They have a new database server that is unproven, and some reason to believe that it may or may not be fully stable. They can make the new server a "replica" and that means the old server serves up the records, and the new server works to stay current. They can check the replica to see that it matches and make sure that it's tracking okay. Then make the new server the master, and it's done. In the meantime, the current master is still working (or we wouldn't have work and we wouldn't be posting in the forums). |
Ozgur Gurgey Send message Joined: 1 Jan 02 Posts: 25 Credit: 898,747 RAC: 0 |
[Quote] The initial migration of the database to the new server is complete. Tomorrow we will make this new server a database replica. If all goes well it will become the database primary in a few days. [Unqoute} So, from this, other than the "tests" the new Server waits idle. @Paul; "The new database is a copy, yet it is not clear that this copy is a good copy yet" if the old one was "good" enough, the dump of that database is as good as the original one. At least theoretically... On the other hand, if you start to validate the new one, you could also think of a validator of the validation @Ned; "They have a new database server that is unproven, and some reason to believe that it may or may not be fully stable." You may have a point there, but the stability problem won't go away if you doesn't start the process of replication immidiately after the completion of dumping. Becouse the old one changing all the time, WU's generated, sent, received, outdated, well maybe not validated ;-). The problem is that the new one doesn't know anything about what happened today. "...or we wouldn't have work and we wouldn't be posting in the forums)" The shortcome of generated WU's, people requesting more than normal, for they know there are only a few (some tens) of WU, will continue. It was the reason to move to a faster (better) system because this. Now the as the publisher the old one must deal with extra with the replication. My point is, if there were hardware issues for the new server, then why dump? if you have issues on dumping process, use backups, try to simulate the real situation, but never try to dump a system, test for one day and on the other day try to catch up with the real database... just my opinion... |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
> @Paul; > "The new database is a copy, yet it is not clear that this copy is a good > copy yet" > if the old one was "good" enough, the dump of that database is as good as the > original one. At least theoretically... On the other hand, if you start to > validate the new one, you could also think of a validator of the validation > The copy is not necessarily as good as the original. Since the original was active at the time the dump was being made, there is the possibility that a record was changing while it was being copied. It is possible (not supposed to happen, but it is still possible) for that record to be garbaged or deleted in the new DB. Data copies are not supposed to fail and generate garbage, but sometimes they do. Any data copy that is critical should be verified. BOINC WIKI |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
> My point is, if there were hardware issues for the new server, then why dump? > if you have issues on dumping process, use backups, try to simulate the real > situation, but never try to dump a system, test for one day and on the other > day try to catch up with the real database... ... and my point is, if there are issues, it's better to find them now vs. after the new server goes live. This is a way to test it under load and prove the server (and the new database) before the change is made. |
Dave Mickey Send message Joined: 19 Oct 99 Posts: 178 Credit: 11,122,965 RAC: 0 |
Just some worrisome musings I have: If they have a damaged DB causing boatloads of extraneous IO, how does moving that DB to new HW help, other than causing extraneous IO at a higher rate? (maybe the conversion to one system to another will be akin to a "reorg", in the pre-disk defrag utility days of long ago, where it was done via dump-to-tape, reformat, load-tape-to-disk; I dunno anything about MySQL. How big can the waiting for validation Q get? Is there any near term point where it might explode? (putting, at this point, something like 10 million credits (250K X 40 each) at risk.) Oh, well, I guess I'll just crunch on, and we'll see..... Dave |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
> Just some worrisome musings I have: > > If they have a damaged DB causing boatloads of > extraneous IO, how does moving that DB to new HW > help, other than causing extraneous IO at a > higher rate? The term "garbage collection" in this context doesn't mean the database is in any way corrupt (doesn't contain garbage). What it means is that the system is short on memory, and it's trying to go through and find stuff that is allocated, in RAM, needs to update disk, etc. but isn't really needed -- that can be given back to the system and reallocated for other uses. It's trying to clean all of that up and get all of the miscellaneous bits of allocated memory pulled together into something big enough to be useful. ... and with the amount of RAM, disk subsystem bandwidth, etc. available, the current database just can't quite do it -- or if it can, it just hasn't been able to finish. So, new hardware with more RAM and faster disks will be a huge help. With enough RAM, it might not need to do garbage collection at all. |
Paul D. Buck Send message Joined: 19 Jul 00 Posts: 3898 Credit: 1,158,042 RAC: 0 |
> @Paul; > "The new database is a copy, yet it is not clear that this copy is a good > copy yet" > if the old one was "good" enough, the dump of that database is as good as the > original one. At least theoretically... On the other hand, if you start to > validate the new one, you could also think of a validator of the validation One of the things I learned with computers is that theory rarely matches real life experience. Also, to expand on what Ned said, if the system is bandwidth limited, new allocations that require clean-up are made as fast as old ones are retired (so to speak) so it is a never ending battle. Kinda like trying to clean up a garage ... |
Ozgur Gurgey Send message Joined: 1 Jan 02 Posts: 25 Credit: 898,747 RAC: 0 |
Well, it is almost 48 hours that the process of migration has started, 'Til now; No news on the frontpage, nothing on the technical news... The number units, waiting for validation has grown well over 200K. If the replication process has started, it will be a daunting task for the old system, to resolve the conflicts at, over 100K WU's state, New users, deleted WU's etc... But if it isn't started already, the job is getting harder by the second. Having said that, I continue to crunch as normal, this thread is not ment to be a bash for the admins. We are all on the same side |
ChristianB Send message Joined: 11 Jul 01 Posts: 139 Credit: 90,213 RAC: 0 |
Just want to say that the best thing for migration would be: Disconnect the Project from DB. Means no one can report or download work and the forum is also off BUT this would decrease(to zero) the external load on the old db system. Now they have time to copy and verify the new db-replica. Now you would say oh we can't post in the forum, we can't dl new work, we can't upload our work, we aren't getting credit BUT fact is we currently aren't getting work, we aren't getting credits, we can't dl new work. So just give the guys at berkeley time to manage the db-replica. These are my 2 cent(eurocent) and sorry for bad english. BOINC Doc | Team-Site | BOINC-Podcast |
Hans Dorn Send message Joined: 3 Apr 99 Posts: 2262 Credit: 26,448,570 RAC: 0 |
> Just want to say that the best thing for migration would be: Disconnect the > Project from DB. Means no one can report or download work and the forum is > also off BUT this would decrease(to zero) the external load on the old db > system. > Now they have time to copy and verify the new db-replica. Now you would say oh > we can't post in the forum, we can't dl new work, we can't upload our work, we > aren't getting credit BUT fact is we currently aren't getting work, we aren't > getting credits, we can't dl new work. > > So just give the guys at berkeley time to manage the db-replica. > > These are my 2 cent(eurocent) and sorry for bad english. > Hmm.... I guess by replicating a live system, the new server gets a fair amount of load. This way it will show any quirks that are present with the new setup. These would be my 0.02 credits :o) Regards Hans |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
> Hmm.... > > I guess by replicating a live system, the new server gets a fair amount of > load. > This way it will show any quirks that are present with the new setup. If I understand how a "replica" works, every update is performed on both servers. So, the old server is carrying the load, and the new server is subjected to the same load. |
Ozgur Gurgey Send message Joined: 1 Jan 02 Posts: 25 Credit: 898,747 RAC: 0 |
> > > Hmm.... > > > > I guess by replicating a live system, the new server gets a fair amount > of > > load. > > This way it will show any quirks that are present with the new setup. > > If I understand how a "replica" works, every update is performed on both > servers. > > So, the old server is carrying the load, and the new server is subjected to > the same load. > The basic idea behind replication is, there exists a "publisher", and a "subscriber". The publisher has the most current data, publisher pushes or subscriber pulls the fresh data. New records inserted, old but changed records, which have been changed since the last replication, create "conflicts", the replication process tries to solve these conflicts, but this has cascading effects, foe example consider a wu has been sent to a user but not sent back at time of replication. After the last replication the publisher receives a result, tries to validate, if conditions met users get credit. All the steps must be "replayed" by the subscriber... So you have a second,less fresh, database, but you can serve two different DB's at two different location. Of course, this is done with a performance penalty. My complaints about the procedure is about this performance overhead. From my point of view, none of the BOINC projects are strong candidates for replication, because of the high level of concurrency demand. OK, you could use it as a backup, but why should you...at least with the hardware at hand, it's a vicious circle. The servers get drowned, the performance suffers yet more... My whole point at |
Hans Dorn Send message Joined: 3 Apr 99 Posts: 2262 Credit: 26,448,570 RAC: 0 |
> > Of course, this is done with a performance penalty. My complaints about the > procedure is about this performance overhead. From my point of view, none of > the BOINC projects are strong candidates for replication, because of the high > level of concurrency demand. OK, you could use it as a backup, but why should > you...at least with the hardware at hand, it's a vicious circle. The servers > get drowned, the performance suffers yet more... > I guess we'll have to sit through it. Switching too early and having the new server crash would create a real mess. BTW the amount of available WUs seems to have increased today. I could get all my boxes back to work. Regards Hans |
Ozgur Gurgey Send message Joined: 1 Jan 02 Posts: 25 Credit: 898,747 RAC: 0 |
Addendum to my last post: The process runs at the pace of the slowest machine taking part... |
Dave Mickey Send message Joined: 19 Oct 99 Posts: 178 Credit: 11,122,965 RAC: 0 |
>BTW the amount of available WUs seems to have increased today. >I could get all my boxes back to work. My machines have been getting just enough of a trickle to stay at about their minimum cache level - usually they get to the low water mark, and after some retries, get one unit (which satisfies the cache for a while, but not filling it to 2X). But I just looked, and there's something like 1900 units ready to send! First time it been > 100 in many days now. Now if we could just get some credits granted - I've none in better than a week, I think. woohoo! Dave |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.