Please somebody explain, the DB problem is getting ridiculous

Message boards : Number crunching : Please somebody explain, the DB problem is getting ridiculous
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Ozgur Gurgey
Volunteer tester

Send message
Joined: 1 Jan 02
Posts: 25
Credit: 898,747
RAC: 0
Turkey
Message 77556 - Posted: 8 Feb 2005, 16:40:00 UTC

Suppose, you have a DB on an old and slow system, and you want to move it to a newer and fast system...so to say, upgrade it...

I think, the worst you could do,

1) Take the old one off
2) Dump the old DB to the new platform

AND (here comes the part)

3) Take the new one offline, and continue on the old system in hope that the old one cope with the the old jobs (which it surely couldn't) plus the new job of replication...

I don't know of course, which "tests" are needed, but, if the new one isn't already working on replication, the conflicts between the master and slave DB's will be much greater tomorrow than this morning. That means extra work...

I would say, the only real test is getting the new DB online, forget about the replication and other stuff... this would only make the problem more complicated. After the new DB is online you could still make a replica in the background and noone will ever notice. If tests on hardware are needed than we/they could wait for a couple of days more, and get the job done.

This said, I know little about MySQL replication, have done myself some replica on production DB's on MS SQL and Oracle...

Please enlighten me...
ID: 77556 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 77559 - Posted: 8 Feb 2005, 17:02:03 UTC - in response to Message 77556.  

> I don't know of course, which "tests" are needed, but, if the new one isn't
> already working on replication, the conflicts between the master and slave
> DB's will be much greater tomorrow than this morning. That means extra
> work...

On another thread, Matt Lebofsky said that the current server is not working well, but it is working. Work is being distributed.

Given the way most people keep second-guessing the decisions made by the administrators, and how many seem to view the slightest interruption as a major tragedy, I'm sure they're being very cautious.

Remember that last week the new server quit right in the middle of copying the database.

So, while their procedure may be slow, it's also safe.
ID: 77559 · Report as offensive
Ozgur Gurgey
Volunteer tester

Send message
Joined: 1 Jan 02
Posts: 25
Credit: 898,747
RAC: 0
Turkey
Message 77562 - Posted: 8 Feb 2005, 17:27:39 UTC - in response to Message 77559.  

> On another thread, Matt Lebofsky said that the current server is not working
> well, but it is working. Work is being distributed.

Hmmm... if you mean this thread http://setiweb.ssl.berkeley.edu/forum_thread.php?id=11422 ("Are we back?looks like it."). I can see no complaint about the new server. it's just the old db that makes the problems. And no sign of distributing the work to the old and new servers.

> Given the way most people keep second-guessing the decisions made by the
> administrators, and how many seem to view the slightest interruption as a
> major tragedy, I'm sure they're being very cautious.
>
> Remember that last week the new server quit right in the middle of copying the
> database.
>
I know,of course, that criticizing is easy when you don't have the responsibility. And by no means is the situation a catastropy. But truely, my questioning is not about the duration of the process. It is simply the methodics.

> So, while their procedure may be slow, it's also safe.
It is absolutely slow and not necessarily safe.
ID: 77562 · Report as offensive
Profile MattDavis
Volunteer tester
Avatar

Send message
Joined: 11 Nov 99
Posts: 919
Credit: 934,161
RAC: 0
United States
Message 77598 - Posted: 8 Feb 2005, 18:44:40 UTC

Thank you for your complaint. I'm sure that will fix the current DB issues.
-----
ID: 77598 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 77625 - Posted: 8 Feb 2005, 19:50:38 UTC - in response to Message 77562.  

> > So, while their procedure may be slow, it's also safe.
> It is absolutely slow and not necessarily safe.

The difficulty is that the current database is the reference. The new database is a copy, yet it is not clear that this copy is a good copy yet. And, though it is a copy, the "old" database has moved on, so the copy (good or bad) is no longer up to date.

Until the database is validated there is no way that you want to change streams mid-horse ....

So, they need to test for data integrety, data cleanliness, then bring it up to date, and THEN switch masters.

To give you an idea of how hard this is, there is no way to tell for sure which name is good for a mailing address. So, as far as the database knows, "Paul D. Buck" and "Paul D. Duck" are equally valid names for a particular address.

Similar issues exsist with any database ... it is not easy making sure that the data is valid ...
ID: 77625 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 77666 - Posted: 8 Feb 2005, 23:38:04 UTC - in response to Message 77562.  


> I know,of course, that criticizing is easy when you don't have the
> responsibility. And by no means is the situation a catastropy. But truely,
> my questioning is not about the duration of the process. It is simply the
> methodics.

I understand that you are questioning the methodology.

They have a database server that is running and handling the live database.

They have a new database server that is unproven, and some reason to believe that it may or may not be fully stable.

They can make the new server a "replica" and that means the old server serves up the records, and the new server works to stay current. They can check the replica to see that it matches and make sure that it's tracking okay.

Then make the new server the master, and it's done.

In the meantime, the current master is still working (or we wouldn't have work and we wouldn't be posting in the forums).
ID: 77666 · Report as offensive
Ozgur Gurgey
Volunteer tester

Send message
Joined: 1 Jan 02
Posts: 25
Credit: 898,747
RAC: 0
Turkey
Message 77673 - Posted: 9 Feb 2005, 0:29:51 UTC

[Quote]
The initial migration of the database to the new server is complete. Tomorrow we will make this new server a database replica. If all goes well it will become the database primary in a few days.
[Unqoute}

So, from this, other than the "tests" the new Server waits idle.

@Paul;
"The new database is a copy, yet it is not clear that this copy is a good copy yet"
if the old one was "good" enough, the dump of that database is as good as the original one. At least theoretically... On the other hand, if you start to validate the new one, you could also think of a validator of the validation

@Ned;
"They have a new database server that is unproven, and some reason to believe that it may or may not be fully stable."
You may have a point there, but the stability problem won't go away if you doesn't start the process of replication immidiately after the completion of dumping. Becouse the old one changing all the time, WU's generated, sent, received, outdated, well maybe not validated ;-). The problem is that the new one doesn't know anything about what happened today.

"...or we wouldn't have work and we wouldn't be posting in the forums)"
The shortcome of generated WU's, people requesting more than normal, for they know there are only a few (some tens) of WU, will continue. It was the reason to move to a faster (better) system because this. Now the as the publisher the old one must deal with extra with the replication.

My point is, if there were hardware issues for the new server, then why dump?
if you have issues on dumping process, use backups, try to simulate the real situation, but never try to dump a system, test for one day and on the other day try to catch up with the real database...

just my opinion...
ID: 77673 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 77710 - Posted: 9 Feb 2005, 2:29:17 UTC - in response to Message 77673.  

> @Paul;
> "The new database is a copy, yet it is not clear that this copy is a good
> copy yet"
> if the old one was "good" enough, the dump of that database is as good as the
> original one. At least theoretically... On the other hand, if you start to
> validate the new one, you could also think of a validator of the validation
>
The copy is not necessarily as good as the original. Since the original was active at the time the dump was being made, there is the possibility that a record was changing while it was being copied. It is possible (not supposed to happen, but it is still possible) for that record to be garbaged or deleted in the new DB. Data copies are not supposed to fail and generate garbage, but sometimes they do. Any data copy that is critical should be verified.



BOINC WIKI
ID: 77710 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 77714 - Posted: 9 Feb 2005, 2:41:14 UTC - in response to Message 77673.  


> My point is, if there were hardware issues for the new server, then why dump?
> if you have issues on dumping process, use backups, try to simulate the real
> situation, but never try to dump a system, test for one day and on the other
> day try to catch up with the real database...

... and my point is, if there are issues, it's better to find them now vs. after the new server goes live.

This is a way to test it under load and prove the server (and the new database) before the change is made.
ID: 77714 · Report as offensive
Dave Mickey

Send message
Joined: 19 Oct 99
Posts: 178
Credit: 11,122,965
RAC: 0
United States
Message 77718 - Posted: 9 Feb 2005, 2:46:13 UTC

Just some worrisome musings I have:

If they have a damaged DB causing boatloads of
extraneous IO, how does moving that DB to new HW
help, other than causing extraneous IO at a
higher rate?
(maybe the conversion to one system to another will
be akin to a "reorg", in the pre-disk defrag utility
days of long ago, where it was done via dump-to-tape,
reformat, load-tape-to-disk; I dunno anything about MySQL.

How big can the waiting for validation Q get? Is
there any near term point where it might explode?
(putting, at this point, something like 10 million
credits (250K X 40 each) at risk.)

Oh, well, I guess I'll just crunch on, and we'll see.....

Dave
ID: 77718 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 77720 - Posted: 9 Feb 2005, 2:52:05 UTC - in response to Message 77718.  

> Just some worrisome musings I have:
>
> If they have a damaged DB causing boatloads of
> extraneous IO, how does moving that DB to new HW
> help, other than causing extraneous IO at a
> higher rate?

The term "garbage collection" in this context doesn't mean the database is in any way corrupt (doesn't contain garbage).

What it means is that the system is short on memory, and it's trying to go through and find stuff that is allocated, in RAM, needs to update disk, etc. but isn't really needed -- that can be given back to the system and reallocated for other uses. It's trying to clean all of that up and get all of the miscellaneous bits of allocated memory pulled together into something big enough to be useful.

... and with the amount of RAM, disk subsystem bandwidth, etc. available, the current database just can't quite do it -- or if it can, it just hasn't been able to finish.

So, new hardware with more RAM and faster disks will be a huge help. With enough RAM, it might not need to do garbage collection at all.
ID: 77720 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 77792 - Posted: 9 Feb 2005, 11:09:42 UTC - in response to Message 77673.  

> @Paul;
> "The new database is a copy, yet it is not clear that this copy is a good
> copy yet"
> if the old one was "good" enough, the dump of that database is as good as the
> original one. At least theoretically... On the other hand, if you start to
> validate the new one, you could also think of a validator of the validation

One of the things I learned with computers is that theory rarely matches real life experience.

Also, to expand on what Ned said, if the system is bandwidth limited, new allocations that require clean-up are made as fast as old ones are retired (so to speak) so it is a never ending battle. Kinda like trying to clean up a garage ...
ID: 77792 · Report as offensive
Ozgur Gurgey
Volunteer tester

Send message
Joined: 1 Jan 02
Posts: 25
Credit: 898,747
RAC: 0
Turkey
Message 77824 - Posted: 9 Feb 2005, 15:34:03 UTC

Well, it is almost 48 hours that the process of migration has started,

'Til now;

No news on the frontpage, nothing on the technical news...

The number units, waiting for validation has grown well over 200K.

If the replication process has started, it will be a daunting task for the old system, to resolve the conflicts at, over 100K WU's state, New users, deleted WU's etc... But if it isn't started already, the job is getting harder by the second.

Having said that, I continue to crunch as normal, this thread is not ment to be a bash for the admins. We are all on the same side
ID: 77824 · Report as offensive
ChristianB
Avatar

Send message
Joined: 11 Jul 01
Posts: 139
Credit: 90,213
RAC: 0
Germany
Message 77853 - Posted: 9 Feb 2005, 18:54:43 UTC

Just want to say that the best thing for migration would be: Disconnect the Project from DB. Means no one can report or download work and the forum is also off BUT this would decrease(to zero) the external load on the old db system.
Now they have time to copy and verify the new db-replica. Now you would say oh we can't post in the forum, we can't dl new work, we can't upload our work, we aren't getting credit BUT fact is we currently aren't getting work, we aren't getting credits, we can't dl new work.

So just give the guys at berkeley time to manage the db-replica.

These are my 2 cent(eurocent) and sorry for bad english.

BOINC Doc | Team-Site | BOINC-Podcast
ID: 77853 · Report as offensive
Hans Dorn
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 2262
Credit: 26,448,570
RAC: 0
Germany
Message 77865 - Posted: 9 Feb 2005, 19:55:55 UTC - in response to Message 77853.  

> Just want to say that the best thing for migration would be: Disconnect the
> Project from DB. Means no one can report or download work and the forum is
> also off BUT this would decrease(to zero) the external load on the old db
> system.
> Now they have time to copy and verify the new db-replica. Now you would say oh
> we can't post in the forum, we can't dl new work, we can't upload our work, we
> aren't getting credit BUT fact is we currently aren't getting work, we aren't
> getting credits, we can't dl new work.
>
> So just give the guys at berkeley time to manage the db-replica.
>
> These are my 2 cent(eurocent) and sorry for bad english.
>

Hmm....

I guess by replicating a live system, the new server gets a fair amount of load.
This way it will show any quirks that are present with the new setup.

These would be my 0.02 credits :o)

Regards Hans

ID: 77865 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 77876 - Posted: 9 Feb 2005, 20:53:59 UTC - in response to Message 77865.  


> Hmm....
>
> I guess by replicating a live system, the new server gets a fair amount of
> load.
> This way it will show any quirks that are present with the new setup.

If I understand how a "replica" works, every update is performed on both servers.

So, the old server is carrying the load, and the new server is subjected to the same load.
ID: 77876 · Report as offensive
Ozgur Gurgey
Volunteer tester

Send message
Joined: 1 Jan 02
Posts: 25
Credit: 898,747
RAC: 0
Turkey
Message 77894 - Posted: 9 Feb 2005, 22:30:59 UTC - in response to Message 77876.  

>
> > Hmm....
> >
> > I guess by replicating a live system, the new server gets a fair amount
> of
> > load.
> > This way it will show any quirks that are present with the new setup.
>
> If I understand how a "replica" works, every update is performed on both
> servers.
>
> So, the old server is carrying the load, and the new server is subjected to
> the same load.
>
The basic idea behind replication is, there exists a "publisher", and a "subscriber". The publisher has the most current data, publisher pushes or subscriber pulls the fresh data. New records inserted, old but changed records, which have been changed since the last replication, create "conflicts", the replication process tries to solve these conflicts, but this has cascading effects, foe example consider a wu has been sent to a user but not sent back at time of replication. After the last replication the publisher receives a result, tries to validate, if conditions met users get credit. All the steps must be "replayed" by the subscriber... So you have a second,less fresh, database, but you can serve two different DB's at two different location.

Of course, this is done with a performance penalty. My complaints about the procedure is about this performance overhead. From my point of view, none of the BOINC projects are strong candidates for replication, because of the high level of concurrency demand. OK, you could use it as a backup, but why should you...at least with the hardware at hand, it's a vicious circle. The servers get drowned, the performance suffers yet more...


My whole point at
ID: 77894 · Report as offensive
Hans Dorn
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 2262
Credit: 26,448,570
RAC: 0
Germany
Message 77896 - Posted: 9 Feb 2005, 22:37:46 UTC - in response to Message 77894.  

>
> Of course, this is done with a performance penalty. My complaints about the
> procedure is about this performance overhead. From my point of view, none of
> the BOINC projects are strong candidates for replication, because of the high
> level of concurrency demand. OK, you could use it as a backup, but why should
> you...at least with the hardware at hand, it's a vicious circle. The servers
> get drowned, the performance suffers yet more...
>

I guess we'll have to sit through it. Switching too early and having the new server crash
would create a real mess.

BTW the amount of available WUs seems to have increased today.
I could get all my boxes back to work.

Regards Hans

ID: 77896 · Report as offensive
Ozgur Gurgey
Volunteer tester

Send message
Joined: 1 Jan 02
Posts: 25
Credit: 898,747
RAC: 0
Turkey
Message 77898 - Posted: 9 Feb 2005, 22:57:34 UTC

Addendum to my last post:

The process runs at the pace of the slowest machine taking part...
ID: 77898 · Report as offensive
Dave Mickey

Send message
Joined: 19 Oct 99
Posts: 178
Credit: 11,122,965
RAC: 0
United States
Message 77924 - Posted: 10 Feb 2005, 0:33:40 UTC

>BTW the amount of available WUs seems to have increased today.
>I could get all my boxes back to work.

My machines have been getting just enough of a trickle to stay
at about their minimum cache level - usually they get to the
low water mark, and after some retries, get one unit (which
satisfies the cache for a while, but not filling it to 2X).

But I just looked, and there's something like 1900 units ready
to send! First time it been > 100 in many days now. Now if we
could just get some credits granted - I've none in better than
a week, I think.

woohoo!

Dave
ID: 77924 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Please somebody explain, the DB problem is getting ridiculous


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.