[OpenAFS-devel] Re: What is a quorum vote based on?

John_Morin@transarc.com John_Morin@transarc.com
Mon, 22 Jan 2001 14:14:29 -0500 (EST)


Hello:
A while back ago, Jeff Blaine and I had a discussion about how UBIK
works. Here it is for those with an interest in it. The chronology of
the email begins at the bottom.
	- John Morin.
	  Morin@Transarc.com


Excerpts from mail.ubik: 1-Nov-100 Re: What is a quorum vote b.. =>
Blaine@linus.mitre. (5395)

> Excerpts from mail: 1-Nov-100 Re: What is a quorum vote b.. Jeff
> Blaine@linus.mitre. (6122*)

> > On what _criteria_ does a database server base its quorum election vote?

> > I voted for Susan Smith in middle school for class president _because she
> > was cute and smart_.

> > See what I am getting at?

> "Cute and smart" is equivalent to being the sync-site or having the
> lowest IP address (in that order). Anyone claiming to be the sync-site
> gets my vote. Otherwise, anyone having the lowest IP address (including
> myself) will get my vote. In both cases, I first have to be asked for a
> vote in order to cast the vote.

> And I never declare myself the sync-site unless I collect a quorum of votes.

> Once I cast a vote though, another rules come into play: I can't change
> my vote for at least 45 seconds - even if someone cuter and smarter or
> claiming to be the sync-site asks for my vote.

> Ex 1:
> 3 servers: A B C (A has lowest IP address, then B, then C):

> - C starts and begins asking for votes (no one is there to hear).
> - B starts and begins asking for votes
> - B receives a vote request from C. B find itself to have the lower IP
> address and so answers "no vote". B plans to vote for himself, but that
> vote isn't made at this stage.
> - C receives a vote request from B - so C votes "yes" for B. C now stops
> sending out its own vote requests and is committed to B for the next 45
> seconds.
> - A starts and begins asking for votes.
> - B receives a vote request from A, sees A is lower and votes for A. B
> now stops sending out its own vote requests and is committed to A for
> the next 45 seconds.
> - B then receives the "yes vote" from C. But B is already committed to
> A, so B ignores the "yes" vote (or responds with an error ??).
> - C receives a vote request from A, but C can't vote for A because C has
> voted for B. 45 seconds later, C hasn't heard from B and so changes it's
> vote to A (A is the only one currently sending out vote requests).

> Ex 2:
> - C starts and begins asking for votes (no one is there to hear).
> - B starts and begins asking for votes
> - B receives a vote request from C. B find itself to have the lower IP
> address and so answers "no vote". B plans to vote for himself, but that
> vote isn't made at this stage.
> - C receives a vote request from B - so C votes "yes" for B. C now stops
> sending out its own vote requests and is committed to B for the next 45
> seconds.
> - B receives the "yes" vote from C and so also votes for himself, finds
> he has a quorum (2 out of 3) and makes himself the syncsite.
> - A starts and begins asking for votes.
> - C receives a vote request from A and says "no vote" - I'm voting for B.
> - B receives a vote request from A and replies "I'm the sync site."
> - A defers to B as the sync site and so casts it's vote for B. A now
> stops sending out vote requests.

> > From step 3 below "If another server tells server A that it is voting for
> > someone else..."  Why would that server vote for someone else?

> The other server would vote for someone else because either the server
> it had already voted for never became the sync site (see example 1) or
> died. The vote it had cast expires and the server is then free to try
> collecting votes for itself or vote for someone else.

> > Why would it 
> > vote for server A?  Why would it vote for anything?  What's its decision
> > process for casting its vote?

> I hope I've given you the correct info.

> > In the end, my ultimate concern is that I have a lowest IP address db server
> > across a WAN link from the main meat of our cell.

> Ah Ha! This is a much more interesting issue. You may want to think
> about putting two database servers on the side of the WAN where the most
> important work is being done.

> Let me just rattle a few personal thoughts off the top of my head ;-)

> AFS begins to break down (a little:-) within WANs. This is why some
> large, global customers have gone to many cells (and deal with the
> issues of keeping data consistent across cells) instead of one large
> global cell. The reason is because the UBIK databases are synchronized
> on each write and a write takes longer to do when the db machines become
> more geographically distance or more numerous.

> For example: Someone has an office in the UK and wants to put some AFS
> clients there. They do that and find the performance is bad because it
> goes to the servers in PA for data and information. You expect this from
> cell-to-cell but not within the same cell. So you decide to add a db
> server machine in the UK. This will begin to slow the db server's write
> performance down affecting everyone's performance. So you then add
> vlserver preferences and then a fileserver on the UK side for fast
> access - with fileserver preferences and ROs in PA for quick/easy
> snapshot and backup capability. The solution is there but it becomes
> harder to administer. We can even start delving into the specifics of
> each database server (but eventually it becomes a case by case study).

> Some fixes went into AFS 3.6 (patch 2 - so it's not in open AFS) that
> makes the UBIK servers faster by doing updates to non-sync sites in
> parallel. This helps a lot but does it help enough? It was initiated by
> one site where they make more changes to the VLDB in a day than we
> do in a year :-) And they have 6 db servers machines across different
> subnets.

> 	- John Morin.

Excerpts from mail.ubik: 1-Nov-100 Re: What is a quorum vote b.. Jeff
Blaine@linus.mitre. (6122*)

> My question seems to be getting buried and misinterpreted.

> While I certainly appreciate the detailed explanation of the quorum
> creation process, and it's valuable to me, that's not what I've been
> trying to find out.

> On what _criteria_ does a database server base its quorum election vote?

> I voted for Susan Smith in middle school for class president _because she
> was cute and smart_.

> See what I am getting at?

> From step 3 below "If another server tells server A that it is voting for
> someone else..."  Why would that server vote for someone else?  Why would
> it vote for server A?  Why would it vote for anything?  What's its decision
> process for casting its vote?

> In the end, my ultimate concern is that I have a lowest IP address db server
> across a WAN link from the main meat of our cell.

> LOCATION "HQ" (heavy usage)              LOCATION "REMOTE" (light usage)
> Network 2.2.2.x (example)                Network 1.1.1.x (example)

>      fs-and-db-one
>      fs-and-db-two
>      fs-and-db-three    <-- WAN LINK -->  fs-and-db-four
>      fs-four                              fs-six
>      fs-five                              fs-seven

> I obviously almost never want fs-and-db-four to become the sync site.
> How do I enforce that policy?  How can I control the election process
> some?  Renumbering our networks is not an option...  What can I do?

> --On Wednesday, November 01, 2000 12:48 PM -0500 John_Morin@transarc.com 
> wrote:
> [snip]
> > How it works (The simple description):
> >
> > (1) When db server A comes up, it starts sending out requests to other
> > db servers to have them vote for Server A. Server A is trying to build a
> > quorum of db servers for itself. The other servers either respond or
> > not. In the process, server A collects votes and remembers who the
> > lowest IP address is.
> >
> > (2) If server A receives a vote request from someone else who has a
> > lower IP address, server A will stop sending out its own vote requests
> > and vote for the lower IP server. Once server A votes for another
> > server, it can't change its vote until a time limit has passed. Once the
> > time limit is pass, it then tries to collect votes for itself again. The
> > time limit may expire for a number of reasons (other server went down or
> > the other server voted for someone else).
> >
> > (3) If another server tells server A that it is voting for someone else,
> > then server A can't count the other in the quorum he is trying to build.
> > But server A continues to try to build a quorum by sending out vote
> > requests.
> >
> > (4) If another server tells server A it is voting for him, server A
> > knows he has the vote for the next X seconds. Server A asks itself that
> > if he votes for himself, does he have quorum (over half the votes -or-
> > half the votes and server A has lowest IP address). If so, he claims
> > himself the syncsite. Server A continues to send out vote requests to
> > constantly renew the vote commitment (always well before the time lapse).
> >
> > Eventually, you get to a steady state where one server is the sync-site
> > and periodically sending out vote requests while all the other servers
> > vote for it. You can see how the vote process tends to focus on the
> > servers with the lower IP addresses.
> >
> > Once a quorum is established, the sync-site checks for the latest
> > revision of the database in its quorum and distributes that. As new
> > servers enter the quorum, their databases are also checked and sync'ed.
> >
> > If a database server never enters a quorum, It does not mean the
> > database is useless, it can still service read requests. IE: a
> > authentication server on the other side of a broken network partition
> > will still allow users to authenticate.
> >
> > In conclusion, having an even or odd number of dbservers isn't the
> > issue. The issue is whether you have more than 2 db servers. If you have
> > more than 2, then losing a single dbserver means a sync-site will be
> > created. With less than 2, ....
> >
> > The code is in src/ubik/. Look at ubeacon_Interact() for how UBIK trys
> > to create a quorum. Also, when a UBIK server comes up, it goes through a
> > number of states to get started. Look at the "urecovery_state" variable
> > in src/ubik/recovery.c, urecovery_Interact(), to see how a UBIK server
> > gets on it's feet.
> >
> >	 - John Morin.
> >	   AFS Developer
> >
> >
> > Excerpts from transarc.external.info-afs: 25-Oct-00 Re: What is a quorum
> > vote b.. Rob Porter@clarkson.edu (1576*)
> >
> >> I believe that in a 2 (even number) DB server environment, the lowest IP
> >> address has 1.5 votes, where the other (NOTE also includes those in an
> >> odd number environments) have 1 vote.
> >
> >> So, loosing the lowest IP addressed server in a 2 server environment
> >> would render the DB read-only.
> >
> >> On Wed, 25 Oct 2000, Paul Blackburn wrote:
> >
> >> > Jeff,
> >> >
> >> > I believe it is the lowest working database server IP address.
> >> >
> >> > Practical experience with upgrading/rebooting
> >> > database servers seems to verify this.
> >> >
> >> >
> >> >
> >> > Caution: I once tried shutting down one db server
> >> > in a cell with two db servers. The result: mayhem.
> >> > With only one of two db servers running
> >> > instead of a quorum there was a quandry.
> >> > No sync site could be elected.
> >> >
> >> > This leads me to believe that it is better to
> >> > configure an odd number of AFS database servers
> >> > to improve the availability of your cell.
> >> >
> >> > Apart from the single db server case, this means
> >> > (with 3, 5, 7 or more) you are likely to have
> >> > enough working db servers for a quorum in
> >> > the event of database server outage.
> >> > --
> >> > cheers
> >> > paul                                  http://acm.org/~mpb
> >> >
> >> >
> >> > "Computers can figure out all kinds of problems, except the
> >> >  things in the world that just don't add up."
> >> >         --James Magary
> >> >
> >> >
> >> >
> >> > Jeff Blaine wrote:
> >> > > I can't find the answer to this question anywhere I've looked.
> >> > > On what criteria does a database server base its quorum election
> >> > > vote?
> >> >
> >> >
> >
> >> --
> >> Robert Porter <rwp@clarkson.edu>
> >> Systems and Network Engineer
> >> Campus Information Services, Clarkson University
> >
> >
> >