[OpenAFS-devel] Re: What is a quorum vote based on?
John_Morin@transarc.com
John_Morin@transarc.com
Mon, 22 Jan 2001 14:14:29 -0500 (EST)
Hello:
A while back ago, Jeff Blaine and I had a discussion about how UBIK
works. Here it is for those with an interest in it. The chronology of
the email begins at the bottom.
- John Morin.
Morin@Transarc.com
Excerpts from mail.ubik: 1-Nov-100 Re: What is a quorum vote b.. =>
Blaine@linus.mitre. (5395)
> Excerpts from mail: 1-Nov-100 Re: What is a quorum vote b.. Jeff
> Blaine@linus.mitre. (6122*)
> > On what _criteria_ does a database server base its quorum election vote?
> > I voted for Susan Smith in middle school for class president _because she
> > was cute and smart_.
> > See what I am getting at?
> "Cute and smart" is equivalent to being the sync-site or having the
> lowest IP address (in that order). Anyone claiming to be the sync-site
> gets my vote. Otherwise, anyone having the lowest IP address (including
> myself) will get my vote. In both cases, I first have to be asked for a
> vote in order to cast the vote.
> And I never declare myself the sync-site unless I collect a quorum of votes.
> Once I cast a vote though, another rules come into play: I can't change
> my vote for at least 45 seconds - even if someone cuter and smarter or
> claiming to be the sync-site asks for my vote.
> Ex 1:
> 3 servers: A B C (A has lowest IP address, then B, then C):
> - C starts and begins asking for votes (no one is there to hear).
> - B starts and begins asking for votes
> - B receives a vote request from C. B find itself to have the lower IP
> address and so answers "no vote". B plans to vote for himself, but that
> vote isn't made at this stage.
> - C receives a vote request from B - so C votes "yes" for B. C now stops
> sending out its own vote requests and is committed to B for the next 45
> seconds.
> - A starts and begins asking for votes.
> - B receives a vote request from A, sees A is lower and votes for A. B
> now stops sending out its own vote requests and is committed to A for
> the next 45 seconds.
> - B then receives the "yes vote" from C. But B is already committed to
> A, so B ignores the "yes" vote (or responds with an error ??).
> - C receives a vote request from A, but C can't vote for A because C has
> voted for B. 45 seconds later, C hasn't heard from B and so changes it's
> vote to A (A is the only one currently sending out vote requests).
> Ex 2:
> - C starts and begins asking for votes (no one is there to hear).
> - B starts and begins asking for votes
> - B receives a vote request from C. B find itself to have the lower IP
> address and so answers "no vote". B plans to vote for himself, but that
> vote isn't made at this stage.
> - C receives a vote request from B - so C votes "yes" for B. C now stops
> sending out its own vote requests and is committed to B for the next 45
> seconds.
> - B receives the "yes" vote from C and so also votes for himself, finds
> he has a quorum (2 out of 3) and makes himself the syncsite.
> - A starts and begins asking for votes.
> - C receives a vote request from A and says "no vote" - I'm voting for B.
> - B receives a vote request from A and replies "I'm the sync site."
> - A defers to B as the sync site and so casts it's vote for B. A now
> stops sending out vote requests.
> > From step 3 below "If another server tells server A that it is voting for
> > someone else..." Why would that server vote for someone else?
> The other server would vote for someone else because either the server
> it had already voted for never became the sync site (see example 1) or
> died. The vote it had cast expires and the server is then free to try
> collecting votes for itself or vote for someone else.
> > Why would it
> > vote for server A? Why would it vote for anything? What's its decision
> > process for casting its vote?
> I hope I've given you the correct info.
> > In the end, my ultimate concern is that I have a lowest IP address db server
> > across a WAN link from the main meat of our cell.
> Ah Ha! This is a much more interesting issue. You may want to think
> about putting two database servers on the side of the WAN where the most
> important work is being done.
> Let me just rattle a few personal thoughts off the top of my head ;-)
> AFS begins to break down (a little:-) within WANs. This is why some
> large, global customers have gone to many cells (and deal with the
> issues of keeping data consistent across cells) instead of one large
> global cell. The reason is because the UBIK databases are synchronized
> on each write and a write takes longer to do when the db machines become
> more geographically distance or more numerous.
> For example: Someone has an office in the UK and wants to put some AFS
> clients there. They do that and find the performance is bad because it
> goes to the servers in PA for data and information. You expect this from
> cell-to-cell but not within the same cell. So you decide to add a db
> server machine in the UK. This will begin to slow the db server's write
> performance down affecting everyone's performance. So you then add
> vlserver preferences and then a fileserver on the UK side for fast
> access - with fileserver preferences and ROs in PA for quick/easy
> snapshot and backup capability. The solution is there but it becomes
> harder to administer. We can even start delving into the specifics of
> each database server (but eventually it becomes a case by case study).
> Some fixes went into AFS 3.6 (patch 2 - so it's not in open AFS) that
> makes the UBIK servers faster by doing updates to non-sync sites in
> parallel. This helps a lot but does it help enough? It was initiated by
> one site where they make more changes to the VLDB in a day than we
> do in a year :-) And they have 6 db servers machines across different
> subnets.
> - John Morin.
Excerpts from mail.ubik: 1-Nov-100 Re: What is a quorum vote b.. Jeff
Blaine@linus.mitre. (6122*)
> My question seems to be getting buried and misinterpreted.
> While I certainly appreciate the detailed explanation of the quorum
> creation process, and it's valuable to me, that's not what I've been
> trying to find out.
> On what _criteria_ does a database server base its quorum election vote?
> I voted for Susan Smith in middle school for class president _because she
> was cute and smart_.
> See what I am getting at?
> From step 3 below "If another server tells server A that it is voting for
> someone else..." Why would that server vote for someone else? Why would
> it vote for server A? Why would it vote for anything? What's its decision
> process for casting its vote?
> In the end, my ultimate concern is that I have a lowest IP address db server
> across a WAN link from the main meat of our cell.
> LOCATION "HQ" (heavy usage) LOCATION "REMOTE" (light usage)
> Network 2.2.2.x (example) Network 1.1.1.x (example)
> fs-and-db-one
> fs-and-db-two
> fs-and-db-three <-- WAN LINK --> fs-and-db-four
> fs-four fs-six
> fs-five fs-seven
> I obviously almost never want fs-and-db-four to become the sync site.
> How do I enforce that policy? How can I control the election process
> some? Renumbering our networks is not an option... What can I do?
> --On Wednesday, November 01, 2000 12:48 PM -0500 John_Morin@transarc.com
> wrote:
> [snip]
> > How it works (The simple description):
> >
> > (1) When db server A comes up, it starts sending out requests to other
> > db servers to have them vote for Server A. Server A is trying to build a
> > quorum of db servers for itself. The other servers either respond or
> > not. In the process, server A collects votes and remembers who the
> > lowest IP address is.
> >
> > (2) If server A receives a vote request from someone else who has a
> > lower IP address, server A will stop sending out its own vote requests
> > and vote for the lower IP server. Once server A votes for another
> > server, it can't change its vote until a time limit has passed. Once the
> > time limit is pass, it then tries to collect votes for itself again. The
> > time limit may expire for a number of reasons (other server went down or
> > the other server voted for someone else).
> >
> > (3) If another server tells server A that it is voting for someone else,
> > then server A can't count the other in the quorum he is trying to build.
> > But server A continues to try to build a quorum by sending out vote
> > requests.
> >
> > (4) If another server tells server A it is voting for him, server A
> > knows he has the vote for the next X seconds. Server A asks itself that
> > if he votes for himself, does he have quorum (over half the votes -or-
> > half the votes and server A has lowest IP address). If so, he claims
> > himself the syncsite. Server A continues to send out vote requests to
> > constantly renew the vote commitment (always well before the time lapse).
> >
> > Eventually, you get to a steady state where one server is the sync-site
> > and periodically sending out vote requests while all the other servers
> > vote for it. You can see how the vote process tends to focus on the
> > servers with the lower IP addresses.
> >
> > Once a quorum is established, the sync-site checks for the latest
> > revision of the database in its quorum and distributes that. As new
> > servers enter the quorum, their databases are also checked and sync'ed.
> >
> > If a database server never enters a quorum, It does not mean the
> > database is useless, it can still service read requests. IE: a
> > authentication server on the other side of a broken network partition
> > will still allow users to authenticate.
> >
> > In conclusion, having an even or odd number of dbservers isn't the
> > issue. The issue is whether you have more than 2 db servers. If you have
> > more than 2, then losing a single dbserver means a sync-site will be
> > created. With less than 2, ....
> >
> > The code is in src/ubik/. Look at ubeacon_Interact() for how UBIK trys
> > to create a quorum. Also, when a UBIK server comes up, it goes through a
> > number of states to get started. Look at the "urecovery_state" variable
> > in src/ubik/recovery.c, urecovery_Interact(), to see how a UBIK server
> > gets on it's feet.
> >
> > - John Morin.
> > AFS Developer
> >
> >
> > Excerpts from transarc.external.info-afs: 25-Oct-00 Re: What is a quorum
> > vote b.. Rob Porter@clarkson.edu (1576*)
> >
> >> I believe that in a 2 (even number) DB server environment, the lowest IP
> >> address has 1.5 votes, where the other (NOTE also includes those in an
> >> odd number environments) have 1 vote.
> >
> >> So, loosing the lowest IP addressed server in a 2 server environment
> >> would render the DB read-only.
> >
> >> On Wed, 25 Oct 2000, Paul Blackburn wrote:
> >
> >> > Jeff,
> >> >
> >> > I believe it is the lowest working database server IP address.
> >> >
> >> > Practical experience with upgrading/rebooting
> >> > database servers seems to verify this.
> >> >
> >> >
> >> >
> >> > Caution: I once tried shutting down one db server
> >> > in a cell with two db servers. The result: mayhem.
> >> > With only one of two db servers running
> >> > instead of a quorum there was a quandry.
> >> > No sync site could be elected.
> >> >
> >> > This leads me to believe that it is better to
> >> > configure an odd number of AFS database servers
> >> > to improve the availability of your cell.
> >> >
> >> > Apart from the single db server case, this means
> >> > (with 3, 5, 7 or more) you are likely to have
> >> > enough working db servers for a quorum in
> >> > the event of database server outage.
> >> > --
> >> > cheers
> >> > paul http://acm.org/~mpb
> >> >
> >> >
> >> > "Computers can figure out all kinds of problems, except the
> >> > things in the world that just don't add up."
> >> > --James Magary
> >> >
> >> >
> >> >
> >> > Jeff Blaine wrote:
> >> > > I can't find the answer to this question anywhere I've looked.
> >> > > On what criteria does a database server base its quorum election
> >> > > vote?
> >> >
> >> >
> >
> >> --
> >> Robert Porter <rwp@clarkson.edu>
> >> Systems and Network Engineer
> >> Campus Information Services, Clarkson University
> >
> >
> >