[OpenAFS] Re: DB servers "quorum" and OpenAFS tools

Thu, 23 Jan 2014 14:57:49 -0500

On Thu, 2014-01-23 at 14:58 +0000, Peter Grandi wrote:

> My real issue was 'server/CellServeDB' because we could not
> prepare ahead of time all 3 new servers, but only one at a time.
>
> The issue is that with 'server/CellServDB' update there is
> potentially a DB daemon (PT, VL) restart (even if the rekeying
> instructions hint that when the mtime of 'server/CellServDB'
> changes the DB daemons reread it) and in any case a sync site
> election.
>
> Because each election causes a "blip" with the client I would
> rather change the 'server/CellServDB' by putting in extra
> entries ahead of time or leaving in entries for disabled
> servers, to reduce the number of times elections are triggered.
> Otherwise I can only update one server per week...

There's not really any such thing as a "new election".  Elections happen
approximately every 15 seconds, all the time.  An interruption in
service occurs only when an election _fails_; that is, when no one
server obtains the votes of more than half of the servers that exist(*).
That can happen if not enough servers are up, of course, but it can also
happen when one or more servers that are up are unable to vote for the
ideal candidate.  Generally, the rule is that one cannot vote for two
different servers within 75 seconds, or vote for _any_ server within 75
seconds of startup.

>From a practical matter, what this means when restarting database
servers for config updates is that you must not restart them all at the
same time.  You _can_ restart even the coordinator without causing an
interruption in service longer than the time it takes the server to
restart (on the order of milliseconds, probably).  Even though the
server that just restarted cannot vote for 75 seconds, that doesn't mean
it cannot run in _and win_ the election.  However, after restarting one
server, you need to wait for things to completely stabilize before
restarting the next one.  This typically takes from 75-90 seconds, and
can be observed in the output of 'udebug'.  What you are looking for is
for the recovery state to be f or 1f, and for the coordinator to be
getting "yes" votes from every server you think is supposed to be up.

Of course, you _will_ have an interruption in service when you retire
the machine that is the coordinator.  At the moment, there is basically
no way to avoid that.  However, if you plan and execute the transition
carefully, you only need to take that outage once.

(*) Special note:  The server with the lowest IP address gets an extra
one-half vote, but only when voting for itself.  This helps to break
ties when the CellServDB contains an even number of servers.

> Ideally if I want to reshape the cell from DB servers 1, 2, 3 to
> 4, 5, 6, I'd love to be able to do it by first putting in the
> 'server/CellServDB' all 6 with 4, 5, 6 not yet available, and
> only at the end remove 1, 2, 3. What does not play well (if one
> of the 3 live servers fails) with the "quorum" :-) so went
> halfway.

This doesn't work because, with 6 servers in the CellServDB, to maintain
a quorum you must have four servers running, or three servers if one of
them is the one with the lowest address.  In fact, you can't even
transition "safely" from three to four servers, because once you have
four servers in your CellServDB, if the one with the lowest address goes
down before the new server is brought up, you'll have two out of four
servers up and no quorum.  

However, you can safely and cleanly transition to and from larger
numbers of servers, one server at a time.  Just be sure that before you
start up a new server, every existing server has been restarted with a
CellServDB naming that server.  Similarly, make sure to shut a server
down before removing it from remaining servers' CellServDB files.

At one point, I believe I worked out a sequence involving careful use of
out-of-sync CellServDB files and the -ubiknocoord option (gerrit #2287)
to allow safely transitioning from 3 servers to 4.  However, this is not
recommended unless you have a deep understanding of the election code,
because it is easy to screw up and create a situation where you can have
two sync sites.

I also worked out (but never implemented) a mechanism to allow an
administrator to trigger a clean transition of the coordinator role from
one server to another _without_ a 75-second interruption.  I'm sure at
some point that we'll revisit that idea.

-- Jeff