[OpenAFS] Re: DB servers "quorum" and OpenAFS tools

Fri, 17 Jan 2014 17:43:59 -0500

On Fri, 2014-01-17 at 14:12 -0600, Andrew Deason wrote:

> time, so presumably if we contact a downed dbserver, the client will not
> try to contact that dbserver for quite some time.

To elaborate: the cache manager keeps track of every server, and
periodically sends a sort of "ping" to each server to find out which
servers are up.  So, it will discover a server is down even if you're
not using it.  And, other than the periodic pings, the cache manager
will never direct a request to a server it thinks is down.  So, failover
for the CM itself is automatic, persistent, and often completely
transparent.

The fileserver works a little differently, but also keeps track of which
server it is using, fails over when that server stops responding, and
generally avoids switching when it doesn't need to.

Ubik database servers all communicate among themselves, which is a
necessary part of the database replication mechanism.  That happens even
when one server is down, but in such a way that you'll never notice a
communication failure between dbservers except in an unusual combination
of circumstances which can sometimes happen if a server goes down while
you are making a request that requires writing to the database.

> >   I have a single-host test OpenAFS cell with 1.6.5.2, and I
> >   have added a second IP address to '/etc/openafs/CellServDB'
> >   with an existing DNS entry (just to be sure) but not assigned
> >   to any machine: sometimes 'vos vldb' hangs for a while (105
> >   seconds), doing 8 attempts to connect to the "down" DB server;
> 
> I'm not sure how you are determining that we're making 8 attempts to
> contact the down server. Are you just seeing 8 packets go by? We can
> send many packets for a single attempt to contact the remote site.

Right.  Even though AFS communicates over UDP, which itself is
connectionless, Rx does have the notion of connections and includes a
full transport layer including retransmission, sequencing, flow control,
and exponential backoff for congestion control.  What you are actually
seeing is multiple retransmissions of a request, which may or may not be
the first packet in a new connection.  The packet is retransmitted
because the server did not reply with an acknowledgement, and the
intervals get longer because of exponential backoff, which is an key
factor in making sure that congested networks eventually get better
rather than only getting worse.

-- Jeff