[OpenAFS] Re: DB servers "quorum" and OpenAFS tools

Andrew Deason adeason@sinenomine.net
Fri, 17 Jan 2014 14:12:20 -0600

On Fri, 17 Jan 2014 18:50:13 +0000
pg@afs.list.sabi.co.UK (Peter Grandi) wrote:

>   What rules do the OpenAFS tools use to contact one of
>   the DB servers?

Most of the time ("read" requests), we'll pick a random dbserver, and
use it. If contacting a dbserver fails for network reasons, we will try
to avoid that server unless we run out of servers.

Possibly the big difference in the behavior you're seeing is that the
kernel clients only perform unauthenticated "read" operations on the
dbservers. Userspace tools like "vos" sometimes need to perform "write"
operations, which changes things somewhat.

For a write operation to the db, things are slightly different because
we must pick the "sync site"; we cannot access just any dbserver to
fulfill the request. If there are 3 or fewer dbservers, we pick randomly
like we do for read operations. If there are more than 3 sites, we ask
one of the sites who the sync site is, and then we contact the sync site
in order to perform the request. If the request is successful, we
remember who the sync site is, and keep using it until we get a network
error (or a "I'm not the sync site" error).

With userspace tools, we have no way of remembering which servers are
down between invocations, so each time a tool runs, it picks a random
server again. The kernel clients are running for a much longer period of
time, so presumably if we contact a downed dbserver, the client will not
try to contact that dbserver for quite some time.

That's just for choosing a dbserver site, though; if you want to know
how long we take to fail to connect to a specific site:

>   I have a single-host test OpenAFS cell with, and I
>   have added a second IP address to '/etc/openafs/CellServDB'
>   with an existing DNS entry (just to be sure) but not assigned
>   to any machine: sometimes 'vos vldb' hangs for a while (105
>   seconds), doing 8 attempts to connect to the "down" DB server;

I'm not sure how you are determining that we're making 8 attempts to
contact the down server. Are you just seeing 8 packets go by? We can
send many packets for a single attempt to contact the remote site. By
default openafs code tends to wait about 50 seconds for a site to
respond to a request. "vos" sets this to 90 seconds for most things (I
don't know why), during which period, it will retry sending packets. 105
seconds is close enough that that should explain it; the timeouts are
not always exact, since we kind of poll outstanding calls to see if they
have timed out.

> The OpenAFS client caches seemed to cope well as expected, as in
> a cell with a "quorum" of 3 "up" DB servers, and 1 "down". I
> think the only consequence I noticed was sometimes 'aklog'
> taking around 15 seconds.

The kernel client will not notice changes to the CellServDB until you
restart it, or run 'fs newcell'. The client also usually doesn't need to
contact the dbservers very often; it could easily take an hour for you
to notice even if all of the dbservers were down. If the client hits a
downed dbserver, it will hang, too (at least around 50 seconds).

> However *some* backups started to hang and some AFS-volumes
> became unaccessible to all clients. The fairly obvious cause was
> that the cloning transaction instead of being very quick would
> not end, and cloning locks the AFS-volume.

I _think_ this is because we are hanging on allocating a new volume id
for the temporary clone. If you run with -verbose, do you see
"Allocating new volume id for clone of volume..." before it hangs?

We could possibly do that before we mark the volume as "busy", but then
we might allocate a vol id we never use, if the volume isn't usable.
Maybe that's better, though. Fixing that doesn't eliminate the hanging
behavior you're seeing, but it would mean the volume would be accessible
to clients while 'vos' is hanging.

May I ask why you are not just dumping .backup volumes? You could create
the .backup volumes en-masse with 'vos backupsys', and then you could
just 'vos dump' them afterwards. Performing operations in big "bulk"
operations like that as much as possible would make the tools more
resilient to the errors you are seeing, since then the tool is a single
command and can remember which dbserver is down.

> With a curious attempt to open "$HOME/.AFSSERVER" (which did not
> exist). the 'vos' also tries to open "/.AFSSERVER".

This is for rmtsys support with certain environments usually involving
the NFS translator. I assume this happens when 'vos' tries to get your
tokens in order to authenticate; if that's correct, it'll go away if you
run with -noauth or -localauth.

Andrew Deason