[OpenAFS] Access an OpenAFS cell in LAN and WAN with dynamic DNS (DDNS) address

Benjamin Kaduk kaduk@MIT.EDU
Tue, 30 Aug 2016 22:43:55 -0400 (EDT)

On Fri, 26 Aug 2016, Karl-Philipp Richter wrote:

> Hi,
> Am 25.06.2016 um 15:21 schrieb Jeffrey Altman:
> > When the IP address changes there is a requirement that the
> > configuration be altered and the servers be restarted in order for that
> > new IP address to become available.
> >
> > The servers and the clients store the IP addresses.  The client in
> > particular caches volume location information for hours and must
> > manually "fs checkvolumes" be forced to refresh it when the file
> > servers' IP address changes.
> Changing the IP addresses in `/etc/openafs/CellServDB` and
> `/etc/openafs/server/CellServDB` and restarting the fileserver and
> client and running `fs checkvolumes` doesn't help (even rebooting both).

Hmm, I would have thought rebooting the client would help.
In any case, the scenario that the original authors had in mind was when a
volume is moved by administrative action from one server to a different
server (and thus to a different IP address), not one of a single
fileserver getting a new address and the old address being unreachable.
Accordingly, the update mechanism tends to rely on "contact the old
address and get told to redo the lookup", so things behave less well when
the old address is unreachable (or worse, a different server that tries to
respond with "confusing" data).  But IIRC 'fs checkvolumes' is supposed to
sidestep the "get told to redo the lookup" step and just redo the lookup
anyway, so there is probably something different going on.

The database of volume locations is updated when a fileserver starts up,
and registers its current address (the fileserver is identified by UUID).

> The server seems to keep track of old addresses and tries to contact
> them - I see

(That looks like AFS client output, not AFS server output.)

>     [  204.480062] afs: Lost contact with file server in
> cell richtercloud.de (code -1) (multi-homed address; other same-host
> interfaces maybe up)
>     [  204.480067] RXAFS_GetCapabilities failed with code -1
>     [  260.948077] afs: Lost contact with file server in
> cell richtercloud.de (code -1) (multi-homed address; other same-host
> interfaces maybe up)
>     [  318.428081] afs: Lost contact with file server in
> cell richtercloud.de (code -1) (multi-homed address; other same-host
> interfaces maybe up)
>     [  375.900096] afs: Lost contact with file server in
> cell richtercloud.de (code -1) (multi-homed address; other same-host
> interfaces maybe up)
>     [  433.380152] afs: Lost contact with file server in
> cell richtercloud.de (code -1) (multi-homed address; other same-host
> interfaces maybe up)
>     [  490.848098] afs: Lost contact with file server in
> cell richtercloud.de (code -1) (all multi-homed ip addresses down for
> the server)
> in `dmesg` for any address I ever entered in the client `CellServDB`.
> Changing the IP causes the volume to be broken (`ls: cannot access
> '/afs/richtercloud.de/': Connection timed out`) even after chaning it
> back, rebooting and running `fs checkvolumes` and `fs checkservers`! It
> seems like the invalid addresses need to be added to `NetRestrict` in
> order to make the volume work again.

CellServDB is only read at afs client startup; during runtime, fs newcell
can be used to update the list of database servers for a cell at runtime.

> I don't have the possiblity to get a WAN IP for my mobile client, so
> it's behind a NAT as well. According to
> https://www.mail-archive.com/openafs-info@openafs.org/msg39090.html that
> shouldn't cause any problems (although I don't get why `fs
> setclientaddrs` exists, then, but that might be another topic).
> I wonder what `RXAFS_GetCapabilities failed with code -1` could mean.

code -1 is RX_CALL_DEAD (connection timed out; see
https://www.central.org/frameless/numbers/errors.html).  It just means
that there was nothing listening at the other end that cared to respond.

> I'm now experimenting with a script which updates the OpenAFS CellServDB
> for the server after a change of the external IP, creates a virtual
> network interface in the LAN where the server is with the same address
> of the external interface of the WAN gateway/WiFi router in order to try
> to trick the database scheme and setup forwarding for port 7000 to 7008
> and 7021 (all UDP) from the WiFi router to the connected interface to
> the server machine and from there to the virtual interface with
> `iptables` (e.g. `sudo iptables -A PREROUTING -t nat -i eth0 -p udp
> --dport 7021 -j DNAT --to [external IP]:7021`). The client (behind NAT
> and WAN) still fails to connect due to `afs: Lost contact with volume
> location server in cell richtercloud.de (code -1)` and
> `ls: cannot open directory '/afs/richtercloud.de/': Connection timed out`.
> Are there any plans to use name resolution in OpenAFS? It's a major
> technology that exists for decades and for a reason. It'd make all our
> lives much easier.

I am unaware of any plans.  DNS lookups are already used in some cases,
with SRV or AFSDB records being used to perform the initial lookup of
dbserver addresses for a cell.  (This is controlled by the
now-misnamed -afsdb argument to afsd.)  But it sounds like you want the
cache manager to track TTLs and redo the lookup so as to get updates when
addresses change.  I only did a quick look (it seems the relevant
kernel-side code is afs_AFSDBHandler()), but we seem to only do the lookup
when first adding a cell, as is consistent with your experiences.  In
principle, it's only a "small manner of programming" to add such support
to the client, but as always, someone has to contribute the code or money
to fund the development of the code, or it's not going to happen.

A big reason why there hasn't been motivation to add such live updating to
the client is because there are architectural issues on the server side if
machines are changing addresses.  The database servers use the ubik
synchronization protocol to present a reliable distributed database, but
part of the proof of correctness is that the participants in the consensus
algorithm are known in advance and do not change.  Proving the behavior
correct in the face of changing participant addresses would require a
great deal of effort and care, at levels that I do not know of any group
that would be likely to be able to contribute at present.  Furthermore,
some "easy"-seeming options, like adding database server UUIDs, would
require changing the intra-dbserver protocol, something that OpenAFS has
not done in a very long time, will only ever do at a major release
boundary, and prevents the use of new database servers in a quorum with
old ones.  There is a lot of inertia behind the current state of affairs,
as lousy as the user experience is in these kind of networks.