[OpenAFS] Re: Best way to debug "Lost contact with file server"

Andrew Deason adeason@sinenomine.net
Mon, 27 Feb 2012 15:50:04 -0600

On Fri, 24 Feb 2012 02:51:48 -0800
Ken Elkabany <Ken@Elkabany.com> wrote:

> Currently, the servers are running 1.4.14 (will be upgraded to 1.6
> soon) on Ubuntu 10.04. The clients are running 1.6.0 on Ubuntu 11.10.
> The clients are not human users, but processes that are constantly
> pulling data from AFS.

Did the client come from packages, or did you build them yourself? If it
came from a package, can you provide the exact version? You may get
better results with a newer 1.6 (such as 1.6.1pre3) as there are known
problems with 1.6.0, but if you're using a package, it probably has
several fixes on top of 1.6.0.

> What tools do I have at my disposal to debug this issue? What is the
> recommended approach to take?

If you can, it would be ideal if you can capture additional information
which is kicked off by that problem occurring (by scanning syslog, or if
there's any other means you have of detecting this, then based on that).

You can capture fstrace debug data somewhat easily, which provides a
good look at what is going on. If you want to, run as root:

fstrace clear cm
fstrace setlog cmfx -buffers 1024
fstrace sets cm -active

Then wait for the problem to occur. When it does, run:

fstrace dump cm > /tmp/some.log.file

Then run:

fstrace sets cm -inactive

to turn off the tracing.

If you cannot easily run something based on when the problem happens,
you can also just try to run that 'fstrace dump cm' command every few
seconds or something, and dumping the output to a file based on
timestamp. Then when the problem occurs, go find the output from around
that time.

Alternatively, you can also try to get a packet dump from the network at
around the time of the 'lost contact' message. Either that or the
fstrace information should provide a pretty clear picture of what's
going on.

> Off-email question: If a volume has N read replicas, how do clients
> choose which one to use?

By default, it's effectively random. Technically the client also takes
into account the ip addresses of the client and server to try and
estimate how "close" it is to each fileserver, but it does so using
antiquated classful addressing techniques and usually isn't very useful.

You can view the preferences the client is using by running
'fs getserverprefs'. You can set your own preferences to override the
default semi-random ones with 'fs setserverprefs'. Servers with lower
numbers are preferred over the others.

Andrew Deason