[OpenAFS] Re: "afs: Lost contact with file server" on the same machine?

Andrew Deason adeason@sinenomine.net
Sun, 14 Jun 2009 21:39:46 -0500

(Hmm, I'm not seeing my message the first time through; darn gmane.
Re-sending; apologies if this arrives twice)

On Sun, 14 Jun 2009 16:33:57 -0700
Adam Megacz <megacz@hcoop.net> wrote:

> Russ Allbery <rra@stanford.edu> writes:
> > This sounds identical to the problem that we were having with our
> > web servers that was mostly caused by CGI script tokens expiring
> > and then scripts continuing to try to access AFS until the file
> > server started throttling Rx connections.
> Can get the fileserver to log a message indicating that it has decided
> to throttle connections from a host?

No, there's no log message that indicates that this is happening; it's
also per-connection (or per-call) rather than per-host, as I recall.
It's unfortunate, since it makes identifying the problem more difficult,
but having logs for this would somewhat defeat the purpose of the
throttling. If you're triggering the throttling behavior, logging it
would almost certainly really slow down the fileserver.

The least disruptive way to see if it's happening is probably
correlating kernel error messages with the problem, as Russ mentioned
earlier. Or looking at the traffic, as Derrick mentioned, looking for
changes in client traffic v server response rates. You could also try
looking at patterns in the number of rx aborts from `rxdebug -rxstats`
over time.

Although, if it's tolerable, it may be easiest to just disable
throttling by passing "-abortthreshold 0" to the fileserver, and see if
the problem goes away. That's not really a long-term solution, though;
you should find what's causing aborts and fix them.

Andrew Deason