[OpenAFS] Re: Qu re tuning timeouts for failover between RO replicas

Andrew Deason adeason@sinenomine.net
Thu, 2 Dec 2010 13:22:11 -0600

On Thu, 2 Dec 2010 13:56:07 -0500 (EST)
"Thomas M. Payerle" <payerle@umd.edu> wrote:

> I am looking for a way to tune the timeout before failing over to
> another AFS server for replicated volumes, but cannot seem to find any
> suitable runtime parameters to tweak.  Do any such parameters exist?

"Sort of". You can do this without recompiling on Linux clients, but you
need to set it at startup.

You can change this by setting /proc/sys/afs/rx_deadtime, but you would
want to set it before the client starts; this is after you load the
kernel module but before you run afsd. Otherwise this value only takes
effect when a server comes back up after being marked down (or something
like that; I forget the details).

This is also a rather coarse hammer, as this is a timeout value for all
RX network activity in the kernel.

> We have some replicated web servers serving data from replicated RO
> volumes.  If one of the servers hosting one of those volumes goes
> down, httpds which were pointing to that server's copy of the volume
> seem to get badly wedged.  I think it is because enough requests come
> in during the time it takes for AFS client on web host to release the
> AFS server is down and move on to a replica that all available threads
> for apache are used, and apache just gets very unhappy.

If the situation never recovers while the fileserver is down, you also
_might_ be hitting a problem that is solved by
<http://gerrit.openafs.org/3339> and <http://gerrit.openafs.org/3340>.
But if you don't want to fiddle with trying patches, that's not really
helpful. Lowering rx_deadtime would also work around that issue, anyway.

Andrew Deason