[OpenAFS-devel] idle dead timeout processing in clients

Russ Allbery rra@stanford.edu
Thu, 08 Dec 2011 14:42:38 -0800


Simon Wilkinson <sxw@inf.ed.ac.uk> writes:

> The first possible cause is journalling filesystems. Many of these flush
> their journals to disk at regular intervals, blocking or reducing access
> to the filesystem during the journal flush. This block can be enough to
> cause the fileserver to start queuing incoming connections, and in a
> site that is finely balanced, may be enough to cause performance to
> stall. This was made considerably worse by the fileserver performing a
> sync() operation every 10 seconds. This is fixed in 1.6.0 - a 1.4.x
> release containing the fix has yet to appear.

I *think* we're currently running a file server with patches applied to
disable some of the sync() calls, but I may be misremembering.  I know
we've had this discussion before.

> The next cause is due to deadlocks between the client and the
> fileserver. The Linux dynamic vcaches code which was added in 1.4.10 is
> known to interact badly with fileserver callback breaks, especially in
> situations where the fileserver is under heavy load. There is a fix in
> 1.6.0, but we have yet to ship a 1.4.x release which contains it. You
> can also work around this particular problem by disabling dynamic
> vcaches in your clients.

The www.stanford.edu clients that are having problems are running with a
patch to not hold the lock that causes the deadlock condition with
callback breaks.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>