[OpenAFS-devel] Cache inconsistency in client 1.4.8 and above

Mon, 4 May 2009 13:52:10 -0400

> Traces of the usual deadlocked suspects are attached. At that point, just
> about any process can deadlock, I suppose. Apparently, the system ceases to
> balance dirty pages (which appears plausible to me, but I have no experience
> with virtual memory implementations whatsoever).

Ok this brought back some memories... I think you're seeing a problem
with older kernels that was addressed by Peter Zijlstra's "per BDI
dirty threshold" patch set in kernel 2.6.24:
    http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=04fbfdc14e5f4

Note the mention of "deadlocks with stacked BDIs", which is exactly
the case for AFS when using a disk cache.  The congestion on the AFS
backing device keeps processes from writing to other devices,
including the ext2/3 device holding the disk cache.  So the cache
manager can't make progress in writing back its dirty data.

See for instance: https://bugzilla.redhat.com/show_bug.cgi?id=453811 -
a request to backport the patch set to 2.6.18 for RHEL 5.

It may well be that there's no way to work around this kernel problem
in the AFS code.

Marc