[OpenAFS] hangs on modifications to top-level directory (1.6.x)

Julian Bradfield jcb+afs@inf.ed.ac.uk
Wed, 28 Nov 2012 16:26:29 +0000


I (we) are baffled by a problem that apparently affects only me, which
I've had for a long time, but which appears to me to be getting worse.

Symptom:
Every few minutes, all AFS access on my client desktop hangs when a
process creates or deletes a file in the top-level directory of my
home directory volume. (Most often, Emacs creating an auto-save record
file.)

This is fairly reliably reproducible, by sitting doing nothing for a
few minutes (5-7 minutes is usually enough), and then doing, say,
touch newfile

Subsequent creations or removals appear to behave as normal until
another period of inactivity. However, setting up a loop
 while [ 1 ] ; do touch foobarbaz ; \rm foobarbaz ; sleep 60 ; done&
does NOT prevent the problem occurring, so it's more complex than just
that.

The problem does not happen with modifications to directories other
than the top-level. It does happen also with ACL changes to the
top-level.


Environment:
Machines are running Scientific Linux 6, with openafs 1.6.1.
(Client kernel 2.6.32-279.11.1.el6.x86_64.)

Things we've tried: 
The problem also occurs when mounting files on a remote machine
(running 1.6.1a under OpenSUSE 11.4, with a 32-bit 2.6.31 kernel).
The problem also occurs when the files are created by a (suitably
permitted) user other than the volume owner.
We moved the volume to another server. No change.
We created a new volume, and I copied just my top-level files over,
symlinking to all the other directories, and switched my homedir to
the new volume. The effect is observed on both volumes.

We have not so far seen the effect on a volume owned by somebody else.

When accessing at home, I sniffed the network while doing a touch that
hung, and observed the following very slow request/reply. (Well,
fairly slow - the record hang I've measured so far is 46 seconds.)
(129.215.125.134 is the client, kraken is the server).


20:22:28.986713 IP 129.215.125.134.afs3-callback > kraken.inf.ed.ac.uk.afs3-fileserver:  rx data fs call op#1574551967 (92)
20:22:29.139900 IP kraken.inf.ed.ac.uk.afs3-fileserver > 129.215.125.134.afs3-callback:  rx ack first 2 serial 0 reason delay (65)
20:22:30.311791 IP 129.215.125.134.afs3-callback > kraken.inf.ed.ac.uk.afs3-fileserver:  rx ack first 1 serial 0 reason ping (620)
20:22:30.365616 IP kraken.inf.ed.ac.uk.afs3-fileserver > 129.215.125.134.afs3-callback:  rx ack first 2 serial 19 reason ping response (65)
20:22:36.052770 IP kraken.inf.ed.ac.uk.afs3-fileserver > 129.215.125.134.afs3-callback:  rx data fs reply op#1574551967 (252)



Any ideas, or suggestions for how else to find out what might be going
on, would be much appreciated. I haven't found anything suggestive by
searching.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.