[OpenAFS-devel] Linux 2.6.12 kernel BUG at fs/namei.c:1189

Russ Allbery rra@stanford.edu
Sun, 11 Dec 2005 08:27:48 -0800


"chas williams - CONTRACTOR" <chas@cmf.nrl.navy.mil> writes:

> this mountpoint wouldn't happen to be mounted multiple times would it?
> this really upsets the linux fs stack.

Not in this case, no.  However, I was accessing it through both the RO and
the RW paths.

> i could see a problem if you lookup the mountpoint first in different
> parent dir, then you move it from a different directory to a new
> directory, the mountpoint's parent is not going to be the directory its
> currently in.  i cant seem to duplicate this though.  i think
> check_bad_parent() should catch this (except for when the parent volid
> dont change).

> what afs version are you running?

1.4.0.

> can you be more specific about duplicating this problem?

This came up when I was cleaning tripwire reports.  The way we do tripwire
is to have one AFS volume per machine that holds the machine configuration
and the current tripwire database, all of which are mounted in a single
replicated directory.  I had been running tripwire on different machines
and copying new databases over into AFS, and in the process I ran across
various systems where the tripwire directory was mounted with the wrong
name (since the mount point has to match the hostname for the way that we
use tripwire).

Whenever I found a system where the mount point didn't match the hostname,
I'd switch to the read/write path and mv the mount point to the right
name, then release the volume.  When I did that, I got this kernel BUG and
a segfault from mv.  (Note that the AFS client had been running for some
time at this point, and I'd unloaded it and reloaded it to upgrade to a
new build at one point in the past.)  After that happened, anything else
that touched the directory that holds all the mount points would block in
disk wait and I couldn't unload the AFS kernel module.

I rebooted the system, which cleared that up, and I could work in that
directory again.  But then, I went back to doing the same thing, and while
the first two or three times I mv'd a mount point everything was fine, the
next time I got the segfault and the BUG again.  I was then very careful
not to touch that directory with any other process, and I have no
processes in disk wait, but even though lsof reports no processes with
open files in AFS, I can't unload the kernel module again (it has three
references in lsmod).

Note that the directory I was working in has several hundred mount points,
a few symlinks, and no other files (and it's the only directory in its
volume).  It's possible that the size of the directory may have something
to do with this, or switching from RO to RW accesses of the same volume.
The individual tripwire directories are not replicated, only the volume
that holds their mount points.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>