[OpenAFS-devel] mountpoints on linux

Fri, 29 Jun 2012 16:16:15 -0500

Hi all,

So, if you are not yet aware, the way we implement mountpoints on Linux
right now is not great. The current approaches have been known to be
iffy, but right now I think we've come to the point where we have
unavoidable easy-to-reproduce panics and/or deadlocks with the current
approach, so we need to change something. Gerrit 7600 and 7601 may give
a good impression of what's going on here; that was my first attempt at
a workaround fix, but it doesn't really work.

I've been discussing this with a few people, and I think Jeff Hutzelman
provided a good overview along with some options. With his permission,
this is reproduced below. I have some comments to go along with this, to
say what the various user-visible differences are and pros/cons, etc,
but I want to post it here by itself first, so it's easier to read.

> Unfortunately, what it boils down to is that the Linux kernel
> architecture assumes that a filesystem is a tree (that is, a
> connected, acyclic graph), and is incapable of correctly handling a
> filesystem like AFS which is in fact a directed graph with some
> restrictions(*).  This is their failing, not ours, and it is not
> limited to AFS.
> 
> The best we can do is come up with an internally-consistent mapping of
> the AFS filesystem onto a (possibly mutating) tree, and use that tree
> as Linux's view of the filesystem.  Mostly, this presents two
> problems:
> 
> 1) what do we do with cycles?
> 2) what do we do with nodes with multiple incoming edges?
> 
> Of course, these are essentially the same question, and I can think of
> several possible answers:
> 
>      A. Duplicate the entire subtree starting at the multiply-mounted
>         volume.  This means that changing one copy would have to result
>         in changing the other copies as well, and of course a change
>         from the server would have to be reflected in every copy.  That
>         means that the vnode->inode mapping would go from a simple
>         pointer to a list, and vnodes would require true reference
>         counting, independent of the refcount on any associated inode.
>         We might have to go do some effort to avoid cycles, just to
>         maintain sanity.  And, the fixed mapping between vnodes and
>         inode numbers probably goes right out the window.
>      B. Reparent the multiply-mounted volume each time it is accessed
>         via a new path.  This is what we've done since the
>         multiple-alias problem first arose.  It actually works fairly
>         well for users, but at times has been a pain to make work.
>         Cycles don't work, of course, since reparenting a volume below a
>         cycle would orphan the whole subtree, and the kernel VFS layer
>         won't let you do that (mostly).
>      C. Pretend like multiple mounts aren't allowed, and simply refuse
>         to follow additional mount points into a volume that already has
>         an associated dentry.  Users would not like this.
>      D. Treat every volume as a separate filesystem, like kafs does.
>         While this has some advantages, it also has some serious
>         disadvantages.  I also have a vague recollection of coming up
>         with a reason at one point why this model is fatally flawed.
>      E. Present additional mount points to the same volume as symbolic
>         links.  If I recall correctly, it is even possible to present
>         them as symlinks where the results of readlink(2) are not
>         actually consistent with what happens if you traverse the link,
>         so we need not be able to construct a path to the original mount
>         point (though of course we can, if it is still in the dentry
>         tree).
>      F. Present _all_ mount points as symbolic links, pointing at paths
>         in /afs/.:mount.
> 
> 
> (*) Most notably, only volume roots can actually have more than one
> incoming edge.

-- 
Andrew Deason
adeason@sinenomine.net