[OpenAFS-devel] Re: mountpoints on linux

Fri, 29 Jun 2012 16:49:46 -0500

On Fri, 29 Jun 2012 16:16:15 -0500
Andrew Deason <adeason@sinenomine.net> wrote:

>>  A. Duplicate the entire subtree starting at the multiply-mounted
>>     volume.

>From a user perspective, this option I think has no downsides; it has no
theoretical user-visible problems or limitations that the others have.

However, I think this is also the most complex option to implement, and
it consumes more memory. Possibly not a lot of memory; the actual AFS
structures could probably be shared and we'd just need a layer on top of
AFS vcaches that map linux inodes to AFS vcaches etc. However, as jhutz
notes, this means that we now have multiple Linux inodes per file. So,
for example, everywhere in the code that uses AFSTOV would need to be
changed to loop through a list of values. At the very least that is a
lot of work; for some call sites, it may require nontrivial
restructuring of code to make work (and this is platform-independent
code, so we may break other platforms trying to fix this).

So, due to the amount of work, at the very least I don't think this is a
short-term solution. But maybe it is the best thing to do...

>>  B. Reparent the multiply-mounted volume each time it is accessed
>>     via a new path.

As I've mentioned, this has problems. The details are a bit much to do
into right here, but briefly... we're not allowed to reparent something
while it's in use. So if we need to do that to perform an rmdir() or a
rename(), we get kinda stuck and have to return a bogus answer to Linux.
In some places, sanity checks will make us panic, and in others,
assumptions from the Linux code can cause us to deadlock.

>>  C. Pretend like multiple mounts aren't allowed, [...] Users would
>>  not like this.

Yeah.

>>  D. Treat every volume as a separate filesystem, like kafs does.

One of the disadvantages of this is that users can no longer 'mv'
mountpoints around; you need to have AFS knowledge to manipulate them.

Another issue is really more of an obstacle in the implementation, but
the interfaces to perform mounts and create new filesystems from within
the kernel are GPLONLY, so we can't use them. It is possible to make
afsd perform the mount from userspace (like the afsdb handler, and OS
X's userspace move helper), and in fact I have done a little work into
doing that, to see how well it can function. I believe this can work,
but it does require quite a bit more effort, and it seems pretty error
prone.

I think there are also two sub-options here, which is whether we
bind-mount /afs/foo/bar to /afs/.:mount, or if we mount /afs/foo/bar as
AFS with some special option to mount a certain volume. The former is
what I was working on, just for ease of implementation; I'm not sure if
there's much of a practical difference between the two.

I feel like there are other disadvantages here; I don't think I covered
jhutz's reservations. There may be some performance concerns here, too,
once we start to access a lot of volumes at once, but I'm not sure how
much of a problem that is.

>>  E. Present additional mount points to the same volume as symbolic
>>     links. [...]
>>  F. Present _all_ mount points as symbolic links, pointing at paths
>>     in /afs/.:mount.

I think presenting these as actual symlinks is a no-go, since it's quite
a big user-visible change. With F, this also makes '..' no longer work
"correctly" ever. With E, it makes '..' not work correctly when there
are multiple mount points; I think that's more acceptable, since I don't
think they've ever worked correctly 100% of the time on Linux.

However, it is possible on Linux to have a directory, but give it a
follow_link function, so it behaves kinda like a symlink in that it gets
dereferenced before being accessed, and lets you point at an arbitrary
dentry like a regular symlink. kafs does this, and just mounts the
volume on the dir as the dereferencing operation.

I think it may work pretty well to have the first mtpt access appear as
a normal dir. Then any other mountpoints to the same volume appear as
dirs, but they dereference to to the first mountpoint. So, option E, but
they appear to be dirs to user applications. '..' works for the cases
where you only access through a single mountpoint, and for other
accesses, it 'breaks' by pointing you to that original mountpoint.

I've been looking a bit today at implementing this approach; it seems
doable with the only non-Linux changes being a couple of small
interface changes to afs_lookup.

I welcome any comments or thoughts on this general subject or any
particulars up there, or attempts at any implementations. Especially
from people with more Linux VFS internals experience than me :)

-- 
Andrew Deason
adeason@sinenomine.net