Bug#143111: [OpenAFS-devel] Testing GNU findutils on AFS.... please!

Thu, 20 Mar 2008 20:06:38 -0400 (EDT)

On Thu, 20 Mar 2008, James Youngman wrote:

> This property is normally honoured by Unix (file-) systems because
> filesystems can only be mounted on subdirectories in any case.  Here's
> an example:

Right, so the link count of the parent counts the mounted-on directory,
not what's mounted there.

> That's not generally true for the reason illustrated above, but I'll
> assume your statement was about AFS specifically.

Correct.

> > You can't tell in advance which kind you'll see, so a heuristic
> >  like checking to see if the rule applies to the first directory you
> >  examine, which probably works fairly well for other filesystems, won't work
> >  with AFS.
>
> So is it the case that I can mount an AFS filesystem at
> /afs/mumble/foo/bar without there previously existing a directory
> /afs/mumble/foo?

Correct, in a way, but your phrasing suggests a conceptual disconnect.
One doesn't generally think of "an AFS filesystem", but of "_the_ AFS
filesystem" -- there is conceptually a single, global filesystem, which is
normally mounted at /afs.

An AFS mount point isn't a dynamic, client-side thing like a normal UNIX
filesystem mount; it is an object _in_ the filesystem which refers to
another volume.  This is presented mostly-transparently to userland as
just another directory.  Part of the "mostly" is the link count currently
under discussion,  The rest is a set of AFS-specific interfaces for
examining and manipulating mount points explicitly (*).

Note that David Howells has an alternate AFS client implementation one
version of which is distributed with the Linux kernel.  He handles mount
points by presenting them to userland as empty directories, allowing any
AFS volume to be mounted on any directory, and providing automounter-like
functionality to trigger mounts as mount points are traversed.  However, I
have no idea whether he adjusts the link counts of the containing
directories to reflect mount points.  I would suspect not, since that
would mean that satisfying stat(2) on a directory would require fetching
metadata for all of its contents from the fileserver.

> At the moment, find (oldfind in 4.2.x and 4.3.x) relies on examining
> the results of stat(2) to figure out if it should turn off the leaf
> optimisation.  It makes this determination for every directory it
> searches.   Supposing find knows that AFS may be in use somewhere on
> the system, what is the highest performance way of determining if the
> link-count assumption will hold immediately within that directory?

I can't think of an efficient way.  It may be helpful to know that inode
numbers in AFS are always odd for true directories, and even for
everything else, including mount points (this began as an artifact of the
server implementation, but too much depends on it for it to change now).
Note that while a volume root is a directory, it is presented with the
inode number of the mount point.

Unfortunately, I'm not sure how that helps you.

> Is it feasible for example to assume that directories not
> (canonically) beginning with /afs/ (or matching the regex ///+afs/)
> simply cannot be on an AFS filesystem?

On a system running OpenAFS, it is generally safe to assume that AFS is
mounted in only one place, because we generally can't do anything else.
There was code for a while to allow multiple mounts on MacOS X, but it
used an approach that ISTR was unsafe to begin with and impossible in
later OS releases, so I don't believe it exists any longer.

On a system running OpenAFS, it is also safe to assume that all files in
AFS will have the same value for st_dev.

Unfortunately, these assumptions don't always hold on other clients.

On many platforms, you can expect the reported filesystem type to be
something meaningful.  I don't know whether this is the case for all
platforms or not.

> Now that I think about it, it would also be helpful to know what
> common Linux AFS clients put in struct dirent.d_type for AFS
> filesystem objects (files, directories, ...).

When we know the type, we fill it in.  Directories and mount points get
DT_DIR; files get DT_REG.  Currently symlinks always get DT_UNKNOWN, which
I consider a bug, and so do objects whose type is genuinely not known.

> How about other Unix
> systems which support both AFS and d_type?

MacOS X uses the same logic as Linux, described above.
It doesn't look like OpenAFS supports d_type on other platforms.

> I also understand that AFS
> ACLs can sometimes allow readddir() to return a directory entry
> without it being possible to lstat(2) said directory item.  Is this
> the case?

Yes.  It's possible to get an entry that corresponds to a directory on
which you have no permissions, or a mount point pointing at a volume that
doesn't exist, or for a FetchData call to return an error in response to
some problem on the fileserver.

> What goes into d_type for such items?

The same as in any other case - if we know the type, it gets filled in;
otherwise it get set to DT_UNKNOWN.  Note that the type of a filesystem
object is immutable (so we don't care about cache freshness) and is not
considered to be covered by access controls, so the AFS client may provide
type information via d_type even when lstat(2) would fail.  On the other
hand, we won't make a call to the fileserver just to fill in d_type, so
it's possible to get DT_UNKNOWN when lstat(2) would succeed.  Note however
that if you follow the usual pattern of calling readdir and statting every
file in the directory, the stat results will usually be cached by the time
you ask for them (not always - we do this in the background, so you might
ask for something before we have it).

(*) There is often a third-party library available, either as -lkrbafs or
as -lkafs, which provides a fairly low-level but portable interface to
AFS-specific system calls.  There is also a set of libraries that come
with OpenAFS which could be used to do the same thing; these will work in
more situations (for example, across an AFS/NFS translator, if the NFS
client is configured correctly), but pulls in a lot more code.  Feel free
to contact me offline if you want more information about these.

-- Jeff