[OpenAFS-devel] Are readdir() results in afs sorted?

Jeffrey Hutzelman jhutz@cmu.edu
Fri, 22 Oct 2004 12:31:31 -0400

On Friday, October 22, 2004 11:53:51 +0200 Martin MOKREJ=A9=20
<mmokrejs@ribosome.natur.cuni.cz> wrote:

>   while thinking which underlying filesystem to pickup for afs server
> partition, I came across ext2/3 email thread where is suggested to take
> advantage of tune2fs -O dir_index /dev/xyz
> using htrees, one needs first to do readdir(), then sort the
> files by inode number before trying to open or stat them.

I think there are several possible things you could be asking here, and=20
it's not clear which it is, so I'll try to answer them all.  It is=20
important to remember through this discussion that the structure of the=20
contents of AFS is not reflected in the structure of the data stored on=20
vice partitions.  AFS internally tracks files by a file ID (FID), which=20
consists of a volume ID, vnode number, and uniqifier.  The cache manager=20
refers to files and directories by their FID's, and the data stored on a=20
vice partition is arranged to make it easy for the fileserver to find a=20
file by its FID.

(1) Will using htree directories on a vice partition help to
    improve the performance of the fileserver?

Probably, though to what extent is unclear.  The fileserver arranges data=20
on vice partitions in a tree-like structure designed to limit the number of =

files in any one directory, to keep the lookup times down (because most=20
traditional filesystems store directories as a list of entries, lookups and =

insertions require a linear search and can be quite slow on directories=20
with many entries).  Using htrees may or may not have a noticeable effect=20
on the size directories used by the fileserver.

(2) Does the fileserver need to do a sort-after-readdir if you use htree
    directories on vice partitions?

No.  The comments about sorting in the threads you mentioned refer to=20
applications which already open a directory, read its entries using=20
readdir, and then stat or open every entry in the directory in the order in =

which they are returned by readdir.  For such applications, this technique=20
is reasonably fast on traditional filesystems, because readdir returns=20
directory entries in the order in which they occur in the directory, which=20
is approximately the order in which they were created.  This often=20
corresponds to the placement of inodes on the disk, so the order returned=20
by readdir is close to the optimal order for accessing those files without=20
lots of seeks.

When you use htree directories, the behaviour of readdir changes.  It now=20
traverses each hash bucket, returning entries as it finds them.  This is=20
_not_ the same as the order in which entries were created, and is nowhere=20
near to the optimal order for accessing the corresponding files.  In fact,=20
since entries in any given hash bucket are likely to be chained in the=20
order in which they are created, you are likely to seek across the disk as=20
many times as there are hash buckets, which is far from optimal.

Because of this, it is recommended that applications which scan entire=20
directories (like programs reading a maildir) read the entire list of files =

from readdir and then sort the results by inode number before starting to=20
access files.  On any filesystem whose structure is similar to that of ext2 =

(which, really, is most filesystems), this will make a big difference when=20
directories are hash tables, and will not generally have ill effects when=20
they are not.

However, all of this applies only to programs that use readdir to get a=20
list of files in a directory, and then access every file (or almost every=20
file).  It does not apply to programs like the AFS fileserver, which know=20
the name of the file they care about.  In other words, the advice Ted is=20
giving is not "use readdir"; it is "if you use readdir, sort the results".

(3) When you call readdir on a directory in AFS, do you get results in
    any particular order?  Is it useful for applications to sort
    readdir results by inode number for directories in AFS?

No, and probably not.  AFS's directories have always been hash tables, so=20
lookups by name are fast even on directories containing many files.=20
However, unlike ext2 with htree directories, AFS does not implement readdir =

by scanning down each hash bucket.  Instead, it scans the directory from=20
start to finish, returning the entries in the order in which they are=20
found, just like traditional filesystems.  Because the structure of AFS=20
directories requires that each entry live entirely within a single page,=20
large entries are sometimes allocated later in the directory even when=20
there is some space in an earlier page.  This means that as with=20
traditional filesystems, the order in which files are returned will not be=20
exactly the order in which they were created, but on directories with low=20
turnover it will be reasonably close.

For a traditional filesystem, sorting by inode number helps whether or not=20
directories are hash tables, because inode numbers correlate well with the=20
physical location of each file on the disk.

In AFS, this is not the case.  The inode numbers returned by AFS reflect=20
the vnode number of the file (among other things), which is the location of =

the file in a per-volume index maintained by the fileserver.  Entries in=20
vnode indexes are used on a first-available basis, and vnode numbers have=20
no relation to the physical location of the file on the server's disk.  So, =

sorting by inode number does not particularly help you.  However, it=20
probably won't hurt either, so don't feel you need to special-case AFS in=20
an application that sorts directories in order to get better performance=20
from ext2 with htrees.

(4) Will sort-after-readdir help on other filesystems?

On a filesystem with traditional structure, sort-after-readdir will help=20
significantly if readdir returns entries in random order, and may help=20
somewhat if readdir returns entries in the order in which they are located=20
in the directory.  I would expect this to cover ext2/ext3, ffs/ufs, and=20
probably xfs as well.

I don't know a lot about the structure of reiserfs, but it is not a=20
traditionally-structured filesystem and sorting directories by inode number =

is probably not helpful.  I can't say whether it would hurt.

-- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
   Sr. Research Systems Programmer
   School of Computer Science - Research Computing Facility
   Carnegie Mellon University - Pittsburgh, PA