[OpenAFS-devel] What is needed to build an AFS fileserver on top of BTRFS?

Jeffrey Hutzelman jhutz@cmu.edu
Tue, 17 Dec 2013 13:37:21 -0500


On Tue, 2013-12-17 at 16:53 +0000, David Howells wrote:
> It has occurred to me and others that something like BTRFS could be a good fit
> to build an AFS fileserver directly on top of.  The question is what facilities
> would be needed from BTRFS to make this work?
> 
> So I thought I'd kick off a shopping list;-)
> 
>  (1) 64-bit data version numbers that increase monotonically with each write.
> 
>      Yes, this is likely to cause some performance degredation as it introduces
>      an ordering over data writes and metadata writes to a file.  Maybe writes
>      can be batched to improve performance?

Yes.  You need a distinct version number for each version of the file
that is visible to any client, but intermediate versions never seen by
any client do not need separate versions.  However, note that whenever a
client does a successful write, it modifies the file locally and assigns
it the next version, so each RPC must result in a new version of the
file.  There are some other complications here, but it's probably not
impossible to design a filesystem-provided version which can be used as
the AFS data version.


>  (2) Storage for ACLs and AFS UIDs.  Having shareable ACLs might also
> be useful.
> 
>      Xattrs would likely do for this.


I'm a bit confused about whether you're talking about btrfs as a storage
backend for, say, OpenAFS, or btrfs as the complete on-disk volume
representation.  In particular, OpenAFS storage backends needn't provide
storage for AFS-level vnode metadata, because that is stored in the
vnode indices.  They really only need to provide storage for the few
pieces of "key" data (volume/vnode/uniq) that bind a storage-layer inode
to the corresponding AFS-layer vnode, plus the DV.

On the other hand, if you're looking to provide the complete on-disk
representation, then you need to be able to "name" inodes by
(volume/vnode/uniq) instead of by a filename.  The fileserver needs to
be able to specify those properties, instead of a name, when creating an
inode (unless you're doing directory management), and it needs to be
able to look up inodes by that same tuple, efficiently, even if you
_are_ doing directory management.  Also bear in mind that the OpenAFS
fileserver's current on-disk directory representation is also the
on-the-wire representation, so even if you store directories in some
other way, it must be possible to produce the AFS protocol version
efficiently.

Also, if the fileserver is managing directories and/or volume cloning,
it needs to be able to manipulate the link count on inodes, including
having inodes be automatically deleted when the fileserver decrements
the link count to zero.


>  (4) A 32-bit vnode number and 32-bit vnode uniquifier/generation number.
> 
>      These don't necessarily have to be stored by BTRFS directly but could
>      instead be in a separate database file that gets snapshotted also.

It turns out to make integrity checking and recovery a lot easier if the
volume/parent ID, vnode number, uniqifier, and data version are part of
the filesystem metadata, rather than being stored in a separate file.
This insures that they don't get separated in some bad way during
filesystem repair, making it difficult/impossible to match
storage-filesystem-layer inodes with the corresponding AFS-layer vnodes.
We live with it today, because we have little choice with modern
filesystems that don't give us any place to store metadata, but if a
filesystem is going to be specifically designed to serve as a more
efficient/reliable AFS storage backend, it should store this sort of
thing in the filesystem metadata.


>  (5) The ability to set the vnode number, vnode uniquifier and data version
>      number to specific values.  Necessary to clone volumes and restore
>      volume dumps.

Well, if the filesystem is going to handle volume snapshotting, then you
don't need to do clones, and vice versa.  However, yes, to handle
restores and some other operations, you need to be able to set the
volume ID, vnode number, and uniqifier of an inode -- but only when
_creating_ the inode; these properties are immutable once an inode is
created.


-- Jeff