[OpenAFS] fine-grained incrementals?

Mike Fedyk mfedyk@matchmail.com
Fri, 25 Feb 2005 13:14:08 -0800


Jeffrey Hutzelman wrote:

>
>
> On Wednesday, February 23, 2005 05:13:52 PM -0800 Mike Fedyk 
> <mfedyk@matchmail.com> wrote:
>
>> Jeffrey Hutzelman wrote:
>>
>>> AFS does copy-on-write at the per-vnode layer.  Each vnode has
>>> metadata which is kept in the volume's vnode indices; among other
>>> things, this includes the identifier of the physical file which
>>> contains the vnode's contents (for the inode fileserver, this is an
>>> inode number; for namei it's a 64-bit "virtual inode number" which can
>>> be used to derive the filename). The underlying inode has a link count
>>> (in the filesystem for inode; in the link table for namei) which
>>> reflects how many vnodes have references to that inode.  When you
>>> write to a vnode whose underlying inode has more than one reference,
>>> the fileserver allocates a new one for the vnode you're writing to,
>>> and copies the contents.
>>
>>
>> OK, I get it now.  An inode fileserver uses the link count on the
>> underlying filesystem (ext3 for instance), and a namei server uses a
>> large file (or possibly block device) with an AFS specific filesystem
>> format.  Is that right?
>
>
> Not quite.  Both inode and namei fileservers store their data in 
> individual files on the local filesystem.  Each local file corresponds 
> to the contents of one vnode (file, directory, or symlink) in the AFS 
> filesystem, or to some particular kind of per-volume metadata (a 
> volume header or vnode index).  The different between the two backends 
> lies largely in how those files are located by the fileserver.
>
> In an inode fileserver (the traditional model), the vnode index 
> contains the inode numbers of the underlying files for each vnode; the 
> inode numbers of the indices themselves are stored in the volume 
> header (the Vxxx.vol files at the top level of each vice partition).  
> These inodes have no regular directory entries which point to them; 
> they are manipulated via a set of special system calls provided by the 
> AFS kernel module.  In this model, the link counts on the underlying 
> inodes reflect the number of vnodes referring to that inode; when the 
> link count is decremented to zero, the inode is automatically freed by 
> the normal kernel filesystem code.
>
> In a namei fileserver, the underlying files are normal files in the 
> filesystem.  The vnode indices contain virtual "inode numbers" which 
> are used to compute the file's actual filename; we then open the files 
> by name. Since these are normal files on an unmodified local 
> filesystem, their link counts in the underlying filesystem represent 
> the number of actual links to them, which is always 1.  Information 
> about how many vnodes are using that file is stored in the "link 
> table", which is an additional per-volume metadata file.  This is the 
> only backend currently available on Linux.

It looks like namei fileservers can make clones much faster since they 
only need to update the link table, which is in one file and not spread 
over the filesystem, but it's not as clean since there is duplicated 
functionality at the AFS and underlying filesystem level.  Though it 
would work well on systems that don't have hard links (FAT16/32 for 
example -- though I wouldn't recommend using that configuration...)

Also, this cuts much of the usefulness of a clone for the fine-grained 
backups, since the COW is done at the vnode/file level, instead of some 
internal block level (which I though AFS had, but it doesn't it turns out).

>
> There is no fileserver backend which stores data in a large file or 
> directly to a block device, and there never has been.  Such a thing 
> would be possible, but it's not clear that it would be superior to the 
> existing backends.

Nevermind, I saw so much talking from the CODA (which forked from AFS a 
while ago) people about using a separate partition (it turned out to be 
only a large meta-data file) that I presumed that also contained the data.

Mike