[OpenAFS] fine-grained incrementals?

Wed, 23 Feb 2005 18:26:02 -0500

On Wednesday, February 23, 2005 01:08:29 PM -0800 Mike Fedyk 
<mfedyk@matchmail.com> wrote:

> Jeffrey Hutzelman wrote:
>
>> On Wednesday, February 23, 2005 11:44:17 AM -0800 Mike Fedyk
>> <mfedyk@matchmail.com> wrote:
>>
>>>  1) r/w volumes pause all activity during a replication release
>>>  2) backups need to show only the blocks changed since the last backup
>>> (instead of entire files)
>>>  3) (it seems) volume releases copy the entire volume image instead of
>>> the changes since the last release
>>>
>>> It looks like all of these issues would be solved in part or completely
>>> with the introduction of "Multiple volume versions" (as listed on the
>>> openafs.org web site under projects).
>>>
>>> 1) would be solved by creating a clone before a release and releasing
>>> from that.
>>
>>
>> That already happens.  But the cloning operation takes time, and the
>> source volume is indeed busy during that time.
>
> Interesting.  Is there documentation on the AFS format and how it does
> COW?  I'm familiar with Linux LVM, so it should use similar concepts,
> except that doing COW at the filesystem level can be more powerful and
> complicated than at the block device layer.
>
> In LVM basically a snapshot/clone just requires a small volume for block
> pointers, and incrementing the user count on the PEs (physical extents).
> How does AFS do this, and why is it taking a noticeable amount of time
> (also what is the AFS equivalent of PE block size)?

There is not particularly much documentation on the on-disk structure of 
AFS fileserver data; you pretty much need to UTSL.

AFS does copy-on-write at the per-vnode layer.  Each vnode has metadata 
which is kept in the volume's vnode indices; among other things, this 
includes the identifier of the physical file which contains the vnode's 
contents (for the inode fileserver, this is an inode number; for namei it's 
a 64-bit "virtual inode number" which can be used to derive the filename). 
The underlying inode has a link count (in the filesystem for inode; in the 
link table for namei) which reflects how many vnodes have references to 
that inode.  When you write to a vnode whose underlying inode has more than 
one reference, the fileserver allocates a new one for the vnode you're 
writing to, and copies the contents.

A cloned volume has its own vnode indices.  The cloning process basically 
involves creating new indices and incrementing the link count on all of the 
underlying inodes.  Unfortunately, usually you are either updating or 
removing an existing clone, which means decrementing the link counts on all 
of its vnodes, and possibly actually freeing the associated data.  On a 
volume with lots of files, this turns out to be time-consuming.

>>> 2) would be solved by creating a clone each time there is a
>>> backup and comparing it to the previous backup clone.  and 3) would be a
>>> similar process with volume releases.
>>
>>
>> This is not a bad idea.  Of course, you still have problems if the
>> start time for the dump you want doesn't correspond to a clone you
>> already have, but that situation can probably be avoided in practice.
>
> Yes, and that is why I said it would be a specific and separate clone
> just for the incremental backups, so the timing isn't based on the backup
> clone for instance.

Yes, but that doesn't necessarily solve the problem.  Real backup scenarios 
may involve multiple levels of backups, which necessitates multiple clones. 
And, situations can arise in which the clone you have does not have 
precisely the right date; for example, the backup system may lose a dump 
for some reason (lost/damaged tape; the backup machine crashed before 
syncing the backup database to disk, etc).  In a well-designed backup 
system, these cases should be rare, but they will occur.

It should be noted that I do not consider the bu* tools that come with AFS 
to have anything to do with a well-designed backup system.

> An interesting thought would be to clone a replicated volume on another
> machine that has more, but slower space than a fileserver holding the
> active r/w volumes and running the backups from that machine to keep load
> down on the r/w volume fileserver.

An interesting idea, except that you forget how the replicated volume gets 
there - by a release process which involves creating a temporary release 
clone.

Replicated volumes are designed to provide load-balancing and availability 
for data which is infrequently written but either frequently read or 
important to have available (often both properties apply to the same data), 
like software.  They are not designed to provide backup copies of 
frequently-written data like user directories.  Within their design goals, 
they work quite well:

- volumes can be replicated over many servers to meet demand
- making the R/W unavailable for a while during a release is not a big
  deal because it is never accessed by normal users
- not having failover is unimportant because if the server containing
  your R/W volumes is down, then you have a fileserver down and should
  be fixing it instead of updating R/W volumes.

> The first step can just allow the admin to mount the clone volumes where
> they want them.  OAFS allows you to have volumes that are not connected
> to the AFS tree that users see, doesn't it?

Yes, you can have volumes that are not mounted anywhere.  The issue is how 
to name and refer to these clones.  The "normal" RO and BK clones appear in 
the same VLDB entry as the base volume, and the cache manager and other 
tools know to strip off the .backup or .readonly suffix to get the name 
they must look up in the VLDB.

I guess what I'm trying to say is that the item in the roadmap is not "be 
able to have multiple clones of a volume", because we have had that for 
quite some time.  The roadmap item _is_ to have a user-visible volume 
snapshotting mechanism where you can find and access multiple snapshots of 
a volume.

-- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
   Sr. Research Systems Programmer
   School of Computer Science - Research Computing Facility
   Carnegie Mellon University - Pittsburgh, PA