[OpenAFS] fine-grained incrementals?
Jeffrey Hutzelman
jhutz@cmu.edu
Wed, 23 Feb 2005 18:26:02 -0500
On Wednesday, February 23, 2005 01:08:29 PM -0800 Mike Fedyk
<mfedyk@matchmail.com> wrote:
> Jeffrey Hutzelman wrote:
>
>> On Wednesday, February 23, 2005 11:44:17 AM -0800 Mike Fedyk
>> <mfedyk@matchmail.com> wrote:
>>
>>> 1) r/w volumes pause all activity during a replication release
>>> 2) backups need to show only the blocks changed since the last backup
>>> (instead of entire files)
>>> 3) (it seems) volume releases copy the entire volume image instead of
>>> the changes since the last release
>>>
>>> It looks like all of these issues would be solved in part or completely
>>> with the introduction of "Multiple volume versions" (as listed on the
>>> openafs.org web site under projects).
>>>
>>> 1) would be solved by creating a clone before a release and releasing
>>> from that.
>>
>>
>> That already happens. But the cloning operation takes time, and the
>> source volume is indeed busy during that time.
>
> Interesting. Is there documentation on the AFS format and how it does
> COW? I'm familiar with Linux LVM, so it should use similar concepts,
> except that doing COW at the filesystem level can be more powerful and
> complicated than at the block device layer.
>
> In LVM basically a snapshot/clone just requires a small volume for block
> pointers, and incrementing the user count on the PEs (physical extents).
> How does AFS do this, and why is it taking a noticeable amount of time
> (also what is the AFS equivalent of PE block size)?
There is not particularly much documentation on the on-disk structure of
AFS fileserver data; you pretty much need to UTSL.
AFS does copy-on-write at the per-vnode layer. Each vnode has metadata
which is kept in the volume's vnode indices; among other things, this
includes the identifier of the physical file which contains the vnode's
contents (for the inode fileserver, this is an inode number; for namei it's
a 64-bit "virtual inode number" which can be used to derive the filename).
The underlying inode has a link count (in the filesystem for inode; in the
link table for namei) which reflects how many vnodes have references to
that inode. When you write to a vnode whose underlying inode has more than
one reference, the fileserver allocates a new one for the vnode you're
writing to, and copies the contents.
A cloned volume has its own vnode indices. The cloning process basically
involves creating new indices and incrementing the link count on all of the
underlying inodes. Unfortunately, usually you are either updating or
removing an existing clone, which means decrementing the link counts on all
of its vnodes, and possibly actually freeing the associated data. On a
volume with lots of files, this turns out to be time-consuming.
>>> 2) would be solved by creating a clone each time there is a
>>> backup and comparing it to the previous backup clone. and 3) would be a
>>> similar process with volume releases.
>>
>>
>> This is not a bad idea. Of course, you still have problems if the
>> start time for the dump you want doesn't correspond to a clone you
>> already have, but that situation can probably be avoided in practice.
>
> Yes, and that is why I said it would be a specific and separate clone
> just for the incremental backups, so the timing isn't based on the backup
> clone for instance.
Yes, but that doesn't necessarily solve the problem. Real backup scenarios
may involve multiple levels of backups, which necessitates multiple clones.
And, situations can arise in which the clone you have does not have
precisely the right date; for example, the backup system may lose a dump
for some reason (lost/damaged tape; the backup machine crashed before
syncing the backup database to disk, etc). In a well-designed backup
system, these cases should be rare, but they will occur.
It should be noted that I do not consider the bu* tools that come with AFS
to have anything to do with a well-designed backup system.
> An interesting thought would be to clone a replicated volume on another
> machine that has more, but slower space than a fileserver holding the
> active r/w volumes and running the backups from that machine to keep load
> down on the r/w volume fileserver.
An interesting idea, except that you forget how the replicated volume gets
there - by a release process which involves creating a temporary release
clone.
Replicated volumes are designed to provide load-balancing and availability
for data which is infrequently written but either frequently read or
important to have available (often both properties apply to the same data),
like software. They are not designed to provide backup copies of
frequently-written data like user directories. Within their design goals,
they work quite well:
- volumes can be replicated over many servers to meet demand
- making the R/W unavailable for a while during a release is not a big
deal because it is never accessed by normal users
- not having failover is unimportant because if the server containing
your R/W volumes is down, then you have a fileserver down and should
be fixing it instead of updating R/W volumes.
> The first step can just allow the admin to mount the clone volumes where
> they want them. OAFS allows you to have volumes that are not connected
> to the AFS tree that users see, doesn't it?
Yes, you can have volumes that are not mounted anywhere. The issue is how
to name and refer to these clones. The "normal" RO and BK clones appear in
the same VLDB entry as the base volume, and the cache manager and other
tools know to strip off the .backup or .readonly suffix to get the name
they must look up in the VLDB.
I guess what I'm trying to say is that the item in the roadmap is not "be
able to have multiple clones of a volume", because we have had that for
quite some time. The roadmap item _is_ to have a user-visible volume
snapshotting mechanism where you can find and access multiple snapshots of
a volume.
-- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
Sr. Research Systems Programmer
School of Computer Science - Research Computing Facility
Carnegie Mellon University - Pittsburgh, PA