[OpenAFS] fine-grained incrementals?

Mike Fedyk mfedyk@matchmail.com
Wed, 23 Feb 2005 16:05:35 -0800


Jeffrey Hutzelman wrote:

>
>
> On Wednesday, February 23, 2005 01:08:29 PM -0800 Mike Fedyk 
> <mfedyk@matchmail.com> wrote:
>
>> Jeffrey Hutzelman wrote:
>>
>>> On Wednesday, February 23, 2005 11:44:17 AM -0800 Mike Fedyk
>>> <mfedyk@matchmail.com> wrote:
>>>
>>>>  1) r/w volumes pause all activity during a replication release
>>>>  2) backups need to show only the blocks changed since the last backup
>>>> (instead of entire files)
>>>>  3) (it seems) volume releases copy the entire volume image instead of
>>>> the changes since the last release
>>>>
>>>> It looks like all of these issues would be solved in part or 
>>>> completely
>>>> with the introduction of "Multiple volume versions" (as listed on the
>>>> openafs.org web site under projects).
>>>>
>>>> 1) would be solved by creating a clone before a release and releasing
>>>> from that.
>>>
>>>
>>>
>>> That already happens.  But the cloning operation takes time, and the
>>> source volume is indeed busy during that time.
>>
>>
>> Interesting.  Is there documentation on the AFS format and how it does
>> COW?  I'm familiar with Linux LVM, so it should use similar concepts,
>> except that doing COW at the filesystem level can be more powerful and
>> complicated than at the block device layer.
>>
>> In LVM basically a snapshot/clone just requires a small volume for block
>> pointers, and incrementing the user count on the PEs (physical extents).
>> How does AFS do this, and why is it taking a noticeable amount of time
>> (also what is the AFS equivalent of PE block size)?
>
>
> There is not particularly much documentation on the on-disk structure 
> of AFS fileserver data; you pretty much need to UTSL.
>
> AFS does copy-on-write at the per-vnode layer.  Each vnode has 
> metadata which is kept in the volume's vnode indices; among other 
> things, this includes the identifier of the physical file which 
> contains the vnode's contents (for the inode fileserver, this is an 
> inode number; for namei it's a 64-bit "virtual inode number" which can 
> be used to derive the filename). The underlying inode has a link count 
> (in the filesystem for inode; in the link table for namei) which 
> reflects how many vnodes have references to that inode.  When you 
> write to a vnode whose underlying inode has more than one reference, 
> the fileserver allocates a new one for the vnode you're writing to, 
> and copies the contents.
>
> A cloned volume has its own vnode indices.  The cloning process 
> basically involves creating new indices and incrementing the link 
> count on all of the underlying inodes.  Unfortunately, usually you are 
> either updating or removing an existing clone, which means 
> decrementing the link counts on all of its vnodes, and possibly 
> actually freeing the associated data.  On a volume with lots of files, 
> this turns out to be time-consuming.

Thanks.

What is the concurrency mechanism (fsync, vnode locks in memory, etc.)? 
Is there a fsync after each vnode operation, or some batching of updates?
So COW is done on each vnode or is there some kind of vnode packing 
within a block?
How big are the internal blocks in AFS (or is everything packed together 
at the byte level?)?

>
>
>>>> 2) would be solved by creating a clone each time there is a
>>>> backup and comparing it to the previous backup clone.  and 3) would 
>>>> be a
>>>> similar process with volume releases.
>>>
>>>
>>>
>>> This is not a bad idea.  Of course, you still have problems if the
>>> start time for the dump you want doesn't correspond to a clone you
>>> already have, but that situation can probably be avoided in practice.
>>
>>
>> Yes, and that is why I said it would be a specific and separate clone
>> just for the incremental backups, so the timing isn't based on the 
>> backup
>> clone for instance.
>
>
> Yes, but that doesn't necessarily solve the problem.  Real backup 
> scenarios may involve multiple levels of backups, which necessitates 
> multiple clones. And, situations can arise in which the clone you have 
> does not have precisely the right date; for example, the backup system 
> may lose a dump for some reason (lost/damaged tape; the backup machine 
> crashed before syncing the backup database to disk, etc).  In a 
> well-designed backup system, these cases should be rare, but they will 
> occur.

The use count for each vnode should be at least 16 bits, so having 
multiple clones shouldn't be a problem.  You just have a clone for each 
backup and remove the extra clones on full backup (really, the 
management of the clones should be done by the backup system, just like 
it would be done with LVM).  All of the clones are equal since they all 
point to the same vnodes.  Their the use of each individual clone would 
be based on the timing (if it matches the time of a full, or incremental 
backup for instance).

>
> It should be noted that I do not consider the bu* tools that come with 
> AFS to have anything to do with a well-designed backup system.

Noted.

>
>> An interesting thought would be to clone a replicated volume on another
>> machine that has more, but slower space than a fileserver holding the
>> active r/w volumes and running the backups from that machine to keep 
>> load
>> down on the r/w volume fileserver.
>
>
> An interesting idea, except that you forget how the replicated volume 
> gets there - by a release process which involves creating a temporary 
> release clone.
>
> Replicated volumes are designed to provide load-balancing and 
> availability for data which is infrequently written but either 
> frequently read or important to have available (often both properties 
> apply to the same data), like software.  They are not designed to 
> provide backup copies of frequently-written data like user 
> directories.  Within their design goals, they work quite well:
>
> - volumes can be replicated over many servers to meet demand
> - making the R/W unavailable for a while during a release is not a big
>  deal because it is never accessed by normal users
> - not having failover is unimportant because if the server containing
>  your R/W volumes is down, then you have a fileserver down and should
>  be fixing it instead of updating R/W volumes.

I'm not mixing this thread with my previous message about using AFS for 
HA and switching r/o to r/w on a r/w fileserver failure.  I was simply 
thinking that it is a relatively common scenario to have lots of space 
on a server that wasn't setup to handle quick response, but has a lot of 
space.

Creating the multiple clones on that system would allow the backup to 
proceed while not slowing the other systems that have faster disks, 
processors, more memory for 24/7 production.

>
>
>
>> The first step can just allow the admin to mount the clone volumes where
>> they want them.  OAFS allows you to have volumes that are not connected
>> to the AFS tree that users see, doesn't it?
>
>
> Yes, you can have volumes that are not mounted anywhere.  The issue is 
> how to name and refer to these clones.  The "normal" RO and BK clones 
> appear in the same VLDB entry as the base volume, and the cache 
> manager and other tools know to strip off the .backup or .readonly 
> suffix to get the name they must look up in the VLDB.
>
>
> I guess what I'm trying to say is that the item in the roadmap is not 
> "be able to have multiple clones of a volume", because we have had 
> that for quite some time.  The roadmap item _is_ to have a 
> user-visible volume snapshotting mechanism where you can find and 
> access multiple snapshots of a volume.

OK, that clears things up.  I was wondering why a COW system would have 
a limit to the number clones (unless it did something like only using 3 
bits for the use count or something)

Mike