[OpenAFS] Status of "vos shadow"

Wed, 20 Jun 2007 16:17:18 -0400

This is a quick note to discuss our experiences with shadows thus  
far. We'd hoped to be done long before now, but other work keeps  
getting in the way of pushing this forward. We are now in early  
pilot, and hope to have an initial set in production by end of summer.

We (well, Dan Hyde) found that the shadow code was largely complete.  
We did find one serious bug that could cause lossage of the original  
volume; I believe Dan has forwarded that fix to the group.

One of the biggest problems we bumped into was only semi-technical.  
It was the lack of definition of what a shadow *should* be as opposed  
to what a shadow is. We made decisions that suit us, but they  
necessarily reflect our intended use for shadows. Your mileage may  
vary, and we're certainly interested in and amenable to changes if  
the community comes to a decision on them.

Our purpose: disaster recovery by means of invisible replicated  
volumes. We envision a set of DR hosts with a shadow volume that  
replicates a production volume. If a host hard-fails and isn't likely  
to come back in a reasonable amount of time, we will go to the shadow  
server and promote the relevant volumes from shadow to production. At  
that time the vldb is modified to show the shadow host as the real  
host, and the on-server copy of the volume is changed from type  
'shadow' to type 'production' (handwave, handwave). "A reasonable  
amount of time" is site-dependent, of course.

Shadows do not appear in the vldb. Their existence is known only to  
the host which contains a particular shadow. Thus one might have many  
shadows, up to and including one on each vice partition in a cell.  
There is no required relationship of name, parenthood, etc, between a  
shadow and the volume from which it was created. (For the rest of  
this note, we'll refer to the original volume as the parent, and a  
shadow of a parent as a child.)

Simple shadowing of a parent onto a non-existent child creates a new  
volume identical to the parent in all but name and visibility.  
Incrementally shadowing a parent onto a child brings the child up-to- 
date with the parent, and is a proportionately faster operation.

Bad things you can do:

Shadowing a volume on to another volumes child results in a jumbled  
and probably useless volume. We don't think it should be permitted,  
but lacking a more extensive and better-defined child/parent  
relationship we don't see a way to prevent it. Properly that  
relationship should be in the vldb, but that requires much more  
extensive changes than (a) we were willing to make and (b) we thought  
the community would accept without pre-agreement as to what that  
relationship would be.

Shadowing a shadow onto itself results in disaster. We have now  
forbidden that in the code.

Shadowing onto a production volume should and does fail. I don't  
recall if we had to modify the code for that, but if so, that'll be  
part of the patch when we release.

There is now a vos command which promotes a shadow to production. It  
does nothing to the parent, which will continue to exist on the  
original server/vice partition and could be re-promoted with the  
appropriate vos sync command.

When a shadow is created, there is a mark in its volume header which  
indicates it is a clone. During the salvage process shadows are  
handled properly. If I recall correctly, we had to make no changes to  
the salvager for this, but if shadows were to appear in the vldb that  
might be a different story.

I don't recall if you can have a shadow named after its parent on the  
same server and vice partition as the parent.

We found a great deal of code that implies a long-term relationship  
between parent and child was intended, but that code is clearly  
incomplete. Unfortunately it's incomplete to such a degree that it's  
not possible to tell what the author(s) intended that relationship to  
be.

More detail on our intended usage:

For every AFS server we have, we will have a shadow server. When a  
volume is created on a server, a shadow is quickly created (semi- 
automated process) on the designated shadow server. When a volume is  
moved from one server to another, the shadow is removed from the old  
shadow host and created on the new host. As often as we can manage  
without affecting server performance (ie, TBD), we will incrementally  
refresh parents to children.

When a disaster occurs (an entire server is lost and not recoverable  
in a reasonable amount of time), the shadow server is brought on  
line. Assuming we've done our job correctly, user volumes simply  
reappear with a new location. The content of those volumes is as up- 
to-date as the most recent refresh of the shadow. Our seat-of-the- 
pants guess is that we can refresh each shadow about 4 times a day  
without affecting overall performance.

"A semi-automated process:" it happens out of cron. A shadow server  
gets the volumes list for the host it's shadowing, and does the  
creation/updating as needed. Since a shadow server knows what shadows  
it's got (think 'vos listvol'), it also can duplicate shadows it  
doesn't need any more. Note this means when a volume is moved, some  
interesting race conditions may ensue. The easiest way to fix those  
race conditions is by putting the shadows into the vldb, but again,  
that is a bigger change than we wanted to put in without a broad  
agreement from the community.

Some fallout/things discovered while testing the above - there's no  
real need to create a shadow at volume creation time; doing an  
incremental onto a non-existent shadow creates the shadow in exactly  
the same manner as doing a full shadow. Some might regard this as a  
bug; for the moment we're taking advantage of it.

Our new, second data center just went on line this week. With that in  
place, we can start the initial pilot work on shadows as disaster  
recovery.