[OpenAFS] Status of "vos shadow"
Steve Simmons
scs@umich.edu
Wed, 20 Jun 2007 16:17:18 -0400
This is a quick note to discuss our experiences with shadows thus
far. We'd hoped to be done long before now, but other work keeps
getting in the way of pushing this forward. We are now in early
pilot, and hope to have an initial set in production by end of summer.
We (well, Dan Hyde) found that the shadow code was largely complete.
We did find one serious bug that could cause lossage of the original
volume; I believe Dan has forwarded that fix to the group.
One of the biggest problems we bumped into was only semi-technical.
It was the lack of definition of what a shadow *should* be as opposed
to what a shadow is. We made decisions that suit us, but they
necessarily reflect our intended use for shadows. Your mileage may
vary, and we're certainly interested in and amenable to changes if
the community comes to a decision on them.
Our purpose: disaster recovery by means of invisible replicated
volumes. We envision a set of DR hosts with a shadow volume that
replicates a production volume. If a host hard-fails and isn't likely
to come back in a reasonable amount of time, we will go to the shadow
server and promote the relevant volumes from shadow to production. At
that time the vldb is modified to show the shadow host as the real
host, and the on-server copy of the volume is changed from type
'shadow' to type 'production' (handwave, handwave). "A reasonable
amount of time" is site-dependent, of course.
Shadows do not appear in the vldb. Their existence is known only to
the host which contains a particular shadow. Thus one might have many
shadows, up to and including one on each vice partition in a cell.
There is no required relationship of name, parenthood, etc, between a
shadow and the volume from which it was created. (For the rest of
this note, we'll refer to the original volume as the parent, and a
shadow of a parent as a child.)
Simple shadowing of a parent onto a non-existent child creates a new
volume identical to the parent in all but name and visibility.
Incrementally shadowing a parent onto a child brings the child up-to-
date with the parent, and is a proportionately faster operation.
Bad things you can do:
Shadowing a volume on to another volumes child results in a jumbled
and probably useless volume. We don't think it should be permitted,
but lacking a more extensive and better-defined child/parent
relationship we don't see a way to prevent it. Properly that
relationship should be in the vldb, but that requires much more
extensive changes than (a) we were willing to make and (b) we thought
the community would accept without pre-agreement as to what that
relationship would be.
Shadowing a shadow onto itself results in disaster. We have now
forbidden that in the code.
Shadowing onto a production volume should and does fail. I don't
recall if we had to modify the code for that, but if so, that'll be
part of the patch when we release.
There is now a vos command which promotes a shadow to production. It
does nothing to the parent, which will continue to exist on the
original server/vice partition and could be re-promoted with the
appropriate vos sync command.
When a shadow is created, there is a mark in its volume header which
indicates it is a clone. During the salvage process shadows are
handled properly. If I recall correctly, we had to make no changes to
the salvager for this, but if shadows were to appear in the vldb that
might be a different story.
I don't recall if you can have a shadow named after its parent on the
same server and vice partition as the parent.
We found a great deal of code that implies a long-term relationship
between parent and child was intended, but that code is clearly
incomplete. Unfortunately it's incomplete to such a degree that it's
not possible to tell what the author(s) intended that relationship to
be.
More detail on our intended usage:
For every AFS server we have, we will have a shadow server. When a
volume is created on a server, a shadow is quickly created (semi-
automated process) on the designated shadow server. When a volume is
moved from one server to another, the shadow is removed from the old
shadow host and created on the new host. As often as we can manage
without affecting server performance (ie, TBD), we will incrementally
refresh parents to children.
When a disaster occurs (an entire server is lost and not recoverable
in a reasonable amount of time), the shadow server is brought on
line. Assuming we've done our job correctly, user volumes simply
reappear with a new location. The content of those volumes is as up-
to-date as the most recent refresh of the shadow. Our seat-of-the-
pants guess is that we can refresh each shadow about 4 times a day
without affecting overall performance.
"A semi-automated process:" it happens out of cron. A shadow server
gets the volumes list for the host it's shadowing, and does the
creation/updating as needed. Since a shadow server knows what shadows
it's got (think 'vos listvol'), it also can duplicate shadows it
doesn't need any more. Note this means when a volume is moved, some
interesting race conditions may ensue. The easiest way to fix those
race conditions is by putting the shadows into the vldb, but again,
that is a bigger change than we wanted to put in without a broad
agreement from the community.
Some fallout/things discovered while testing the above - there's no
real need to create a shadow at volume creation time; doing an
incremental onto a non-existent shadow creates the shadow in exactly
the same manner as doing a full shadow. Some might regard this as a
bug; for the moment we're taking advantage of it.
Our new, second data center just went on line this week. With that in
place, we can start the initial pilot work on shadows as disaster
recovery.