[OpenAFS] State of the Michigan shadow system (long)

Steve Simmons scs@umich.edu
Mon, 20 Dec 2010 14:46:38 -0500


Matt Benjamin alluded to this in other email on the info list; given the =
state of our world it's a good idea to get the idea out to others. "The =
state of our world" doesn't mean it's coming apart, just means that we =
probably aren't going to be working on this for the forseeable future.

Dan Hyde and I were doing a system at Michigan that was intended to =
allow rapid disaster recovery for AFS using a system analogous to =
snapmirroring on NetApp filers and similar devices. This note is to =
provide a quick overview of where we were going, how far we got, and =
why.

Problem: at Michigan, losing an AFS file server can make some or all of =
the cell unusable (handwave, handwave on why). As the number of servers =
increases, the likelihood of this becomes higher and higher. We were =
looking for a way to minimize those losses, and glommed onto an =
unfinished project called 'shadow volumes' to do it.

Shadow volumes have lots of theoretical capabilities; we were pushing =
for one specific set of features in our implementation. Don't take our =
work as being representative of either what the initial developers =
intended or as the only possible use for it. In classic open source =
fashion, our development reflected scratching our own particular itches.

Credit where credit is due: Dan did all the heavy lifting on the code =
and a lot of the test operational deployment. And the original work was =
done by someone whose name escapes me right at this second; if time and =
energy permit I'll look that up and give that credit.

A shadow volume is a read-only remote clone of a primary volume. We had =
to create some terminology here, and 'primary' is what we called the =
real-time, in-use, r/w production volume. A remote clone closely =
resembles a read-only replica of a volume, but differs in several =
important respects.

First and foremost, it does not appear in the vldb. Thus there is no =
possibility of the read-only copy coming into production. If it were =
public like a r/o replica, it would generate all kinds of problems for =
the day to day use of the volume. Our solution to this follows the =
original developer: the only way to prevent use of the r/o was to not =
have it appear in the vldb. Longer term there are better ways, but this =
did the least violence to existing cells.

A shadow volume should retain a timestamp and name-or-id relationship =
with the primary. This should enable something much like a release of a =
replicated volume - incremental changes are quickly and easily =
propogated to the shadow. We call that refreshing the shadow. As the =
shadow was not in the vldb, this requires the refresh be initiated by =
something external to the vldb/primary. That code is complete and works. =
This was running on a nightly basis in our cell with an acceptably small =
amount of overtime - not much more than the nightly backup snapshots. =
Big kudos to Dan on this.

Shadow volumes could be detected only on the server on which they =
reside. Modification were made to vos listvol for that purpose. A bit in =
the volume header was selected for distinguishing a shadow from a =
primary volume; I believe that was the only modification made to the =
volume header file. This work is also done.

A mechanism needs to be established such that a shadow volume can be =
promoted (our term) to a primary. This mechanism would involve at least =
two steps: flipping the shadow bit in the header file to indicate the =
volume is a primary, and updating the vldb to indicate the new location =
of the primary. This work is incomplete; I don't have a feel for how =
much if any is done.

With these features, we could meet the minimum bar for our usage. We =
could, in theory, disastrously lose an AFS server, promote the shadows, =
and be back online in minutes. There would be data lossage for any =
changes which occur between the last refresh and the promotion, but this =
was judged preferable to having the cell down or non-functional for =
hours or even days.

In our initial implementation, we were building afs servers in pairs =
with shadow servers. Each server in a pair was intended for only one =
purpose - either all primary volumes, or all shadow volumes. This isn't =
the only way to do it, but we selected this method for a couple of =
reasons:

* It eased the tracking of where shadow volumes were, and enabled us to =
easily find shadow volumes that might no longer be needed on a a given =
shadow server.
* It very much reflects the problem we're trying to solve: disastrous =
loss of either (a) a file server or (b) an entire data center. A quick =
ability to tell a server 'promote everything' made for quick and =
accurate response in the face of not having the shadow data in the vldb.

To support this process, every night (or at interval you choose) the =
shadow servers would examine the primary volumes on it's paired server, =
and would create or refresh the shadows as needed. We intended to update =
our provisioning process for volumes such that shadows would =
automatically be created when a primary was created or moved, but since =
shadow servers caught any missing volumes automatically, it was kind of =
low on the list.

Other things one could do with shadows:

I mention using shadows and their clones as part of a file restore =
system. That's nice, but rather a pain in many ways. It's also a desire =
to work around the limitations of only having 7 clone slots available. =
Having a significantly larger number of clones is a much better =
solution, but that's outside the scope of this project.

Thing envisioned but not yet followed through to an actual design:

* a vldb-like solution such that shadows(s) of a given primary could be =
identified easily and moved/updated appropriately. In the best of all =
worlds, this would be a part of the vldb, but that's a lot to wish for
* volume-sensitive and shadow-sensitive decisions on freshen frequency. =
One might refresh critical data volumes quite often, less critical ones =
rarely or not at all. One might refresh on-site shadows frequently, =
off-site daily
* remote shadows become your long-term backup system. This would require =
several features, most critical
** the ability to have clones of shadows, one clone per, say, each daily =
backup. Note this requires that refreshing a clone should also have to =
manage those clones in some flexible way
** the ability to promote a shadow to a different name. This enables the =
shadow and its clones to be made visible without taking the production =
off-line.
* clones (in particular, .backup) of a primary should be refreshable to =
a shadow, ie, specified clones of the primary could be refreshed to the =
shadow
* some way of mediating between incompatible operations, eg, have =
refresh operations either queue or abort cleanly if they would interfere =
with other activities like volume moves, backupsys, etc.

Some open questions:

It was clear we were talking about volume families - a primary, its =
clones, it's shadows, their clones, etc, etc. Should you be able to have =
shadows of shadows? We think so; refreshing multiple shadows of a given =
volume shouldn't require hitting the primary multiple times nor doing =
all those refreshes in lock steps. We need to establish a sort of =
taxonomy of volumes with well-defined relationships. Dan and I came up =
with a lot of ideas, but are very aware that we were reasoning in the =
dark. Other sites might well have other needs that would affect this.

I think we were sliding towards a transparent upward-compatible =
replacement of the vldb as well. Based purely on how I imagine the vldb =
to work :-), it should be possible to add shadow data to it and define =
some additional rpcs. Users of the old rpcs would only get the data that =
was in the 'legacy' vldb, users of the new rpcs would get shadow data as =
well. That's a door folks may not want opened yet, but it seems a better =
choice than bolting a separate shadow-oriented vldb to the side.


So that's where we are. I believe out latest shadow software is built =
against 1.4.11, but could be wrong. If folks are interested, I'd be =
happy to chat with Dan and we'll release the patches to interested =
parties.

If folks think this is worth writing up in the afslore wiki as a partial =
project, I'd be glad to take this note and shovel it in with appropriate =
formatting.

Steve


=20