[OpenAFS] Re: State of the Michigan shadow system (long)

Steve Simmons scs@umich.edu
Mon, 20 Dec 2010 19:00:18 -0500

On Dec 20, 2010, at 3:29 PM, Andrew Deason wrote:

> On Mon, 20 Dec 2010 14:46:38 -0500
> Steve Simmons <scs@umich.edu> wrote:
>> A shadow volume is a read-only remote clone of a primary volume. We
>> had to create some terminology here, and 'primary' is what we called
>> the real-time, in-use, r/w production volume. A remote clone closely
>> resembles a read-only replica of a volume, but differs in several
>> important respects.
> By 'read-only' do you just mean in intended usage? I may be way off, =
> my memory of shadow volumes (as implemented in openafs.org code) is =
> they are are virtually identical to the primary, and are not marked as
> RO volumes or anything like that in the underlying namei metadata. So, =
> fileserver could theoretically attach it and modify it, though it was
> intended that the lack of an entry in the vldb would prevent clients
> from accessing it.

Yes, 'read-only' is sloppy terminology on my part. 'Enforcement' of the =
read-only nature was done by virtue of the shadow being invisible to =
most things that access volumes.

>> First and foremost, it does not appear in the vldb. Thus there is no
>> possibility of the read-only copy coming into production.
> I understand this was probably the best way to do this at the time, =
> this alone does not prevent the volume from getting used. Since vldb
> results are cached by clients and an administrator could screw up vldb
> data somehow, it's possible for someone to access the wrong volume.


>> Shadow volumes could be detected only on the server on which they
>> reside. Modification were made to vos listvol for that purpose. A bit
>> in the volume header was selected for distinguishing a shadow from a
>> primary volume; I believe that was the only modification made to the
>> volume header file. This work is also done.
> By "done" does this mean you just implemented it at umich, or it's in
> the openafs.org tree? Is the volume header bit you're referring to
> inService (or another existing flag), or did you use a separate field
> specifically for shadows?

That's how we implemented it, yes. I don't believe the source is in the =
public openafs.org source tree anywhere, tho I think Dan Hyde has it =
incorported as a branch in  his git archive. I'd ask him, but he's on =
vacation this week.

I don't know off the top of the head which bit he used. In our =
disucssions at the time we used one of the reserved bits, but in full =
knowledge that such might have to change when/if time came to make the =
implementation more public.

>> I think we were sliding towards a transparent upward-compatible
>> replacement of the vldb as well. Based purely on how I imagine the
>> vldb to work :-), it should be possible to add shadow data to it and
>> define some additional rpcs. Users of the old rpcs would only get the
>> data that was in the 'legacy' vldb, users of the new rpcs would get
>> shadow data as well. That's a door folks may not want opened yet, but
>> it seems a better choice than bolting a separate shadow-oriented vldb
>> to the side.
> I thought the bigger problem is not the compatibility of the
> client<->vlserver interface, but rather the vlserver<->vlserver
> interface; that is, the structure of the VL entries in ubik, since =
> structures doesn't have any spare fields (although LockAfsId is not
> used). You can probably play some games to keep enough compatibility
> with older vlservers, but it requires some thought.

Again, loose terminology on my part, 'cause I didn't want to drown folks =
in detail. But since you were kind enough to ask:

Yeah, it's hard. One chunk of what makes it hard is that the vldb format =
is fixed and there's little or no space to wedge new stuff into it. =
Another complicating factor is this whole idea of volume families and =
determining if, when and how we want to be tracking the inter-volume =
relationships and dependencies. As a particular example, in our existing =
implementation it's perfectly possible for shadow A' (A-prime) of volume =
A to be overwritten by as a shadow of volume B. Sometimes you want that: =
B could be a shadow of A, and we're reducing overhead on A by refreshing =
B from A'. In a sense, you might think of B as more properly A''. But =
how should such relationships be detected, and what if any limitations =
should be imposed on such refreshes? Lacking a good taxonomy of what a =
shadow volume is and how it relates to the primary, we can't come up =
with a good database definition to encode that. Lacking that definition, =
we can't come up with a proposal that would allow shadow data to be =
placed in the vldb in any upwards-compatable method.

The decision to leave shadows outside of the vldb ultimately begs the =
question of how to manage shadows and volume families, and IMHO is =
acceptable only as a short-term case.

Coming to the more specific vlserver-vlserver-ubiq questions - yeah, =
that's hard. If all we're thinking of is simple records that could =
(please, god, please!) be shoehorned into the dbs, those are relatively =
simple issues. I dunno if that's possible, tho. In addition, it ignores =
any possible issues that may arise when the db is in a transitional =
state - ie, an incomplete subset of the volume family data has been =
distributed and somebody makes a query about it. As far as I know, ubiq =
doesn't have any concept of atomic commits across multiple entries. That =
makes processing volume families in any except the simples ways very =

If a more complex implementation is required, well, maybe maybe huge =
violence has to be done to the vldb format, the servers, and ubiq. Maybe =
we need to move to some other replicated system entirely. Maybe this is =
a good argument for keeping the shadow data in a separate db, not unlike =
the kind of system Russ built for extended data at Stanford (I believe =
they use a mysql db to track creation and manipulation of mount points, =
etc). Maybe not. No matter what, it's not an implementation I'd want to =
proceed with in the absence of a community decision that This Is The =
Right Direction. So for now, shadows stay outside the vldb and non-vldb =
processes are going to have to handle it.

Assuming there is some way to get shadow data into the vldb:

My seat-of-the-pants feeling is that an upwardly compatible db is =
doable. Older clients and servers that don't understand shadows should =
work perfectly fine. They will use the older RPCs to communicate, and as =
such will never get  presented with data about shadows. Those older rpcs =
to do vos move, copy, delete, etc, doesn't require any knowledge of =
shadow entries in the vldb. When those are actions requested using the =
existing RPCs, shadow-enable vldb manipulation code needs to be handle =
the relationships in whatever default way we define. As an bit of =
precedent, current removal of a replication site from the list doesn't =
cause the replicant copy to be deleted. I can see (or rather, could live =
with) similar performance for shadows.

Newer clients and vldb replication should use the newer RPCs. For thing =
like rename or delete, the newer client commands and the rpcs should =
allow us instruct the appropriate entity that removal of a volume should =
or should not cause the shadow(s) to be removed, or shadows renamed as =
part of volume renames, etc.

But all that is *way* ahead of the game. For now, we've gone with the =
initial implementers decision of shadows not being in the vldb.