[OpenAFS-devel] meaning of VNOVOL, VOFFLINE, etc.

Tom Keiser tkeiser@sinenomine.net
Fri, 4 May 2012 17:40:30 -0400


Hi,

As some of you already know, sites have recently run into troubles
caused by interpretation of various volume package special error
codes.  After looking at the Ed Zayas spec, and how the unix and
Windows clients interpret the various codes in master and OpenAFS 1.0,
I wanted to start a discussion about the slight redefinition of
protocol error handling semantics over the past decade.  According to
the Zayas VVL spec, the relevant error codes have the following
meanings:

- VSALVAGE:  volume needs to be salvaged

- VNOVOL:  the given volume is either not attached, doesn't exist, or
is not online

- VNOSERVICE: the volume is currently not in service

- VOFFLINE: the specified volume is offline, for the reason given in
the offline message field (a subield within the volume field in struct
volser_trans)

- VBUSY: the named volume is temporarily unavailable, and the client
is encouraged to retry the operation shortly


By my reading of the above specification, VOFFLINE is strictly for use
when offlineMessage is set in the VolumeDiskData file, whereas VNOVOL
was intended to be the catch-all "it's not online" error code.
Indeed, OpenAFS 1.0 volume.c more-or-less follows the above rubric.
When working on DAFS many years ago, I tried to follow these
definitions (although, admittedly, I got it wrong in a number of
cases).

Now, I must concede that the definitions in the Zayas spec are not
terribly useful: they do not differentiate between "I don't have it",
and "I won't give it to you", which is typically the fundamental
question the cm is trying to answer.  In this strict sense, I much
prefer the way recent versions of the Windows CM utilize
VNOVOL/VOFFLINE as a means of satisfying the existence question.
However, as much as I like the cleanliness this approach provides, I
am concerned about the seeming divergence between our implementations
and our specification...

It's certainly possible that I'm not privy to protocol discussions
where it was decided that redefining VNOVOL, VNOSERVICE[*], and
VOFFLINE was ok (given that legacy CMs seem to make little distinction
between VOFFLINE, VNOVOL, VSALVAGE, VNOSERVICE, etc.).  If that is the
case, could someone provide more information from these discussions?

Obviously, the current mismatch in behavior between DAFS and the
Windows CM needs to be resolved posthaste.  That we already have a
wide deployment base of nodes in disagreement about the denotation of
certain critical error codes is troubling--to the point that
pragmatism may preclude us from strict adherence to the extant AFS-3
specification.

This leaves me with two questions:

1) is there something that OpenAFS can do to resolve this issue
without requiring any standards involvement?

2) if not, what is our stop-gap until we can fix this at the afs3-stds level?


With regard to (1), I have some patches that modify DAFS to behave
more like the Windows CM expects.  However, before I consider pushing
these patches to gerrit, I want to solicit opinions regarding these
underlying questions...

-- 
Tom Keiser
tkeiser@sinenomine.net


* I'm willing to grant up-front that our reappropriation of VNOSERVICE
does not require further discussion, as it was a previously-unused
error constant.