[OpenAFS-devel] meaning of VNOVOL, VOFFLINE, etc.

Jeffrey Altman jaltman@your-file-system.com
Fri, 04 May 2012 18:35:04 -0400


This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enigC92EA3829BDE5F2E3A5C10D7
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Tom:

The 1991 Zayas specifications are lacking in many regards.  For
starters, the Vxxx error codes are only defined for the Vol/VL RPCs and
not for the FS/CM RPCs.  The use of the Vxxxx error codes in the FS/CM
RPCs is left undefined and yet those errors are reported to cache
managers by file servers.

I think it was 2004 or perhaps early 2005 when a large user was
concerned about VLDB scalability due to the introduction of tens of
thousands of Windows clients into the environment.  Each time a VNOVOL,
VMOVED, VOFFLINE, VSALVAGE or VNOSERVICE error was received the Windows
client would query the VLDB and retry the request after 2 seconds.  If a
volume couldn't be served from a file server this process would be
repeated.  This is exacerbated by the behavior of the Explorer Shell
which reads the contents of directories it displays searching for
various metadata.  As a result the VLDB servers were struggling under
the load.  It wasn't going to be possible to make the VLDB servers
process more requests so it was important to reduce the number of
requests that were sent.

The discussions that took place came to the conclusion that the
description of VNOVOL was ambiguous and its meaning based upon usage
should be that the volume is not present.  With that interpretation a
client could restrict the number of VLDB lookups for a volume.  I do not
remember if these discussions took place at a hackathon, a workshop, or
on Zephyr.   Such use of the error codes didn't make a difference to
deployed clients since they acted on all error codes in an identical
fashion nor did it result in a protocol change given existing use in the
file server.

Perhaps others can find a reference in Zephyr logs.  I no longer have
access to them.

Jeffrey Altman

On 5/4/2012 5:40 PM, Tom Keiser wrote:
> Hi,
>=20
> As some of you already know, sites have recently run into troubles
> caused by interpretation of various volume package special error
> codes.  After looking at the Ed Zayas spec, and how the unix and
> Windows clients interpret the various codes in master and OpenAFS 1.0,
> I wanted to start a discussion about the slight redefinition of
> protocol error handling semantics over the past decade.  According to
> the Zayas VVL spec, the relevant error codes have the following
> meanings:
>=20
> - VSALVAGE:  volume needs to be salvaged
>=20
> - VNOVOL:  the given volume is either not attached, doesn't exist, or
> is not online
>=20
> - VNOSERVICE: the volume is currently not in service
>=20
> - VOFFLINE: the specified volume is offline, for the reason given in
> the offline message field (a subield within the volume field in struct
> volser_trans)
>=20
> - VBUSY: the named volume is temporarily unavailable, and the client
> is encouraged to retry the operation shortly
>=20
>=20
> By my reading of the above specification, VOFFLINE is strictly for use
> when offlineMessage is set in the VolumeDiskData file, whereas VNOVOL
> was intended to be the catch-all "it's not online" error code.
> Indeed, OpenAFS 1.0 volume.c more-or-less follows the above rubric.
> When working on DAFS many years ago, I tried to follow these
> definitions (although, admittedly, I got it wrong in a number of
> cases).
>=20
> Now, I must concede that the definitions in the Zayas spec are not
> terribly useful: they do not differentiate between "I don't have it",
> and "I won't give it to you", which is typically the fundamental
> question the cm is trying to answer.  In this strict sense, I much
> prefer the way recent versions of the Windows CM utilize
> VNOVOL/VOFFLINE as a means of satisfying the existence question.
> However, as much as I like the cleanliness this approach provides, I
> am concerned about the seeming divergence between our implementations
> and our specification...
>=20
> It's certainly possible that I'm not privy to protocol discussions
> where it was decided that redefining VNOVOL, VNOSERVICE[*], and
> VOFFLINE was ok (given that legacy CMs seem to make little distinction
> between VOFFLINE, VNOVOL, VSALVAGE, VNOSERVICE, etc.).  If that is the
> case, could someone provide more information from these discussions?
>=20
> Obviously, the current mismatch in behavior between DAFS and the
> Windows CM needs to be resolved posthaste.  That we already have a
> wide deployment base of nodes in disagreement about the denotation of
> certain critical error codes is troubling--to the point that
> pragmatism may preclude us from strict adherence to the extant AFS-3
> specification.
>=20
> This leaves me with two questions:
>=20
> 1) is there something that OpenAFS can do to resolve this issue
> without requiring any standards involvement?
>=20
> 2) if not, what is our stop-gap until we can fix this at the afs3-stds =
level?
>=20
>=20
> With regard to (1), I have some patches that modify DAFS to behave
> more like the Windows CM expects.  However, before I consider pushing
> these patches to gerrit, I want to solicit opinions regarding these
> underlying questions...
>=20


--------------enigC92EA3829BDE5F2E3A5C10D7
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)

iQEcBAEBAgAGBQJPpFmaAAoJENxm1CNJffh4oFYH/1DMQCDMr+R2lpb/E6OA0/u3
o/pIdvBfcoQzrWIXsXdPiIy9sDnM6PGYecTXylqyVapW8xFIl5ZRNcHFlXblECdc
KHvNC8dLw9N4r1Rmi/pH6KeQqBjEdqDWpJNXp2accrs20ZBcuQfjPMaMdBomRp14
riqZ3g5xv72o62TIxUTgwtKfIc4iWFhIieTWmmbr9qQCx1zM+YyJnMIlxp0JdBtl
WzoDKxSWgGfMWIRLMGAZvKirvvrvqOwLsw0sMG4mjxeXTXAVZjBVvMnnG98eeWPg
10fxJ247alVmtJxzQvd8DeVOaiBstB6EFS1zuLBGup+kkk9bh8w82a1LLDos3Zg=
=IyRb
-----END PGP SIGNATURE-----

--------------enigC92EA3829BDE5F2E3A5C10D7--