[OpenAFS-devel] Re: [OpenAFS] Volume root corruptions - anybody seen those?

Jeffrey Hutzelman jhutz@cmu.edu
Thu, 05 Jun 2008 16:55:14 -0400


--On Thursday, June 05, 2008 04:03:44 PM -0400 Dean Anderson <dean@av8.com> 
wrote:

> On Thu, 5 Jun 2008, Rainer Toebbicke wrote:
>>
>> 1. never use "salvager -vol",  always shut down the volserver before
>> (a challenge with bosserver restarting it), try gymnastics with "vos
>> lock", "vos offline", etc...  all looks pretty cumbersome.
>
> Err, you tell bosserver to shut it down using bos. This is pretty basis
> afs admin stuff.... I'm a little worried...

You can tell the bosserver to shut down the whole bnode, but you cannot 
tell it to shut down just the volserver part of an 'fs' bnode.  But that's 
OK; shutting down the whole volserver is a bit heavy-handed anyway. 
Hartmut's patch is on the right track -- the appropriate thing to do here 
is for the fssync service to provide the same protections for multiple 
fssync clients accessing the same volume that it does for a single client 
and the fileserver itself.

I worry, though, that the approach of simply denying any request for a 
volume that has already been taken offline by another fssync client might 
be a bit too simplistic.  I can't see any specific problem here, it's worth 
thinking about.

A bigger concern is that the proposed patch applies only to 
FSYNC_NEEDVOLUME and not also to FSYNC_OFF.  That strikes me as incorrect 
-- the difference is that FSYNC_OFF means the volume must be taken offline 
and made unavailable.  Surely an FSYNC_NEEDVOLUME request and an FSYNC_OFF 
request should be mutually exclusive regardless of in which order they 
arrive.


> Hmm, we haven't seen these problems in 18+ years, so I'm just a little
> concernced about whether this is really a problem or not.

The fact that you have not observed a race does not mean that it doesn't 
exist.  Race conditions are often very dependent on usage patterns; it is 
entirely possible for you to never see such a problem even though others 
see it every day.  Additionally, if the probability of encountering the 
problem is low enough, you may experience it and not recognize it for what 
it is, especially if the underlying problem hasn't been noticed yet (which 
is not that uncommon - concurrency is Hard).


> Adding unnecessary stateful locks is just more opportunity to get things
> screwed up. I think the real question is why salvager would ever be run
> when volserver is both running and thinking the volume is online?  How
> did you get in that position? (and it isn't because of a lack of locks)

Because the single-volume mode of the salvager is intended to be run in 
that situation.


> The premise that volserver somehow works on volumes that are offline is
> probably where you went wrong, I think. My guess is that you don't
> understand the afs system administration and you aren't properly
> admin'ing the system so that the volumes are offline when you are
> running salvager.

Actually, it sounds like you don't understand the internals of the fssync 
service, which coordinates access to volumes between the fileserver and 
"volume utilities", including the volserver and certain standalone tools. 
When the salvager is run on a single volume, either directly or via 'bos 
salvage', the fileserver is _not_ shut down.  Instead, the salvager uses 
the fssync interface to request that the fileserver hand over the volume, 
and then to hand it back when it is done.  This is the same interface used 
by the volserver to take control of a volume to dump, restore, or clone it, 
or even to examine its header.

There are multiple types of fssync requests; some request that the 
fileserver completely take a volume offline while the requestor is using 
it, and others only ask that the fileserver make the volume read-only.  So 
yes, even though many 'vos' subcommands will not work on a volume which 
'vos listvol' shows as "offline", most of those commands do involve the 
volserver taking the volume offline while it works on it.


-- Jeff