[OpenAFS-devel] Re: [OpenAFS] Volume root corruptions - anybody seen those?

Rainer Toebbicke rtb@pclella.cern.ch
Thu, 5 Jun 2008 11:07:26 +0200


Sorry, I started this in openafs-info a few days ago, but here's 
probably the better place:

the problem: "salvager -part xxx -vol NNN" corrupts volumes
(BTW, even with -nowrite!)

the scenario:

1. the volserver takes a volume offline, the salvager takes the same 
volume offline without knowing about the volserver. Possible, since 
the partition lock only protects the VAttachVolume, and the "offline" 
works (of course) even if the volume is already offline.

2. the salvager and the volserver work on same data, first opportunity 
for a thorough mess.

3. assuming they didn't, e.g. salvager -nowrite, the salvager finishes 
and puts the volume back online while the volserver still running. 
There is no "reference count" on offline/online. The fileserver sees 
this, the user jumps onto it, and we have a second opportunity to blow 
everything apart.


I see a number of alternatives,

1. never use "salvager -vol",  always shut down the volserver before 
(a challenge with bosserver restarting it), try gymnastics with "vos 
lock", "vos offline", etc...  all looks pretty cumbersome.

2. salvage a single volume within the volserver, triggered by a "vos 
salvage" command; probably most logical, analogous to "vos zap" et al. 
I haven't checked yet if the code can be shared without problems, 
looks like some non-trivial Makefile and #ifdef gymnastics to get 
right. And of course, it changes the documented commands, with changes 
in bos and vos.

3. have the salvager create a transaction in the volserver, which 
takes care of offline/online. Would have to care about deadlocks, as 
the salvager doesn't just salvage the volume itself but the RW parent 
in case of a RO or BK. And also with the (usual) case that the volume 
is so damaged that it cannot be attached. But it's minimal intrusion 
otherwise, no external changes.


Before I go and implement 2. or 3., please give me your thoughts about 
the scenario if you care (I may have missed something and would have 
to look for another culprit - Derrick hinted at "-orphans attach" 
where I did not understand everything yet), and whether 2. is 
preferable over 3.


Cheers, Rainer


-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985       Fax: +41 22 767 7155