[OpenAFS-devel] Re: [OpenAFS] Volume root corruptions - anybody seen those?

Rainer Toebbicke rtb@pclella.cern.ch
Fri, 6 Jun 2008 10:10:22 +0200


Jeffrey Hutzelman schrieb:

> anyway. Hartmut's patch is on the right track -- the appropriate thing 
> to do here is for the fssync service to provide the same protections for 
> multiple fssync clients accessing the same volume that it does for a 
> single client and the fileserver itself.
> 

Hartmut and I discussed this over the phone and disagreed on whether 
this "synchronization" is actually the fileserver's role.
He thinks it is, and fairly enough the existing code agrees with him.
His patch is so small that I'll shut up and go straight and deploy it.

Only for the record and on the grounds that even code existing before 
1993 (lol just checked it in afs3.2) should not be immune from 
questioning for proper design, I'll point out nevertheless:

. the fileserver has a hell of a job already and loading it with yet 
another function will hardly make it slimmer

. the fileserver is actually not concerned managing volumes, certainly 
not all volumes which float around on the disks. Logically, it would 
then also have to manage synchronization of volumes it never cares 
about itself, such as the various temporary clones, R/W volids when 
the only one in sight is a R/O, etc.

. sooner or later a volume id will stay on the OfflineVolumes list for 
obscure reasons, you have no way to find out, requests for it will be 
denied and you'll scratch your head for a while. Then you'll restart 
the file server...

. denying a request requires code upstream to handle that, and retry

A "design" solution would probably use mutex-like objects, even 
administrative tools to inspect and manage them. The volserver's 
transactions have some nice aspects, in particular transparency. And 
in the long run I suspect the volserver could salvage volumes itself.

However, as Derrick points out the original problem might still come 
from elsewhere (I fail to see how volserver/salvager/fileserver 
stepping on each other's toes could explain the peculiar vnode length 
field corruption we saw) so the "design" solution will have to wait.


-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985       Fax: +41 22 767 7155