[OpenAFS] Volume root corruptions - anybody seen those?

Rainer Toebbicke rtb@pclella.cern.ch
Tue, 3 Jun 2008 09:18:57 +0200


We've started a program of low rate preventive salvages of individual 
volumes (in -nowrite, per volume while fileserver is running, mainly 
to spot irregularities) and ran into the following problem:

on a few (1/1000) volumes the length of the volume root directory 
would be flagges as something astronomically big, to be corrected to 
6144, 8192 or other more modest multiples of 2k. And sure enough, 
doing the real salvage the root would then be logically truncated (not 
to say completely messed up) and plenty of orphaned files would appear.

Sounds all logical for a corrupted directory, so the question is where 
did it get corrupted. This is openafs 1.4.4 still.

When you converted that "BIG" number to hexadecimal you got the 
peculiar pattern 0x180000001800, or 0x080000000800, i.e. always the 
"final" multiple of 2K ored into that same multiple shifted left by 32 
bits!

Looks like a misconversion from 32 bit into the vnode length field 
encoded into two afs_int32s, doesn't it?

I have (of course) changed the name of the "length" field in vnode.h 
and recompiled everything just to check who would dare to
use the vnode length field without going through the "official" 
macros, and surely found a place where for which I have a patch ready. 
But that one doesn't account for anything serious.

We've only started this on one server, which happens to be a 32-bit 
machine, so statistically irrelevant. (Actually, this is not quite 
true: in the past we did see a very small number of volumes where the 
volume header got corrupted on that same machine, and only "volinfo 
-fixheader" would be able to solve the problem, the salvager wouldn't).

My worry is that the "salvage" itself messes up things, after all it 
needs to detach the volume from the file server in order to work on 
it, there are probably places where this could easily go wrong. It 
also needs to lock out the volserver, to be checked whether that logic 
waterproof.

I actually considered the approach very careful compared to people who 
  walk volume structures with scripts looking for mount points and the 
like and never run into problems... hence my surprise that we see 
corruptions.

Hints, observations, apropos, experiences, I-told-yous, 
seen-this-as-wells appreciated.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985       Fax: +41 22 767 7155