[OpenAFS] Volume root corruptions - anybody seen those?
Rainer Toebbicke
rtb@pclella.cern.ch
Tue, 3 Jun 2008 09:18:57 +0200
We've started a program of low rate preventive salvages of individual
volumes (in -nowrite, per volume while fileserver is running, mainly
to spot irregularities) and ran into the following problem:
on a few (1/1000) volumes the length of the volume root directory
would be flagges as something astronomically big, to be corrected to
6144, 8192 or other more modest multiples of 2k. And sure enough,
doing the real salvage the root would then be logically truncated (not
to say completely messed up) and plenty of orphaned files would appear.
Sounds all logical for a corrupted directory, so the question is where
did it get corrupted. This is openafs 1.4.4 still.
When you converted that "BIG" number to hexadecimal you got the
peculiar pattern 0x180000001800, or 0x080000000800, i.e. always the
"final" multiple of 2K ored into that same multiple shifted left by 32
bits!
Looks like a misconversion from 32 bit into the vnode length field
encoded into two afs_int32s, doesn't it?
I have (of course) changed the name of the "length" field in vnode.h
and recompiled everything just to check who would dare to
use the vnode length field without going through the "official"
macros, and surely found a place where for which I have a patch ready.
But that one doesn't account for anything serious.
We've only started this on one server, which happens to be a 32-bit
machine, so statistically irrelevant. (Actually, this is not quite
true: in the past we did see a very small number of volumes where the
volume header got corrupted on that same machine, and only "volinfo
-fixheader" would be able to solve the problem, the salvager wouldn't).
My worry is that the "salvage" itself messes up things, after all it
needs to detach the volume from the file server in order to work on
it, there are probably places where this could easily go wrong. It
also needs to lock out the volserver, to be checked whether that logic
waterproof.
I actually considered the approach very careful compared to people who
walk volume structures with scripts looking for mount points and the
like and never run into problems... hence my surprise that we see
corruptions.
Hints, observations, apropos, experiences, I-told-yous,
seen-this-as-wells appreciated.
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985 Fax: +41 22 767 7155