[OpenAFS] Re: Offline volumes after upgrade to 1.6.1pre2

Andrew Deason adeason@sinenomine.net
Thu, 16 Feb 2012 10:15:35 -0600


On Thu, 16 Feb 2012 11:15:19 +0100
├ůsa Andersson <spigg@csc.kth.se> wrote:

> Hello,
> 
> We upgraded our file servers from 1.6.0 to 1.6.1pre2 last Sunday (most of
> our clients are running version 1.6.0) and after that we have seen volumes 
> going offline and entries in the FileLog indicating they need salvaging. 
> 
> Typical FileLog-entries look like this:
> 
> ----------
> Mon Feb 13 09:38:02 2012 Fid 537126959.344.442066 has inconsistent
> length (index 573440, inode 524288); volume must be salvaged
> ----------

This is a new consistency check. It indicates that the vnode index
(which is consulted when someone on a client does e.g. a stat()) says
the file is 573440 bytes long, but the actual file data on disk only has
524288 bytes. Before this check, the fileserver would just serve the
first 524288 bytes, and client applications tended to just see NULs
after that (though the actual content may be undefined, I'm not sure).

> or like this:
> 
> ----------
> Mon Feb 13 10:10:47 2012 fssync: breaking all call backs for volume 537126959
> Mon Feb 13 10:10:47 2012 ReadHeader: Failed to open volume info header file (v>olume=537126959, inode=2306942731429085183); errno=2
> Mon Feb 13 10:10:47 2012 VAttachVolume: Error reading diskDataHandle header fo>r vol 537126961; error=101
> Mon Feb 13 10:10:47 2012 VAttachVolume: Error attaching volume /vicepa//V05371>26961.vol; volume needs salvage; error=101

A special file for a volue just doesn't exist on disk. I think this
would be... /vicepX/AFSIDat/j/jUy+U/zzzz521u1+0

I don't know what would cause that. Is this after a salvage? Or a
release/restore/etc? I would guess 537126961 is the BK volume for
537126959 ? 

> offline so far. Running salvage seems to fix the volumes.

I assume you are not running DAFS? (otherwise, these should be salvaged
automatically)

> Is 1.6.1pre2 detecting data corruption brought on by 1.6.0 and this is
> what we're seeing?

For the first one, I think that's possible. If the CoW corruption
results in files getting incorrectly truncated, that would certainly
cause the first, but I'm not sure if the specific corruption patterns
are known. Do you have any idea what file 537126959.344.442066 is?

I don't think that's possible for the second one, though.

-- 
Andrew Deason
adeason@sinenomine.net