[OpenAFS] Re: Offline volumes after upgrade to 1.6.1pre2

Åsa Andersson spigg@csc.kth.se
Fri, 17 Feb 2012 16:50:24 +0100


On 12-02-16, Andrew Deason wrote:
> On Thu, 16 Feb 2012 11:15:19 +0100
> Åsa Andersson <spigg@csc.kth.se> wrote:
> 
> > Hello,
> > 
> > We upgraded our file servers from 1.6.0 to 1.6.1pre2 last Sunday (mos=
t of
> > our clients are running version 1.6.0) and after that we have seen vo=
lumes 
> > going offline and entries in the FileLog indicating they need salvagi=
ng. 
> > 
> > Typical FileLog-entries look like this:
> > 
> > ----------
> > Mon Feb 13 09:38:02 2012 Fid 537126959.344.442066 has inconsistent
> > length (index 573440, inode 524288); volume must be salvaged
> > ----------
> 
> This is a new consistency check. It indicates that the vnode index
> (which is consulted when someone on a client does e.g. a stat()) says
> the file is 573440 bytes long, but the actual file data on disk only ha=
s
> 524288 bytes. Before this check, the fileserver would just serve the
> first 524288 bytes, and client applications tended to just see NULs
> after that (though the actual content may be undefined, I'm not sure).
> 
> > or like this:
> > 
> > ----------
> > Mon Feb 13 10:10:47 2012 fssync: breaking all call backs for volume 5=
37126959
> > Mon Feb 13 10:10:47 2012 ReadHeader: Failed to open volume info heade=
r file (v>olume=537126959, inode=2306942731429085183); errno=2
> > Mon Feb 13 10:10:47 2012 VAttachVolume: Error reading diskDataHandle =
header fo>r vol 537126961; error=101
> > Mon Feb 13 10:10:47 2012 VAttachVolume: Error attaching volume /vicep=
a//V05371>26961.vol; volume needs salvage; error=101
> 
> A special file for a volue just doesn't exist on disk. I think this
> would be... /vicepX/AFSIDat/j/jUy+U/zzzz521u1+0
> 
> I don't know what would cause that. Is this after a salvage? Or a
> release/restore/etc? I would guess 537126961 is the BK volume for
> 537126959 ? 

Yes, 537126961 is the BK volume and this is after a salvage.
 
> > offline so far. Running salvage seems to fix the volumes.
> 
> I assume you are not running DAFS? (otherwise, these should be salvaged
> automatically)

Yes, that's correct, we're not running DAFS.

> > Is 1.6.1pre2 detecting data corruption brought on by 1.6.0 and this i=
s
> > what we're seeing?
> 
> For the first one, I think that's possible. If the CoW corruption
> results in files getting incorrectly truncated, that would certainly
> cause the first, but I'm not sure if the specific corruption patterns
> are known. Do you have any idea what file 537126959.344.442066 is?

It's a SQLite 3.x database.

> I don't think that's possible for the second one, though.

It seems the typical pattern (that we missed yesterday) is that we first 
get the "inconsistent length" error for the volume and then we get the other
error following a salvage run. 

In a few cases we've seen a second error message about inconsistent length
following shortly after the first (but after a salvage of the volume).
The first file it complains about below is also a SQLite 3.x database.

-------
Fri Feb 17 11:09:04 2012 Fid 537158894.970.216752 has inconsistent length (index 524288, inode 262144); volume must be salvaged
Fri Feb 17 11:11:28 2012 ReadHeader: Failed to open volume info header file (volume=537158894, inode=2307079891209682943); errno=2
Fri Feb 17 11:11:28 2012 ReadHeader: Failed to open volume info header file (volume=537158894, inode=2307079891209682943); errno=2
Fri Feb 17 11:11:28 2012 ReadHeader: Failed to open volume info header file (volume=537158894, inode=2307079891209682943); errno=2
Fri Feb 17 11:11:31 2012 fssync: breaking all call backs for volume 537158894
Fri Feb 17 11:11:40 2012 Fid 537158894.1.1 has inconsistent length (index 6144, inode 14336); volume must be salvaged
Fri Feb 17 11:41:27 2012 fssync: breaking all call backs for volume 537158894
Fri Feb 17 11:41:27 2012 ReadHeader: Failed to open volume info header file (volume=537158894, inode=2307079891209682943); errno=2
Fri Feb 17 11:41:27 2012 ReadHeader: Failed to open volume info header file (volume=537158894, inode=2307079891209682943); errno=2
Fri Feb 17 11:41:27 2012 ReadHeader: Failed to open volume info header file (volume=537158894, inode=2307079891209682943); errno=2
--------

Hopefully this means that we don't have any further corruption 
caused by 1.6.1pre2 going on.

Thanks for your help!

Åsa Andersson
School of Computer Science and Communication
Royal Institute of Technology


> -- 
> Andrew Deason
> adeason@sinenomine.net
> 
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info