[OpenAFS] Re: CopyOnWrite failure, leading to volume salvage

Thu, 27 Sep 2012 13:03:53 -0500

On Thu, 27 Sep 2012 17:00:36 +0100
Dameon Wagner <dameon.wagner@it.ox.ac.uk> wrote:

> #---8<-----------------------------------------------------------------
> STARTING AFS SALVAGER 2.4 (/usr/lib/openafs/salvager /vicepa 536874907)
> 5 nVolumesInInodeFile 160 
> Recreating link table for volume 536874907.

Oh, we do log something for this, hooray :)

Yes, so your link table for that volume went missing. Either it got
deleted, or something moved/renamed it, or something like that. As I
mentioned in my earlier email, that would explain your symptoms.

A missing link table itself should not be involved in corruption or data
loss or anything like that here. It is easily reconstructed, but takes
scanning all of the volume group's files to do so, which is what the
salvager does.

> CHECKING CLONED VOLUME 536892026.
> CHECKING CLONED VOLUME 536891895.
> CHECKING CLONED VOLUME 536891436.
> CHECKING CLONED VOLUME 536891016.

What are all of these volumes? Do they correspond to anything in the
vldb ('vos listvl 536892026'), or do you see them in logs?

If you have the volinfo tool, can you see if you can get any information
from these? ('volinfo /vicepa 536892026' as root on the fileserver) It
might give a name for the volumes

> #---8<-----------------------------------------------------------------
> $ vos examine -id 536874907
> vhost.a071                        536874907 RW  372246072 K  On-line
>     $FILESERVER /vicepa 
>     RWrite  536874907 ROnly  536892662 Backup          0 
>     MaxQuota  524288000 K 
>     Creation    Tue Sep  9 16:08:54 2008
>     Copy        Wed Sep  1 09:16:37 2010
>     Backup      Never
>     Last Update Thu Sep 27 16:31:50 2012
>     187791 accesses in the past day (i.e., vnode references)
> 
>     RWrite: 536874907 
>     number of sites -> 1
>        server $FILESERVER partition /vicepa RW Site
> #---8<-----------------------------------------------------------------
> 
> I was a little surprised to see "Backup Never", as our backup systems
> logs show a successful backup just this morning (and previous days
> through the schedule too).  Let me know if any further information
> would be helpful/useful.

Do your backups actually use 'backup' volumes? (That is, involving 'vos
backup' or 'bos backupsys') Or could you mention how your backups work?

What is confusing about that output is that the original error message
mentions a CoW error, but if you only have one RW volume by itself, you
shouldn't need to CoW. Do you expect for there to exist an RO volume for
this, or a backup volume, or is it possible you were using 'vos dump
-clone' or anything like that around that time?

Or perhaps more generally, do you know of any volume operations that
were going on around the time of that first CopyOnWrite error? (either
VolserLog or your own logging would be helpful there)

-- 
Andrew Deason
adeason@sinenomine.net