[OpenAFS] Re: 'vos dump' destroys volumes?

Andrew Deason adeason@sinenomine.net
Mon, 26 Mar 2012 12:38:50 -0500


On Mon, 26 Mar 2012 17:25:04 +0200
Matthias Gerstner <matthias.gerstner@esolutions.de> wrote:

> I'm recently experiencing trouble during my backup of OpenAFS volumes.
> I perform backups using the
> 
> 'vos dump -server <server> -partition <partition> -clone -id <vol>'

<vol> I presume is an rw volume?

Just so you know, a more common way of doing this is to use 'vos
backupsys' and then backup the .backup volumes. Nothing 'wrong' with
what you're doing, but it's a less common way.

> However some days ago the backup of a specific volume failed with
> a bad exit code (255). My backup script thus stopped further processing.
> The concerned volume went offline as a result and did only show up in
> 'vos listvol' as "couldn't attach volume ...".

What did volserver say in VolserLog when that happened? It should give a
reason as to why it could not attach.

> After running a salvage on the affected volume it was brought back
> online but most of the contained data was deleted due to a supposed
> corruption of the directory strucuture detected during salvage.

SalvageLog will say specifically why. Or SalsrvLog if you are running
DAFS; are you running DAFS?

> Attached is the VolserLog from the time when the last of the incidents
> occured.

What was the volume id for the volume in question? Possibly 536879790 or
536879793?

> I'm currently running openafs 1.6.1 on Gentoo Linux with kernel
> version 3.2.1.

1.6.1 is not a version that exists yet (or at least, certainly did not
exist on Friday). What version is the volserver, and what version is
'vos'? (Running `strings </path/to/bin> | grep built` is a sure way to
tell.)

> Fri Mar 23 00:10:57 2012 1 Volser: Clone: Cloning volume 536879790 to new volume 536889517
> Fri Mar 23 00:16:04 2012 1 Volser: Delete: volume 536889517 deleted 
> Fri Mar 23 00:16:04 2012 1 Volser: Clone: Cloning volume 536879793 to new volume 536889518
> Fri Mar 23 00:16:06 2012 VDestroyVolumeDiskHeader: Couldn't unlink disk header, error = 2
> Fri Mar 23 00:16:06 2012 VPurgeVolume: Error -1 when destroying volume 536889517 header
> Fri Mar 23 00:16:06 2012 1 Volser: Delete: volume 536889517 deleted 
> Fri Mar 23 00:16:09 2012 1 Volser: Delete: volume 536889518 deleted 
> Fri Mar 23 00:16:09 2012 VDestroyVolumeDiskHeader: Couldn't unlink disk header, error = 2
> Fri Mar 23 00:16:09 2012 VPurgeVolume: Error -1 when destroying volume 536889518 header
> Fri Mar 23 00:16:09 2012 1 Volser: Delete: volume 536889518 deleted 
> Fri Mar 23 00:21:20 2012 trans 69 on volume 536889518 is older than 300 seconds
> Fri Mar 23 00:21:20 2012 trans 66 on volume 536889517 is older than 300 seconds

Hmm, are you sure 'vos dump' is the only thing you are running at the
time? (You're running more than one in parallel... how many do you run
at once?) This sequence of operations does not seem normal for just a
'vos dump'.

-- 
Andrew Deason
adeason@sinenomine.net