[OpenAFS] Re: 'vos dump' destroys volumes?

Kim Kimball dhk@ccre.com
Mon, 26 Mar 2012 13:17:14 -0600


Dumping the RW volume makes it "busy" during the dump, which makes the
volume unwritable -- and generates "afs: Waiting for busy volume" errors
when a write occurs.

Dumping the .backup is not just a good practice, in my opinion, it is
the only sensible practice if keeping writability is important.  Large
volumes can take a while to dump --

Identifying the software version that is running is better done with
"rxdebug" -- it's a nit, but the binaries are not guaranteed to be the
same as what's running -- and the "strings | grep" approach only tells
you what version the binary is, and not what the running version is ...

It does look like more than one operation was in progress -- a volume
delete isn't part of a volume dump
Kim


On 3/26/2012 11:38 AM, Andrew Deason wrote:
> On Mon, 26 Mar 2012 17:25:04 +0200
> Matthias Gerstner <matthias.gerstner@esolutions.de> wrote:
>
>> I'm recently experiencing trouble during my backup of OpenAFS volumes.
>> I perform backups using the
>>
>> 'vos dump -server <server> -partition <partition> -clone -id <vol>'
> <vol> I presume is an rw volume?
>
> Just so you know, a more common way of doing this is to use 'vos
> backupsys' and then backup the .backup volumes. Nothing 'wrong' with
> what you're doing, but it's a less common way.
>
>> However some days ago the backup of a specific volume failed with
>> a bad exit code (255). My backup script thus stopped further processing.
>> The concerned volume went offline as a result and did only show up in
>> 'vos listvol' as "couldn't attach volume ...".
> What did volserver say in VolserLog when that happened? It should give a
> reason as to why it could not attach.
>
>> After running a salvage on the affected volume it was brought back
>> online but most of the contained data was deleted due to a supposed
>> corruption of the directory strucuture detected during salvage.
> SalvageLog will say specifically why. Or SalsrvLog if you are running
> DAFS; are you running DAFS?
>
>> Attached is the VolserLog from the time when the last of the incidents
>> occured.
> What was the volume id for the volume in question? Possibly 536879790 or
> 536879793?
>
>> I'm currently running openafs 1.6.1 on Gentoo Linux with kernel
>> version 3.2.1.
> 1.6.1 is not a version that exists yet (or at least, certainly did not
> exist on Friday). What version is the volserver, and what version is
> 'vos'? (Running `strings </path/to/bin> | grep built` is a sure way to
> tell.)
>
>> Fri Mar 23 00:10:57 2012 1 Volser: Clone: Cloning volume 536879790 to new volume 536889517
>> Fri Mar 23 00:16:04 2012 1 Volser: Delete: volume 536889517 deleted 
>> Fri Mar 23 00:16:04 2012 1 Volser: Clone: Cloning volume 536879793 to new volume 536889518
>> Fri Mar 23 00:16:06 2012 VDestroyVolumeDiskHeader: Couldn't unlink disk header, error = 2
>> Fri Mar 23 00:16:06 2012 VPurgeVolume: Error -1 when destroying volume 536889517 header
>> Fri Mar 23 00:16:06 2012 1 Volser: Delete: volume 536889517 deleted 
>> Fri Mar 23 00:16:09 2012 1 Volser: Delete: volume 536889518 deleted 
>> Fri Mar 23 00:16:09 2012 VDestroyVolumeDiskHeader: Couldn't unlink disk header, error = 2
>> Fri Mar 23 00:16:09 2012 VPurgeVolume: Error -1 when destroying volume 536889518 header
>> Fri Mar 23 00:16:09 2012 1 Volser: Delete: volume 536889518 deleted 
>> Fri Mar 23 00:21:20 2012 trans 69 on volume 536889518 is older than 300 seconds
>> Fri Mar 23 00:21:20 2012 trans 66 on volume 536889517 is older than 300 seconds
> Hmm, are you sure 'vos dump' is the only thing you are running at the
> time? (You're running more than one in parallel... how many do you run
> at once?) This sequence of operations does not seem normal for just a
> 'vos dump'.
>