[OpenAFS] Re: 'vos dump' destroys volumes?

Andrew Deason adeason@sinenomine.net
Tue, 27 Mar 2012 10:24:57 -0500


On Tue, 27 Mar 2012 14:01:04 +0200
Matthias Gerstner <matthias.gerstner@esolutions.de> wrote:

> The situation with the salvage was as follows: The affected volume
> was a pretty large volume containing about 160 gigabytes of data spread
> across 3.5 million files. During the salvage I saw a *lot* of log lines
> similar to this flying by:
> 
> '??/??/SomeFile' deleted.
> 
> After half an hour of seeing this the volume was back online with less
> than 10 gigabytes of data remaining. So I figured the top-level
> directory structure got somehow lost. Sorry that I can't provide the
> actual log any more.

Please save the log if it happens again. Just a directory object being
corrupt will not delete its children unless you pass '-orphans remove'
to the salvager. However, the default, '-orphans ignore' will keep
orphaned data around but it will be effectively invisible until you
salvage with '-orphans attach'.

> Seems I forgot to mention 'pre1':
> 
> # strings /usr/sbin/vos | grep built
> @(#) OpenAFS 1.6.1pre1 built  2012-01-24
> 
> Is it too risky to use the pre-release? I got used to running the
> unstable openafs packages for being able to keep up with recent Linux
> kernel versions.

That version is known to have issues with data corruption/loss, which
are fixed in pre4. I don't know if that's what you're hitting, though.
(You can also run a newer client with older servers just fine.)

I assume the volserver is running the same version? As Kim said,
'rxdebug <server> 7005 -version'

> Now that you say it, it really does look like two things are running
> in parallel. But I can't think of how that could be happening. The
> backup script is supposed to dump one volume after another in a serial
> manner. And on this specific server the backup script is the only
> administrative AFS operation that is scheduled at all. Also when I
> disable the backup job for a night then nothing shows up in the log at
> all.

If you turn on the volser audit log with
'-auditlog /usr/afs/logs/VolserLog.audit' or something, you can see
specifically what operations were run when and by whom. Or turn up the
debug level with '-d 125 -log', and you'll see a bunch more information
in VolserLog interspersed with everything else.

> However, I'm running two pairs of file and volume server. Each machine
> performs a backup of its volumes and this happens in parallel. But
> this shouldn't affect a single machines log.

So, you just have two completely separate servers, and each one is
running a fileserver/volserver? Yeah, that shouldn't matter.

> I'm getting continued weird behaviour during my backups. Last night
> for example a dump was aborted with the following error message:
> 
> 'consealed data inconsistent'

That's "sealed data inconsistent". You can get this if your tokens
expired sometime during the process (I don't remember / just don't know
what causes that vs an 'expired' message). Do you have the output of
'vos' running with '-verbose' by any chance? How long are the backups
taking, and are you running this under a tool to refresh tokens?

> However the original volume in question remained intact this time. I'm
> attaching the VolserLog of this incident.

Hmm, did you forget to attach this?

-- 
Andrew Deason
adeason@sinenomine.net