[OpenAFS] Re: Salvaging user volumes

Andrew Deason adeason@sinenomine.net
Thu, 13 Jun 2013 11:33:25 -0500


On Wed, 12 Jun 2013 21:24:49 -0400
Garance A Drosihn <drosih@rpi.edu> wrote:

> On May 29th, "something happened" to two AFS volumes which are both
> on the same vice partition.  The volumes had been mounted fine, but
> suddenly they could not be attached.

I'm not quite sure what you mean by this; there isn't really any
"mounting" operation for volumes on the server side; the volume either
attaches or it fails to attach. We do mount the /vicep* partitions
containing volumes, though, of course.

> This happened at around 9:30pm, and as far as I know nothing
> interesting was happening at that time.  We found out about the
> problem the next time we went to run backups, because the volumes
> created by 'vos backup' were also corrupt, and that caused our backup
> process to hang.  (I am not responsible for our AFS backups, so that's
> about all I know about that part).

How are you determining these times? From this description, maybe it
sounds like the problems with the backup run alerted you to a problem,
and you looked in FileLog/VolserLog/etc, and saw errors around those
times. Is that what happened?

> So, before I get myself into too much trouble, what's the prudent
> thing to do here?  Should I just redo the salvage, with '-oktozap'?
> Or is there something else I should be doing?  And am I right in
> thinking that volumes shouldn't just show up as being corrupt like
> this?  Should I be looking harder for some kind of hardware problem?

Volumes shouldn't just show up as corrupt like that, yes. However, as
has been mentioned, 1.4.6 is pretty old; it contains known issues that
cause data corruption, and it has a few security issues. So, while
volumes shouldn't just show up as corrupt, I wouldn't find it terribly
surprising with a fileserver running that version.

As for what to do to get the volumes usable again, the most surefire way
is to restore their backups. The volume 537480981 looks like it may be
completely gone; it's maybe possible there is some dangling data hanging
around, but it looks like there's nothing left. Volume 537444436 seems
more likely to be completely gone; the salvager cannot find any data on
that partition for that volume.

So, I assume you would want to remove those volumes, and 'vos restore'
them from a previous volume dump. You can try to 'vos zap' the volumes,
to just remove them from disk. If that complains about the volume
needing salvage or whatnot, you can try to force it with 'vos zap
-force'. If that fails to remove the volume (it shouldn't, but I'm not
sure about older versions...), you may need to directly tinker with the
vicepb contents. But, we can deal with that just as it becomes
necessary.

-- 
Andrew Deason
adeason@sinenomine.net