[OpenAFS] Re: Salvaging user volumes

Thu, 13 Jun 2013 22:56:44 -0400

On 6/13/13 12:33 PM, Andrew Deason wrote:
 > On Wed, 12 Jun 2013 21:24:49 -0400
 > Garance A Drosihn<drosih@rpi.edu>  wrote:
 >
 >> On May 29th, "something happened" to two AFS volumes which are
 >> both on the same vice partition.  The volumes had been mounted
 >>  fine, but suddenly they could not be attached.
 >
 > I'm not quite sure what you mean by this; there isn't really any
 > "mounting" operation for volumes on the server side; the volume
 > either attaches or it fails to attach. We do mount the /vicep*
 > partitions containing volumes, though, of course.

A bad choice of words on my part.  I meant 'attached' not 'mounted'.
We last restarted these fileservers back in January, and these
volumes attached fine at that time.  And we've been recreating
the backup volumes three times a week ever since then, so I assume
the volumes remained attached and not corrupted.

There's also a report which runs every day to determine how much
disk space people are using (so we can charge them for the space).
One of those volumes appeared on all reports up to May 28th, and
was gone on May 29th.

 >> This happened at around 9:30pm, and as far as I know nothing
 >> interesting was happening at that time.

It turns out that daily report runs between 9pm and 10pm, so that
report is almost certainly what triggered the error messages.

 > How are you determining these times? From this description, maybe
 > it sounds like the problems with the backup run alerted you to a
 > problem, and you looked in FileLog/VolserLog/etc, and saw errors
 > around those times. Is that what happened?

The first errors I noticed in any logs where these in FileLog:

Wed May 29 21:27:53 2013 Volume 537480983: couldn't reread volume header
Wed May 29 21:27:53 2013 VAttachVolume: Error reading diskDataHandle \
                          vol header /vicepb//V0537480983.vol; error=101

Wed May 29 21:27:02 2013 Volume 537444438: couldn't reread volume header
Wed May 29 21:27:02 2013 VAttachVolume: Error reading diskDataHandle \
                          vol header /vicepb//V0537444438.vol; error=101

And I knew our backups run at 5am.  So I assumed something else
must have happened at 9:30pm.  But now I see that's just when we
first ran into the problem due to our own procedure.

 >> So, before I get myself into too much trouble, what's the prudent
 >> thing to do here?  Should I just redo the salvage, with '-oktozap'?

 > So, I assume you want to remove those volumes, and 'vos restore'
 > them from a previous volume dump. You can try to 'vos zap' the
 > volumes, to just remove them from disk. If that complains about
 > the volume needing salvage or whatnot, you can try to force it
 > with 'vos zap -force'. If that fails to remove the volume (it
 > shouldn't, but I'm not sure about older versions...), you may
 > need to directly tinker with the vicepb contents. But, we can
 > deal with that just as it becomes necessary.

A plain 'vos zap' complained that the volumes needed to be
salvaged.  Adding -force resulted in:

[root]#  vos zap -server afsfs14 -partition vicepb \
                  -id 537444436 -backup -localauth -force
vos: forcibly removing all traces of volume 537444436, \
      please wait...failed with code 2.

[root]#  vos zap -server afsfs14 -partition vicepb \
                  -id 537480981 -backup -localauth -force
vos: forcibly removing all traces of volume 537480981, \
      please wait...failed with code 30.

I should note that we don't care at all about the contents of
these volumes.  I just want to make sure I don't trigger
damage to *other* volumes while trying to fix this.  And the
guy who is responsible for backups is anxious about this, as
apparently these damaged volumes cause the backup to hang in
some way.  I had intended to take this week as vacation, but
Murphey's law seems determined to prevent that!

At this point I'm tempted to try 'salvager -oktozap' based on the
documentation for it, but I'll wait to hear if that's the right
thing to do in this situation.

I should also note that this week we did finish a full backup of
the entire AFS cell, except for these two volumes.  So it should
be true that everything else is reasonably okay.  I hope to keep
it that way!

-- 
Garance Alistair Drosehn                =     drosih@rpi.edu
Senior Systems Programmer               or   gad@FreeBSD.org
Rensselaer Polytechnic Institute;             Troy, NY;  USA