[OpenAFS] Salvaging user volumes

Wed, 12 Jun 2013 21:24:49 -0400

Hi.

We have an odd situation come up in our AFS cell, and I'm not sure
what I need to do to correct it.

On May 29th, "something happened" to two AFS volumes which are both
on the same vice partition.  The volumes had been mounted fine, but
suddenly they could not be attached.  This happened at around 9:30pm,
and as far as I know nothing interesting was happening at that time.
We found out about the problem the next time we went to run backups,
because the volumes created by 'vos backup' were also corrupt, and
that caused our backup process to hang.  (I am not responsible for
our AFS backups, so that's about all I know about that part).

Note that we have not restarted the file server process since
January, so I'm rather disturbed that these volumes suddenly
showed up as corrupted or unmountable.  From what I can tell,
both of these volumes have not been touched in years.
(although I guess it is possible that they might have been
modified on the day the problem came up, but I doubt that).
I've done a listvol, and these are the only volumes in the
cell which were listed as "Could not attach volume"

Depending on what AFS command I run, I'm told that the volume
cannot be mounted, or it is corrupt, or that it needs salvaging.

So I ran the salvager on each specific volume.  On the first volume,
the salvager printed out a bunch of info of what it was doing.  My
email client has a nasty habit of totally reformatting useful info,
so here's a web page with the command output for the first volume:

http://homepages.rpi.edu/~drosehn/Temp/AfsSalvage-2013June-Vol1.txt

When I tried to do the same steps for the second volume, I got some
noticeably different results:

http://homepages.rpi.edu/~drosehn/Temp/AfsSalvage-2013June-Vol2.txt

This vice partition is on some SAN-attached storage, and of course
this happened while the experienced guy responsible for that was
on vacation.  But when he got back I asked him about the volume,
and he could find no errors connected to it.  I also found no
disk-hardware errors in the system-logs on the AFS file server.

So, before I get myself into too much trouble, what's the prudent
thing to do here?  Should I just redo the salvage, with '-oktozap'?
Or is there something else I should be doing?  And am I right in
thinking that volumes shouldn't just show up as being corrupt like
this?  Should I be looking harder for some kind of hardware problem?

(aside: this is the first time I've had to salvage any AFS volumes
in the few years that I've been responsible for our AFS cell, and
I can't remember any time in the last 12 years that a volume has
shown up in this state]

-- 
Garance Alistair Drosehn                =     drosih@rpi.edu
Senior Systems Programmer               or   gad@FreeBSD.org
Rensselaer Polytechnic Institute;             Troy, NY;  USA