[OpenAFS] Stupid SysAdmin tricks with backup and restore

Joseph H Vilas jhv@oit.duke.edu
Tue, 29 Apr 2003 16:47:20 -0400


Background: I do the backups for our AFS cell.  User volumes are named
users.LOGIN, and backup volumes are mounted online and called
users.LOGIN.backup.  If I want to restore a user volume and don't want
to trounce the existing volume, I restore it to users.LOGIN.restored,
mount it in a canonical place, then do what's needed with the data
within.  Pretty straightforward.  We're using both OpenAFS and
Transarc's AFS in different places, and mostly running Solaris 2.8.

The stupid part:  When I restore to a .restored volume, I type
something like

        backup volrestore -server SOMESERVER -partition -SOMEPARTITION
        -volume users.LOGIN -extension .restored -date 12/34/5678
        -portoffset ##

with reasonable parameterizations.  About a year ago I accidentally
typed

        backup volrestore -server SOMESERVER -partition -SOMEPARTITION
        -volume users.LOGIN -extension .backup -date 12/34/5678
        -portoffset ##

You'd think that trying to restore data to a backup volume would just
fail, as the backup volume is readonly.  

That's not what happens.  Not at all.  What does happen is bad.  

The active users.LOGIN volume disappears.  A new users.LOGIN.backup
gets created, with the data from the tapes you're using.  In other
words, you just blew away some poor defenseless user's home
directory.  

So for some reason I've now done this about 3 or 4 times.  I've only
been doing this for 8 years, so you'd think I'd know how to do it
right, but whatever.  I did it again today.  

The trick: I remembered what server originally held the user's volume.
(If I couldn't have remembered, I'd have rooted around in
/usr/afs/logs on our fileservers and tried to find the right server.)
I got in the vice partition that originally held the user's volume,
and rooted around for vnode numbers that looked like they might be
related to the volume in question.  I reasoned that since the volume
died in an abnormal fashion, there might be spoor around.

It turns out I was right.  For whatever reason, there was a vnode
there with a number awfully close to the original volume.  These
aren't the actual numbers, but if the original volume was numbered
500000000, and the original backup volume was numbered 500000002, this
volume was numbered 500000004.  It's as if the volume was the backup
volume of a backup volume, if you will.  The cell's volserver didn't
have a clue about this volume in terms of a name, but knew about the
suspect vnode number.  I then did a vos dump of the suspect vnode --
something like

        vos dump -id 500000004 -time 0 -file /tmp/LOGIN -server
        localhost -partition whatever

again with correct parameterization.  I then did

        vos restore -server SERVERNAME -partition SOMEPARTITION -name
        users.LOGIN -file /tmp/LOGIN

I could have done these both on one line with a pipe, but since I was
pretty far out on a limb as it was, I wanted to take tiny baby steps.

When I was done, I had successfully restored the user's volume
restored from this aberrant vnode.  I don't know if the vnode was
created the last time backup was run, but I suspect from the numbering
that it was created as part of the errant process that destroyed the
original volume.  In either case it's the best that could be done in
terms of restoring the state of the volume.  At least the date stamps
in the directories look good.  Interestingly, the date stamp on the
vnode was from 1999.

Ok it was a stupid mistake.  But given the stupid mistake, it was a
pretty good way to recover.  And I've darn sure never seen this done
before, so I thought I'd share.  I don't think you'll find this
recovery method in any documentation anywhere.  :)

Joe

--
Joseph H Vilas  jhv@duke.edu  1-919-660-6902
Box 90132, Duke University, Durham, NC 27708-0132
...for me and my short attention span....