[OpenAFS] salvage removed .6M files!

Mike Polek mike@pictage.com
Mon, 01 Aug 2005 19:10:46 -0700


Hi, Steve,
   Did you do an fsck on the hard drive (or whatever the
SUN equivalent is these days)? I had a similar problem
recently where a system lost power. It started up ok
and recovered using the ext3 journal, but my data was missing
after the salvage. After a few salvage attempts, my data
was still missing. I stopped the AFS fileserver, unmounted
the partitions, used fsck to check them all manually, and
sure enough the partition that was flaking had errors.
Once I cleaned that up, I salvaged again, and voila!...
my data reappeared.
   I recommend checking the underlying filesystem for
errors. It may be too late if you've already started
restoring data to the partition... but perhaps for
future reference.

OS: RedHat 9
Kernel: 2.4.30
AFS: 1.2.13

Mike Polek
Pictage, Inc.


  > ---- Original Message ----
  > From: rader

More information, fwiw...

  - SalvageLog.old indicates (the initial) salvaging started
    at 01:07:43

  - BosLog indicates that that salvage exited with signal 15 at
    05:00:38

  - SalvageLog indicates another salvage--the one that went
    awry--started at 05:00:38 and completed 06:44:41

  - bos getrestart reports the server should restart for
    new binaries at "5:00 am"

It is possible the "restart for new binaries" erroneously happened,
and it kill -SIGTERM'ed the bos salvage which left the volume
in an inconsistent state that caused the subsequent salvage to
blow chunks??  (I'm under the general impression that interrupting
salvages is a bad idea.)

At any rate, I've turned off the "restarts for new binaries at
5:00 am" thing.

steve
- - -
systems & network manager
high energy physics
university of wisconsin

  > ---- Original Message ----
  > From: rader
  >
  > One of our servers (Solaris7 inode fileserver running 1.2.11) lost
  > power this morning and the resulting bos salvage on a large (50 GB)
  > volume removed about 600,000 files....  /usr/afs/logs/SalvageLog
  > reads, for example...
  >
  >  07/29/2005 06:19:26 dir vnode 87953: invalid entry: \
  >    ./cmsprod/cern/setup.sh (vnode 2258102, unique 14499243)
  >  07/29/2005 06:19:26 dir vnode 87953: ./cmsprod/cern/setup.sh \
  >    (vnode 2258102): unique changed from 14499243 to 0 -- deleted
  >
  > Does anybody have any suggestions about how to recover the lost
  > files??  (I'm restoring from tape now, but I'll still have the
  > busted volume around when I'm done.)
  >
  > steve
  > - - -
  > systems & network manager
  > high energy physics
  > university of wisconsin