[OpenAFS] Salvager did not run automatically on solaris 9, 1.4.1-rc10

Jeffrey Hutzelman jhutz@cmu.edu
Thu, 13 Apr 2006 17:49:50 -0400


On Thursday, April 13, 2006 09:41:49 AM -0700 Renata Maria Dart 
<renata@slac.stanford.edu> wrote:

> Hi, we recently had a fileserver crash because of an ecache error.
> When the server came back up it had the further misfortune of a fibre
> channel adapter error which prevented the drives containing the vice
> partitions from coming back online.  Once those issues were dealt
> with, the system was again rebooted and came up with its vice
> partitions but did not salvage on its own...we had to run bos salvage
> manually to bring the volumes online.  This is a solaris 9 system
> running openafs 1.4.1-rc10.  There are 2 partitions on it and the fs
> process specifies 2 parallel salvage processes.  Unfortunately I was
> not there to see all the details when the system came back online and
> the admin who restored the system ran separate salvager commands for
> the 3 200gb volumes that live on the system and didn't preserve the
> original salvage logs.  Is it to be expected that salvager won't run
> automatically after such a sequence of events?  Another couple of
> pieces of information...I recently converted this system from inode to
> namei, it does not have 'enable-fast-restart' configured into it, and
> here are the entries from BosLog:

Under the circumstances you describe, yes, this is normal.

The bosserver forces a whole-server salvage any time the fileserver exits 
abnormally, or on startup if the fileserver was not shut down cleanly 
(there is a file in /usr/afs/local or whereever indicating the fileserver 
is "running"; if that file is present when the bossserver starts, it 
assumes an unclean shutdown).

On an inode-based server, fsck sets a flag if it makes any changes to a 
partition.  When the fileserver starts up, if it sees this flag on any 
partitions, it immediately exits with an error, which causes the bosserver 
to force a salvage.  Since the needs-salvage flag is stored on the 
partition, it is set and reset only when the partition is actually fsck'd 
and salvaged, respectively.  It moves around if you move the disk, and 
doesn't get touched if the disk is "missing", as with your fc disk problem.

However, you switched to namei, which doesn't have that feature (and can't, 
since it doesn't use a modified fsck).  So on your first start, the 
bosserver forced a salvage, but there weren't any partitions to salvage, so 
nothing interesting happened.  Then the fileserver started up, and someone 
noticed there were no partitions, and rebooted.  That involved a clean 
shutdown of the fileserver, which meant there was no forced salvage on the 
next boot.


I have to admit I'm a little curious why you switched from inode to namei 
on a Solaris server...

-- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
   Sr. Research Systems Programmer
   School of Computer Science - Research Computing Facility
   Carnegie Mellon University - Pittsburgh, PA