[OpenAFS] server process hang on salvage attempt

Joseph H Vilas Joseph H Vilas <jhv@oit.duke.edu>
Fri, 02 Jul 2004 13:44:52 -0400


I have a problem where attempting to salvage a volume results in a
hung salvager and volserver.  The fileserver continues to run, but at
very heavy CPU utilization.  The only way out of the problem seems to
be a restart.  The machine:

   delrey.acpub.duke.edu[230] uname -a
   SunOS delrey.acpub.duke.edu 5.9 Generic_112233-11 sun4u sparc SUNW,Ultra-60
   delrey.acpub.duke.edu[231] /usr/sbin/rxdebug localhost 7000 -version
   Trying (port 7000):
   AFS version:  OpenAFS 1.2.10 built  2003-11-17 
   delrey.acpub.duke.edu[232] /usr/sbin/rxdebug localhost 7005 -version
   Trying (port 7005):
   AFS version:  OpenAFS 1.2.10 built  2003-11-17

We're using namei.  

At first I thought this was a problem with a particular volume.  The
volume definitely had a problem, but vos move still got it off after
restarting the server instances.  Then a salvage on another, normal
volume on the same partition still caused the problem.  We didn't have
time in our outage window to go through another restart cycle, and I
didn't try a salvage of another partition or any whole partition.  

The salvager is getting this far:

   @(#) OpenAFS 1.2.10 built  2003-11-17 
   06/25/2004 20:31:38 STARTING AFS SALVAGER 2.4 (/usr/libexec/openafs/salvager /vicepc 536889470)

Then it hangs.  The volserver will answer a vos status request, but
any request for real information (like vos listvol) will not return.
It looks like the fssync thread in the fileserver is just not coming
back.  In any case, running the salvager reliably produced this
problem 3-4 times.  The machine has otherwise been operating just fine
for about 2 1/2 months.

I have a gcore from the fileserver while it was apparently
dysfunctional.  What other information could I provide that would be
of any help?


