[OpenAFS] hung directory

Dr A V Le Blanc Dr A V Le Blanc <LeBlanc@mcc.ac.uk>
Fri, 25 Oct 2002 17:12:49 +0100


With respect to the problems of hangs, we've had a couple of nasty
problems here.  We have two fileservers running OpenAFS 1.2.7 on
IRIX 6.5, and one running OpenAFS 1.2.7 on a linux 2.4.19 machine
(Debian woody).

One volume on one of the SGI fileservers somehow got itself
duplicated: that is, there were two volumes with the same
name on the same vice partition.  For some reason, one of
these volumes was always visible to users, got written to, and
contained all the valuable data; the other volume was always
backed up, and contained nothing.

Recently I noticed that a volume of this name was listed as off line.
I tried to salvage it, but it was still listed as off line.
So I moved it to a partition on another server.  This changed
everything: only the empty volume was visible as well as backed up.
We solved the problem by deleting the empty volume from the new
partition, and then doing 'vos syncv' for the specific server,
partition, and volume on the old disk.  This make the volume
visible again, and I moved it to the new server.

Question: How can there be two volumes with the same name on
the same partition?

The server where this volume appeared has been having problems
recently.  Its load average goes up, sometimes to 5, sometimes to 10,
and it becomes very unresponsive.  Attempts to move or backup
volumes on the server may cause it to lose contact with clients.
Some vos commands may hang, and be unkillable, in the sense that
even after receiving the signal -9, they are still there 12 hours
later; only reboot can get rid of them.  Moreover, attempts to
move volumes often end up timing out.

Rebooting this system is perilous as well, since the salvage
operation usually takes at least 30 or 40 minutes, even when
the machine was shut down cleanly.  Once recently it took
4 hours.  Some problematic volumes on this machine have poor
access times; transfer rates of between 5 and 15 _megabytes_
per minute are not uncommon.  This is an Origin machine, with a
180mhz IP27 processor.  The other SGI server is identical hardware
and identical software, but never shows this problem.

Questions: I'm afraid of running 'vos syncv' and 'vos syncs'
generally; I might lose more un-backed-up volumes with data
and keep the empty backup volumes, if there are more like this.
How can we identify potential problem volumes?  Also what is
wrong with the machine that it is having these performance
problems?  There are no reports of hardware errors, which
usually do show up on SGI machines.  Finally, how can I
fix it?  There are almost 1300 volumes on this server,
3705 on the twin machine, and 1276 on the new Linux server.

     -- Owen
     LeBlanc@mcc.ac.uk