[OpenAFS] AFS server processes hanging on SG servers

Dr A V Le Blanc Dr A V Le Blanc <LeBlanc@mcc.ac.uk>
Tue, 14 Jan 2003 12:05:25 +0000


I wrote on this subject some while ago, but there was no
real response: someone suggested there might be a problem
with the file system in which /usr/afs/db resides, but this
doesn't seem to be the case.  I now have more data about the
problem.

We have three database/file servers, one Linux machine and
two Silicon Graphics Origin 200s.  The Linux machine is
now running openafs-1.2.8, and the Silicon Graphics machines
are running openafs-1.2.7.  The Linux machine has not had
any problems with 1.2.6 or 1.2.7 or 1.2.8; the Silicon Graphics
machines are both showing problems and did so with 1.2.5 and
another version, I no longer remember which.  The SG servers are
running Irix 6.4 on mips IP27 processors.

The problem is that various AFS server processes hang.  This causes
problems with ubik voting, as I reported on October 28 on this list,
and it also causes problems because the processes do not respond
to incoming requests.  So when an incoming authentication request
tries to contact the kaserver, it waits to time out before trying
a second db server, and this makes logins interminable when a
kaserver process has hung.  The hung processes appear to be in
the wait state, and they do not die when killed with a 'kill -9':
they can only be removed by rebooting the machine.  When the hung
process is volserver, this means all /vice partitions are corrupt
after the reboot, and the salvaging usually takes about 45 minutes.

The following server processes have been hanging: kaserver, ptserver,
buserver, volserver.  I have not seen the fileserver or vlserver hang
yet; the bosserver doesn't hang but did frequently die without
trace; this means killing the other server processes, which doesn't
seem to be possible without having to salvage all the vice partitions
afterwards.

The hanging has taken place 8 times since the start of December,
four times on each machine.  Over Christmas there was a simultaneous
hang of the kaserver on one machine and the ptserver on the other.
Hanging can be detected by noticing the load average: these machines
rarely have a load average above .2 or .3; when a server process is
hung, the load average is 1.2 or 1.3; once there were two processes
hung at the same time, and the load average was over 2.

Has anyone else seen any symptoms of problems like these?  Is
there anything we can do other than move the AFS service off of
the Silicon Graphics servers and onto Linux boxes?  (We wouldn't
have SGs at all but for some unfortunate internal politics.)

     -- Owen
     LeBlanc@mcc.ac.uk