[OpenAFS] AFS server processes hanging on SG servers

Paul Blackburn mpb@est.ibm.com
Tue, 14 Jan 2003 12:43:29 +0000

Hello Owen,

A work-around option to consider: change your cell configuration
to having three dedicated AFS database servers running on Linux
and use the SGI machines as dedicated fileservers?
paul                                  http://acm.org/~mpb

Dr A V Le Blanc wrote:

>I wrote on this subject some while ago, but there was no
>real response: someone suggested there might be a problem
>with the file system in which /usr/afs/db resides, but this
>doesn't seem to be the case.  I now have more data about the
>We have three database/file servers, one Linux machine and
>two Silicon Graphics Origin 200s.  The Linux machine is
>now running openafs-1.2.8, and the Silicon Graphics machines
>are running openafs-1.2.7.  The Linux machine has not had
>any problems with 1.2.6 or 1.2.7 or 1.2.8; the Silicon Graphics
>machines are both showing problems and did so with 1.2.5 and
>another version, I no longer remember which.  The SG servers are
>running Irix 6.4 on mips IP27 processors.
>The problem is that various AFS server processes hang.  This causes
>problems with ubik voting, as I reported on October 28 on this list,
>and it also causes problems because the processes do not respond
>to incoming requests.  So when an incoming authentication request
>tries to contact the kaserver, it waits to time out before trying
>a second db server, and this makes logins interminable when a
>kaserver process has hung.  The hung processes appear to be in
>the wait state, and they do not die when killed with a 'kill -9':
>they can only be removed by rebooting the machine.  When the hung
>process is volserver, this means all /vice partitions are corrupt
>after the reboot, and the salvaging usually takes about 45 minutes.
>The following server processes have been hanging: kaserver, ptserver,
>buserver, volserver.  I have not seen the fileserver or vlserver hang
>yet; the bosserver doesn't hang but did frequently die without
>trace; this means killing the other server processes, which doesn't
>seem to be possible without having to salvage all the vice partitions
>The hanging has taken place 8 times since the start of December,
>four times on each machine.  Over Christmas there was a simultaneous
>hang of the kaserver on one machine and the ptserver on the other.
>Hanging can be detected by noticing the load average: these machines
>rarely have a load average above .2 or .3; when a server process is
>hung, the load average is 1.2 or 1.3; once there were two processes
>hung at the same time, and the load average was over 2.
>Has anyone else seen any symptoms of problems like these?  Is
>there anything we can do other than move the AFS service off of
>the Silicon Graphics servers and onto Linux boxes?  (We wouldn't
>have SGs at all but for some unfortunate internal politics.)
>     -- Owen
>     LeBlanc@mcc.ac.uk
>OpenAFS-info mailing list