[OpenAFS] AFS DB server locking up
Matthew Cocker
matt@cs.auckland.ac.nz
Mon, 15 Apr 2002 10:02:56 +1200
Hi
We have recently had a major problem with our OpenAFS cell which was (we
are not 100% sure that it is the only problem) caused by a hardware problem
with a Megaraid raid controller and shared IRQ and a couple of failed hard
disk that the raid controller did not detect for some reason. While this
problem was not caused by AFS it certainly had major affects on our cell.
When a user on the affected server tried to write to their volume it would
cause fileserver process to essentially lock up (on a linux system process
had a D label that would not change back, and the load would start to
climb). Afs would continue to run (although that users write would fail)
until enough fileserver processes went into this state and then the server
would become unavailable. Unfortunately once one fileserver process was in
this state the only way to get AFS back was to power cycle the machine (afs
could not be killed and the machine could not be shutdown). Interestingly
vos listvol would work for a while but you could not access any volumes on
the server but eventually vos listvol would fail as well.
Unfortunately this outage on one fileserver would take down rw access to
the cell as it was locking up the primary DB server (or loading it up or
something). If you did not shut the affected fileserver down the moment it
started to have locked processes the DB server would eventually start to
refuse people access to their RW volumes. This lasted for several hours
until I found out about the problem. A bos restart on the DB fixed the
problem once (once you turn off the affected fileserver). Another time I
had to reboot the server (not power cycle it). There are no log entries on
the DB reporting any obvious errors and udebug seems to look OK but I did
not save any of this output for analysis. Interestingly I have managed to
induce a similar problems by screwing up the routing on a fileserver so
that it could not talk back to anyone but I don't know if it is the same
problem as the fileserver processes crashed at about the same time I killed
the routing table.
Is there anything we can change to make the DB server more robust in these
circumstances. Is it a timeout/open coonection not being killed problem?
Cheers
Matt Cocker