[OpenAFS] AFS DB server locking up

Mon, 15 Apr 2002 10:02:56 +1200

Hi
We have recently had a major problem with our OpenAFS cell which was (we 
are not 100% sure that it is the only problem) caused by a hardware problem 
with a Megaraid raid controller and shared IRQ and a couple of failed hard 
disk that the raid controller did not detect for some reason. While this 
problem was not caused by AFS it certainly had major affects on our cell.

When a user on the affected server tried to write to their volume it would 
cause fileserver process to essentially lock up (on a linux system process 
had a D label that would not change back, and the load would start to 
climb). Afs would continue to run (although that users write would fail) 
until enough fileserver processes went into this state and then the server 
would become unavailable. Unfortunately once one fileserver process was in 
this state the only way to get AFS back was to power cycle the machine (afs 
could not be killed and the machine could not be shutdown). Interestingly 
vos listvol would work for a while but you could not access any volumes on 
the server but eventually vos listvol would fail as well.

Unfortunately this outage on one fileserver would take down rw access to 
the cell as it was locking up the primary DB server (or loading it up or 
something). If you did not shut the affected fileserver down the moment it 
started to have locked processes the DB server would eventually start to 
refuse people access to their RW volumes. This lasted for several hours 
until I found out about the problem. A bos restart on the DB fixed the 
problem once (once you turn off the affected fileserver). Another time I 
had to reboot the server (not power cycle it). There are no log entries on 
the DB reporting any obvious errors and udebug seems to look OK but I did 
not save any of this output for analysis. Interestingly I have managed to 
induce a similar problems by screwing up the routing on a fileserver so 
that it could not talk back to anyone but I don't know if it is the same 
problem as the fileserver processes crashed at about the same time I killed 
the routing table.
Is there anything we can change to make the DB server more robust in these 
circumstances. Is it a timeout/open coonection not being killed problem?

Cheers
Matt Cocker