[OpenAFS] Fileserver and bad I/O to underlying disk

Ryan C. Underwood nemesis@icequake.net
Mon, 27 Jun 2011 16:29:14 -0500


On one of my 1.6.0 fileservers I am having some intermittent (once every few
months) trouble with the disk array where for no discernible reason it star=
kicking disks out of a RAID5 until the RAID is offlined.  This results in m=
"rejected I/O to offline device" kernel messages, and eventually the kernel
gives up on the disk entirely and the device node disappears from /dev.  I =
power cycle the array and it comes back.  But so far I have had to also reb=
the server to straighten it out because the AFS fileserver cannot be recove=
for the following reason.

Other userspace programs doing I/O to the disk array fail out with -EIO
eventually and I can umount -f the other mounts.  Unfortunately, I have not
been able to figure out how to get rid of the fileserver processes so I can
umount -f the vice partitions that are still pointing to the dead device and
straighten everything out from there.  The fileserver process is in D state
presumably wedged in I/O.  Sending it kill -9 has no effect.  Is there
something in the design of the fileserver that would prevent it from failing
and dying cleanly if something evil happens to the underlying data store?

Sorry if this is a bit confusing, it's hard to explain what is going on.

Ryan C. Underwood, <nemesis@icequake.net>