[OpenAFS] fileserver threads stuck on solaris servers

Frederic Gilbert Frederic.Gilbert@inria.fr
Mon, 03 Feb 2003 15:58:34 +0100


We have a problem that occurred four or five times in two weeks now, with
DB and FS servers, running 1.2.7 on Solaris-2.6.

The main symptom is that fileserver processes are kind of stuck, not
responding any more to sollicitations on port 7001 (rxdebug <server> fails),
hence not serving files any more. Doing truss on the processes show that
they are stuck in some thread semaphor/signaling routines
(lwp_mutex_lock/unlock, lwp_sema_post/wait,...).
On the other hand, the fileserver process being alive, bos status or fs
checkserv don't report any server failure.
When the problem occurs, a great number of "ProbeUuid failed" and sometimes
"volume callback for host xxxxxxxx.7001 failed" messages are found in
FileLog.
Until now, no such thing happened to our two Linux servers,
running 1.2.7 on RH7.1.

These servers are running for a few monthes now, and did not show such
behavior until 2 weeks ago. We haven't done any significative change to our
configuration in that delay (unless changing AFS keys, if that helps...).

If this links to some known OpenAFS problem, or if anyone has a suggestion, we
would be pleased to hear about it...

Thanks in advance,
Fred Gilbert.