[OpenAFS] fileserver threads stuck on solaris servers

Rainer Toebbicke rtb@pclella.cern.ch
Tue, 04 Feb 2003 10:19:29 +0100


Frederic Gilbert wrote:
> We have a problem that occurred four or five times in two weeks now, with
> DB and FS servers, running 1.2.7 on Solaris-2.6.
> 
> The main symptom is that fileserver processes are kind of stuck, not
> responding any more to sollicitations on port 7001 (rxdebug <server> fails),

the fileserver does not listen on port 7001, rather on 7000.

> hence not serving files any more. Doing truss on the processes show that
> they are stuck in some thread semaphor/signaling routines
> (lwp_mutex_lock/unlock, lwp_sema_post/wait,...).

When it's hung, do a 'gcore <pid>' to get a dump. Using dbx and 'threads', 
'thread -blockedby <threadid>', or 'thread -blocks <threadid>' might give you 
a hint. Best when everything is compiled/linked with '-g'... I wrote a little 
dbx script which does a traceback of all threads which you could try but I 
suspect it only works correctly with debugging information.

( you mention 'truss', but I take it you're running the pthreaded fileserver 
'lwp_' things you mention are from the Solaris pthread implementation and not 
from the non-pthreaded fileserver ).

> On the other hand, the fileserver process being alive, bos status or fs
> checkserv don't report any server failure.
> When the problem occurs, a great number of "ProbeUuid failed" and sometimes
> "volume callback for host xxxxxxxx.7001 failed" messages are found in
> FileLog.

The recovery process for those situations is indeed broken in that threads can 
step on each other's toes. That particular problem will be alleviated in the 
upcoming 1.2.9 but I doubt it's all over yet.


-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke        http://cern.ch/~rtb         rtb@mail.cern.ch  O__
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland   > |
Phone: +41 22 767 8985       Fax: +41 22 767 7155                     ( )\( )