[OpenAFS] fileserver threads stuck on solaris servers
Rainer Toebbicke
rtb@pclella.cern.ch
Tue, 04 Feb 2003 10:19:29 +0100
Frederic Gilbert wrote:
> We have a problem that occurred four or five times in two weeks now, with
> DB and FS servers, running 1.2.7 on Solaris-2.6.
>
> The main symptom is that fileserver processes are kind of stuck, not
> responding any more to sollicitations on port 7001 (rxdebug <server> fails),
the fileserver does not listen on port 7001, rather on 7000.
> hence not serving files any more. Doing truss on the processes show that
> they are stuck in some thread semaphor/signaling routines
> (lwp_mutex_lock/unlock, lwp_sema_post/wait,...).
When it's hung, do a 'gcore <pid>' to get a dump. Using dbx and 'threads',
'thread -blockedby <threadid>', or 'thread -blocks <threadid>' might give you
a hint. Best when everything is compiled/linked with '-g'... I wrote a little
dbx script which does a traceback of all threads which you could try but I
suspect it only works correctly with debugging information.
( you mention 'truss', but I take it you're running the pthreaded fileserver
'lwp_' things you mention are from the Solaris pthread implementation and not
from the non-pthreaded fileserver ).
> On the other hand, the fileserver process being alive, bos status or fs
> checkserv don't report any server failure.
> When the problem occurs, a great number of "ProbeUuid failed" and sometimes
> "volume callback for host xxxxxxxx.7001 failed" messages are found in
> FileLog.
The recovery process for those situations is indeed broken in that threads can
step on each other's toes. That particular problem will be alleviated in the
upcoming 1.2.9 but I doubt it's all over yet.
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke http://cern.ch/~rtb rtb@mail.cern.ch O__
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland > |
Phone: +41 22 767 8985 Fax: +41 22 767 7155 ( )\( )