[OpenAFS] [1.2.7] Strange file server meltdown

Rainer Toebbicke rtb@pclella.cern.ch
Fri, 13 Dec 2002 10:05:26 +0100


Russ Allbery wrote:
> Hello folks,
> 
> We're running OpenAFS 1.2.7 on Solaris 8, and are seeing an unusual
> problem.  Two of our file servers are periodically going into an
> apparently load-related meltdown around 3:30am to 4:00am at fairly
> unpredictable intervals.  We're having about one instance of this a week.
> 
...


We've twice seen similar problems in the last two weeks: Solaris 2.8, OpenAFS 
1.2.7, fileserver has '3252 calls waiting for a thread
2 threads are idle' - all clients on it are hanging, system 100% CPU, 'bos 
restart' sends a msg to FileLog but then nothing.

I took a 'gcore', a snoop snapshot, and rxdebug output while the server was in 
that state. Also did a truss: the server was *only* doing send/receive, no 
disk I/O, no nothing.

Luckily we're always running everything compiled with '-g': all threads except 
the usual 'maintenance' ones in the gcore were waiting on host_glock_mutex, 
except the one which held it in h_TossStuff_r. I went down the host chain it 
was 'tossing' but there was no obvious sign of a tight loop as the chain 
finished after a couple of dozens. A pity that I did not take another gcore a 
few seconds later.


Actually, we're now busy reverting to OpenAFS 1.2.6! Circumstantial evidence 
only - a number of problems appeared shortly after upgrading to 1.2.7:

1. twice this hanger

2. an assertion failure in host.c GetClient(). There is obviously a window for 
a race condition in h_GetHost_r. The question whether this is related to 1.) 
above.

3. TWO crashes each night around 4:30 in rxi_AttachServerProc() on 
queue_Remove(call) with call==NULL.

I'll post details about 2 & 3 above as soon as I am able to collect enough 
evidence.


In the meantime: how can I find out what deltas went into 1.2.7 after 1.2.6? 
Globally, I mean, not on a per source file level?

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke        http://cern.ch/~rtb         rtb@mail.cern.ch  O__
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland   > |
Phone: +41 22 767 8985       Fax: +41 22 767 7155                     ( )\( )