[OpenAFS] fileserver meltdown diagnostics

Nathan Neulinger nneul@umr.edu
Sun, 12 Dec 2004 10:29:02 -0600


What causes a thread to get sucked up continually. We are diagnosing
an issue with one of our fileservers that has a problem with at least one
client that is holding open:

Connection from host 131.151.99.183, port 7001, Cuid 858895d7/6dfc428
  serial 64,  natMTU 1260, security index 0, server conn
    call 0: # 1, state active, mode: error
    call 1: # 0, state not initialized
    call 2: # 0, state not initialized
    call 3: # 0, state not initialized

(about 10 of those)

Connection from host 131.151.99.183, port 7001, Cuid 9904b28f/6d935cc
  serial 2815,  natMTU 1260, security index 0, client conn
    call 0: # 88, state active, mode: receiving, flags: reader_wait, has_output_packets
    call 1: # 0, state not initialized
    call 2: # 0, state not initialized
    call 3: # 0, state not initialized

(and ONE of those...)



I believe there is the possibility of another client that is intermittently 
causing the same problem, resulting in all remaining threads being taken,
and the server going into a meltdown state.

What would cause these connections/threads to not be reclaimed? i.e. once
they get into error state, why aren't they being freed?

I have a FULL network trace of all traffic from this particular client to
this server as it is happening, but not when it started unfortunately. I 
will have that as soon as it melts down again though. (Not if, when. 
I would expect it to be sometime in the next 2 hours. Maybe 10 minutes
ago the idle thread count was solid at 6, it's now solid at 5. I expect
that to count down till the server melts.)

-- Nathan

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-6679
UMR Information Technology             Fax: (573) 341-4216