[OpenAFS] fileserver meltdown diagnostics

Nathan Neulinger nneul@umr.edu
Sun, 12 Dec 2004 19:36:55 -0600


On Sun, Dec 12, 2004 at 02:15:26PM -0500, Derrick J Brashear wrote:
 > On Sun, 12 Dec 2004, Nathan Neulinger wrote:
> 
> >What causes a thread to get sucked up continually. We are diagnosing
> >an issue with one of our fileservers that has a problem with at least one
> >client that is holding open:
> 
> What server version? What does rxdebugging the client say? If it's a 
> platform where you can do so easily, can you get a backtrace from the 
> fileserver? I have an idea, particularly if it's older than 1.2.13 server.

Was running CVS from around March/April 2004 on the server. 1.3.7004 windows
client. Upgraded server to 1.3.75 no change in symptom.

Only thing I could see that was odd is it looked to me like the client was
sending a gettime request to the server that was never answered, even though
the client and server were exchanging rx ack messages w/ Ping/PingResponse content.

What we determined (and verified by watching) was that this one client seemed 
to slowly (one every 10-14 minutes) accumulated a error connection to the file
server. Once it hit 2 idle threads, meltdown started. Unfortunately, we don't
have a failure case any more, the problem dissapeared instantly when we remote
logged into the client that was causing the issue and has not re-occurred. 

At this point, any 7004 clients are being downgraded to 1.2.9b since that is what 
99.9% of the rest of the windows clients are running and while there are
issues with 1.2.9b, they are at least known issues. (I don't have a whole 
lot of information or involvement in the windows client deployment, so can't
really give much more detail there.)

-- Nathan

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-6679
UMR Information Technology             Fax: (573) 341-4216