[OpenAFS] fileserver meltdown diagnostics
Nathan Neulinger
nneul@umr.edu
Sun, 12 Dec 2004 19:36:55 -0600
On Sun, Dec 12, 2004 at 02:15:26PM -0500, Derrick J Brashear wrote:
> On Sun, 12 Dec 2004, Nathan Neulinger wrote:
>
> >What causes a thread to get sucked up continually. We are diagnosing
> >an issue with one of our fileservers that has a problem with at least one
> >client that is holding open:
>
> What server version? What does rxdebugging the client say? If it's a
> platform where you can do so easily, can you get a backtrace from the
> fileserver? I have an idea, particularly if it's older than 1.2.13 server.
Was running CVS from around March/April 2004 on the server. 1.3.7004 windows
client. Upgraded server to 1.3.75 no change in symptom.
Only thing I could see that was odd is it looked to me like the client was
sending a gettime request to the server that was never answered, even though
the client and server were exchanging rx ack messages w/ Ping/PingResponse content.
What we determined (and verified by watching) was that this one client seemed
to slowly (one every 10-14 minutes) accumulated a error connection to the file
server. Once it hit 2 idle threads, meltdown started. Unfortunately, we don't
have a failure case any more, the problem dissapeared instantly when we remote
logged into the client that was causing the issue and has not re-occurred.
At this point, any 7004 clients are being downgraded to 1.2.9b since that is what
99.9% of the rest of the windows clients are running and while there are
issues with 1.2.9b, they are at least known issues. (I don't have a whole
lot of information or involvement in the windows client deployment, so can't
really give much more detail there.)
-- Nathan
------------------------------------------------------------
Nathan Neulinger EMail: nneul@umr.edu
University of Missouri - Rolla Phone: (573) 341-6679
UMR Information Technology Fax: (573) 341-4216