[OpenAFS-devel] "Lost contact with file server" problems

Lyle lws@o-o.yi.org
Sun, 21 Aug 2005 03:16:05 -0400


What Derrick Said. =20

You have to leave a packet capture running continuously on the off =
chance
that this might happen...  Not just the first error packet, but the last
couple of RPCs just before that.  So you really want
1.  a network monitor that implements stop triggers.  These used to be
rather expensive, but maybe ethereal finally implemented them?  I don't
know.
2.  a true broadcast network or the ability to tap your switch so you =
don't
have to run monitor software directly on the fileserver.  Running =
software
on the client is not likely to be useful unless you can reliably predict
which system will be affected. =20

Wait a sec.  At this point, you're thinking you know which system will =
be
affected, it's this one at 192.168.18.34, right?  But what I'm saying is =
--
After you reboot that machine, and it comes back up and is running =
normally
for a while, which client will be next to experience this bug?  Is it =
always
the same one?  Even after reboots?  That is new, useful, and surprising
information. =20

My experience was that the affected client would vary and not be
particularly reproducible, which means that you have to monitor a whole =
lot
of connections simultaneously, hence a tap on the switch.

Make sense?


-----Original Message-----
From: openafs-devel-admin@openafs.org
[mailto:openafs-devel-admin@openafs.org] On Behalf Of Derrick J Brashear
Sent: Sunday, August 21, 2005 1:42 AM
To: openafs-devel@openafs.org
Subject: Re: [OpenAFS-devel] "Lost contact with file server" problems


it needs to include the first error packet, e.g. the window where it =
loses=20
contact, to be useful

once it's down, that's not interesting

Derrick