[OpenAFS-devel] "Lost contact with file server" problems

Roland Kuhn rkuhn@e18.physik.tu-muenchen.de
Sun, 21 Aug 2005 20:55:00 +0200 (CEST)


Hi Lyle!

On Sun, 21 Aug 2005, Lyle wrote:

> What Derrick Said.
>
> You have to leave a packet capture running continuously on the off chance
> that this might happen...  Not just the first error packet, but the last
> couple of RPCs just before that.  So you really want
> 1.  a network monitor that implements stop triggers.  These used to be
> rather expensive, but maybe ethereal finally implemented them?  I don't
> know.
> 2.  a true broadcast network or the ability to tap your switch so you don't
> have to run monitor software directly on the fileserver.  Running software
> on the client is not likely to be useful unless you can reliably predict
> which system will be affected.
>
Well, I did it differently: I have eight potentially failing clients and a 
tcpdump running on the fileserver for each of them with a file size of 
100MB. And I have written a small perl script, which keeps only the last 
10 files for each client. As soon as the failure happens, the other seven 
clients will finish their jobs (which means up to 3.5GB of traffic each) 
and then the complete system will (hopefully) be quiet. I reckon that the 
failed client will not fill up 1GB of traffic once it is in the bad state.

> Wait a sec.  At this point, you're thinking you know which system will be
> affected, it's this one at 192.168.18.34, right?  But what I'm saying is --
> After you reboot that machine, and it comes back up and is running normally
> for a while, which client will be next to experience this bug?  Is it always
> the same one?  Even after reboots?  That is new, useful, and surprising
> information.
>
No, I cannot predict which one will fail. But I don't have to reboot: the 
immediate fix is to stop/start the afs client, but it also suffices to 
simply wait two hours (roughly).

> My experience was that the affected client would vary and not be
> particularly reproducible, which means that you have to monitor a whole lot
> of connections simultaneously, hence a tap on the switch.
>
> Make sense?
>
Yes, if we had such a switch. I'm writing the tcpdumps to a different RAID 
than the one housing the processed data and hoping that this will not 
prevent the error from occuring. Let's see what I find tomorrow morning.

Ciao,
 					Roland