[OpenAFS-devel] rx abort, again

Harald Barth haba@pdc.kth.se
Wed, 05 Oct 2005 18:26:29 +0200 (MEST)


> I thought we had fixed the problem where a client ignores an rx abort from a
> fileserver, but now I'm seeing it on a client built from cvs head yesterday.

So I'm not seeing ghosts either. I don't have a nice trace but the typical
rows in log/messages have reappeared on one client:

Sep 26 22:03:06 d10n03 kernel: afs: Lost contact with file server 130.237.232.195 in cell pdc.kth.se (all multi-homed ip addresses down for the server)
Sep 26 22:03:08 d10n03 kernel: afs: Tokens for user of AFS id 12020 for cell pdc.kth.se have expired
Sep 26 22:05:33 d10n03 kernel: afs: file server 130.237.232.195 in cell pdc.kth.se is back up (multi-homed address; other same-host interfaces may still be down)

# /usr/openafs/sbin/rxdebug juliana.pdc.kth.se 7001 -version
Trying 194.132.193.64 (port 7001):
AFS version:  OpenAFS 1.4.0-rc4 built  2005-09-15 

It is still better (only 2 minutes and so far only found on ONE of 443
boxes this version runs on), but still - the bug is not quite dead
yet. I did never get any answer on my previous mail in which I asked
if the patch should not be changed to

   a) be more conservative when changing the connection status to "broken"
      and only change it in that direction

   b) tear down the rx-connection right away there and then 
      (rxabort or something)

> Shouldn't the client tear down the connection and start a new one?  Or does
> abort just apply to that call, not the entire connection?

That was about what I thought, too. If we know it's broken beyond mercy: Shoot.

> Trace is here:
> /afs/citi.umich.edu/user/rees/www/fs-abort.bin

Thanks, ethereal is my friend. 

Harald.