[OpenAFS-devel] "Lost contact with file server" problems

Jeffrey Hutzelman jhutz@cmu.edu
Fri, 26 Aug 2005 16:35:58 -0400


On Monday, August 22, 2005 16:52:29 -0400 Jeffrey Altman 
<jaltman@secure-endpoints.com> wrote:

> I'm sure there is code in the client that identifies expired tokens
> and removes them.   I just don't believe that code is associated in
> any way with the code that processes RXKADEXPIRED errors.

Well, I don't know what strangeness you might have in the Windows client. 
The traditional client _does_ discard a user's tokens when it gets any 
authentication error, including RXKADEXPIRED.

> I'm also suspicious of why the server has no code that specifically
> addresses RXKADEXPIRED errors if the client is allowed to send them
> to the server.

The client isn't specifically sending RXKADEXPIRED.  It is sending an abort 
because it received a packet on a connection that is in error.  Such 
aborts, whether sent by the client or server, _always_ contain the error 
code corresponding to the current error on the call.

The server doesn't need to _do_ anything special in response to this 
particular error.  It just needs to propagate the error back up the call 
chain, which it does, so that whatever procedure is handling this call gets 
an error on its next rx_Write or whatever and aborts.  This is all 
perfectly normal.


Now, as Derrick noted, the RXKADEXPIRED is in fact not originating in the 
client, but in the _server_; the connection is in error because an abort on 
that connection was received two or three minutes earlier with an error 
code of RXKADEXPIRED.

The confusing thing is, once the connection is in error, why is the client 
ever sending a new request to the server?  The answer appears to be that 
rx_NewCall on a connection in error does not fail (not surprising; IIRC the 
assumption is that rx_NewCall always succeeds), but also does not propagate 
the connection's error state down to the call.  IMHO this is a bug.


If this is in fact the problem, I believe the patch below will make the 
client notice the error condition on the newly-created call.  There is 
still some question as to why the client did not react to the RXKADEXPIRED 
received in response to its _previous_ call.  Of course, there's _also_ the 
question as to why there was such a huge latency between the data packet on 
that call and the resulting abort.

-- Jeff


Index: rx.c
===================================================================
RCS file: /cvs/openafs/src/rx/rx.c,v
retrieving revision 1.83
diff -u -r1.83 rx.c
--- rx.c	19 Aug 2005 19:20:44 -0000	1.83
+++ rx.c	26 Aug 2005 20:31:19 -0000
@@ -1146,7 +1146,12 @@

     /* Client is initially in send mode */
     call->state = RX_STATE_ACTIVE;
-    call->mode = RX_MODE_SENDING;
+    if (conn->error) {
+        call->mode = RX_MODE_ERROR;
+        call->error = conn->error;
+    } else {
+        call->mode = RX_MODE_SENDING;
+    }

     /* remember start time for call in case we have hard dead time limit */
     call->queueTime = queueTime;