[OpenAFS-devel] "Lost contact with file server" problems

Derrick J Brashear shadow@dementia.org
Sun, 28 Aug 2005 16:30:41 -0400 (EDT)


>From jhutz, try this:
--- rx.c        30 May 2005 04:55:26 -0000      1.82
+++ rx.c        28 Aug 2005 20:30:00 -0000
@@ -1146,7 +1146,11 @@

      /* Client is initially in send mode */
      call->state = RX_STATE_ACTIVE;
-    call->mode = RX_MODE_SENDING;
+    call->error = conn->error;
+    if (call->error)
+       call->mode = RX_MODE_ERROR;
+    else
+       call->mode = RX_MODE_SENDING;

      /* remember start time for call in case we have hard dead time limit */
      call->queueTime = queueTime;


On Sat, 27 Aug 2005, Harald Barth wrote:

>>> Except you missed the abort from the server to the client 2 minutes earlier
>>>
>>> 05:43:40.773551 IP (tos 0x0, ttl  64, id 6836, offset 0, flags [none],
>>> length: 60) 192.168.18.2.7000 > 192.168.18.39.7001: [udp sum ok]  rx abort
>>> cid 1dd424ec call# 0 seq 0 ser 13 (32)
>
>
> I had a look at this with ethereal (recent ethereal you have to
> disable rudp in Analyze->Enabled Protocols). So far I found:
>
> Abort code is 151725569 = ntohl(0x01260b09) which should be
> htonl(call->error) from the server. Undfortunately that is
> only a "big number" to me :-( and does not ring a bell.
>
> I have been reading rx.c. In the client after receiving this
> abort packet we should end up in
>
>    /* Check for connection-only requests (i.e. not call specific). */
>    if (np->header.callNumber == 0) {
>        switch (np->header.type) {
>        case RX_PACKET_TYPE_ABORT:
>            /* What if the supplied error is zero? */
>            rxi_ConnectionError(conn, ntohl(rx_GetInt32(np, 0)));
>
> Then ...
>
> rxi_ConnectionError(register struct rx_connection *conn,
>                    register afs_int32 error)
> {
>    if (error) {
>        register int i;
>        MUTEX_ENTER(&conn->conn_data_lock);
>        if (conn->challengeEvent)
>            rxevent_Cancel(conn->challengeEvent, (struct rx_call *)0, 0);
>        if (conn->checkReachEvent) {
>            rxevent_Cancel(conn->checkReachEvent, (struct rx_call *)0, 0);
>            conn->checkReachEvent = 0;
>            conn->flags &= ~RX_CONN_ATTACHWAIT;
>            conn->refCount--;
>        }
>        MUTEX_EXIT(&conn->conn_data_lock);
>        for (i = 0; i < RX_MAXCALLS; i++) {
>            struct rx_call *call = conn->call[i];
>            if (call) {
>                MUTEX_ENTER(&call->lock);
>                rxi_CallError(call, error);
>                MUTEX_EXIT(&call->lock);
>            }
>        }
>        conn->error = error;
>        MUTEX_ENTER(&rx_stats_mutex);
>        rx_stats.fatalErrors++;
>        MUTEX_EXIT(&rx_stats_mutex);
>    }
> }
>
> I think in this case error != 0 but I think we should take care of
> error == 0 somehow (if it does happen at all).
>
> Then we call rxi_CallError(call, error) which sets the call's error
> status if it was not allready set. Then the call is reset if there
> is not some kind of BUSY status.
>
> void
> rxi_CallError(register struct rx_call *call, afs_int32 error)
> {
>    if (call->error)
>        error = call->error;
> #ifdef RX_GLOBAL_RXLOCK_KERNEL
>    if (!(call->flags & RX_CALL_TQ_BUSY)) {
>        rxi_ResetCall(call, 0);
>    }
> #else
>    rxi_ResetCall(call, 0);
> #endif
>    call->error = error;
>    call->mode = RX_MODE_ERROR;
> }
>
> It does not seem that the connections is taken down however. The client seems later
> to try to use the connection again but is then kinda out of sync and sends aborts itself.
> If you filter in ethereal with "rx.cid == 500442348" you'll see.
>
>> Well, what does this mean? I'm no RX expert...
>
> I don't know either ;-) but I think the connections should be reaped a
> bit more agressive after an abort, the rx-code should reestablish new
> ones in that case, shouldn't it?
>
> Or could there be some confusion if this packet is encrypted or not?
> security index == 2?
>
> I think I had an similar condition this afternoon on my laptop, but I
> hadn't any tcpdump running at the time, so I can't tell for sure.
>
> Harald.
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel
>
>