[OpenAFS-devel] "Lost contact with file server" problems

Harald Barth haba@pdc.kth.se
Sat, 27 Aug 2005 17:23:25 +0200 (MEST)


> > Except you missed the abort from the server to the client 2 minutes earlier
> >
> > 05:43:40.773551 IP (tos 0x0, ttl  64, id 6836, offset 0, flags [none], 
> > length: 60) 192.168.18.2.7000 > 192.168.18.39.7001: [udp sum ok]  rx abort 
> > cid 1dd424ec call# 0 seq 0 ser 13 (32)


I had a look at this with ethereal (recent ethereal you have to
disable rudp in Analyze->Enabled Protocols). So far I found:

Abort code is 151725569 = ntohl(0x01260b09) which should be
htonl(call->error) from the server. Undfortunately that is
only a "big number" to me :-( and does not ring a bell.

I have been reading rx.c. In the client after receiving this
abort packet we should end up in     

    /* Check for connection-only requests (i.e. not call specific). */
    if (np->header.callNumber == 0) {
        switch (np->header.type) {
        case RX_PACKET_TYPE_ABORT:
            /* What if the supplied error is zero? */
            rxi_ConnectionError(conn, ntohl(rx_GetInt32(np, 0)));

Then ...

rxi_ConnectionError(register struct rx_connection *conn,
                    register afs_int32 error)
{
    if (error) {
        register int i;
        MUTEX_ENTER(&conn->conn_data_lock);
        if (conn->challengeEvent)
            rxevent_Cancel(conn->challengeEvent, (struct rx_call *)0, 0);
        if (conn->checkReachEvent) {
            rxevent_Cancel(conn->checkReachEvent, (struct rx_call *)0, 0);
            conn->checkReachEvent = 0;
            conn->flags &= ~RX_CONN_ATTACHWAIT;
            conn->refCount--;
        }
        MUTEX_EXIT(&conn->conn_data_lock);
        for (i = 0; i < RX_MAXCALLS; i++) {
            struct rx_call *call = conn->call[i];
            if (call) {
                MUTEX_ENTER(&call->lock);
                rxi_CallError(call, error);
                MUTEX_EXIT(&call->lock);
            }
        }
        conn->error = error;
        MUTEX_ENTER(&rx_stats_mutex);
        rx_stats.fatalErrors++;
        MUTEX_EXIT(&rx_stats_mutex);
    }
}

I think in this case error != 0 but I think we should take care of
error == 0 somehow (if it does happen at all).

Then we call rxi_CallError(call, error) which sets the call's error
status if it was not allready set. Then the call is reset if there
is not some kind of BUSY status. 

void
rxi_CallError(register struct rx_call *call, afs_int32 error)
{
    if (call->error)
        error = call->error;
#ifdef RX_GLOBAL_RXLOCK_KERNEL
    if (!(call->flags & RX_CALL_TQ_BUSY)) {
        rxi_ResetCall(call, 0);
    }
#else
    rxi_ResetCall(call, 0);
#endif
    call->error = error;
    call->mode = RX_MODE_ERROR;
}

It does not seem that the connections is taken down however. The client seems later 
to try to use the connection again but is then kinda out of sync and sends aborts itself.
If you filter in ethereal with "rx.cid == 500442348" you'll see.

> Well, what does this mean? I'm no RX expert...

I don't know either ;-) but I think the connections should be reaped a
bit more agressive after an abort, the rx-code should reestablish new
ones in that case, shouldn't it?

Or could there be some confusion if this packet is encrypted or not? 
security index == 2?

I think I had an similar condition this afternoon on my laptop, but I
hadn't any tcpdump running at the time, so I can't tell for sure.

Harald.