[OpenAFS-devel] "Lost contact with file server" problems
Derrick J Brashear
shadow@dementia.org
Sun, 28 Aug 2005 16:30:41 -0400 (EDT)
>From jhutz, try this:
--- rx.c 30 May 2005 04:55:26 -0000 1.82
+++ rx.c 28 Aug 2005 20:30:00 -0000
@@ -1146,7 +1146,11 @@
/* Client is initially in send mode */
call->state = RX_STATE_ACTIVE;
- call->mode = RX_MODE_SENDING;
+ call->error = conn->error;
+ if (call->error)
+ call->mode = RX_MODE_ERROR;
+ else
+ call->mode = RX_MODE_SENDING;
/* remember start time for call in case we have hard dead time limit */
call->queueTime = queueTime;
On Sat, 27 Aug 2005, Harald Barth wrote:
>>> Except you missed the abort from the server to the client 2 minutes earlier
>>>
>>> 05:43:40.773551 IP (tos 0x0, ttl 64, id 6836, offset 0, flags [none],
>>> length: 60) 192.168.18.2.7000 > 192.168.18.39.7001: [udp sum ok] rx abort
>>> cid 1dd424ec call# 0 seq 0 ser 13 (32)
>
>
> I had a look at this with ethereal (recent ethereal you have to
> disable rudp in Analyze->Enabled Protocols). So far I found:
>
> Abort code is 151725569 = ntohl(0x01260b09) which should be
> htonl(call->error) from the server. Undfortunately that is
> only a "big number" to me :-( and does not ring a bell.
>
> I have been reading rx.c. In the client after receiving this
> abort packet we should end up in
>
> /* Check for connection-only requests (i.e. not call specific). */
> if (np->header.callNumber == 0) {
> switch (np->header.type) {
> case RX_PACKET_TYPE_ABORT:
> /* What if the supplied error is zero? */
> rxi_ConnectionError(conn, ntohl(rx_GetInt32(np, 0)));
>
> Then ...
>
> rxi_ConnectionError(register struct rx_connection *conn,
> register afs_int32 error)
> {
> if (error) {
> register int i;
> MUTEX_ENTER(&conn->conn_data_lock);
> if (conn->challengeEvent)
> rxevent_Cancel(conn->challengeEvent, (struct rx_call *)0, 0);
> if (conn->checkReachEvent) {
> rxevent_Cancel(conn->checkReachEvent, (struct rx_call *)0, 0);
> conn->checkReachEvent = 0;
> conn->flags &= ~RX_CONN_ATTACHWAIT;
> conn->refCount--;
> }
> MUTEX_EXIT(&conn->conn_data_lock);
> for (i = 0; i < RX_MAXCALLS; i++) {
> struct rx_call *call = conn->call[i];
> if (call) {
> MUTEX_ENTER(&call->lock);
> rxi_CallError(call, error);
> MUTEX_EXIT(&call->lock);
> }
> }
> conn->error = error;
> MUTEX_ENTER(&rx_stats_mutex);
> rx_stats.fatalErrors++;
> MUTEX_EXIT(&rx_stats_mutex);
> }
> }
>
> I think in this case error != 0 but I think we should take care of
> error == 0 somehow (if it does happen at all).
>
> Then we call rxi_CallError(call, error) which sets the call's error
> status if it was not allready set. Then the call is reset if there
> is not some kind of BUSY status.
>
> void
> rxi_CallError(register struct rx_call *call, afs_int32 error)
> {
> if (call->error)
> error = call->error;
> #ifdef RX_GLOBAL_RXLOCK_KERNEL
> if (!(call->flags & RX_CALL_TQ_BUSY)) {
> rxi_ResetCall(call, 0);
> }
> #else
> rxi_ResetCall(call, 0);
> #endif
> call->error = error;
> call->mode = RX_MODE_ERROR;
> }
>
> It does not seem that the connections is taken down however. The client seems later
> to try to use the connection again but is then kinda out of sync and sends aborts itself.
> If you filter in ethereal with "rx.cid == 500442348" you'll see.
>
>> Well, what does this mean? I'm no RX expert...
>
> I don't know either ;-) but I think the connections should be reaped a
> bit more agressive after an abort, the rx-code should reestablish new
> ones in that case, shouldn't it?
>
> Or could there be some confusion if this packet is encrypted or not?
> security index == 2?
>
> I think I had an similar condition this afternoon on my laptop, but I
> hadn't any tcpdump running at the time, so I can't tell for sure.
>
> Harald.
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel
>
>