[OpenAFS] Heavy performance loss on gigabit ethernet

Systems Administration sysadmin@contrailservices.com
Tue, 17 Aug 2004 10:46:55 -0600


> The problem is in the calculation of the rtt (round trip time) and
> retransmit timeout. If the rtt is 0, then it is considered not 
> initialized,
> and the timeout is set to 2 or 3 seconds (depending on whether the 
> server
> is considered 'local' to the client), whereas if the rtt is low, but
> non-zero, the timeout can drop as low as 0.35 seconds. You can examine 
> the
> rtt and timeout values of an rx server or client using the -peers 
> switch of
> the rxdebug command.

Hmmm, so unless there is enough latency in the loop the client goes 
deaf effectively when a frame gets dropped by the ethernet 
media-to-media conversion (fiber to copper) - because the rtt was 
indistinguishable from 0 the timeout is set so high that the server and 
client get out of synch?

So if I understand this right I can either induce some latency in the 
path to give the AFS protocol some latency to base the timeout settings 
on - or I have to change the algorithm that is being used to select the 
timeout interval?  How can I test that some kind of change in the RTT 
would fix this - is there any way to force a change in the timeout 
interval?

I'm not sure that this is the cause however - rxdebug on a hungup 
client shows a very short timeout of 0.368, but the tcpdumps dp not 
show that fast a requery from the client or server - more like 2 
seconds which corresponds to the timeout settings reported by the 
server IP addresses listed in the output of rxdebug:

Trying 127.0.0.1 (port 7000):
Free packets: 579, packet reclaims: 0, calls: 934, used FDs: 44
not waiting for packets.
0 calls waiting for a thread
11 threads are idle
Connection from host 192.168.1.75, port 7001, Cuid b6e06fad/47ac3c4
   serial 10,  natMTU 1444, security index 0, client conn
     call 0: # 5, state dally, mode: receiving, flags: receive_done
     call 1: # 0, state not initialized
     call 2: # 0, state not initialized
     call 3: # 0, state not initialized
Done.
Peer at host 192.168.1.75, port 7001
         ifMTU 1444      natMTU 1444     maxMTU 5692
         packets sent 606        packet resends 22
         bytes sent high 0 low 121207
         bytes received high 0 low 21558
         rtt 2 msec, rtt_dev 4 msec
         timeout 0.368 sec
Peer at host 192.168.1.3, port 7002
         ifMTU 1444      natMTU 1444     maxMTU 1444
         packets sent 24 packet resends 0
         bytes sent high 0 low 2532
         bytes received high 0 low 384
         rtt 0 msec, rtt_dev 0 msec
         timeout 2.000 sec
Peer at host 192.168.1.3, port 7003
         ifMTU 1444      natMTU 1444     maxMTU 1444
         packets sent 2  packet resends 0
         bytes sent high 0 low 185
         bytes received high 0 low 16
         rtt 0 msec, rtt_dev 0 msec
         timeout 2.000 sec


On the client tcpdump shows:

10:37:01.221028 IP (tos 0x0, ttl  64, id 6264, offset 0, flags [none], 
length: 80) lightning.xxx.afs3-callback > turbine.xxx.afs3-fileserver: 
[udp sum ok]  rx data cid f224e074 call# 241 seq 1 ser 516 
<client-init>,<last-pckt> fs call fetch-data fid 536871033/575/1830 
offset 0 length 999999999 (52)
10:37:01.819413 IP (tos 0x0, ttl  64, id 6265, offset 0, flags [none], 
length: 80) lightning.xxx.afs3-callback > turbine.xxx.afs3-fileserver: 
[udp sum ok]  rx data cid f224e074 call# 241 seq 1 ser 517 
<client-init>,<req-ack>,<last-pckt> fs call fetch-data fid 
536871033/575/1830 offset 0 length 999999999 (52)
10:37:01.819721 IP (tos 0x0, ttl  64, id 18826, offset 0, flags [DF], 
length: 93) turbine.xxx.afs3-fileserver > lightning.xxx.afs3-callback: 
[udp sum ok]  rx ack cid f224e074 call# 241 seq 1 ser 269 <slow-start> 
first 2 serial 517 reason duplicate packet ifmtu 5692 maxmtu 1444 rwind 
32 maxpackets 4 (65)

The last entry just as the client goes into zombie mode is the one 
marked with "reason duplicate packet".

Another interesting point - can I assume that the output of rxdebug is 
reporting MTU which is the tcp MTU - if so how does the AFS binary 
think it can send MTUs of 5692 when the interface is declared to have 
an MTU of 1500 - I know the client cannot handle that large an MTU 
because the intermediate hops on the ethernet connection are not 
gigabit capable.  Could the server be trying to send a bogus packet 
with 5692 data octets which are getting dropped on the floor due to an 
invalid MTU size?