[OpenAFS] Heavy performance loss on gigabit ethernet
Systems Administration
sysadmin@contrailservices.com
Tue, 17 Aug 2004 10:46:55 -0600
> The problem is in the calculation of the rtt (round trip time) and
> retransmit timeout. If the rtt is 0, then it is considered not
> initialized,
> and the timeout is set to 2 or 3 seconds (depending on whether the
> server
> is considered 'local' to the client), whereas if the rtt is low, but
> non-zero, the timeout can drop as low as 0.35 seconds. You can examine
> the
> rtt and timeout values of an rx server or client using the -peers
> switch of
> the rxdebug command.
Hmmm, so unless there is enough latency in the loop the client goes
deaf effectively when a frame gets dropped by the ethernet
media-to-media conversion (fiber to copper) - because the rtt was
indistinguishable from 0 the timeout is set so high that the server and
client get out of synch?
So if I understand this right I can either induce some latency in the
path to give the AFS protocol some latency to base the timeout settings
on - or I have to change the algorithm that is being used to select the
timeout interval? How can I test that some kind of change in the RTT
would fix this - is there any way to force a change in the timeout
interval?
I'm not sure that this is the cause however - rxdebug on a hungup
client shows a very short timeout of 0.368, but the tcpdumps dp not
show that fast a requery from the client or server - more like 2
seconds which corresponds to the timeout settings reported by the
server IP addresses listed in the output of rxdebug:
Trying 127.0.0.1 (port 7000):
Free packets: 579, packet reclaims: 0, calls: 934, used FDs: 44
not waiting for packets.
0 calls waiting for a thread
11 threads are idle
Connection from host 192.168.1.75, port 7001, Cuid b6e06fad/47ac3c4
serial 10, natMTU 1444, security index 0, client conn
call 0: # 5, state dally, mode: receiving, flags: receive_done
call 1: # 0, state not initialized
call 2: # 0, state not initialized
call 3: # 0, state not initialized
Done.
Peer at host 192.168.1.75, port 7001
ifMTU 1444 natMTU 1444 maxMTU 5692
packets sent 606 packet resends 22
bytes sent high 0 low 121207
bytes received high 0 low 21558
rtt 2 msec, rtt_dev 4 msec
timeout 0.368 sec
Peer at host 192.168.1.3, port 7002
ifMTU 1444 natMTU 1444 maxMTU 1444
packets sent 24 packet resends 0
bytes sent high 0 low 2532
bytes received high 0 low 384
rtt 0 msec, rtt_dev 0 msec
timeout 2.000 sec
Peer at host 192.168.1.3, port 7003
ifMTU 1444 natMTU 1444 maxMTU 1444
packets sent 2 packet resends 0
bytes sent high 0 low 185
bytes received high 0 low 16
rtt 0 msec, rtt_dev 0 msec
timeout 2.000 sec
On the client tcpdump shows:
10:37:01.221028 IP (tos 0x0, ttl 64, id 6264, offset 0, flags [none],
length: 80) lightning.xxx.afs3-callback > turbine.xxx.afs3-fileserver:
[udp sum ok] rx data cid f224e074 call# 241 seq 1 ser 516
<client-init>,<last-pckt> fs call fetch-data fid 536871033/575/1830
offset 0 length 999999999 (52)
10:37:01.819413 IP (tos 0x0, ttl 64, id 6265, offset 0, flags [none],
length: 80) lightning.xxx.afs3-callback > turbine.xxx.afs3-fileserver:
[udp sum ok] rx data cid f224e074 call# 241 seq 1 ser 517
<client-init>,<req-ack>,<last-pckt> fs call fetch-data fid
536871033/575/1830 offset 0 length 999999999 (52)
10:37:01.819721 IP (tos 0x0, ttl 64, id 18826, offset 0, flags [DF],
length: 93) turbine.xxx.afs3-fileserver > lightning.xxx.afs3-callback:
[udp sum ok] rx ack cid f224e074 call# 241 seq 1 ser 269 <slow-start>
first 2 serial 517 reason duplicate packet ifmtu 5692 maxmtu 1444 rwind
32 maxpackets 4 (65)
The last entry just as the client goes into zombie mode is the one
marked with "reason duplicate packet".
Another interesting point - can I assume that the output of rxdebug is
reporting MTU which is the tcp MTU - if so how does the AFS binary
think it can send MTUs of 5692 when the interface is declared to have
an MTU of 1500 - I know the client cannot handle that large an MTU
because the intermediate hops on the ethernet connection are not
gigabit capable. Could the server be trying to send a bogus packet
with 5692 data octets which are getting dropped on the floor due to an
invalid MTU size?