[OpenAFS-devel] performance study

Tue, 15 Feb 2005 12:04:11 +0100

Ken Hornstein wrote:

> 
> I've not compared RX against sunrpc+gssapi, but I've compared it
> against TCP on higher speed networks.  Between two OC-12 connected
> machines, I can get around 510-520 Mbits.  The best I can do with RX
> between the same machines is around 140 Mbits (I know Hartmut has
> claimed that he sees full bandwidth at Gig-E speeds, but I've never
> seen that here).
> 

I did some debugging recently in the RX protocol layer trying to solve 
performance problems and fixed a few things already:

  1. despite the verbal claim to request an ACK on the last packet sent 
in a chain, this does not always happen. In Murphy's terms it never 
happens when it would do good. In my tests, once the windows had 
reasonably opened the last packet in a chain had the flags field set to 
0. Therefore the ACK required to release the next batch was sent only 
after a timeout (of about 0.3 second). The protocol continues to work, 
but slowly.

Fixing this brought the speed from 20-30 MB/s to >110 MB/s (memory to 
memory, LAN). What was funny is that Hartmut ran my program without any 
tweaks and got 114 MB/s immediatly. The only explanation I have is that 
he hit some sort of "sweet spot" or "standing wave" on the two identical 
uni-processor machines he tried it on.

  2. When debugging using the rx_debugFile it would be nice if all 
packets were printed. Alas, when sending a chain of packets the "dpf" is 
placed outside the loop so that it only traces one of them, by Murphy 
the least interesting one.

I'll submit the patches to those two issues as soon as I've dealt with 
the last one:

once the first few ACKs arrive RX doubles the packet size (the MTU). 
Now, certain networks (wide area) appear to support this only "sort of". 
I have been given hints that Juniper routers start to heavily drop 
packets in that case, ruining packet re-assembly. I wrote a small test 
program that runs 20 kB/s without any special parameters, but at 
satisfying 2.4MB/s simply by calling rx_SetNoJumbo() (over the WAN of 
course, that's also the speed you get via TCP). What would be needed is 
a mechanism to discover that the MTU increase led to a slow-down and 
recover - but I admit I don't quite understand yet how the already 
existing code is meant to work.

Another one: RX has an implicit limit of a send/receive window of 255 
packets. For trans-atlantic traffic that's not too much with a standard 
MTU of 1416 (or so) bytes. In 1.3.x the variables are 32 bit numbers, 
but the offset for the ACKs is still a u_char and that sits in the 
packet. Perhaps fixable by sending several small ACK packets.

BTW:

RX over TCP might be interesting - if it scales! I'm a bit worried about 
file servers doing poll() (forget select()) on 10000+ TCP connections...

Besides that probably much more thought went into TCP than into RX also 
comes the fact that the routers very likely are better tuned to TCP.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985       Fax: +41 22 767 7155