[OpenAFS-devel] performance study
Rainer Toebbicke
rtb@pclella.cern.ch
Tue, 15 Feb 2005 12:04:11 +0100
Ken Hornstein wrote:
>
> I've not compared RX against sunrpc+gssapi, but I've compared it
> against TCP on higher speed networks. Between two OC-12 connected
> machines, I can get around 510-520 Mbits. The best I can do with RX
> between the same machines is around 140 Mbits (I know Hartmut has
> claimed that he sees full bandwidth at Gig-E speeds, but I've never
> seen that here).
>
I did some debugging recently in the RX protocol layer trying to solve
performance problems and fixed a few things already:
1. despite the verbal claim to request an ACK on the last packet sent
in a chain, this does not always happen. In Murphy's terms it never
happens when it would do good. In my tests, once the windows had
reasonably opened the last packet in a chain had the flags field set to
0. Therefore the ACK required to release the next batch was sent only
after a timeout (of about 0.3 second). The protocol continues to work,
but slowly.
Fixing this brought the speed from 20-30 MB/s to >110 MB/s (memory to
memory, LAN). What was funny is that Hartmut ran my program without any
tweaks and got 114 MB/s immediatly. The only explanation I have is that
he hit some sort of "sweet spot" or "standing wave" on the two identical
uni-processor machines he tried it on.
2. When debugging using the rx_debugFile it would be nice if all
packets were printed. Alas, when sending a chain of packets the "dpf" is
placed outside the loop so that it only traces one of them, by Murphy
the least interesting one.
I'll submit the patches to those two issues as soon as I've dealt with
the last one:
once the first few ACKs arrive RX doubles the packet size (the MTU).
Now, certain networks (wide area) appear to support this only "sort of".
I have been given hints that Juniper routers start to heavily drop
packets in that case, ruining packet re-assembly. I wrote a small test
program that runs 20 kB/s without any special parameters, but at
satisfying 2.4MB/s simply by calling rx_SetNoJumbo() (over the WAN of
course, that's also the speed you get via TCP). What would be needed is
a mechanism to discover that the MTU increase led to a slow-down and
recover - but I admit I don't quite understand yet how the already
existing code is meant to work.
Another one: RX has an implicit limit of a send/receive window of 255
packets. For trans-atlantic traffic that's not too much with a standard
MTU of 1416 (or so) bytes. In 1.3.x the variables are 32 bit numbers,
but the offset for the ACKs is still a u_char and that sits in the
packet. Perhaps fixable by sending several small ACK packets.
BTW:
RX over TCP might be interesting - if it scales! I'm a bit worried about
file servers doing poll() (forget select()) on 10000+ TCP connections...
Besides that probably much more thought went into TCP than into RX also
comes the fact that the routers very likely are better tuned to TCP.
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985 Fax: +41 22 767 7155