[OpenAFS] 1.4.8, Rx Performance Improvements, and a Small Business Innovative Research grant

Rainer Toebbicke rtb@pclella.cern.ch
Fri, 3 Oct 2008 12:08:33 +0200


> As a result of these problems Rx was periodically not sending the=20
> anticipated acknowledgment packet which in turn resulted in a timeout
> and retransmission.  The Rx stack was also frequently finding itself
> out of free packets and was forced to block on a global lock while
> additional packets structures were allocated from the process'=20
> memory pool.  The end result was a performance improvement of greater
> than 9.5% when comparing the Rx performance of 1.4.8 over 1.4.7. =20
>=20
> Rough tests show that the 1.4.8 Rx stack is capable of 124MBytes/second
> over a 10Gbit link.  There is still a long way to go to fill a 10Gbit
> pipe but it is a start.  Now we are only off by one order of magnitude.
>=20

Having in the past repeatedly dug into the RX code (without spotting=20
those problems!) I am of course very interested and will try the new=20
code as soon as possible!

Just a few findings on RX from my previous (vain) attempts to make it=20
"lightning fast" - perhaps they trigger ideas for whoever is still=20
working on it or corrections from those who know better:

1. as latency grows when crossing routers or even public networks the=20
default window of 32 packets is too small. On the other hand, the=20
handling of the transmission queue grows with n**2, and even fast=20
processors are quickly overwhelmed. Here's where "oprofile" is a=20
valuable tool. Some of this can be reduced with queue hints, wisely=20
posting retransmit events and trying to avoid scanning the whole queue=20
in several places;

2. jumbograms are a pain: years ago we had a research network dropping=20
fragmented packets and spent weeks on pinning that down. Currently we=20
suspect another one. Firewalls choke on them. They also increase=20
complexity for access list in routers. And of course the probability=20
increases of having to retransmit the whole jumbogram because one=20
fragment got lost. What makes me frown is that it is apparently=20
faster if the kernels split and reassemble jumbograms on the fly, than=20
by RX doing it with much more knowledge about the state;

3. the path for handling an ACK packet is very long, I measured on the=20
order of  10 microseconds on average on a modern processor. At over=20
100 MB/s you'd be handling ~50000 ACKs per second in a non-jumbogram=20
configuration and have hardly any time left to send out new packets. A=20
lot is spent on waiting for the call-lock: even when that one is=20
released quickly (which it isn't in the standard implementation, as=20
the code leisurely walks around with it for extended periods, but I=20
experimented with a "release" flag), the detour through the scheduler=20
slows down things dramatically. The lock structure should probably be=20
revisited to make contention between ack recv & transmit threads less=20
likely;

4. slow start is implemented state-of-the-art, fast recovery however=20
looks odd to me (actually: "inexisting" but I may be fooled by some=20
jumbogram smoke). When it comes to congestion avoidance, a lot of the=20
research that went into TCP in the last ten years is obviously=20
missing. I started experimenting with CUBIC in the hope that it helps=20
to reduce retransmits and keeping a constant flow, let's see;

5. earlier this year we mentioned handling of new calls, which is=20
again a quadratic problem due to the mixture of service classes. This=20
makes it impractical to allow for thousands of waiting calls, creating=20
a problem on a cluster with thousands of nodes.

With those observations... does rx-over-tcp look like a solution? On=20
the packet-transmission side probably, but the encapsulation very=20
likely still demands significant processing power. And running a=20
server with 10000 or 20000 TCP connections does not sound that obvious=20
either.

Voil=E0... my 0.02 =A4. Sorry for being verbose, I couldn't resist.


--=20
=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D=
-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=
=3D-=3D
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985       Fax: +41 22 767 7155