[OpenAFS] tcpoob timeline

Jeffrey Altman jaltman@your-file-system.com
Fri, 26 Oct 2012 18:40:38 -0400

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 10/26/2012 5:03 PM, Andrew Deason wrote:
> On Wed, 17 Oct 2012 10:45:05 +0200
> To provide a sense of ordering... rxgk standards work will definitely
> precede tcp oob, though rxgk implementation may or may not. After rxgk,=

> some smaller/simpler standards docs may go through, but tcp oob may be
> the next 'bigger' one. But the ordering here is unsure; Mike Meffie
> should be clarifying some specifics of the new standards process within=

> the next week. I expect that around that time is when we'll discuss the=

> priority of which documents to look at; some people may disagree with m=
> guessed priorities.
> Note that that is my thinking and my guesses for code being in the tree=
> not for a stable release. Release scheduling is such a question mark fo=
> me right now I can't even begin to guess for that.

I have significant concerns about the design of TCP OOB as it was
described at EAKC2012.


The argument in favor of a TCP based solution is that RX cannot go fast
enough.  Andrew's claim is that RX cannot use a window size greater than
43.75K because of the 32 packet window limitation in 1.6.  The fact is
that this limitation is not a protocol limitation but an implementation
limitation.  Andrew points to Simon Wilkinson's past talks on RX as a
justification for this restriction.  Of course, Simon's talks also
provide the road map for how to remove the bottlenecks and permit the
full window size of 256 packets to be used without performance
degradation.  When combined with Jumbograms a window can support up to
2MB of application data per call without any changes to the protocol.

In fact, Simon's "AFS Performance" talk, slide 28


showed some of the results of his hard work with RX UDP throughput
quadrupled since 1.4 and more than doubled since 1.6.  There is still
room for significant improvement beyond the numbers presented at the

As Andrew indicated in his talk, TCP OOB was a compromise driven by the
fact that spending the resources to fix the RX implementation was deemed
to be too time consuming and expensive.  TCP OOB was designed to be a
rapid development approach to obtaining higher throughput.  Andrew
achieved some impressive numbers in a closed environment with no
requirement for wire privacy OOB channel.

However, when we begin to consider standardization and inclusion of an
OOB mechanism in OpenAFS we must provide for wire authentication and
privacy at least as strong as that provided by the RX UDP connection and
the impact that TCP socket allocation will have on the scalability of
the file server and fairness in delivering data to all clients.

Between AFS2 and AFS3 a decision was made to switch to UDP because the
file servers could not maintain enough open tcp connections to serve all
of the clients.  While we might believe the days of TCP socket limits
are behind us, the number of file descriptors on a system does not scale
to the number of clients that may be actively connecting to an AFS file
server on public Internet deployments.   Adding a TCP connection per
active FetchData / StoreData operation can severely restrict the number
of clients a file server can communicate with simultaneously.

It might be interesting to the community to note that Microsoft as part
of Server 2012 and Windows 8 have begun rolling out UDP based
equivalents to many of the Windows protocols that are frequently used
over the public Internet.  Some of these new protocols have been back
ported to Windows 7 SP1 and are being rolled out via Windows Update
starting today.  The reason for these new protocol implementations is
that research has shown that UDP based protocols can be faster than TCP
and perform significantly better over connections with large latencies
and packet loss.

While I believe there is a place for OOB transfer protocols in
constrained situations it is my personal belief that OOB transfer
protocols are not an appropriate long term solution for general access
to the /afs name space.  Improving RX UDP to support high performance
data transfers is not just theoretical but has been demonstrated.

In a later reply, Matt asked about RX/TCP and what happened to it.
RX/TCP was designed to be a BEEP style protocol which included
bidirectional data flows.  After considerable effort was spent on
implementation there were significant problems.

First the performance characteristics when managing multiple calls in
more than one direction over a single TCP connection were not as
impressive as one would want.  Research papers have documented the
problems caused by multiple layers of flow control and their negative
interactions. Second, without a clean well defined RX API which hid the
implementation from the callers it would not be possible to quietly
graft RX TCP into the existing cache managers and service.  Finally, the
file server would need to manage the binding of callback connections to
an appropriately secured in-bound TCP connection and manage the cases
where one did not exist.  In the end, the resources being spent on the
problem did not appear to be worth the benefits when compared to the
benefits that improving the RX UDP implementation will provide.  As a
result, YFSi has continued to devote on-going development resources to
profiling, protocol analysis, and performance optimization for the RX
UDP protocol.

It is our firm belief that for the vast majority of use cases, OOB
protocols are simply unnecessary.  That is not to say that
standardization work on an OOB solution should not proceed.  However, I
believe that the community would be better off not looking for a quick
fix and should instead focus efforts and resources on a top to bottom
analysis of the data flows.   This is what YFSi has spent the last five
years doing to great success.

Jeffrey Altman

Jeffrey Altman

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

Version: GnuPG v1.4.9 (MingW32)