[OpenAFS] 1.4.8, Rx Performance Improvements, and a Small Business Innovative Research grant

Jeffrey Altman jaltman@secure-endpoints.com
Thu, 02 Oct 2008 19:11:31 -0400

This is a cryptographically signed message in MIME format.

Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

In discussions during 2007 with the HEPiX community, it was made 
clear to the gatekeepers that identifying and correcting trouble spots 
within Rx was one of the most important areas that OpenAFS needed to
improve in order to maintain the existing deployments within that
community of users.  No one had the resources to put towards such a 
pursuit and it was suggested that OpenAFS apply for a United States
Small Business Innovative Research (SBIR) grant to fund the work.  

There are two problems with such an approach.  First, OpenAFS does
not legally exist and even when it does have a legal Foundation, it
would not be eligible for an SBIR grant due to its not-for-profit
status.  Since we did not have any other source of funding to perform
the work it was suggested that one of the existing commercial support
companies submit a grant application.

In October 2007 I founded Your File System Inc. as a for-profit company 
that would be eligible to receive an SBIR grant and use the funding 
to accomplish two goals.  First, to benefit the OpenAFS community by
documenting the existing architectures and protocols used by OpenAFS
as well as engage in profiling and performance analysis that could be
used as input to the development of next generation implementations.
Secondly, because the company is receiving SBIR funding, to develop
a sustainable business model that could support the development of a 
next generation distributed storage system.  

The SBIR grant has provided funding for developer hours as well
as test equipment.  In particular, the SBIR grant has provided 
a 10GBit/second network testbed which is being used for Rx profiling.

I am pleased to announce the first public benefit to the OpenAFS 
community as a result of the SBIR grant with the expectation 
that there will be much more to come in the future.  

There have been many efforts over the last five years to improve Rx.
Tom Keiser implemented per thread free packet queues to reduce the 
contention for the global lock protecting the free packet queue.  
Other work has been performed to reduce the dependency on global 
locks.  Rx hot threads have been implemented on a broader range of 
platforms.  Various bug fixes have been accepted as they have been
validated.  Still with all of this work, Rx still has experienced
noticeable performance problems.  In November 2006 there was 
discussion regarding a 350ms hiccup that was experienced repeatedly
and was significantly hampering performance.  Several folks have
tried to pin it down over the years unsuccessfully.

Funded by the SBIR grant there have been efforts over the last couple
of months to analyze Rx performance data from a number of sources.
There were several symptoms identified that it was unclear were related
to the hiccup but were worth investigating.  First there was a periodic 
out of memory error experienced in Windows test clients.  Second, there 
was a consistent lack of free packets.  Third, there were a much larger 
number of retries than could be explained due to packets lossage on the 

What the investigations uncovered were a related set of problems;
some of which affect all implementations of Rx derived from the Transarc
implementation.  The problems fall into several categories:

   1. Resetting a Call object emptied packet queues without adding the 
      packets to the free packet queue.  rxi_ResetCall() would call 
      queue_Init() on queues with active rx_packets on them.  once the 
      queues were cleared the packets were leaked and any acknowledgment 
      of receipt or transmission of other outgoing data would be lost.  
      Instead of initializing the queues the contents of the queues should 
      simply be freed either by a call to rxi_FreePackets() or by setting 
      the force flag on rxi_ClearTransmitQueue() and rxi_ClearReceiveQueue().
   2. Packets queued for transmission would not be sent.
      In rx.c there were two instances of RX_GLOBAL_RXLOCK_KERNEL which
      should have been AFS_GLOBAL_RXLOCK_KERNEL.  This oversight
      resulted in rx_calls that were actively transmitting packets to
      reset the call prematurely and leak the outgoing packets.
   3. Packets would be leaked while read operations were progressing.
      rxi_ReadProc()/rxi_ReadProc32() failed to remove the currentPacket
      and put it on the call's iov queue when all of its data was read. 
      This resulted in the packet being lost either when the next read
      packet was fetched, when the next packet was transmitted, or when
      the call was reset. 
   4. The algorithm in OpenAFS which is used to allocate additional
      packets when there were no free packets was overly aggressive.  It
      was based on the overall number of packets that had been
      previously allocated.  Each allocation would increase a larger
      number than the previous one. 

The side effects of these issues have been present in AFS for a very
long time and have been seen in both clients and servers.  Corrections
for these errors have been integrated into 1.5.53 and 1.4.8-pre1.

As a result of these problems Rx was periodically not sending the 
anticipated acknowledgment packet which in turn resulted in a timeout
and retransmission.  The Rx stack was also frequently finding itself
out of free packets and was forced to block on a global lock while
additional packets structures were allocated from the process' 
memory pool.  The end result was a performance improvement of greater
than 9.5% when comparing the Rx performance of 1.4.8 over 1.4.7.  

Rough tests show that the 1.4.8 Rx stack is capable of 124MBytes/second
over a 10Gbit link.  There is still a long way to go to fill a 10Gbit
pipe but it is a start.  Now we are only off by one order of magnitude.

Some might ask, "how is it that these bugs remained present in the OpenAFS
source tree for all of these years?"   The answer is quite simple.  "No
one ever thought to look for packet leaks."   Many organizations still 
perform weekly server restarts and never noticed the memory leaks and
the rxdebug -rxstats output lists the free packet count but no one ever 
thought that it was important to report the number of allocated packets.   
As a result no one noticed that the reason there were free packets available 
was because packets were constantly being allocated instead of recycled.
Over the years many individuals have noticed the extra resends.  Its just
that no one was able to identify why they were being sent.  The resends
did not prevent the system from functioning.  It was just slower than 
it should be.

As these changes become available for both clients and servers I am 
expecting users to see a much improved throughput rate and several 
previously unexplained server and client crashes will now being a thing
of the past.

In order to get OpenAFS 1.4.8 released we need the assistance of the
community to test the pre-releases.  1.4.8-pre1 was announced yesterday.
The best way to move the release process along is for organizations 
that deploy OpenAFS to test the pre-releases and send e-mails to this
mailing list confirming what works.  Silence cannot be interpreted by
the gatekeepers as all is well.  

I look forward to your reports of success and to reporting future 
grant funded contributions in the future.

Jeffrey Altman

Content-Type: application/x-pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature