OpenAFS Master Repository branch, master, updated. openafs-devel-1_5_76-4549-gd03e3c3

Wed, 14 Jan 2015 10:29:29 -0500

The following commit has been merged in the master branch:
commit d03e3c392eee2bd9158c417f42bcbf2009dabfc3
Author: Andrew Deason <adeason@sinenomine.net>
Date:   Thu Sep 13 17:58:50 2012 -0500

    rx: Normalize use of some MTU-discovery fields
    
    When we store MTUs (peer->ifMTU, peer->natMTU, etc.), we store the
    maximum transport unit usable by RX, i.e., excluding the IP and UDP
    headers, but including the RX header.  Contrariwise, when we track the
    size of packets we've sent (conn->lastPacketSize, peer->maxPacketSize),
    we track logical packet lengths which exclude the RX header (and the IP
    and UDP headers).  However, the consumers of lastPacketSize and
    maxPacketSize were not always interpreting their values correctly as
    excluding the RX (and other) headers.
    
    Add comments to these fields in their respective structure definitions
    to help make clear what they contain (and the difference between them).
    Correct several checks which were using the wrong constant for
    correcting between these two worldviews (and the wrong sign).  Modernize
    the style of lines that are touched.
    
    The lastPacketSize and maxPacketSize variables are only accessed from
    five places: while sending packets, while processing acks, while sending
    acks, while handling growMTU events, and in rxi_CheckCall().  They are
    used to track the size of packets that have been successfully sent (and
    thus, indirectly, to control the size of packets we send).  The
    maxPacketSize is only set once we have received an ack from the peer for
    a packet of that size, and lastPacketSize tracks the size of a
    speculative packet larger than that size, until it is acked.
    
    When sending packets, we check if the size of the packet we are sending
    exceeds the recorded maxPacketSize, and if so, record that speculative
    size in the lastPacketSize variable, along with the sequence number of
    that packet in lastPacketSizeSeq.
    
    Correspondingly, when processing acks, if the packet tracked in
    lastPacketSizeSeq is being acked, we know that our speculative large
    packet was successfully received, and can attempt to update the recorded
    maxPacketSize for the peer.  This is done through an intermediate
    variable, 'pktsize', which holds either the value of lastPacketSize or
    lastPingSize, without adjustment for the size of any headers.
    
    The ack processing has a bit of code to handle the case where
    maxPacketSize has been reset to zero, initializing it to a small value
    which should be expected to always work.  The intention seems to have
    been to initialize maxPacketSize to the smallest permitted value (since
    RX_MIN_PACKET_SIZE is amount of data available to RX in the smallest
    permitted IP packet), but the old code was actually initializing
    maxPacketSize from zero to something a bit larger than the minimum, by
    RX_IPUDP_SIZE + RX_HEADER_SIZE.  This over-large initialization was
    mostly harmless, see below.  After this potential initialization,
    'pktsize' was incorrectly used to set a new ifMTU and natMTU for the
    peer.  It correctly determined that a packet larger than the previous
    maxPacketSize had been acked, but then set the peer's ifMTU and natMTU
    to smaller values than the acked packet actually indicates.  (It is
    careful to only increase the ifMTU, though.)  The actual peer->MTU is
    *not* updated here, but will presumably trickle through eventually via
    rxi_Resend() or similar.  It is possible that this code should be using
    rxi_SetPeerMtu() or similar logic, but that might involve locking issues
    which are outside the scope of this analysis.
    
    The over-large initialization of maxPacketSize (from zero) was
    fortuitously mostly harmless on systems using minimum-sized IP packets,
    since a correspondingly wrong check was used to detect if a new MTU
    invalidates this maxPacketSize, with the constants offsetting.
    Likewise, the checks in rxi_SendAck() had the wrong constants, but they
    offset to produce the correct boundary between small and large packets
    while trying to grow the MTU.  Unfortunately, the old behavior in the
    "small" case is not correct, and the grow MTU event would try to send a
    packet with more padding than was intended.  In networks allowing
    packets slightly larger than the minimum (but not much larger than the
    minimum), the old code may have been unable to discover the true MTU.
    
    In the main (MTU-related) logic of rxi_SendAck, a variable 'padbytes' is
    set to a small increment of maxPacketSize in the "small" case, and a
    larger increment of maxMTU in the "large" case.  There is a floor for
    padbytes based on RX_MIN_PACKET_SIZE, which ended up being larger than
    intended in the old code by approximately the size of the rx header.
    (Some of the adjustments performed are rather opaque, so the motivations
    are unclear.)
    
    The more interesting places where accesses to lastPacketSize and
    maxPacketSize happen are during the MTU grow events and in
    rxi_CheckCall().
    
    In rxi_CheckCall(), the packet size variables are only accessed if
    the connection has the msgsizeRetryErr flag set, the call is not timed
    out (whether for idleness or during active waiting), and the call has
    actually received data.  In this case, we conclude that sending packets
    failed and decrease the MTU.  The old code was quite broken in this
    regard, with a reversed sense of conditional for whether a speculative
    large packet was outstanding, and a rather large decrease in MTU size
    of effectively 128 + RX_HEADER_SIZE + RX_IPUDP_SIZE = 212, when only
    a decrease of 128 was intended.  The new code corrects the sense of
    the conditional and sets the decrease in MTU to the intended value of 128.
    
    With respect to MTU grow events, this change only touches
    rxi_SetPeerMtu(), to correct the conditional on whether the MTU update
    invalidates/overrides the cached maxPacketSize.  There is a window of
    values which could cause the old code to incorrectly fail to invalidate
    the cached maxPacketSize.  Values in this window could result in the old
    code being stuck on a bad MTU guess, but should not cause an actual
    failure to communicate.  That conditional zeroing of maxPacketSize is
    the only access to the PacketSize variables in rxi_SetPeerMtu().
    maxPacketSize is also checked in rxi_GrowMTUEvent(), but only against
    zero to avoid sending RX_ACK_MTU packets needlessly, so it is unaffected
    by the issue at hand.
    
    In summary, in certain network conditions, the old code could fail
    to find an optimum MTU size, but would always continue to operate.
    The new code is more likely to find a better MTU size, and the old
    and the new code should interoperate fine with both each other and
    themselves.
    
    [kaduk@mit.edu add a few missed cases; expound on analysis in commit message]
    
    Change-Id: I7494892738d38ffe35bdf1dfd483dbf671cc799a
    Reviewed-on: http://gerrit.openafs.org/8111
    Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
    Tested-by: BuildBot <buildbot@rampaginggeek.com>
    Reviewed-by: Daria Brashear <shadow@your-file-system.com>

 src/rx/rx.c      |   23 ++++++++++++-----------
 src/rx/rx_conn.h |    4 ++--
 src/rx/rx_peer.h |    6 +++---
 3 files changed, 17 insertions(+), 16 deletions(-)

-- 
OpenAFS Master Repository