[OpenAFS-devel] Path MTU discovery

Andrew Deason adeason@sinenomine.net
Fri, 14 Sep 2012 17:25:07 -0500


So today I've been pushing a bunch of changes related to ICMP error
handling and PMTU discovery. Most of this is just small bugfixes and
stuff, and I don't think warrants much high-level discussion. (The
current set of patches ends with gerrit 8120.) However, the PMTU
discovery stuff still has issues; I wanted to raise some discussion of
it here.

Over the years I believe there have been a few different attempts for
implementing some form of PMTU discovery in Rx. I'm a little unclear on
what the current plan(s) is for how to do this, since there is some code
here and there in the tree that is not currently enabled.  This email
serves as a description of what the deal is with this, and writing out
some thoughts as I solicit comments or explanations.

So, as I understand it, PMTU discovery in Rx is impossible to do in the
'usual'/ideal sense, since we cannot alter packet sizes on resend since
our ACKs and window sizes etc are all packet-based. That is, say we sent
a packet of 1400 bytes with DF set, and we get back 'frag needed' with a
specified limit of 1000 bytes. We cannot then send a 1000-byte packet
and a 400 byte packet, since we already sent a 1400-byte packet with a
given sequence number, so it must remain the same size.

At that point, I can see a couple of things happening:

What seems to occur with the 'current' PMTU Linux code is that we just
keep trying to send that 1400 byte packet, and Linux eventually just
sends it with DF unset. All future packets sent to that same destination
are then also sent with DF unset (I assume this stops after some route
cache entry expires or something). However, since we did get the ICMP
'frag needed' message, we do update the peer MTU, so future messages are
sent with the smaller, e.g. 1000-byte size. But since they are sent with
DF unset, if the MTU changes again we don't notice, and fragmentation
occurs.

Furthermore, this is made worse because of some code that looks like it
tries to discover larger MTUs. This stuff in rxi_ReceiveAckPacket:

        peer->ifMTU=pktsize+RX_HEADER_SIZE;
        peer->natMTU = rxi_AdjustIfMTU(peer->ifMTU);

So, if we force through a (fragmented) 1400 byte packet, we get an ACK
for it so Rx thinks it's fine to send 1400 bytes, so the peer MTU is
increased. And since we keep sending fragmented packets, we never lower
it again, so the PMTU information is effectively not used. On testing
this, it only seems to happen 'sometimes'; I assume this is due to
differences in packet arrivals/ordering, or when Linux decides to let a
too-large packet through. (I'm not sure if maybe we want to disable this
for AFS_RXERRQ_ENV, or maybe prevent MTU increases for some time limit
after it was last decreased.)

The only other option I see, though, is to just kill the call as soon as
we get a 'frag needed', and the caller will just have to restart the
call. For networks that drop UDP fragments (which certainly have been
seen in many places), this is effectively what happens anyway, since the
call will never proceed. This doesn't seem too terrible, since MTU
changes should be rare. I'm not sure if this was the intended purpose
for RX_MSGSIZE?


My best guess for what may have been attempted in the code: when the
calling application defines an error via rx_SetMsgsizeRetryErr, we kill
the call immediately with an error (e.g. RX_MSGSIZE). Otherwise, we try
to force the 1400-byte packet through, and lower packet sizes back to
the discovered MTU as soon as we can. If fragments can't get through,
the call dies with a network error.

-- 
Andrew Deason
adeason@sinenomine.net