[OpenAFS-devel] Path MTU discovery

Simon Wilkinson sxw@your-file-system.com
Sat, 15 Sep 2012 11:06:37 +0100


>=20
> My best guess for what may have been attempted in the code: when the
> calling application defines an error via rx_SetMsgsizeRetryErr, we =
kill
> the call immediately with an error (e.g. RX_MSGSIZE). Otherwise, we =
try
> to force the 1400-byte packet through, and lower packet sizes back to
> the discovered MTU as soon as we can. If fragments can't get through,
> the call dies with a network error.

Hi Andrew,

I think your understanding on the PMTU code is roughly correct. It =
pretty much matches what I worked out last time I looked at this.

One critical thing that I think your overview misses is that we have two =
different types of MTU discovery. In my notes, I've take to calling =
these low and high MTU.

High MTU is where we attempt to discover if the MTU of the link is =
larger than the RX packet size. Code to do this has been in the tree for =
a while - Derrick reworked this as part of the YFS grant work, but I =
don't think ever got something that worked. High MTU discovery uses ICMP =
errors, the DF flag, and works in approximately the same way as TCP PMTU =
discovery, with the exception (as you note) that we can't resize =
existing RX packets.=20

When I looked at this last, my intention was to use high MTU discovery =
as a means of safely enabling jumbograms. Rather than using jumbograms =
to go over the known MTU (which causes fragmentation, and all of the =
problems that jumbograms are known for), you'd use jumbograms to combine =
RX packets to just below the discovered MTU. Doing this avoids all of =
the problems of jumbograms, and means that we don't have to get into =
creating oversize RX packets, which has its own pitfalls.

Low MTU is where the MTU of the link is smaller than the RX packet size. =
This is the case that Derrick discovered at the conference at UIUC and =
wrote code to work around. Low MTU detection doesn't use the traditional =
path MTU discovery code, but instead uses padded RX ping packets. If we =
don't get a response to a ping packet of a certain size, then we resend =
the ping with a lower size. When we eventually get a response, that's =
the MTU of the link. This is the code that uses rx_SetMsgsizeRetryErr - =
if that's registered, and we aren't making progress because of MTU, then =
the call will be failed with that error, and the application can retry, =
and thus get a smaller packet size.

To my mind, keeping the two of these separate makes sense at present. =
There are a lot of questions around support for setting the DF flag, and =
getting the ICMP errors delivered to the RX stack, especially when that =
stack is in userspace. The low MTU detection should work everywhere. =
Last time I looked, low MTU had some issues - in particular, it was =
using hard ACKs to determine with a call was making progress, when =
actually the presence of soft ACKs is sufficient (you don't care that =
the packet has reached the application, just that it has been =
successfully received by the network stack)

It would be good to keep discussing this. Like most of RX, this code is =
all a bit tangled, and I think discussing overall design intent is a =
great way to make sure that the patches do what we all expect them to!

Cheers,

Simon.