[OpenAFS-devel] Path MTU discovery

Sat, 15 Sep 2012 13:48:49 -0400

On 09/15/2012 06:06 AM, Simon Wilkinson wrote:
>> My best guess for what may have been attempted in the code: when the
>> calling application defines an error via rx_SetMsgsizeRetryErr, we kill
>> the call immediately with an error (e.g. RX_MSGSIZE). Otherwise, we try
>> to force the 1400-byte packet through, and lower packet sizes back to
>> the discovered MTU as soon as we can. If fragments can't get through,
>> the call dies with a network error.
> Hi Andrew,
>
> I think your understanding on the PMTU code is roughly correct. It pretty much matches what I worked out last time I looked at this.
>
> One critical thing that I think your overview misses is that we have two different types of MTU discovery. In my notes, I've take to calling these low and high MTU.
>
> High MTU is where we attempt to discover if the MTU of the link is larger than the RX packet size. Code to do this has been in the tree for a while - Derrick reworked this as part of the YFS grant work, but I don't think ever got something that worked. High MTU discovery uses ICMP errors, the DF flag, and works in approximately the same way as TCP PMTU discovery, with the exception (as you note) that we can't resize existing RX packets.
>
> When I looked at this last, my intention was to use high MTU discovery as a means of safely enabling jumbograms. Rather than using jumbograms to go over the known MTU (which causes fragmentation, and all of the problems that jumbograms are known for), you'd use jumbograms to combine RX packets to just below the discovered MTU. Doing this avoids all of the problems of jumbograms, and means that we don't have to get into creating oversize RX packets, which has its own pitfalls.
>
> Low MTU is where the MTU of the link is smaller than the RX packet size. This is the case that Derrick discovered at the conference at UIUC and wrote code to work around. Low MTU detection doesn't use the traditional path MTU discovery code, but instead uses padded RX ping packets. If we don't get a response to a ping packet of a certain size, then we resend the ping with a lower size. When we eventually get a response, that's the MTU of the link. This is the code that uses rx_SetMsgsizeRetryErr - if that's registered, and we aren't making progress because of MTU, then the call will be failed with that error, and the application can retry, and thus get a smaller packet size.
>
> To my mind, keeping the two of these separate makes sense at present. There are a lot of questions around support for setting the DF flag, and getting the ICMP errors delivered to the RX stack, especially when that stack is in userspace. The low MTU detection should work everywhere. Last time I looked, low MTU had some issues - in particular, it was using hard ACKs to determine with a call was making progress, when actually the presence of soft ACKs is sufficient (you don't care that the packet has reached the application, just that it has been successfully received by the network stack)
>
> It would be good to keep discussing this. Like most of RX, this code is all a bit tangled, and I think discussing overall design intent is a great way to make sure that the patches do what we all expect them to!
Is this already documented somewhere outside of the source code? Should 
this be in the wiki?

Jason