[OpenAFS-devel] Re: Path MTU discovery

Tue, 25 Sep 2012 13:16:21 -0500

On Sat, 15 Sep 2012 11:06:37 +0100
Simon Wilkinson <sxw@your-file-system.com> wrote:

> High MTU is where we attempt to discover if the MTU of the link is
> larger than the RX packet size. Code to do this has been in the tree
> for a while - Derrick reworked this as part of the YFS grant work, but
> I don't think ever got something that worked. High MTU discovery uses
> ICMP errors, the DF flag, and works in approximately the same way as
> TCP PMTU discovery, with the exception (as you note) that we can't
> resize existing RX packets. 

Here you are talking about enabling the Linux IP_MTU_DISCOVER
functionality, and the ICMP error queue stuff, correct? Maybe what you
describe was the intent of this, but that's certainly not all it does;
this method does detect when the pmtu decreases and we get an icmp
response saying what our next frag limit is. I don't see how this ever
increases the peer mtu.

Or are you also counting RX_ACK_MTU,lastPacketSize,lastPingSize,etc
here?

> Low MTU is where the MTU of the link is smaller than the RX packet
> size. This is the case that Derrick discovered at the conference at
> UIUC and wrote code to work around. Low MTU detection doesn't use the
> traditional path MTU discovery code, but instead uses padded RX ping
> packets. If we don't get a response to a ping packet of a certain
> size, then we resend the ping with a lower size. When we eventually
> get a response, that's the MTU of the link. This is the code that uses
> rx_SetMsgsizeRetryErr - if that's registered, and we aren't making
> progress because of MTU, then the call will be failed with that error,
> and the application can retry, and thus get a smaller packet size.

So, this sounds like either RX_ACK_MTU, lastPacketSize, lastPingSize,
etc, or it sounds like the 'mtuout' label in rxi_CheckCall. One of
those, yes?

> To my mind, keeping the two of these separate makes sense at present.
> There are a lot of questions around support for setting the DF flag,
> and getting the ICMP errors delivered to the RX stack, especially when
> that stack is in userspace.

For now, I'm only worried about receiving ICMP errors on Linux, since
that's the only platform I'm aware of that allows us to receive such
errors without receiving nearly all ICMP errors for the whole box. And
for Linux, this isn't difficult for userspace operations or anything, as
it is a normal unprivileged operation. (Maybe there are other methods on
other platforms for doing this, but I haven't looked into it.)

I think my immediate concern is what to do about lastPacketSize raising
the MTU after we have 'forced' a packet through via fragmentation that
is higher than the actual MTU; this appears to be my only issue
preventing the ICMP/IP_MTU_DISCOVER-based pmtu from working. Ideally I
would want to just not set lastPacketSize/etc for a packet that is going
out that is fragmented, but I don't think we have a way to determine
that under the current model.

What we could possibly do for Linux is to have two sockets open, one
which is set to always set DF, and one to never set DF, and we could
choose ourselves (Linux doesn't let you set this per-call; we'd have to
setsockopt every time we want to switch... I think other platforms may
let us set this per-call). What we could then do is always send the MTU
pings with DF set, and everything else with DF not set, and only adjust
MTU based on those MTU pings.

Basing MTU decisions on both (MTU-specific pings and actual data
packets) seems error prone, since the data packets we want to try and
push through by all means, but the MTU ones should fail if any hop
doesn't like the packet size.

I'm somewhat thinking aloud here now; does this make sense?

-- 
Andrew Deason
adeason@sinenomine.net