[OpenAFS-devel] Re: Path MTU discovery

Derrick Brashear shadow@gmail.com
Tue, 25 Sep 2012 15:33:54 -0400


On Tue, Sep 25, 2012 at 2:16 PM, Andrew Deason <adeason@sinenomine.net> wrote:
> On Sat, 15 Sep 2012 11:06:37 +0100
> Simon Wilkinson <sxw@your-file-system.com> wrote:
>
>> High MTU is where we attempt to discover if the MTU of the link is
>> larger than the RX packet size. Code to do this has been in the tree
>> for a while - Derrick reworked this as part of the YFS grant work, but
>> I don't think ever got something that worked. High MTU discovery uses
>> ICMP errors, the DF flag, and works in approximately the same way as
>> TCP PMTU discovery, with the exception (as you note) that we can't
>> resize existing RX packets.
>
> Here you are talking about enabling the Linux IP_MTU_DISCOVER
> functionality, and the ICMP error queue stuff, correct?

No. This is code which pads packets to discover when they stop being passed.

> Maybe what you
> describe was the intent of this, but that's certainly not all it does;
> this method does detect when the pmtu decreases and we get an icmp
> response saying what our next frag limit is. I don't see how this ever
> increases the peer mtu.
>
> Or are you also counting RX_ACK_MTU,lastPacketSize,lastPingSize,etc
> here?

Yes.

>> Low MTU is where the MTU of the link is smaller than the RX packet
>> size. This is the case that Derrick discovered at the conference at
>> UIUC and wrote code to work around. Low MTU detection doesn't use the
>> traditional path MTU discovery code, but instead uses padded RX ping
>> packets. If we don't get a response to a ping packet of a certain
>> size, then we resend the ping with a lower size. When we eventually
>> get a response, that's the MTU of the link. This is the code that uses
>> rx_SetMsgsizeRetryErr - if that's registered, and we aren't making
>> progress because of MTU, then the call will be failed with that error,
>> and the application can retry, and thus get a smaller packet size.
>
> So, this sounds like either RX_ACK_MTU, lastPacketSize, lastPingSize,
> etc, or it sounds like the 'mtuout' label in rxi_CheckCall. One of
> those, yes?

well, the lastPacket/lastPing is related to low and high. the mtuout
case is low.

>> To my mind, keeping the two of these separate makes sense at present.
>> There are a lot of questions around support for setting the DF flag,
>> and getting the ICMP errors delivered to the RX stack, especially when
>> that stack is in userspace.
>
> For now, I'm only worried about receiving ICMP errors on Linux, since
> that's the only platform I'm aware of that allows us to receive such
> errors without receiving nearly all ICMP errors for the whole box. And
> for Linux, this isn't difficult for userspace operations or anything, as
> it is a normal unprivileged operation. (Maybe there are other methods on
> other platforms for doing this, but I haven't looked into it.)

I have a Solaris streams module to do it, but it's ugly.

> I think my immediate concern is what to do about lastPacketSize raising
> the MTU after we have 'forced' a packet through via fragmentation that
> is higher than the actual MTU; this appears to be my only issue
> preventing the ICMP/IP_MTU_DISCOVER-based pmtu from working. Ideally I
> would want to just not set lastPacketSize/etc for a packet that is going
> out that is fragmented, but I don't think we have a way to determine
> that under the current model.
>
> What we could possibly do for Linux is to have two sockets open, one
> which is set to always set DF, and one to never set DF, and we could
> choose ourselves (Linux doesn't let you set this per-call; we'd have to
> setsockopt every time we want to switch... I think other platforms may
> let us set this per-call). What we could then do is always send the MTU
> pings with DF set, and everything else with DF not set, and only adjust
> MTU based on those MTU pings.

that sounds like a reasonable approach, though, I suspect this is more portable
than just Linux quite simply, and the more places we can have it, the better.

> Basing MTU decisions on both (MTU-specific pings and actual data
> packets) seems error prone, since the data packets we want to try and
> push through by all means, but the MTU ones should fail if any hop
> doesn't like the packet size.
>
> I'm somewhat thinking aloud here now; does this make sense?
>
> --
> Andrew Deason
> adeason@sinenomine.net
>
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel
>



-- 
Derrick