[OpenAFS-devel] Re: Path MTU discovery

Derrick Brashear shadow@gmail.com
Tue, 25 Sep 2012 16:20:56 -0400


On Tue, Sep 25, 2012 at 4:07 PM, Andrew Deason <adeason@sinenomine.net> wrote:
> On Tue, 25 Sep 2012 15:33:54 -0400
> Derrick Brashear <shadow@gmail.com> wrote:
>
>> > Here you are talking about enabling the Linux IP_MTU_DISCOVER
>> > functionality, and the ICMP error queue stuff, correct?
>>
>> No. This is code which pads packets to discover when they stop being
>> passed.
>
> Okay, so in my mind, there are effectively three different MTU-related
> systems right now; agree? hi/lo/icmp(linux)

there's something which attempts the icmp thing for solaris also which
is not my streams module,
istr.

but yes, 3 different systems, at least in the code base.

> They're not all working or even enabled, and they may try to do similar
> things, but... I mean, in terms of code presence in the tree. It's just
> a little confusing, since each of them is kinda spread out, and they all
> use the same terminology and vars/fields and such ("MTU").

yes.

> I'm not complaining (though sure, it'd be nice to have this easier to
> understand); just trying to explicitly note this to be clear.
>
>> > So, this sounds like either RX_ACK_MTU, lastPacketSize,
>> > lastPingSize, etc, or it sounds like the 'mtuout' label in
>> > rxi_CheckCall. One of those, yes?
>>
>> well, the lastPacket/lastPing is related to low and high. the mtuout
>> case is low.
>
> lastPacket/lastPing can't lower the mtu, though, as I understand it.
> They can only raise it.

not per se. it's used to know what the last thing we actually were able to
send was, and the downward tweak happens with the mtuout code.

>> > What we could possibly do for Linux is to have two sockets open, one
>> > which is set to always set DF, and one to never set DF, and we could
>> > choose ourselves (Linux doesn't let you set this per-call; we'd have
>> > to setsockopt every time we want to switch... I think other
>> > platforms may let us set this per-call). What we could then do is
>> > always send the MTU pings with DF set, and everything else with DF
>> > not set, and only adjust MTU based on those MTU pings.
>>
>> that sounds like a reasonable approach, though, I suspect this is more
>> portable than just Linux quite simply, and the more places we can have
>> it, the better.
>
> Yes, and I'm trying to think of how to do this so Linux ICMP processing
> can still be incorporated, without having to have a completely separate
> "Linux implementation". I don't mean to exclude the others; just
> mentioning Linux since it seems harder there to specify DF-vs-not at a
> fine granularity.
>
>
> If I can try to draw out how this works / would work:
>
>  - "normally" every N seconds, we send out a padded DF ping a little
>    larger than the known path MTU. If we get a response or an ICMP frag
>    error, set the pmtu.
>
>  - After X seconds/packets of packet loss, we send out a padded DF ping
>    smaller than the known path MTU. If we get a response or ICMP frag
>    error, set the pmtu. If we don't get either after Y seconds, repeat
>    with smaller packets.

codify "packet loss", because here's where it gets exciting.

>  - All other packets clear DF.
>
> Currently the 'pings' are done as call events, which I think is really
> adding to the complexity. If we could do this per-peer (as has been
> suggested for the NATping stuff, too), I think it would make this easier
> to follow and would reduce overhead.

well, the nat ping stuff has been restructured to be different, but yeah.

> Rx ping acks iirc need to be tied to a call, though; would it be
> possible to use "version" packets for this again?

if we're sending them anyway, yes, but the goal there was to generate no extra
traffic in the base case.

> That could possibly be done with event objects tied to peers, but ever
> since the NAT-ping thing I've been wondering about a separate thread for
> handling peer processing. Since NAT-ping is potentially quite frequent
> (every 6 seconds or whatever for potentially hundreds or even thousands
> of peers), it seems like that alone is a lot of pressure on the event
> thread for just idle behavior. If we instead had a thread that just
> slept for 6 seconds and then traversed every peer for NAT-ping, we could
> handle other things along the way, like MTU processing.

we need a way to discover nat ping is unneeded and not send it, and marking such
a thing in the peer and then doing this seems reasonable.

when simon has a moment hopefully he will weigh in.

> So, thoughts/etc?
>
> --
> Andrew Deason
> adeason@sinenomine.net
>
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel
>



-- 
Derrick