[OpenAFS-devel] Re: Path MTU discovery

Andrew Deason adeason@sinenomine.net
Tue, 25 Sep 2012 15:07:07 -0500


On Tue, 25 Sep 2012 15:33:54 -0400
Derrick Brashear <shadow@gmail.com> wrote:

> > Here you are talking about enabling the Linux IP_MTU_DISCOVER
> > functionality, and the ICMP error queue stuff, correct?
> 
> No. This is code which pads packets to discover when they stop being
> passed.

Okay, so in my mind, there are effectively three different MTU-related
systems right now; agree? hi/lo/icmp(linux)

They're not all working or even enabled, and they may try to do similar
things, but... I mean, in terms of code presence in the tree. It's just
a little confusing, since each of them is kinda spread out, and they all
use the same terminology and vars/fields and such ("MTU").

I'm not complaining (though sure, it'd be nice to have this easier to
understand); just trying to explicitly note this to be clear.

> > So, this sounds like either RX_ACK_MTU, lastPacketSize,
> > lastPingSize, etc, or it sounds like the 'mtuout' label in
> > rxi_CheckCall. One of those, yes?
> 
> well, the lastPacket/lastPing is related to low and high. the mtuout
> case is low.

lastPacket/lastPing can't lower the mtu, though, as I understand it.
They can only raise it.

> > What we could possibly do for Linux is to have two sockets open, one
> > which is set to always set DF, and one to never set DF, and we could
> > choose ourselves (Linux doesn't let you set this per-call; we'd have
> > to setsockopt every time we want to switch... I think other
> > platforms may let us set this per-call). What we could then do is
> > always send the MTU pings with DF set, and everything else with DF
> > not set, and only adjust MTU based on those MTU pings.
> 
> that sounds like a reasonable approach, though, I suspect this is more
> portable than just Linux quite simply, and the more places we can have
> it, the better.

Yes, and I'm trying to think of how to do this so Linux ICMP processing
can still be incorporated, without having to have a completely separate
"Linux implementation". I don't mean to exclude the others; just
mentioning Linux since it seems harder there to specify DF-vs-not at a
fine granularity.


If I can try to draw out how this works / would work:

 - "normally" every N seconds, we send out a padded DF ping a little
   larger than the known path MTU. If we get a response or an ICMP frag
   error, set the pmtu.
 
 - After X seconds/packets of packet loss, we send out a padded DF ping
   smaller than the known path MTU. If we get a response or ICMP frag
   error, set the pmtu. If we don't get either after Y seconds, repeat
   with smaller packets.
 
 - All other packets clear DF.

Currently the 'pings' are done as call events, which I think is really
adding to the complexity. If we could do this per-peer (as has been
suggested for the NATping stuff, too), I think it would make this easier
to follow and would reduce overhead.

Rx ping acks iirc need to be tied to a call, though; would it be
possible to use "version" packets for this again?

That could possibly be done with event objects tied to peers, but ever
since the NAT-ping thing I've been wondering about a separate thread for
handling peer processing. Since NAT-ping is potentially quite frequent
(every 6 seconds or whatever for potentially hundreds or even thousands
of peers), it seems like that alone is a lot of pressure on the event
thread for just idle behavior. If we instead had a thread that just
slept for 6 seconds and then traversed every peer for NAT-ping, we could
handle other things along the way, like MTU processing.

So, thoughts/etc?

-- 
Andrew Deason
adeason@sinenomine.net