[OpenAFS-devel] 1.8.11pre1 client hanging on Linux 6.7
Jeffrey E Altman
jaltman@auristor.com
Mon, 29 Jan 2024 12:56:32 -0500
On 1/26/2024 1:53 PM, Michael Laß wrote:
> I captured the following traces and will comment inline on what I could
> find:
>
>
> Starting with a client running on Linux 6.6.13, trying to access
> /afs/desy.de:
> fstrace: https://homepages.upb.de/lass/openafs/6.6.13.fstrace
> pcapng: https://homepages.upb.de/lass/openafs/6.6.13.pcapng
>
> The packet trace (pcapng, can be opened with Wireshark) shows that the
> reply to fetch-data-64 (i.e., the directory listing) arrives in
> fragments (e.g., frames 127+128). Nevertheless, the reception of the
> packet is acknowledged in frame 131. In the end, everything works fine.
>
>
> Running the same scenario on Linux 6.7:
> fstrace: https://homepages.upb.de/lass/openafs/6.7_default-mtu_default-rxmaxmtu.fstrace
> pcapng: https://homepages.upb.de/lass/openafs/6.7_default-mtu_default-rxmaxmtu.pcapng
>
> The receiving side looks very similar, we still receive the reply to
> fetch-data-64 in fragments (frames 127+128, 129+130, etc.). However,
> the reception is never acknowledged by the client. The getdents64
> syscall hangs forever.
>
>
> Reducing the maximum RX MTU via -rxmaxmtu 1400 on Linux 6.7:
> fstrace: https://homepages.upb.de/lass/openafs/6.7_default-mtu_rxmtu-1400.fstrace
> pcapng: https://homepages.upb.de/lass/openafs/6.7_default-mtu_rxmaxmtu-1400.pcapng
>
> The reply to fetch-data-64 is not fragmented anymore because the RX
> packets are sufficiently small (frames 149-152). The reception is ACK'd
> in frame 154.
>
>
> It could be that the larger UDP packets are segmented by my provider,
> as my IPv4 connection is realized via DS-Lite (a carrier-grade NAT
> [1][2]), which may reduce the MTU. This segmentation may be key to
> reproduce this issue.
>
> Still, it worked fine with Linux 6.6, even when receiving fragmented
> responses, and it is not working anymore with Linux 6.7. I may start
> bisecting the Linux kernel changes between 6.6 and 6.7, but I fear that
> this will take weeks...
>
> Best regards,
> Michael
>
>
> [1] https://en.wikipedia.org/wiki/Carrier-grade_NAT
> [2] https://en.wikipedia.org/wiki/IPv6_transition_mechanism#Dual-Stack_Lite_(DS-Lite)
Dear openafs-devel list,
Michael and I spent some time over the weekend reproducing the behavior
with packet captures on both the client host and the fileserver host.
Michael's ISP connection is Vodofone DS_Lite which transmits IPv4
traffic over a tunnel with a 1460 MTU and his LAN MTU is 1500 and a
preferred Rx MTU of 1444 (that is 1444 bytes of data per Rx DATA packet).
The OpenAFS cache manager advertises a willingness to accept packets as
large as 5692 bytes and up to four Rx packets in each datagram (aka
jumbograms).
When a desy.de fileserver replies to a FetchData RPC for the root
directory of the root.cell.readonly volume it must return 6144 bytes.
This requires 5 DATA packets of sizes (1444, 1444, 1444, 1444, 368).
Of these five DATA packets only the 5th packet can be transferred across
the tunnel without fragmentation.
The Linux network stack attempts to emulate the behavior of IPv6 with
regards to the transmission of fragmented packets. In IPv6 only the
initial sender of a packet is permitted to fragment the packet.
Routers/switches along the path are not permitted to fragment. Instead,
any router/switch that cannot forward a packet because it is too large
for the next hop must return an ICMPV6 TOO_BIG packet documenting the
MTU of the next hop. Upon receipt of the ICMPV6 the sending host
updates a local path mtu cache and the next time a packet is sent along
the path that is larger than the path mtu, the packet will be fragmented.
The way that Linux emulates the IPv6 behavior when using IPv6 is to set
the IP DONT_FRAGMENT flag on all outgoing packets. This prevents the
packets from being forwarded onto network segments with smaller MTUs and
is supposed to trigger the reply of an ICMP TOO_BIG. This process is
referred to a Path MTU Discovery. A summary can be found at
https://erg.abdn.ac.uk/users/gorry/course/inet-pages/pmtud.html.
The AuriStorFS fileservers begin each call with a congestion window set
to 4 packets. This permits four packets to be placed onto the wire.
In the case of fetching the desy.de root.cell.readonly root directory
there are five DATA packets. The first four are placed onto the wire
with the DONT_FRAGMENT flag. They cannot fit in the tunnel and so the
packets are dropped and ICMP packets are replied.
However, it appears that not all ICMP packets are being received from
vodofone or perhaps there are two layers of tunneling. The first might
have MTU 1480 and the second MTU 1460.
Since no ACK has been received for the transmitted DATA packets, the
fileserver retransmits the DATA packets when the RTO timeout occurs and
doubles the RTO timeout. The retransmitted packets have the Rx
REQUEST_ACK flag set and the IP DONT_FRAGMENT flag set. Eventually an
ICMP TOO_BIG is delivered which advertises a small enough MTU and the
next RTO retransmission will result in the DATA packets being sent
fragmented.
Once the fragmented DATA packets are small enough to fit in the tunnel
they are delivered to the client host where they are reassembled and
delivered to the cache manager's socket. As the DATA packets with Rx
REQUEST_ACK are received ACK packets are returned to the fileserver.
Once the cache manager ACKs the first DATA packet the fileserver can
send the 5th DATA packet which does not require fragmentation and the
call completes. This pattern is observed in the 6.6.13.pcapng trace.
In the 6.7_default-mtu_default-rxmaxmtu.pcapng capture the pattern is
the same except that the cache manager never receives the DATA packets
and therefore never sends ACK packets. There appears to be a
regression introduced in the v6.7 kernel which prevents reassembly or
delivery of reassembled packets to the cache manager's Rx. A Linux
kernel bisect is required to determine which commit introduced the
behavior change.
Since there is no hard dead timeout configured for fileserver
connections there is nothing to timeout the FetchData call from the
client side. The cache manager and fileserver will exchange
PING/PING_RESPONSE packets since those fit within the tunnel without
fragmentation but the DATA packets will never make it through. As a
result Michael observes the "ls -l /afs/desy.de/" process blocking in
the syscall for what feels like forever. The AuriStorFS fileserver will
eventually kill the call when the DATA packets have been retransmitted
too often. However, there are no timeouts that will result in failure
of the call.
When Michael configures afsd with -rxmaxmtu 1400 the cache manager
informs the fileserver that it will not accept any packets larger than
Rx MTU 1400 (IP MTU 1456) which is smaller than the tunnel's MTU which
permits all of the DATA packets to be delivered without fragmentation.
As presented at the AFS Tech Workshop in June 2023, the 2023 builds of
AuriStor Rx include Path MTU Discovery. AuriStor's PMTUD
implementation starts with a Rx MTU of 1144 and probes upwards from
there. No DATA packets are constructed which are larger than the
verified Path MTU. Therefore, all packets are delivered without
fragmentation. This permits AFS access from the Linux v6.7 kernel even
with the regression.
Hopefully a bisect between v6.6 and v6.7 will identify the source of the
regression so that it can be fixed for v6.8 and then back-ported to one
of the v6.7 stable branches.
Jeffrey Altman