[OpenAFS-devel] 1.8.11pre1 client hanging on Linux 6.7

Mon, 29 Jan 2024 12:56:32 -0500

On 1/26/2024 1:53 PM, Michael Laß wrote:
> I captured the following traces and will comment inline on what I could
> find:
>
>
> Starting with a client running on Linux 6.6.13, trying to access
> /afs/desy.de:
> fstrace: https://homepages.upb.de/lass/openafs/6.6.13.fstrace
> pcapng:  https://homepages.upb.de/lass/openafs/6.6.13.pcapng
>
> The packet trace (pcapng, can be opened with Wireshark) shows that the
> reply to fetch-data-64 (i.e., the directory listing) arrives in
> fragments (e.g., frames 127+128). Nevertheless, the reception of the
> packet is acknowledged in frame 131. In the end, everything works fine.
>
>
> Running the same scenario on Linux 6.7:
> fstrace: https://homepages.upb.de/lass/openafs/6.7_default-mtu_default-rxmaxmtu.fstrace
> pcapng:  https://homepages.upb.de/lass/openafs/6.7_default-mtu_default-rxmaxmtu.pcapng
>
> The receiving side looks very similar, we still receive the reply to
> fetch-data-64 in fragments (frames 127+128, 129+130, etc.). However,
> the reception is never acknowledged by the client. The getdents64
> syscall hangs forever.
>
>
> Reducing the maximum RX MTU via -rxmaxmtu 1400 on Linux 6.7:
> fstrace: https://homepages.upb.de/lass/openafs/6.7_default-mtu_rxmtu-1400.fstrace
> pcapng:  https://homepages.upb.de/lass/openafs/6.7_default-mtu_rxmaxmtu-1400.pcapng
>
> The reply to fetch-data-64 is not fragmented anymore because the RX
> packets are sufficiently small (frames 149-152). The reception is ACK'd
> in frame 154.
>
>
> It could be that the larger UDP packets are segmented by my provider,
> as my IPv4 connection is realized via DS-Lite (a carrier-grade NAT
> [1][2]), which may reduce the MTU. This segmentation may be key to
> reproduce this issue.
>
> Still, it worked fine with Linux 6.6, even when receiving fragmented
> responses, and it is not working anymore with Linux 6.7. I may start
> bisecting the Linux kernel changes between 6.6 and 6.7, but I fear that
> this will take weeks...
>
> Best regards,
> Michael
>
>
> [1] https://en.wikipedia.org/wiki/Carrier-grade_NAT
> [2] https://en.wikipedia.org/wiki/IPv6_transition_mechanism#Dual-Stack_Lite_(DS-Lite)

Dear openafs-devel list,

Michael and I spent some time over the weekend reproducing the behavior 
with packet captures on both the client host and the fileserver host.

Michael's ISP connection is Vodofone DS_Lite which transmits IPv4 
traffic over a tunnel with a 1460 MTU and his LAN MTU is 1500 and a 
preferred Rx MTU of 1444 (that is 1444 bytes of data per Rx DATA packet).

The OpenAFS cache manager advertises a willingness to accept packets as 
large as 5692 bytes and up to four Rx packets in each datagram (aka 
jumbograms).

When a desy.de fileserver replies to a FetchData RPC for the root 
directory of the root.cell.readonly volume it must return 6144 bytes.  
This requires 5 DATA packets of sizes (1444, 1444, 1444, 1444, 368).

Of these five DATA packets only the 5th packet can be transferred across 
the tunnel without fragmentation.

The Linux network stack attempts to emulate the behavior of IPv6 with 
regards to the transmission of fragmented packets.   In IPv6 only the 
initial sender of a packet is permitted to fragment the packet.  
Routers/switches along the path are not permitted to fragment.  Instead, 
any router/switch that cannot forward a packet because it is too large 
for the next hop must return an ICMPV6 TOO_BIG packet documenting the 
MTU of the next hop.  Upon receipt of the ICMPV6 the sending host 
updates a local path mtu cache and the next time a packet is sent along 
the path that is larger than the path mtu, the packet will be fragmented.

The way that Linux emulates the IPv6 behavior when using IPv6 is to set 
the IP DONT_FRAGMENT flag on all outgoing packets.  This prevents the 
packets from being forwarded onto network segments with smaller MTUs and 
is supposed to trigger the reply of an ICMP TOO_BIG.  This process is 
referred to a Path MTU Discovery.   A summary can be found at 
https://erg.abdn.ac.uk/users/gorry/course/inet-pages/pmtud.html.

The AuriStorFS fileservers begin each call with a congestion window set 
to 4 packets.   This permits four packets to be placed onto the wire.   
In the case of fetching the desy.de root.cell.readonly root directory 
there are five DATA packets.  The first four are placed onto the wire 
with the DONT_FRAGMENT flag.   They cannot fit in the tunnel and so the 
packets are dropped and ICMP packets are replied.

However, it appears that not all ICMP packets are being received from 
vodofone or perhaps there are two layers of tunneling.  The first might 
have MTU 1480 and the second MTU 1460.

Since no ACK has been received for the transmitted DATA packets, the 
fileserver retransmits the DATA packets when the RTO timeout occurs and 
doubles the RTO timeout.   The retransmitted packets have the Rx 
REQUEST_ACK flag set and the IP DONT_FRAGMENT flag set.   Eventually an 
ICMP TOO_BIG is delivered which advertises a small enough MTU and the 
next RTO retransmission will result in the DATA packets being sent 
fragmented.

Once the fragmented DATA packets are small enough to fit in the tunnel 
they are delivered to the client host where they are reassembled and 
delivered to the cache manager's socket.   As the DATA packets with Rx 
REQUEST_ACK are received ACK packets are returned to the fileserver.

Once the cache manager ACKs the first DATA packet the fileserver can 
send the 5th DATA packet which does not require fragmentation and the 
call completes.   This pattern is observed in the 6.6.13.pcapng trace.

In the 6.7_default-mtu_default-rxmaxmtu.pcapng capture the pattern is 
the same except that the cache manager never receives the DATA packets 
and therefore never sends ACK packets.   There appears to be a 
regression introduced in the v6.7 kernel which prevents reassembly or 
delivery of reassembled packets to the cache manager's Rx.   A Linux 
kernel bisect is required to determine which commit introduced the 
behavior change.

Since there is no hard dead timeout configured for fileserver 
connections there is nothing to timeout the FetchData call from the 
client side.   The cache manager and fileserver will exchange 
PING/PING_RESPONSE packets since those fit within the tunnel without 
fragmentation but the DATA packets will never make it through.  As a 
result Michael observes the "ls -l /afs/desy.de/" process blocking in 
the syscall for what feels like forever.  The AuriStorFS fileserver will 
eventually kill the call when the DATA packets have been retransmitted 
too often.  However, there are no timeouts that will result in failure 
of the call.

When Michael configures afsd with -rxmaxmtu 1400 the cache manager 
informs the fileserver that it will not accept any packets larger than 
Rx MTU 1400 (IP MTU 1456) which is smaller than the tunnel's MTU which 
permits all of the DATA packets to be delivered without fragmentation.

As presented at the AFS Tech Workshop in June 2023, the 2023 builds of 
AuriStor Rx include Path MTU Discovery.   AuriStor's PMTUD 
implementation starts with a Rx MTU of 1144 and probes upwards from 
there.  No DATA packets are constructed which are larger than the 
verified Path MTU. Therefore, all packets are delivered without 
fragmentation. This permits AFS access from the Linux v6.7 kernel even 
with the regression.

Hopefully a bisect between v6.6 and v6.7 will identify the source of the 
regression so that it can be fixed for v6.8 and then back-ported to one 
of the v6.7 stable branches.

Jeffrey Altman