[OpenAFS-devel] 1.8.11pre1 client hanging on Linux 6.7

Michael Laß lass@mail.upb.de
Fri, 16 Feb 2024 19:55:47 +0100


Dear all,

I want to give a final summary on this issue. The fix for the kernel
regression landed in Linux versions 6.8-rc4 [1] and 6.7.5 [2].

My current theory on why this issue did not affect most people is that
the broken code path was only taken if UDP checksum verification could
not be offloaded to the NIC. For me this was the case in two different
scenarios which were entirely independent of each other:

1. My test cell running in a VM on the same host as the client. Here,
the packets never hit any physical NIC, hence checksums had to be
verified in software by the kernel. This failed for all packets which
is why communication was entirely impossible.

2. My DS-Lite internet connection which reduces the IPv4 path MTU and
causes fragmentation of IP packets at the tunnel endpoint. Here,
checksums of any reassembled packets had to be checked in software by
the kernel. Hence, communication was generally possible but "large" (>
IPv4 path MTU) packets were dropped.

Thanks again to Jeffrey who helped a lot in debugging this issue,
verifying its fix and finally making sure that the fix landed in 6.7.5.

Best regards,
Michael

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/comm=
it/?id=fe92f874f09145a6951deacaa4961390238bbe0d
[2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit=
/?h=linux-6.7.y&id=50d0dff3f706ff4a71df99b7526341ae9fa83e09

Am Mittwoch, dem 31.01.2024 um 17:32 +0100 schrieb Michael Laß:
> Thank you Jeffrey for the detailed summary!
> 
> I finished bisecting the changes between Linux 6.6 and 6.7 and by now
> I
> think we were hunting multiple issues here.
> 
> 
> 1. A regression in Linux 6.7:
> 
> Bisecting lead me to dc32bff195b45e8571c442954beee259e9500dac
> ("iov_iter, net: Fold in csum_and_memcpy()") being the first bad
> commit. With this change, my client cannot at all talk to my test
> cell,
> which runs in a VM on the same system. I think I spotted the mistake
> in
> that change and I just proposed a fix on the netdev mailing list:
> 
> https://lore.kernel.org/netdev/20240131155220.82641-1-bevan@bi-co.net/T/#=
u
> 
> With this patch applied on top of v6.7.2, access to my test cell
> works
> fine again. Note that this is likely not related to any MTU
> restrictions, as the traffic does not leave my home network.
> 
> 
> 2. Lost packets and significant delays due to MTU restrictions:
> 
> As Jeffrey explained, the MTU of my IPv4 connection is reduced down
> to
> 1460 due to the tunneling over IPv6. When accessing a public cell
> with
> default settings, large reply packets are lost on their way. At some
> point in time (and for unknown reasons) the packets start to arrive
> in
> fragments. From there on, the connection works fine, although likely
> not optimally due to fragmentation overhead.
> 
> I think that this problem already affected me with earlier kernel
> versions, as an initial access always took quite a while. I can only
> assume that problem no. 1 additionally influenced my tests with
> public
> cells and made things even worse.
> 
> This issue can easily be fixed by passing `-rxmaxmtu 1404` to afsd.
> Knowing about my internet connection, I will use this flag in future.
> 
> 
> I will continue testing for a bit to see if there are any remaining
> issues.
> 
> Best regards,
> Michael
> 
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel