[OpenAFS-devel] 1.8.11pre1 client hanging on Linux 6.7

Michael Laß lass@mail.upb.de
Wed, 24 Jan 2024 19:49:02 +0100


Thanks Cheyenne for trying to reproduce this issue. We are both using
the exact same versions of the Linux kernel and OpenAFS, so the
difference in behavior is quite interesting. Unfortunately, I still
cannot really make sense of this problem. I am seeing two slightly
different failure modes:


1. When trying to access my test cell, which is actually a VirtualBox
VM running  on the same machine as the client, `ls` hangs on the
following syscall:

openat(AT_FDCWD, "/afs/fritz.box", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY

Using Wireshark, I looked at the RX network traffic and it looks like
this:
https://homepages.upb.de/lass/openafs/RX_traffic_accessing_test_cell.png

So it looks like the server is sending a reply to the VLDB request
multiple times, because there is no acknowledgment from the client.


2. When trying to access a public cell, in this case desy.de, `ls` gets
past the `openat` syscall and hangs within getdents64:

openat(AT_FDCWD, "/afs/desy.de", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(3, "", {st_mode=S_IFDIR|0755, st_size=6144, ...}, AT_EMPTY_PATH) = 0
getdents64(3, 

Looking at the RX packages, the initial communication contains some
"Ack Delay" packages:
https://homepages.uni-paderborn.de/lass/openafs/RX_traffic_accessing_desy1.=
png

... and then seems to be stuck in a loop with "FS Reply"s and pings:
https://homepages.uni-paderborn.de/lass/openafs/RX_traffic_accessing_desy2.=
png


In this comparison, the server versions differ as well, which may
contribute to the difference in communication:

# fritz.box:
% rxdebug afs.fritz.box 7000 -version
Trying 192.168.178.230 (port 7000):
AFS version: OpenAFS 1.8.9-1-debian 2022-12-22

# desy.de:
% rxdebug 131.169.2.111 7000 -version
Trying 131.169.2.111 (port 7000):
AFS version: AuriStor 2021.05 built 2023-12-19

But I saw similar behavior as with desy.de with kth.se which runs
OpenAFS 1.8.9.


So far, I have not received any complaints by other Arch Linux users
who use my packages. So this may very well be an isolated issue that
only affects my system.

Best regards,
Michael


Am Montag, dem 22.01.2024 um 09:19 -0700 schrieb Cheyenne Wills:
> Typo in the instructions, sysrq_trigger -> sysrq-trigger, so the
> command
> is:
> 
>   `echo t | sudo tee /proc/sysrq-trigger`
> 
> The output will be in dmesg 
> 
> Having said that, I've not been able to duplicate the problem.  I
> however notice a short hang when I lost contact with an afs server
> that was not specified within the CellServDB.
> 
> 
> [ 2721.299326] afs: Lost contact with volume location server
> xxx.xx.xxx.xx in cell xxxxx.net (code -1)
> [ 2722.629824] afs: volume location server xxx.xx.xxx.xx in
> cell xxxxx.net is back up (code 0)
> 
> I'm running a vanilla 6.7 kernel with openafs-stable-1_8_x (which is
> 1.8.11pre1).
> 
>