[OpenAFS] OpenAFS client softlockup on highly concurrential file-system patterns (100% CPU in kernel mode)

Ciprian Dorin Craciun ciprian.craciun@gmail.com
Wed, 20 Nov 2019 19:17:54 +0200


On Wed, Nov 20, 2019 at 7:03 PM Mark Vitale <mvitale@sinenomine.net> wrote:
> Thank you for the backtraces.  I agree that 'gm' is the problematic thread;
> it appears to be stuck in rxi_WriteProc waiting for the Rx packet transmit window
> to advance.  That is, it's waiting for acknowledgments - probably from the fileserver.


It's true that the test was performed over wireless, however the same
behaviour was encountered even when over GigaBit LAN.
(This is a personal setup, both server, network and client, and there
was light to no usage on both the client, server and the network.)


> Unfortunately the rest of the backtrace seems muddled and so we can't tell exactly
> what the client was doing.  In fact, many of the backtraces are incomplete.

I haven't deleted anything from a particular process stacktrace.
Although I have deleted processes that have nothing to do with AFS or
didn't contain a stack which contained `afs`.

(If you think it would be useful I can send you privately a complete,
uncensored, output.)


> If I have some time later this week, I may try to reproduce this issue.
> However, there's no guarantee I will be able to do so, so it would be better
> if we could either obtain more information from your site, or if you could
> narrow the problem down to a simpler test case.

I'll try to reproduce this without the actual build system.  (Using
say `stat`, `cp` and `xargs`.)


> Do you have FileLogs and/or fileserver audit logs for the time in question?

Yes, I do have access to them.

The following is the syslog output from OpenAFS server in a 5 minute
time-window to the stacktrace sent yesterday:
~~~~
FindClient: stillborn client 0x7fe9b0012dc0(77749fe8); conn
0x7fe9d800e390 (host 172.30.214.35:7001) had client
0x7fe9b00131d0(77749fe8)
FindClient: stillborn client 0x7fe9b00132a0(77749fec); conn
0x7fe9d800e660 (host 172.30.214.35:7001) had client
0x7fe9b0012dc0(77749fec)
FindClient: stillborn client 0x7fe9b0013030(77749fec); conn
0x7fe9d800e660 (host 172.30.214.35:7001) had client
0x7fe9b0012dc0(77749fec)
FindClient: stillborn client 0x7fe9b0012cf0(77749fec); conn
0x7fe9d800e660 (host 172.30.214.35:7001) had client
0x7fe9b0012dc0(77749fec)
~~~~

No information is present in `/var/log/openafs` in that timeframe.

The following are the arguments of `fileserver`:
~~~~
-syslog -sync always -p 4 -b 524288 -l 524288 -s 1048576 -vc 4096 -cb
1048576 -vhandle-max-cachesize 32768 -jumbo -udpsize 67108864
-sendsize 67108864 -rxmaxmtu 9000 -rxpck 4096 -busyat 65536
~~~~

(Yesterday over wireless I didn't use Jumbo frames, but the day
before, where the same thing happened, I was using them.)

Ciprian.