[OpenAFS] OpenAFS client softlockup on highly concurrential file-system
patterns (100% CPU in kernel mode)
Ciprian Dorin Craciun
ciprian.craciun@gmail.com
Tue, 19 Nov 2019 13:53:59 +0200
A few days ago I have encountered a very strange OpenAFS client issue that
basically exhibits in two ways:
* either the processes accessing the file-system get "stuck" reading (or
perhaps opening) the files; (although if one waits "long" enough, sometimes
those processes will finally complete their job;) (in this case the CPU
doesn't go to 100%;)
* either if one tries to `SIGTERM` the stuck processes, the CPU goes to 100%
(on multiple cores) in kernel mode; (again, sometimes if one waits long
enough, the system settles;)
The usage pattern is as follows:
* it is a typical "build" scenario, where a `make`-like tool (in this case
`ninja`) heavily stats all files it knows about to find changed or missing
ones; (in my case there are about 90k files, all hosted on AFS; moreover
I suspect `ninja` tries to stat these on multiple threads;)
* there are a few processes that do CPU-bound tasks, reading a file (from
AFS) and writing the output to another one (also on AFS); (the concurrency
level doesn't seem to change much, from 128 processes in parallel to 4;)
I was able to replicate this issue each time I tried to run the build and
send `SIGTERM`, after letting the whole build process run for a night it
eventually completed.
My setup is as follows:
* OpenSUSE Tumbleweed, kernel 5.3.9-1-default, client package
`openafs-client` and `openafs-kmp-default` at `1.8.5_k5.3.9_1-1.3` as
provided by OpenSUSE;
* `afsd` parameters (neither memory cache (on `tmpfs`) or disk cache seems
to help; neither daemons from 4 to 1; encryption is off):
~~~~
-verbose -blocks 7864320 -chunksize 17 -files 524288 -files_per_subdir 128
-dcache 524288 -stat 524288 -volumes 128 -splitcache 90/10 -afsdb
-dynroot-sparse -fakestat-all -inumcalc md5 -backuptree -daemons 1
-rxmaxfrags 8 -rxmaxmtu 1500 -rxpck 4096 -nosettime
~~~~
-verbose -memcache -blocks 1048576 -chunksize 17 -stat 524288 -volumes 128
-splitcache 90/10 -afsdb -dynroot-sparse -fakestat-all -inumcalc md5
-backuptree -daemons 1 -rxmaxfrags 8 -rxmaxmtu 1500 -rxpck 4096 -nosettime
~~~~
* the server is also on OpenSUSE Leap 15.0, with `openafs-server` package at
`1.8.0-lp150.2.2.1` as provided by OpenSUSE;
* I suspect that perhaps the issue is due to the latest kernel version,
because I have run similar patterns a few weeks ago on an older kernel (but
still from the `5.x` family), but can't say for sure;
I also tried the following:
* `fs flushall` seems to block as the processes accessing the file-system;
* the only way to "kill" the stuck processes is to disconnect the network,
and let them timeout;
Any pointers on how to diagnose this?
Thanks,
Ciprian.