[OpenAFS] OpenAFS client softlockup on highly concurrential
file-system patterns (100% CPU in kernel mode)
Ciprian Dorin Craciun
ciprian.craciun@gmail.com
Tue, 19 Nov 2019 22:38:04 +0200
On Tue, Nov 19, 2019 at 5:10 PM Ciprian Dorin Craciun
<ciprian.craciun@gmail.com> wrote:
> > # echo t > /proc/sysrq-trigger
At the following link you can find an extract of `dmesg` after the
sysrq trigger.
https://scratchpad.volution.ro/ciprian/f89fc32a0bbd0ae6d6f3edbbc3ee111c/b9c3bc4f795bbe9e7eaca93b0a57bea0.txt
(I have filtered processes that don't have `afs` in their name, mainly
because it exposes all my workstation's processes. However I can
provide privately a complete file.)
The following is the process which gets stuck (it took almost ~25
minutes to complete, and it is not input file related):
~~~~
gm S 0 27572 27562 0x80000000
Call Trace:
? __schedule+0x2be/0x6d0
schedule+0x39/0xa0
afs_cv_wait+0x10a/0x300 [libafs]
? wake_up_q+0x60/0x60
rxi_WriteProc+0x21d/0x410 [libafs]
? rxfs_storeUfsWrite+0x55/0xb0 [libafs]
? afs_GenericStoreProc+0x11a/0x1f0 [libafs]
? afs_CacheStoreDCaches+0x1a9/0x5b0 [libafs]
? afs_CacheStoreVCache+0x32c/0x680 [libafs]
? __filemap_fdatawrite_range+0xca/0x100
? afs_osi_Wakeup+0xb/0x60 [libafs]
? afs_UFSGetDSlot+0xf6/0x4f0 [libafs]
? afs_StoreAllSegments+0x725/0xc20 [libafs]
? afs_linux_flush+0x486/0x4e0 [libafs]
? filp_close+0x32/0x70
? __x64_sys_close+0x1e/0x50
? do_syscall_64+0x6e/0x200
? entry_SYSCALL_64_after_hwframe+0x49/0xbe
~~~~
On a second try (that also lockups) the following is the stack-trace
(only for the blocked process) (they look almost identical):
~~~~
gm S 0 30548 30545 0x80004000
Call Trace:
? __schedule+0x2be/0x6d0
schedule+0x39/0xa0
afs_cv_wait+0x10a/0x300 [libafs]
? wake_up_q+0x60/0x60
rxi_WriteProc+0x21d/0x410 [libafs]
? rxfs_storeUfsWrite+0x55/0xb0 [libafs]
? afs_GenericStoreProc+0x11a/0x1f0 [libafs]
? afs_CacheStoreDCaches+0x1a9/0x5b0 [libafs]
? afs_CacheStoreVCache+0x32c/0x680 [libafs]
? __filemap_fdatawrite_range+0xca/0x100
? afs_osi_Wakeup+0xb/0x60 [libafs]
? afs_UFSGetDSlot+0xf6/0x4f0 [libafs]
? afs_StoreAllSegments+0x725/0xc20 [libafs]
? afs_linux_flush+0x486/0x4e0 [libafs]
? filp_close+0x32/0x70
? __x64_sys_close+0x1e/0x50
? do_syscall_64+0x6e/0x200
? entry_SYSCALL_64_after_hwframe+0x49/0xbe
~~~~
I can reliably trigger the issue almost 50% of the times, by just
doing the following:
* remove a few files (in my case ~15) which should trigger the rebuild
of around x2;
* start the build with maximum 8 processes concurrency;
* all the processes execute similar jobs, with similarly sized inputs,
outputs and used CPU time;
Based on `htop` I would say that neither `ninja` which does the heavy
`stat`-ing, neither `gm` (an ImageMagik alternative) are
multi-threaded.
The build procedure involves the following AFS related operations:
* check if the output exists, and if so `rm`;
* create an `output.tmp` file;
* move the `output.tmp` to `output`;
No other proceses are actively using AFS (except `mc` and a couple of
`bash` which have their `cwd` into an AFS volume). (The `[nodaemon]`
process is a simple tool that uses `prtcl (PR_SET_CHILD_SUBREAPER)` to
catch double forking processses, and also has the `cwd` into AFS.)
Hope it helps,
Ciprian.