[OpenAFS] OpenAFS client softlockup on highly concurrential file-system patterns (100% CPU in kernel mode)

Mark Vitale mvitale@sinenomine.net
Tue, 19 Nov 2019 15:06:36 +0000


Ciprian,

> On Nov 19, 2019, at 6:53 AM, Ciprian Dorin Craciun <ciprian.craciun@gmail=
.com> wrote:
>=20
> A few days ago I have encountered a very strange OpenAFS client issue tha=
t
> basically exhibits in two ways:
>=20
> * either the processes accessing the file-system get "stuck" reading (or
> perhaps opening) the files; (although if one waits "long" enough, sometim=
es
> those processes will finally complete their job;)  (in this case the CPU
> doesn't go to 100%;)
>=20
> * either if one tries to `SIGTERM` the stuck processes, the CPU goes to 1=
00%
> (on multiple cores) in kernel mode;  (again, sometimes if one waits long
> enough, the system settles;)
>=20
<snip>
>=20
> Any pointers on how to diagnose this?

If you had a true soft lockup, there should be some information in the sysl=
og.
If you don't see anything there, you could try this while the hang is occur=
ring:

# echo t > /proc/sysrq-trigger

This will produce backtraces in the syslog.
Either way, whatever you find in syslog, if you could put it in a paste sit=
e of your
choice, then post the link here, that would be helpful.


Regards,
--
Mark Vitale
Sine Nomine Associates
20 Years of Customer Success