[OpenAFS] OpenAFS client softlockup on highly concurrential file-system patterns (100% CPU in kernel mode)

Mark Vitale mvitale@sinenomine.net
Wed, 20 Nov 2019 17:03:36 +0000


Ciprian,

> On Nov 19, 2019, at 4:37 PM, Ciprian Dorin Craciun <ciprian.craciun@gmail=
.com> wrote:
>=20
> On Tue, Nov 19, 2019 at 10:38 PM Ciprian Dorin Craciun
> <ciprian.craciun@gmail.com> wrote:
>> At the following link you can find an extract of `dmesg` after the
>> sysrq trigger.
>>=20
>>  https://scratchpad.volution.ro/ciprian/f89fc32a0bbd0ae6d6f3edbbc3ee111c=
/b9c3bc4f795bbe9e7eaca93b0a57bea0.txt
>=20
>=20
> I forgot to mention that in this case the CPU didn't go up to 100%, in
> fact it was quite "quiet".  (The 100% CPU seems to happen only after a
> process "blocks" and I try to `SIGTERM` or `SIGKILL` it.)

Thank you for the backtraces.  I agree that 'gm' is the problematic thread;
it appears to be stuck in rxi_WriteProc waiting for the Rx packet transmit =
window
to advance.  That is, it's waiting for acknowledgments - probably from the =
fileserver.
Unfortunately the rest of the backtrace seems muddled and so we can't tell =
exactly
what the client was doing.  In fact, many of the backtraces are incomplete.
However, I can tell that all the cache manager kernel threads (housekeeping=
 et al)
are in a normal/idle state.

I did some preliminary searches through the linux kernel git repo for recen=
t changes
in 5.3.9 and older, but didn't see anything that seemed relevant.

If I have some time later this week, I may try to reproduce this issue. =20
However, there's no guarantee I will be able to do so, so it would be bette=
r
if we could either obtain more information from your site, or if you could
narrow the problem down to a simpler test case.

Do you have FileLogs and/or fileserver audit logs for the time in question?

Thanks,
--
Mark Vitale
Sine Nomine Associates
20 Years of Customer Success