[OpenAFS] OpenAFS client softlockup on highly concurrential file-system patterns (100% CPU in kernel mode)

Mark Vitale mvitale@sinenomine.net
Wed, 20 Nov 2019 17:49:07 +0000


> On Nov 20, 2019, at 12:17 PM, Ciprian Dorin Craciun <ciprian.craciun@gmai=
l.com> wrote:
>=20
>=20
>> Do you have FileLogs and/or fileserver audit logs for the time in questi=
on?
>=20
> Yes, I do have access to them.
>=20
> The following is the syslog output from OpenAFS server in a 5 minute
> time-window to the stacktrace sent yesterday:
> ~~~~
> FindClient: stillborn client 0x7fe9b0012dc0(77749fe8); conn
> 0x7fe9d800e390 (host 172.30.214.35:7001) had client
> 0x7fe9b00131d0(77749fe8)
> FindClient: stillborn client 0x7fe9b00132a0(77749fec); conn
> 0x7fe9d800e660 (host 172.30.214.35:7001) had client
> 0x7fe9b0012dc0(77749fec)
> FindClient: stillborn client 0x7fe9b0013030(77749fec); conn
> 0x7fe9d800e660 (host 172.30.214.35:7001) had client
> 0x7fe9b0012dc0(77749fec)
> FindClient: stillborn client 0x7fe9b0012cf0(77749fec); conn
> 0x7fe9d800e660 (host 172.30.214.35:7001) had client
> 0x7fe9b0012dc0(77749fec)
> ~~~~
>=20
> No information is present in `/var/log/openafs` in that timeframe.
>=20
> The following are the arguments of `fileserver`:
> ~~~~
> -syslog -sync always -p 4 -b 524288 -l 524288 -s 1048576 -vc 4096 -cb
> 1048576 -vhandle-max-cachesize 32768 -jumbo -udpsize 67108864
> -sendsize 67108864 -rxmaxmtu 9000 -rxpck 4096 -busyat 65536

I see some areas of concern here.  First of all, many of your parameters
indicate that you expect to run relatively high load through this fileserve=
r.
Yet there are only -p 4 server threads defined.  The fileserver will automa=
tically
increase this to the minimum of 6, but that still seems quite low.
This low thread number, combined with a very large -busyat value,
means that this fileserver will queue a very large backlog before returning
VBUSY to the client.  Is there a reason you need to keep the fileserver
threads so low?  Would it be possible for you to increase it dramatically
(perhaps 100) and try the test again?

Regards,
--
Mark Vitale
mvitale@sinenomine.net