[OpenAFS-devel] fileserver problem

Thomas Mueller thomas.mueller@hrz.tu-chemnitz.de
Mon, 29 Oct 2001 14:51:45 +0100 (MET)


Hi all,

Today we had an outage of a fileserver (i386_linux22) running Redhat 6.2
and OpenAFS-1.2.2.

Suddenly the load increased to 15 or even more. Clients stopped working.

I invoked "kill -TSTP" to increase fileservers loglevel and "kill -XCPU"=20
to get the dumps.

/usr/afs/logs/FileLog:
...
Mon Oct 29 11:48:53 2001 Host 853c6d8 used to support WhoAreYou, deleting.
Mon Oct 29 11:53:39 2001 Set Debug On level =3D 1
Mon Oct 29 11:53:40 2001 Set Debug On level =3D 5
Mon Oct 29 11:53:40 2001 GSS: Delete longest inactive host 5b3c6d86
Mon Oct 29 11:53:40 2001 GSS: Delete longest inactive host 5b3c6d86
Mon Oct 29 11:53:40 2001 GSS: Delete longest inactive host 5b3c6d86
Mon Oct 29 11:53:40 2001 GSS: Delete longest inactive host 5b3c6d86
Mon Oct 29 11:53:40 2001 GSS: Delete longest inactive host 5b3c6d86
...
Mon Oct 29 11:53:40 2001 GSS: Delete longest inactive host 6ec86d86
Mon Oct 29 11:53:40 2001 GSS: Delete longest inactive host 5b3c6d86
Mon Oct 29 11:53:40 2001 GSS: Delete longest inactive host 5b3c6d86
Mon Oct 29 11:53:40 2001 GSS: Delete longest inactive host 5b3c6d86
Mon Oct 29 11:53:40 2001 GSS: Delete longest inactive host 5b3c6d86
Mon Oct 29 11:53:40 2001 GSS: Delete longest inactive host 5b3c6d86
Mon Oct 29 11:53:40 2001 GSS: Delete longest inactive host 5b3c6d86
Mon Oct 29 11:53:40 2001 GSS: Delete longest inactive host 5b3c6d86
Mon Oct 29 11:53:40 2001 GSS: Delete longest inactive host 5b3c6d86
...

such lines are repeated until I restarted the fileserver
(about 30000 times during 1 minute).
Only two ip addresses are contained (0x866d3c5b and 0x866dc86e)

/usr/afs/local/hosts.dump:
...
ip:5b3c6d86 port:22811 hidx:1321 cbid:63416 lock:ffffffff last:1004352554=
 active:1004352554 down:0 del:0 cons:2 cldel:0
         hpfailed:0 hcpsCall:1004351695 hcps [ -656 -212] [] holds: 3ef65=
b1000000000000 slot/bit: 0/1
...
ip:6ec86d86 port:22811 hidx:62 cbid:56533 lock:ffffffff last:1004352554 a=
ctive:1004352554 down:0 del:0 cons:2 cldel:0
         hpfailed:0 hcpsCall:1004351271 hcps [ -656 -431 -212] [ 6ec86d86=
] holds: 1000101000000000000 slot/bit: 0/1
...

I noticed that all other 1176 entries in this file have the value "cbid:0=
".

/usr/afs/local/clients.dump:
...
Host 5b3c6d86.22811 down =3D 0, LastCall Mon Oct 29 11:49:14 2001
    user id=3D42124,  name=3Dnfu, sl=3DAuthenticated till Tue Oct 30 09:5=
0:03 2001
      CPS-5 is []
...
Host 6ec86d86.22811 down =3D 0, LastCall Mon Oct 29 11:49:14 2001
    user id=3D32766,  name=3Danonymous, sl=3DNot authenticated till No Li=
mit
      CPS-2 is []
    user id=3D4799,  name=3Derm, sl=3DAuthenticated till Tue Oct 30 12:55=
:58 2001
      CPS-8 is []
    user=3Danonymous, no current server connection
      CPS-2 is []
    user=3Dafs_cron, no current server connection
      CPS-3 is []
    user=3Dafs_cron, no current server connection
      CPS-3 is []

/usr/afs/local/callback.dump was not written

We saw this several times during the last few weeks
(I think since we have OpenAFS-1.1.1 on this server),
but this time I could gather some logs.

Do you have any hints?

Thanks,
Thomas.
--=20
-----------------------------------------------------------------------
Thomas M=FCller, TU Chemnitz, Universit=E4tsrechenzentrum, D-09107 Chemni=
tz
mail: Thomas.Mueller@hrz.tu-chemnitz.de
-----------------------------------------------------------------------