[OpenAFS] Solaris AFS client down - why does this happen

Mark Vitale mvitale@sinenomine.net
Mon, 25 Jan 2016 18:22:42 +0000


On Nov 4, 2015, at 11:50 AM, Karl Behler <karl.behler@ipp.mpg.de> wrote:

> Dear Mark and Ben,
>=20
> thanks for your response. We could not find which component in our system=
 may have caused the "umount".
> But since then it never happened again. I think we will go over to a newe=
r version of the client and then see what happens.

When you first reported this, I focused on the possible reasons for the shu=
tdown, and inadvertently overlooked the panic/hang that you also reported. =
 But recently while doing some Solaris testing I discovered a bug in the Op=
enAFS Solaris shutdown code.  Unlike you, my shutdowns were intentional; bu=
t like you, I saw "Failed to flush vcache" messages and a panic after shutd=
own.   At that point I remembered your email and realized you had probably =
encountered the same panic I did.  However, the only way to be sure would b=
e to look for the panic messages in your syslog.  If you still have those (=
from back in October), it would be helpful to see them.

Regardless, I'm able to duplicate the problem quite easily.

I've opened https://rt.central.org/rt/Ticket/Display.html?id=3D132689 and I=
 am working on an upstream fix for this.

Regards,
--
Mark Vitale
Sine Nomine Associates


>> On Oct 16, 2015, at 10:46 AM, Karl Behler <karl.behler@ipp.mpg.de> wrote=
:
>>=20
>>> we experience unwanted "shutdown" events of our OpenAFS 1.6.9 clients u=
nder Solaris 10.
>>>=20
>>> Running this client since October last year without problems on ten Sol=
aris desktop servers which reboot regularly on weekends, we recently had ki=
nd of crashes on nearly half of these servers in the middle of a week.
>>>=20
>>> The log file (/var/adm/messages) contains kernel messages which look li=
ke a shutdown which seems to be initiated by the afsd itself.
>>> (In the following log the real event starts at Oct 16 11:54:47)
>>>=20
>>> Oct 16 11:35:39 sxaug37 genunix: [ID 900631 kern.notice] afs: byte-rang=
e lock/unlock ignored; make sure no one else is running this program (pid 2=
3006 (thunderbird-bin), user 13471, fid 1108706165.12934.344145).
>>> Oct 16 11:39:23 sxaug37 genunix: [ID 900631 kern.notice] afs: byte-rang=
e lock/unlock ignored; make sure no one else is running this program (pid 2=
2054 (firefox-bin), user 6570, fid 1108604831.175334.13229850).
>>> Oct 16 11:49:23 sxaug37 last message repeated 1 time
>>> Oct 16 11:54:47 sxaug37 genunix: [ID 146023 kern.notice] afs: WARM
>>> Oct 16 11:54:47 sxaug37 genunix: [ID 510892 kern.notice] shutting down =
of: vcaches...
>>> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flus=
h vcache 0x28e2f840
>>> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flus=
h vcache 0x2924b960
>>> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flus=
h vcache 0x28114c00
>>> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flus=
h vcache 0x27d49000
>>> ... several hundert similar messages
>>> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flus=
h vcache 0x2811dbc0
>>> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flus=
h vcache 0x28a53c60
>>> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flus=
h vcache 0x27e10460
>>> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flus=
h vcache 0x289fad40
>>> Oct 16 11:54:47 sxaug37 genunix: [ID 364168 kern.notice] BkG...
>>> Oct 16 11:54:47 sxaug37 genunix: [ID 338304 kern.notice] CB...
>>> Oct 16 11:54:47 sxaug37 genunix: [ID 543876 kern.notice] afs...
>>> Oct 16 11:54:47 sxaug37 genunix: [ID 229921 kern.notice] CTrunc...
>>> Oct 16 11:54:47 sxaug37 genunix: [ID 916331 kern.notice] AFSDB...
>>> Oct 16 11:54:47 sxaug37 genunix: [ID 196290 kern.notice] RxEvent...
>>> Oct 16 11:54:48 sxaug37 genunix: [ID 687192 kern.notice] UnmaskRxkSigna=
ls...
>>> Oct 16 11:54:48 sxaug37 genunix: [ID 346748 kern.notice] RxListener...
>>> Oct 16 11:54:48 sxaug37 genunix: [ID 890369 kern.notice] NetIfPoller...
>>> Oct 16 11:54:48 sxaug37 genunix: [ID 288918 kern.notice] WARNING: not a=
ll blocks freed: large 0 small 217
>>> Oct 16 11:54:48 sxaug37 genunix: [ID 646860 kern.notice]  ALL allocated=
 tables...
>>> Oct 16 11:54:48 sxaug37 genunix: [ID 773001 kern.notice] done
>>> Oct 16 11:58:24 sxaug37 genunix: [ID 540533 kern.notice] ^MSunOS Releas=
e 5.10 Version Generic_150401-28 64-bit
>>> Oct 16 11:58:24 sxaug37 genunix: [ID 282658 kern.notice] Copyright (c) =
1983, 2015, Oracle and/or its affiliates. All rights reserved.
>>>=20
>>> Sometimes the system reboots immediately and sometimes the system stays=
 in a state where all attempts to access AFS end with I/O Error.