[OpenAFS] Solaris AFS client down - why does this happen

Benjamin Kaduk kaduk@MIT.EDU
Wed, 21 Oct 2015 00:41:03 -0400 (EDT)


On Mon, 19 Oct 2015, Karl Behler wrote:

> Dear All,
>
> we experience unwanted "shutdown" events of our OpenAFS 1.6.9 clients under
> Solaris 10.
>
> Running this client since October last year without problems on ten Solaris
> desktop servers which reboot regularly on weekends, we recently had kind of
> crashes on nearly half of these servers in the middle of a week.
>
> The log file (/var/adm/messages) contains kernel messages which look like a
> shutdown which seems to be initiated by the afsd itself.

Not necessarily; I think this is the output regardless of what triggers
the shutdown.  In particular, there is no error-handling code in the
kernel module that triggers a shutdown if some class of error occurs.  So,
while I can imagine a few scenarios where this would "spontaneously"
happen, they all involve the interplay between several different
(hypothetical) bugs, it is not exactly what I would jump to as my first
conclusion.  So, maybe try to instrument umount calls on the systems in
question, I guess.

-Ben

> (In the following log the real event starts at Oct 16 11:54:47)
>
> Oct 16 11:35:39 sxaug37 genunix: [ID 900631 kern.notice] afs: byte-range
> lock/unlock ignored; make sure no one else is running this program (pid 23006
> (thunderbird-bin), user 13471, fid 1108706165.12934.344145).
> Oct 16 11:39:23 sxaug37 genunix: [ID 900631 kern.notice] afs: byte-range
> lock/unlock ignored; make sure no one else is running this program (pid 22054
> (firefox-bin), user 6570, fid 1108604831.175334.13229850).
> Oct 16 11:49:23 sxaug37 last message repeated 1 time
> Oct 16 11:54:47 sxaug37 genunix: [ID 146023 kern.notice] afs: WARM
> Oct 16 11:54:47 sxaug37 genunix: [ID 510892 kern.notice] shutting down of:
> vcaches...
> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flush
> vcache 0x28e2f840
> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flush
> vcache 0x2924b960
> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flush
> vcache 0x28114c00
> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flush
> vcache 0x27d49000
> ... several hundert similar messages
> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flush
> vcache 0x2811dbc0
> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flush
> vcache 0x28a53c60
> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flush
> vcache 0x27e10460
> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flush
> vcache 0x289fad40
> Oct 16 11:54:47 sxaug37 genunix: [ID 364168 kern.notice] BkG...
> Oct 16 11:54:47 sxaug37 genunix: [ID 338304 kern.notice] CB...
> Oct 16 11:54:47 sxaug37 genunix: [ID 543876 kern.notice] afs...
> Oct 16 11:54:47 sxaug37 genunix: [ID 229921 kern.notice] CTrunc...
> Oct 16 11:54:47 sxaug37 genunix: [ID 916331 kern.notice] AFSDB...
> Oct 16 11:54:47 sxaug37 genunix: [ID 196290 kern.notice] RxEvent...
> Oct 16 11:54:48 sxaug37 genunix: [ID 687192 kern.notice] UnmaskRxkSignals...
> Oct 16 11:54:48 sxaug37 genunix: [ID 346748 kern.notice] RxListener...
> Oct 16 11:54:48 sxaug37 genunix: [ID 890369 kern.notice] NetIfPoller...
> Oct 16 11:54:48 sxaug37 genunix: [ID 288918 kern.notice] WARNING: not all
> blocks freed: large 0 small 217
> Oct 16 11:54:48 sxaug37 genunix: [ID 646860 kern.notice]  ALL allocated
> tables...
> Oct 16 11:54:48 sxaug37 genunix: [ID 773001 kern.notice] done
> Oct 16 11:58:24 sxaug37 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10
> Version Generic_150401-28 64-bit
> Oct 16 11:58:24 sxaug37 genunix: [ID 282658 kern.notice] Copyright (c) 1983,
> 2015, Oracle and/or its affiliates. All rights reserved.
>
> Sometimes the system reboots immediately and sometimes the system stays in a
> state where all attempts to access AFS end with I/O Error.
>
> Any idea what happens and what to do?
>
> Best regards,
>
> Karl
>
> --
> Dr. Karl Behler
> CODAC & IT services ASDEX Upgrade
> phon +49 89 3299-1351 fax 3299-961351
>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>