[OpenAFS] Solaris AFS client down - why does this happen

Wed, 27 Jan 2016 19:40:55 +0100

Dear Mark,

sorry for my late answer. I'm still in a hurry. Unfortunately there are 
no log files left from October.
But I should say that we had several similar events during the last week 
(and from time to time) which
are in coincidence with shutdowns or end in shutdowns (don't know what 
comes first).
I could prepare the message logs if you think it's worth. However, not 
before tomorrow.

Best regards,

Karl

On 25.01.16 19:22, Mark Vitale wrote:
> On Nov 4, 2015, at 11:50 AM, Karl Behler <karl.behler@ipp.mpg.de> wrote:
>
>> Dear Mark and Ben,
>>
>> thanks for your response. We could not find which component in our system may have caused the "umount".
>> But since then it never happened again. I think we will go over to a newer version of the client and then see what happens.
> When you first reported this, I focused on the possible reasons for the shutdown, and inadvertently overlooked the panic/hang that you also reported.  But recently while doing some Solaris testing I discovered a bug in the OpenAFS Solaris shutdown code.  Unlike you, my shutdowns were intentional; but like you, I saw "Failed to flush vcache" messages and a panic after shutdown.   At that point I remembered your email and realized you had probably encountered the same panic I did.  However, the only way to be sure would be to look for the panic messages in your syslog.  If you still have those (from back in October), it would be helpful to see them.
>
> Regardless, I'm able to duplicate the problem quite easily.
>
> I've opened https://rt.central.org/rt/Ticket/Display.html?id=132689 and I am working on an upstream fix for this.
>
> Regards,
> --
> Mark Vitale
> Sine Nomine Associates
>
>
>>> On Oct 16, 2015, at 10:46 AM, Karl Behler <karl.behler@ipp.mpg.de> wrote:
>>>
>>>> we experience unwanted "shutdown" events of our OpenAFS 1.6.9 clients under Solaris 10.
>>>>
>>>> Running this client since October last year without problems on ten Solaris desktop servers which reboot regularly on weekends, we recently had kind of crashes on nearly half of these servers in the middle of a week.
>>>>
>>>> The log file (/var/adm/messages) contains kernel messages which look like a shutdown which seems to be initiated by the afsd itself.
>>>> (In the following log the real event starts at Oct 16 11:54:47)
>>>>
>>>> Oct 16 11:35:39 sxaug37 genunix: [ID 900631 kern.notice] afs: byte-range lock/unlock ignored; make sure no one else is running this program (pid 23006 (thunderbird-bin), user 13471, fid 1108706165.12934.344145).
>>>> Oct 16 11:39:23 sxaug37 genunix: [ID 900631 kern.notice] afs: byte-range lock/unlock ignored; make sure no one else is running this program (pid 22054 (firefox-bin), user 6570, fid 1108604831.175334.13229850).
>>>> Oct 16 11:49:23 sxaug37 last message repeated 1 time
>>>> Oct 16 11:54:47 sxaug37 genunix: [ID 146023 kern.notice] afs: WARM
>>>> Oct 16 11:54:47 sxaug37 genunix: [ID 510892 kern.notice] shutting down of: vcaches...
>>>> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flush vcache 0x28e2f840
>>>> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flush vcache 0x2924b960
>>>> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flush vcache 0x28114c00
>>>> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flush vcache 0x27d49000
>>>> ... several hundert similar messages
>>>> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flush vcache 0x2811dbc0
>>>> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flush vcache 0x28a53c60
>>>> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flush vcache 0x27e10460
>>>> Oct 16 11:54:47 sxaug37 genunix: [ID 159345 kern.notice] Failed to flush vcache 0x289fad40
>>>> Oct 16 11:54:47 sxaug37 genunix: [ID 364168 kern.notice] BkG...
>>>> Oct 16 11:54:47 sxaug37 genunix: [ID 338304 kern.notice] CB...
>>>> Oct 16 11:54:47 sxaug37 genunix: [ID 543876 kern.notice] afs...
>>>> Oct 16 11:54:47 sxaug37 genunix: [ID 229921 kern.notice] CTrunc...
>>>> Oct 16 11:54:47 sxaug37 genunix: [ID 916331 kern.notice] AFSDB...
>>>> Oct 16 11:54:47 sxaug37 genunix: [ID 196290 kern.notice] RxEvent...
>>>> Oct 16 11:54:48 sxaug37 genunix: [ID 687192 kern.notice] UnmaskRxkSignals...
>>>> Oct 16 11:54:48 sxaug37 genunix: [ID 346748 kern.notice] RxListener...
>>>> Oct 16 11:54:48 sxaug37 genunix: [ID 890369 kern.notice] NetIfPoller...
>>>> Oct 16 11:54:48 sxaug37 genunix: [ID 288918 kern.notice] WARNING: not all blocks freed: large 0 small 217
>>>> Oct 16 11:54:48 sxaug37 genunix: [ID 646860 kern.notice]  ALL allocated tables...
>>>> Oct 16 11:54:48 sxaug37 genunix: [ID 773001 kern.notice] done
>>>> Oct 16 11:58:24 sxaug37 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_150401-28 64-bit
>>>> Oct 16 11:58:24 sxaug37 genunix: [ID 282658 kern.notice] Copyright (c) 1983, 2015, Oracle and/or its affiliates. All rights reserved.
>>>>
>>>> Sometimes the system reboots immediately and sometimes the system stays in a state where all attempts to access AFS end with I/O Error.

-- 
Dr. Karl Behler	
CODAC & IT services ASDEX Upgrade
phon +49 89 3299-1351 fax 3299-961351