[OpenAFS] File Server appears to stop responding

Randy Kemp rkemp@srhs.net
Thu, 23 Oct 2008 16:47:39 -0400


I'm still having this problem.  I've done some more debugging to try and
track down the cause.  It occurred again today and I used 'cmdebug' to
get info from the client and 'rxdebug' to get info from the server.  I
ran these from a machine independent of both the client and server in
question.  The interesting thing is that when I ran 'cmdebug' not only
did it take a very long time to execute and display a huge list of locks
but as soon as it completed both the client and server starting
functioning properly.  A subsequent run of 'cmdebug' displayed no
locks.  I suppose it's possible that this could be coincidental but this
problem has occurred countless times and until now has never resolved
itself and the client has always had to be restarted.  Is it possible
that 'cmdebug' could have triggered something in the client that caused
it to un-hang?  This is the first time I've used 'cmdebug' with it in
this state.

Clearly this is a client problem that puts the server in an inaccesable
state.  However,  it seems to me that there is also a problem with
fileserver since a single client can so easily cause it to be
inaccessable to all clients.  I'm now running 1.4.8pre2 on the server.

The output of 'cmdebug' is below.  I had to cut out the 'rxdebug' with
rxstats output because of the list's message size limit.  If anyone
would like to see that I can send it too.

$ cmdebug app1.thinland.srhs.net
Lock afs_xvcache status: (reader_waitingwriter_waitingupgrade_waiting,
write_locked(pid:21773 at:138), 207 waiters)
Lock afs_xserver status: (none_waiting, 1 read_locks(pid:0))
Lock afs_xvcb status: (writer_waiting, write_locked(pid:6465 at:273), 4
waiters)
** Cache entry @ 0x559fc000 for 2.536872292.672.7299 [phoenix.srhs.net]
    locks: (none_waiting, upgrade_locked(pid:30572 at:121))
        10654496 bytes  DV            0  refcnt     2
    callback 53c6ce00   expires 1224796783
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x36003000 for 2.536871938.2.206 [phoenix.srhs.net]
    locks: (upgrade_waiting, upgrade_locked(pid:2729 at:121), 2 waiters)
          396689 bytes  DV         1435  refcnt     2
    callback 53c6ce00   expires 1224782830
    2 opens     2 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x85d22700 for 2.536874302.714.5569 [phoenix.srhs.net]
    locks: (none_waiting, upgrade_locked(pid:18490 at:121))
           64554 bytes  DV            1  refcnt     2
    callback 53c6ce00   expires 1224796655
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x7487e380 for 2.536874044.2.206 [phoenix.srhs.net]
    locks: (writer_waitingupgrade_waiting, upgrade_locked(pid:2584
at:121), 6 waiters)
          385251 bytes  DV         1315  refcnt     2
    callback 53c6ce00   expires 1224788718
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x5e540700 for 2.536871620.402.4804 [phoenix.srhs.net]
    locks: (none_waiting, upgrade_locked(pid:16183 at:121))
        56025088 bytes  DV          561  refcnt     2
    callback 53c6ce00   expires 1224783726
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x69819a80 for 2.536873822.288.3237 [phoenix.srhs.net]
    locks: (none_waiting, upgrade_locked(pid:30568 at:121))
        54714368 bytes  DV          542  refcnt     2
    callback 53c6ce00   expires 1224796783
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0xccc5fa80 for 2.536874170.992.13816 [phoenix.srhs.net]
    locks: (upgrade_waiting, upgrade_locked(pid:3674 at:121), 5 waiters)
          514657 bytes  DV           43  refcnt     2
    callback 53c6ce00   expires 1224796527
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x7b104000 for 2.536874170.994.13854 [phoenix.srhs.net]
    locks: (none_waiting, upgrade_locked(pid:13553 at:121))
           64948 bytes  DV            1  refcnt     2
    callback 53c6ce00   expires 1224796655
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x0985f380 for 2.536874200.2.606 [phoenix.srhs.net]
    locks: (upgrade_waiting, upgrade_locked(pid:2725 at:121), 4 waiters)
          226343 bytes  DV          842  refcnt     2
    callback 53c6ce00   expires 1224782830
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x721d7700 for 2.536873201.2.606 [phoenix.srhs.net]
    locks: (none_waiting, upgrade_locked(pid:3482 at:121))
          304988 bytes  DV          941  refcnt     2
    callback 53c6ce00   expires 1224796655
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x3ddb6700 for 2.536876402.908.11304 [phoenix.srhs.net]
    locks: (none_waiting, upgrade_locked(pid:24115 at:121))
           62072 bytes  DV           10  refcnt     2
    callback 53c6ce00   expires 1224796655
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x7198d700 for 2.536873378.892.10147 [phoenix.srhs.net]
    locks: (writer_waitingupgrade_waiting, write_locked(pid:28527
at:66), 3 waiters)
          505958 bytes  DV           25  refcnt     2
    callback 53c6ce00   expires 1224796782
    2 opens     2 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x0c4b3000 for 2.536874302.27.24 [phoenix.srhs.net]
    locks: (none_waiting, write_locked(pid:18490 at:147))
            2048 bytes  DV          719  refcnt     2
    callback 53c6ce00   expires 1224796655
    0 opens     0 writers
    normal file
    states (0x1), stat'd
** Cache entry @ 0xe50f8380 for 2.536872292.2.606 [phoenix.srhs.net]
    locks: (writer_waitingupgrade_waiting, upgrade_locked(pid:4888
at:121), 12 waiters)
          462170 bytes  DV         1271  refcnt     2
    callback 53c6ce00   expires 1224782958
    2 opens     2 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x1cc66000 for 2.536874344.1.1 [phoenix.srhs.net]
    locks: (none_waiting, write_locked(pid:5920 at:135))
            4096 bytes  DV         1275  refcnt     2
    callback 53c6ce00   expires 1224797295
    0 opens     0 writers
    volume root
    states (0x1), stat'd
** Cache entry @ 0x820cea80 for 2.536876324.114.7527 [phoenix.srhs.net]
    locks: (none_waiting, upgrade_locked(pid:401 at:121))
             144 bytes  DV            0  refcnt     2
    callback 53c6ce00   expires 1224797295
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x6983ba80 for 2.536874368.31.229 [phoenix.srhs.net]
    locks: (none_waiting, write_locked(pid:3380 at:54))
               0 bytes  DV            0  refcnt     2
    callback 53c6ce00   expires 0
    0 opens     0 writers
    normal file
    states (0x0)
** Cache entry @ 0x884c0700 for 2.536873363.33.31 [phoenix.srhs.net]
    locks: (none_waiting, write_locked(pid:5660 at:54))
            9840 bytes  DV            0  refcnt     2
    callback 53c6ce00   expires 0
    0 opens     0 writers
    normal file
    states (0x0)
** Cache entry @ 0xd8d66380 for 2.536874755.690.5213 [phoenix.srhs.net]
    locks: (none_waiting, upgrade_locked(pid:27886 at:121))
           57968 bytes  DV           26  refcnt     2
    callback 53c6ce00   expires 1224796783
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0xb9979380 for 2.536876381.91.1631 [phoenix.srhs.net]
    locks: (none_waiting, write_locked(pid:13043 at:54))
            2048 bytes  DV          242  refcnt     3
    callback 53c6ce00   expires 1224782446
    0 opens     0 writers
    normal file
    states (0x0)
** Cache entry @ 0x14d7e000 for 2.536873405.1.1 [phoenix.srhs.net]
    locks: (none_waiting, write_locked(pid:5824 at:135))
            4096 bytes  DV         1173  refcnt     2
    callback 53c6ce00   expires 1224797295
    0 opens     0 writers
    volume root
    states (0x1), stat'd
** Cache entry @ 0xb2409700 for 2.536872703.712.7884 [phoenix.srhs.net]
    locks: (none_waiting, upgrade_locked(pid:28492 at:121))
            2382 bytes  DV            0  refcnt     2
    callback 53c6ce00   expires 1224797295
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x461bc000 for 2.536876402.1114.11312 [phoenix.srhs.net]
    locks: (none_waiting, upgrade_locked(pid:17206 at:121))
          121827 bytes  DV            1  refcnt     2
    callback 53c6ce00   expires 1224796783
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0xd8c99a80 for 2.536872703.734.7872 [phoenix.srhs.net]
    locks: (upgrade_waiting, upgrade_locked(pid:5823 at:121), 2 waiters)
          503437 bytes  DV           15  refcnt     2
    callback 53c6ce00   expires 1224796783
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x558ad700 for 2.536874470.2.606 [phoenix.srhs.net]
    locks: (upgrade_waiting, upgrade_locked(pid:5894 at:121), 3 waiters)
          250418 bytes  DV          894  refcnt     2
    callback 53c6ce00   expires 1224796782
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x3dd60a80 for 2.536876402.128.3198 [phoenix.srhs.net]
    locks: (none_waiting, upgrade_locked(pid:24115 at:121))
         1015808 bytes  DV         2032  refcnt     2
    callback 53c6ce00   expires 1224797295
    1 opens     1 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x0c45d000 for 2.536874512.2.206 [phoenix.srhs.net]
    locks: (writer_waitingupgrade_waiting, write_locked(pid:27196
at:66), 12 waiters)
          189750 bytes  DV          897  refcnt     2
    callback 53c6ce00   expires 1224796783
    3 opens     3 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0xd8d0e380 for 2.536873378.131.2007 [phoenix.srhs.net]
    locks: (none_waiting, write_locked(pid:25829 at:135))
            2048 bytes  DV          102  refcnt     2
    callback 53c6ce00   expires 1224784111
    0 opens     0 writers
    normal file
    states (0x1), stat'd
** Cache entry @ 0x8d02e380 for 2.536874695.884.9977 [phoenix.srhs.net]
    locks: (writer_waitingupgrade_waiting, write_locked(pid:3822 at:66),
14 waiters)
          512293 bytes  DV           48  refcnt     2
    callback 53c6ce00   expires 1224796527
    6 opens     6 writers
    normal file
    states (0x21), stat'd
** Cache entry @ 0x46096380 for 2.536874014.545.9379 [phoenix.srhs.net]
    locks: (none_waiting, write_locked(pid:22297 at:616))
            2048 bytes  DV            0  refcnt     2
    callback 00000000   expires 1224797295
    0 opens     0 writers
    normal file
    states (0x0)
** Cache entry @ 0x68187380 for 2.536874470.101.1024 [phoenix.srhs.net]
    locks: (none_waiting, write_locked(pid:29269 at:135))
            2048 bytes  DV           42  refcnt     2
    callback 53c6ce00   expires 1224784111
    0 opens     0 writers
    normal file
    states (0x1), stat'd
** Cache entry @ 0x46d07000 for 2.536874602.25.222 [phoenix.srhs.net]
    locks: (none_waiting, write_locked(pid:5932 at:54))
            2048 bytes  DV            0  refcnt     2
    callback 53c6ce00   expires 0
    0 opens     0 writers
    normal file
    states (0x0)
** Cache entry @ 0xcad4c000 for 2.536874512.137.2220 [phoenix.srhs.net]
    locks: (none_waiting, write_locked(pid:26552 at:135))
            2048 bytes  DV           62  refcnt     2
    callback 53c6ce00   expires 1224784111
    0 opens     0 writers
    normal file
    states (0x1), stat'd
** Cache entry @ 0xcad4c000 for 2.536874512.137.2220 [phoenix.srhs.net]
    locks: (none_waiting, write_locked(pid:26552 at:135))
            2048 bytes  DV           62  refcnt     2
    callback 53c6ce00   expires 1224784111
    0 opens     0 writers
    normal file
    states (0x1), stat'd


Randy Kemp wrote:
> It did it again as suspected.  The requested backtrace is attached. 
> It's inaccessible from any client (not just the app server) on all
> interfaces.  No one will need to use AFS or the app server for the
> remainder of the day so I will leave it in this state in case anyone
> comes up with some other bit of info that might be useful to retrieve.
>
> Randy Kemp wrote:
>> Derrick Brashear wrote:
>>> On Mon, Oct 13, 2008 at 9:52 AM, Randy Kemp <rkemp@srhs.net> wrote:
>>>  
>>>> I had users again today to test with.  The problem with fileserver
>>>> ceasing to respond and generating the "CallPreamble: Couldn't get CPS.
>>>> Too many lockers" error occurred again.
>>>>     
>>> To the app server, or to both machines? I assume, actually, only to
>>> the app server.
>>>   
>> The last time it occurred (when I sent the first message) it stopped
>> responding to all connected clients.  I did not think to test that this.
>>>> I'm now running fileserver with the following parameters, "-p 128
>>>> -b 512
>>>> -l 3072 -s 3072 -vc 3072 -cb 65536 -busyat 1536 -rxpck 1024 -nojumbo".
>>>> At the time that the problem started there were two physical clients
>>>> connected, one was a standalone workstation and the other was the
>>>> aforementioned application server with approximately 40 users logged
>>>> in.  All requests from said application server are now coming from a
>>>> single address to a single interface on the AFS server.
>>>>     It turns out that restarting AFS vs. rebooting the server does
>>>> not make
>>>> the problem go away as I previously thought.  It was purely
>>>> coincidental
>>>> that I had also restarted the application server last time.  What I
>>>> have
>>>> now discovered is that even rebooting the AFS server does not resolve
>>>> the problem (errors start immediately upon startup) and it now appears
>>>> that the problem is only resolved by restarting the client on the
>>>> application server.  The application server is running OpenAFS client
>>>> version 1.4.7 on Ubuntu Linux with kernel version 2.6.24.
>>>>     
>>>
>>> echo t > /proc/sysrq-trigger on the application server, as root, when
>>> the fileserver won't talk to it, collect the system log from e.g.
>>> /var/log/messages with the backtrace in it, and let us see that.
>>>   
>> I have about 90 users that will try to log in to the app server later
>> today so I'm sure it will do it again.  I'll do this and test it from
>> other clients if/when it happens.
-- 
Randy Kemp