[OpenAFS] File Server appears to stop responding

Randy Kemp rkemp@srhs.net
Mon, 13 Oct 2008 09:52:56 -0400


This is a multi-part message in MIME format.
--------------060203090901040706050704
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

I had users again today to test with.  The problem with fileserver
ceasing to respond and generating the "CallPreamble: Couldn't get CPS.
Too many lockers" error occurred again.

I'm now running fileserver with the following parameters, "-p 128 -b 512
-l 3072 -s 3072 -vc 3072 -cb 65536 -busyat 1536 -rxpck 1024 -nojumbo".
At the time that the problem started there were two physical clients
connected, one was a standalone workstation and the other was the
aforementioned application server with approximately 40 users logged
in.  All requests from said application server are now coming from a
single address to a single interface on the AFS server.

It turns out that restarting AFS vs. rebooting the server does not make
the problem go away as I previously thought.  It was purely coincidental
that I had also restarted the application server last time.  What I have
now discovered is that even rebooting the AFS server does not resolve
the problem (errors start immediately upon startup) and it now appears
that the problem is only resolved by restarting the client on the
application server.  The application server is running OpenAFS client
version 1.4.7 on Ubuntu Linux with kernel version 2.6.24.

I've attached the FileLog and I trimmed out the repeating errors.  Any
ideas?

Randy Kemp wrote:
> Harald Barth wrote:
>>> I've been experiencing a problem where fileserver appears to simply
>>> stop responding to requests. I currently only have one AFS server
>>> (OpenAFS version 1.4.7).
>>>     
>> OS?
>>   
> Linux Kernel version 2.6.25.  I was running 2.6.24 when this problem 
> started.
>> What parameters do you start the fileserver with?
> No parameters currently.  Perhaps that's part of my problem.  I've 
> done some research on the fileserver parameters and will see if 
> tweaking those improves things.
>>> I have to completely reboot the server to resolve it.     
>>
>> Hm. What is not freed correctly?
>>   
> Exactly, that's the strange part.
>>  
>>> This log was generated after I restarted fileserver and it still 
>>> wasn't working.
>>>     
>>
>> The "CallPreamble: Couldn't get CPS. Too many lockers" indicates that
>> the fileserver can not send any rx calls any more. The question now is
>> how it got into that state and why restarting (other than reboot) does
>> not clear it.
>>   
> Thanks for explaining that.  Now I just need to figure out why.  I 
> unfortunately won't have a large number of users to test with for a 
> couple of weeks.
>
-- 
Randy Kemp




--------------060203090901040706050704
Content-Type: text/plain;
 name="FileLog-20081013"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="FileLog-20081013"

Sun Oct 12 04:00:30 2008 File server starting
Sun Oct 12 04:00:30 2008 afs_krb_get_lrealm failed, using phoenix.srhs.net.
Sun Oct 12 04:00:30 2008 VL_RegisterAddrs rpc failed; will retry periodically (code=5376, err=0)
Sun Oct 12 04:00:30 2008 Set thread id 126 for FSYNC_sync
Sun Oct 12 04:00:30 2008 FSYNC_sync: bind failed with (98), removed bogus /var/lib/openafs/fssync.sock
Sun Oct 12 04:00:30 2008 Partition /vicepa: attaching volumes
Sun Oct 12 04:00:30 2008 Partition /vicepa: attached 1863 volumes; 0 volumes not attached
Sun Oct 12 04:00:30 2008 Set thread id 127 for 'FiveMinuteCheckLWP'
Sun Oct 12 04:00:30 2008 Set thread id 129 for 'FsyncCheckLWP'
Sun Oct 12 04:00:30 2008 Set thread id 128 for 'HostCheckLWP'
Sun Oct 12 04:00:30 2008 Getting FileServer name...
Sun Oct 12 04:00:30 2008 FileServer host name is 'afs'
Sun Oct 12 04:00:30 2008 Getting FileServer address...
Sun Oct 12 04:00:30 2008 FileServer afs has address 10.141.4.2 (0x2048d0a or 0xa8d0402 in host byte order)
Sun Oct 12 04:00:30 2008 File Server started Sun Oct 12 04:00:30 2008
Mon Oct 13 07:48:59 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:49:17 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:49:35 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:49:35 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:49:54 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:49:54 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:49:54 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:49:54 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:50:48 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:50:48 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:50:48 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:50:48 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:50:49 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:50:49 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:50:49 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:50:49 2008 CallPreamble: Couldn't get CPS. Too many lockers
[trimmed repeating error]
Mon Oct 13 07:55:08 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:55:08 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:55:08 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:55:08 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:55:08 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:55:08 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:55:08 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:55:08 2008 CallPreamble: Couldn't get CPS. Too many lockers
Mon Oct 13 07:55:08 2008 Shutting down file server at Mon Oct 13 07:55:08 2008
Mon Oct 13 07:55:08 2008 Vice was last started at Sun Oct 12 04:00:30 2008

Mon Oct 13 07:55:08 2008 Large vnode cache, 3072 entries, 22 allocs, 49437 gets (4292 reads), 5933 writes
Mon Oct 13 07:55:08 2008 Small vnode cache,3072 entries, 2328 allocs, 77690 gets (4823 reads), 19130 writes
Mon Oct 13 07:55:08 2008 Volume header cache, 3072 entries, 49547 gets, 0 replacements
Mon Oct 13 07:55:08 2008 Partition /vicepa: 1941465536 available 1K blocks (minfree=0), Mon Oct 13 07:55:08 2008 1933775811 free blocks
Mon Oct 13 07:55:08 2008 With 512 directory buffers; 51616 reads resulted in 719 read I/Os
Mon Oct 13 07:55:08 2008 Total Client entries = 610, blocks = 9; Host entries = 6, blocks = 1
Mon Oct 13 07:55:08 2008 There are 610 connections, process size 28412
Mon Oct 13 07:55:08 2008 There are 6 workstations, 2 are active (req in < 15 mins), 0 marked "down"
Mon Oct 13 07:55:08 2008 VShutdown:  shutting down on-line volumes...
Mon Oct 13 07:55:16 2008 VShutdown:  complete.
Mon Oct 13 07:55:16 2008 File server has terminated normally at Mon Oct 13 07:55:16 2008

allenge 205 response 216 debug 0 params 0 unused 0 unused 0 unused 0 version 0 
   other send counters: ack 1122699, data 100014 (not resends), resends 174, pushed 0, acked&ignored 8182
   	(these should be small) sendFailed 0, fatalErrors 0
   Average rtt is 0.005, with 935 samples
   Minimum rtt is 0.000, maximum is 0.042
   222 server connections, 130 client connections, 10 peer structs, 260 call structs, 1 free call structs
14928 add CB, 16144 break CB, 0 del CB, 3042 del FE, 369 CB's timed out, 0 space reclaim, 4 del host
8719 CBs, 8719 FEs, (17438 of total of 65536 16-byte blocks)

--------------060203090901040706050704--