[OpenAFS] Re: fileserver process eat CPU

Jean-Marc Choulet jm130794@gmail.com
Tue, 10 Sep 2013 20:07:46 +0200


Le 10/09/2013 17:43, Andrew Deason a écrit :
> On Tue, 10 Sep 2013 07:57:03 +0200
> Jean-Marc Choulet <jm130794@gmail.com> wrote:
>
>> We have a question about the fileserver process. On our server openafs
>> (Debian squeeze), the fileserver process eats 80% CPU every 5 seconds.
> Are you running the openafs version from squeeze
> (1.4.12.1+dfsg-4+squeeze2)?
Yes, it is
>
> This is somewhat a guess, but: that version sill sync() every 10
> seconds; that's a bug that is fixed in later versions. That's every 10
> seconds, though, not every 5, and that should only really do anything
> noticeable if you have a lot of disk activity on the machine (whether or
> not it's caused by openafs). If you 'strace' the fileserver process, it
> should be pretty obvious if that's happening; you'll see sync() calls
> executing and taking a long time.
>
> For how long does it use 80% cpu?
Every 10 seconds
How long : about 10 seconds
>
>> No AFS clientis connected to the server
>> We restarted our server but it is always the same.
> If the above is the problem, you can patch the server to not do that
> (it's a very small patch), or you may be able to alter the underlying
> filesystem to make sync() calls less noticeable.
... it's easy for you, but not for us :)
>
> However, if that's not the problem, or just more generally to see
> "what is going on", there are a few different things you can do:
>
>   - Check FileLog and BosLog (or syslog if you log to syslog), or just
>     all of the *Log files. Just see if anything abnormal looks like it's
>     being logged. And of course, if there's anything getting logged every
>     5 seconds, it's probably relevant.
root@afs1:~# tail -f /var/log/openafs/*
==> /var/log/openafs/BackupLog <==
09/08/2013 04:01:14 Will allocate 400 ubik buffers
09/08/2013 04:01:14 Waiting for quorum election
09/08/2013 04:01:18 Have established quorum
09/08/2013 04:01:18 Ready to process requests at Sun Sep  8 04:01:18 2013

09/10/2013 07:38:38 Will allocate 400 ubik buffers
09/10/2013 07:38:38 Waiting for quorum election
09/10/2013 07:38:42 Have established quorum
09/10/2013 07:38:42 Ready to process requests at Tue Sep 10 07:38:42 2013


==> /var/log/openafs/BosLog <==
Tue Sep 10 07:38:38 2013: Server directory access is okay

==> /var/log/openafs/BosLog.old <==
Sun Sep  8 04:01:14 2013: Server directory access is okay
Sun Sep  8 22:00:03 2013: backupcell exited with code 0
Mon Sep  9 22:00:03 2013: backupcell exited with code 0
Tue Sep 10 07:38:21 2013: buserver exited on signal 15
Tue Sep 10 07:38:21 2013: vlserver exited on signal 15
Tue Sep 10 07:38:21 2013: ptserver exited on signal 15
Tue Sep 10 07:38:27 2013: fs:vol exited on signal 15
Tue Sep 10 07:38:38 2013: fs:file exited with code 0

==> /var/log/openafs/FileLog <==
Tue Sep 10 19:35:32 2013 CB: ProbeUuid for host 0x7f57bb0ef318 
(172.20.128.130:7001) failed -01
Tue Sep 10 19:37:32 2013 CB: ProbeUuid for host 0x7f57bb0ef318 
(172.20.128.133:7001) failed -01
Tue Sep 10 19:39:30 2013 CB: ProbeUuid for 0x7f57bb0ef318 
(172.20.128.130:7001) failed -01
Tue Sep 10 19:41:32 2013 CB: ProbeUuid for host 0x7f57bb0ef318 
(172.20.128.133:7001) failed -01
Tue Sep 10 19:43:36 2013 CB: ProbeUuid for 0x7f57bb0ef318 
(172.20.128.130:7001) failed -01
Tue Sep 10 19:45:32 2013 CB: ProbeUuid for host 0x7f57bb0ef318 
(172.20.128.133:7001) failed -01
Tue Sep 10 19:47:42 2013 CB: ProbeUuid for 0x7f57bb0ef318 
(172.20.128.130:7001) failed -01
Tue Sep 10 19:49:32 2013 CB: ProbeUuid for host 0x7f57bb0ef318 
(172.20.128.133:7001) failed -01
Tue Sep 10 19:51:48 2013 CB: ProbeUuid for 0x7f57bb0ef318 
(172.20.128.130:7001) failed -01
Tue Sep 10 19:53:10 2013 CB: ProbeUuid for host 0x7f57bb0ef318 
(172.20.128.133:7001) failed -01

==> /var/log/openafs/FileLog.old <==
Tue Sep 10 07:38:38 2013 File server has terminated normally at Tue Sep 
10 07:38:38 2013

ll 0 challenge 47 response 94 debug 0 params 0 unused 0 unused 0 unused 
0 version 0
    other send counters: ack 1523714, data 2017892 (not resends), 
resends 241, pushed 0, acked&ignored 486192
        (these should be small) sendFailed 0, fatalErrors 0
    Average rtt is 0.005, with 5138462 samples
    Minimum rtt is 0.000, maximum is 0.566
    0 server connections, 16 client connections, 4 peer structs, 32 call 
structs, 32 free call structs
146883 add CB, 90604 break CB, 91195 del CB, 21718 del FE, 5007 CB's 
timed out, 0 space reclaim, 72 del host
1 CBs, 1 FEs, (2 of total of 60000 16-byte blocks)

==> /var/log/openafs/PtLog <==
Tue Sep 10 07:38:38 2013 Using 172.20.128.247 as my primary address

==> /var/log/openafs/PtLog.old <==
Sun Sep  8 04:01:14 2013 Using 172.20.128.247 as my primary address

==> /var/log/openafs/SalvageLog <==
@(#) OpenAFS 1.4.12.1 built  2011-02-09
06/01/2013 22:00:24 STARTING AFS SALVAGER 2.4 (/usr/lib/openafs/salvager 
/vicepa 171082816 -tmpdir /tmp)
06/01/2013 22:00:24 1 nVolumesInInodeFile 32
06/01/2013 22:00:25 SALVAGING VOLUME 171082816.
06/01/2013 22:00:25 user.aabdulha_tmp (171082816) updated 10/11/2012 12:21
06/01/2013 22:00:25 totalInodes 746
06/01/2013 22:00:30 Salvaged user.aabdulha_tmp (171082816): 742 files, 
78092 blocks

==> /var/log/openafs/SalvageLog.old <==
06/01/2013 14:56:17 dir vnode 119: special old unlink-while-referenced 
file .__afs179A is deleted (vnode 86)
06/01/2013 14:56:17 dir vnode 119: special old unlink-while-referenced 
file .__afsC629 is deleted (vnode 1106)
06/01/2013 14:56:17 dir vnode 119: special old unlink-while-referenced 
file .__afsCE72 is deleted (vnode 902)
06/01/2013 14:56:17 dir vnode 119: special old unlink-while-referenced 
file .__afs01D1 is deleted (vnode 908)
06/01/2013 14:56:17 dir vnode 119: special old unlink-while-referenced 
file .__afsAFF7 is deleted (vnode 890)
06/01/2013 14:56:17 dir vnode 119: special old unlink-while-referenced 
file .__afsD4CC is deleted (vnode 700)
06/01/2013 14:56:17 dir vnode 119: special old unlink-while-referenced 
file .__afs096D is deleted (vnode 946)
06/01/2013 14:56:17 dir vnode 119: special old unlink-while-referenced 
file .__afsCC1D is deleted (vnode 1016)
06/01/2013 14:56:24 Volume uniquifier is too low; fixed
06/01/2013 14:56:24 Salvaged user.aabdulha_tmp (171082816): 742 files, 
78092 blocks

==> /var/log/openafs/VLLog <==
Tue Sep 10 07:38:38 2013 Using 172.20.128.247 as my primary address
Tue Sep 10 07:38:38 2013 Starting AFS vlserver 4 (/usr/lib/openafs/vlserver)

==> /var/log/openafs/VLLog.old <==
Sun Sep  8 04:01:14 2013 Using 172.20.128.247 as my primary address
Sun Sep  8 04:01:14 2013 Starting AFS vlserver 4 (/usr/lib/openafs/vlserver)

==> /var/log/openafs/VolserLog <==
Tue Sep 10 10:45:25 2013 1 Volser: CreateVolume: volume 144602000 
(user.jmprudha) created
Tue Sep 10 10:45:27 2013 1 Volser: CreateVolume: volume 154567376 
(user.dyang2) created
Tue Sep 10 10:45:29 2013 1 Volser: CreateVolume: volume 144122616 
(user.adilou) created
Tue Sep 10 12:52:13 2013 1 Volser: CreateVolume: volume 159986776 
(user.airauld_tmp) created
Tue Sep 10 12:52:45 2013 1 Volser: CreateVolume: volume 144663544 
(user.poncot_tmp) created
Tue Sep 10 16:33:56 2013 1 Volser: CreateVolume: volume 142021512 
(user.wjacques_tmp) created
Tue Sep 10 16:35:18 2013 1 Volser: CreateVolume: volume 154334216 
(user.cbrenckle_tmp) created
Tue Sep 10 16:35:30 2013 1 Volser: CreateVolume: volume 154165960 
(user.rhasan_tmp) created
Tue Sep 10 16:35:57 2013 1 Volser: CreateVolume: volume 143848040 
(user.lnan_tmp) created
Tue Sep 10 16:36:07 2013 1 Volser: CreateVolume: volume 146104976 
(user.ihamed_tmp) created

==> /var/log/openafs/VolserLog.old <==
Mon Sep  9 16:58:54 2013 1 Volser: CreateVolume: volume 163242808 
(user.asalrajh) created
Mon Sep  9 16:58:55 2013 1 Volser: CreateVolume: volume 173085448 
(user.rhassan) created
Mon Sep  9 16:58:57 2013 1 Volser: CreateVolume: volume 158343912 
(user.asaif) created
Mon Sep  9 16:58:58 2013 1 Volser: CreateVolume: volume 159806320 
(user.mtourni3) created
Mon Sep  9 22:00:01 2013 1 Volser: Clone: Recloning volume 536870918 to 
volume 536870920
Mon Sep  9 22:00:01 2013 1 Volser: Clone: Recloning volume 536870945 to 
volume 536870947
Mon Sep  9 22:00:03 2013 1 Volser: Clone: Recloning volume 151039480 to 
volume 151039482
Mon Sep  9 22:00:03 2013 1 Volser: Clone: Recloning volume 536870975 to 
volume 536870977
Mon Sep  9 22:00:03 2013 1 Volser: Clone: Recloning volume 536871149 to 
volume 536871151
Mon Sep  9 22:00:03 2013 1 Volser: Clone: Recloning volume 536871152 to 
volume 536871154

==> /var/log/openafs/FileLog <==
Tue Sep 10 19:55:55 2013 CB: ProbeUuid for 0x7f57bb0ef318 
(172.20.128.130:7001) failed -01
Tue Sep 10 19:57:18 2013 CB: ProbeUuid for host 0x7f57bb0ef318 
(172.20.128.133:7001) failed -01
Tue Sep 10 20:00:01 2013 CB: ProbeUuid for 0x7f57bb0ef318 
(172.20.128.130:7001) failed -01
Tue Sep 10 20:00:38 2013 CB: WhoAreYou failed for host 0x7f57bb0ecfc0 
(172.20.128.103:7001), error 1
Tue Sep 10 20:00:38 2013 CB: WhoAreYou failed for host 0x7f57bb0ed710 
(172.20.128.188:7001), error 1
Tue Sep 10 20:00:40 2013 CB: WhoAreYou failed for host 0x7f57bb0ed5d8 
(172.20.128.190:7001), error 1
Tue Sep 10 20:00:42 2013 CB: WhoAreYou failed for host 0x7f57bb0ec9a8 
(172.20.128.192:7001), error 1
Tue Sep 10 20:01:07 2013 CB: RCallBackConnectBack failed for host 
0x7f57bb0eb148 (172.20.128.130:7001)
Tue Sep 10 20:01:07 2013 CB: Call back connect back failed (in break 
delayed) for Host 172.20.128.130:7001
Tue Sep 10 20:01:07 2013 BreakDelayedCallbacks FAILED for host 
172.20.128.130:7001 which IS UP.  Connection from 172.20.128.130:7001.  
Possible network or routing failure.
Tue Sep 10 20:01:07 2013 MultiProbe failed to find new address for host 
172.20.128.130:7001
Tue Sep 10 20:01:09 2013 CB: WhoAreYou failed for host 0x7f57bb0ef318 
(172.20.128.133:7001), error 1
Tue Sep 10 20:01:26 2013 CB: ProbeUuid for host 0x7f57bb0ef318 
(172.20.128.133:7001) failed -01