[OpenAFS] Re: fileserver process eat CPU
Jean-Marc Choulet
jm130794@gmail.com
Tue, 10 Sep 2013 20:07:46 +0200
Le 10/09/2013 17:43, Andrew Deason a écrit :
> On Tue, 10 Sep 2013 07:57:03 +0200
> Jean-Marc Choulet <jm130794@gmail.com> wrote:
>
>> We have a question about the fileserver process. On our server openafs
>> (Debian squeeze), the fileserver process eats 80% CPU every 5 seconds.
> Are you running the openafs version from squeeze
> (1.4.12.1+dfsg-4+squeeze2)?
Yes, it is
>
> This is somewhat a guess, but: that version sill sync() every 10
> seconds; that's a bug that is fixed in later versions. That's every 10
> seconds, though, not every 5, and that should only really do anything
> noticeable if you have a lot of disk activity on the machine (whether or
> not it's caused by openafs). If you 'strace' the fileserver process, it
> should be pretty obvious if that's happening; you'll see sync() calls
> executing and taking a long time.
>
> For how long does it use 80% cpu?
Every 10 seconds
How long : about 10 seconds
>
>> No AFS clientis connected to the server
>> We restarted our server but it is always the same.
> If the above is the problem, you can patch the server to not do that
> (it's a very small patch), or you may be able to alter the underlying
> filesystem to make sync() calls less noticeable.
... it's easy for you, but not for us :)
>
> However, if that's not the problem, or just more generally to see
> "what is going on", there are a few different things you can do:
>
> - Check FileLog and BosLog (or syslog if you log to syslog), or just
> all of the *Log files. Just see if anything abnormal looks like it's
> being logged. And of course, if there's anything getting logged every
> 5 seconds, it's probably relevant.
root@afs1:~# tail -f /var/log/openafs/*
==> /var/log/openafs/BackupLog <==
09/08/2013 04:01:14 Will allocate 400 ubik buffers
09/08/2013 04:01:14 Waiting for quorum election
09/08/2013 04:01:18 Have established quorum
09/08/2013 04:01:18 Ready to process requests at Sun Sep 8 04:01:18 2013
09/10/2013 07:38:38 Will allocate 400 ubik buffers
09/10/2013 07:38:38 Waiting for quorum election
09/10/2013 07:38:42 Have established quorum
09/10/2013 07:38:42 Ready to process requests at Tue Sep 10 07:38:42 2013
==> /var/log/openafs/BosLog <==
Tue Sep 10 07:38:38 2013: Server directory access is okay
==> /var/log/openafs/BosLog.old <==
Sun Sep 8 04:01:14 2013: Server directory access is okay
Sun Sep 8 22:00:03 2013: backupcell exited with code 0
Mon Sep 9 22:00:03 2013: backupcell exited with code 0
Tue Sep 10 07:38:21 2013: buserver exited on signal 15
Tue Sep 10 07:38:21 2013: vlserver exited on signal 15
Tue Sep 10 07:38:21 2013: ptserver exited on signal 15
Tue Sep 10 07:38:27 2013: fs:vol exited on signal 15
Tue Sep 10 07:38:38 2013: fs:file exited with code 0
==> /var/log/openafs/FileLog <==
Tue Sep 10 19:35:32 2013 CB: ProbeUuid for host 0x7f57bb0ef318
(172.20.128.130:7001) failed -01
Tue Sep 10 19:37:32 2013 CB: ProbeUuid for host 0x7f57bb0ef318
(172.20.128.133:7001) failed -01
Tue Sep 10 19:39:30 2013 CB: ProbeUuid for 0x7f57bb0ef318
(172.20.128.130:7001) failed -01
Tue Sep 10 19:41:32 2013 CB: ProbeUuid for host 0x7f57bb0ef318
(172.20.128.133:7001) failed -01
Tue Sep 10 19:43:36 2013 CB: ProbeUuid for 0x7f57bb0ef318
(172.20.128.130:7001) failed -01
Tue Sep 10 19:45:32 2013 CB: ProbeUuid for host 0x7f57bb0ef318
(172.20.128.133:7001) failed -01
Tue Sep 10 19:47:42 2013 CB: ProbeUuid for 0x7f57bb0ef318
(172.20.128.130:7001) failed -01
Tue Sep 10 19:49:32 2013 CB: ProbeUuid for host 0x7f57bb0ef318
(172.20.128.133:7001) failed -01
Tue Sep 10 19:51:48 2013 CB: ProbeUuid for 0x7f57bb0ef318
(172.20.128.130:7001) failed -01
Tue Sep 10 19:53:10 2013 CB: ProbeUuid for host 0x7f57bb0ef318
(172.20.128.133:7001) failed -01
==> /var/log/openafs/FileLog.old <==
Tue Sep 10 07:38:38 2013 File server has terminated normally at Tue Sep
10 07:38:38 2013
ll 0 challenge 47 response 94 debug 0 params 0 unused 0 unused 0 unused
0 version 0
other send counters: ack 1523714, data 2017892 (not resends),
resends 241, pushed 0, acked&ignored 486192
(these should be small) sendFailed 0, fatalErrors 0
Average rtt is 0.005, with 5138462 samples
Minimum rtt is 0.000, maximum is 0.566
0 server connections, 16 client connections, 4 peer structs, 32 call
structs, 32 free call structs
146883 add CB, 90604 break CB, 91195 del CB, 21718 del FE, 5007 CB's
timed out, 0 space reclaim, 72 del host
1 CBs, 1 FEs, (2 of total of 60000 16-byte blocks)
==> /var/log/openafs/PtLog <==
Tue Sep 10 07:38:38 2013 Using 172.20.128.247 as my primary address
==> /var/log/openafs/PtLog.old <==
Sun Sep 8 04:01:14 2013 Using 172.20.128.247 as my primary address
==> /var/log/openafs/SalvageLog <==
@(#) OpenAFS 1.4.12.1 built 2011-02-09
06/01/2013 22:00:24 STARTING AFS SALVAGER 2.4 (/usr/lib/openafs/salvager
/vicepa 171082816 -tmpdir /tmp)
06/01/2013 22:00:24 1 nVolumesInInodeFile 32
06/01/2013 22:00:25 SALVAGING VOLUME 171082816.
06/01/2013 22:00:25 user.aabdulha_tmp (171082816) updated 10/11/2012 12:21
06/01/2013 22:00:25 totalInodes 746
06/01/2013 22:00:30 Salvaged user.aabdulha_tmp (171082816): 742 files,
78092 blocks
==> /var/log/openafs/SalvageLog.old <==
06/01/2013 14:56:17 dir vnode 119: special old unlink-while-referenced
file .__afs179A is deleted (vnode 86)
06/01/2013 14:56:17 dir vnode 119: special old unlink-while-referenced
file .__afsC629 is deleted (vnode 1106)
06/01/2013 14:56:17 dir vnode 119: special old unlink-while-referenced
file .__afsCE72 is deleted (vnode 902)
06/01/2013 14:56:17 dir vnode 119: special old unlink-while-referenced
file .__afs01D1 is deleted (vnode 908)
06/01/2013 14:56:17 dir vnode 119: special old unlink-while-referenced
file .__afsAFF7 is deleted (vnode 890)
06/01/2013 14:56:17 dir vnode 119: special old unlink-while-referenced
file .__afsD4CC is deleted (vnode 700)
06/01/2013 14:56:17 dir vnode 119: special old unlink-while-referenced
file .__afs096D is deleted (vnode 946)
06/01/2013 14:56:17 dir vnode 119: special old unlink-while-referenced
file .__afsCC1D is deleted (vnode 1016)
06/01/2013 14:56:24 Volume uniquifier is too low; fixed
06/01/2013 14:56:24 Salvaged user.aabdulha_tmp (171082816): 742 files,
78092 blocks
==> /var/log/openafs/VLLog <==
Tue Sep 10 07:38:38 2013 Using 172.20.128.247 as my primary address
Tue Sep 10 07:38:38 2013 Starting AFS vlserver 4 (/usr/lib/openafs/vlserver)
==> /var/log/openafs/VLLog.old <==
Sun Sep 8 04:01:14 2013 Using 172.20.128.247 as my primary address
Sun Sep 8 04:01:14 2013 Starting AFS vlserver 4 (/usr/lib/openafs/vlserver)
==> /var/log/openafs/VolserLog <==
Tue Sep 10 10:45:25 2013 1 Volser: CreateVolume: volume 144602000
(user.jmprudha) created
Tue Sep 10 10:45:27 2013 1 Volser: CreateVolume: volume 154567376
(user.dyang2) created
Tue Sep 10 10:45:29 2013 1 Volser: CreateVolume: volume 144122616
(user.adilou) created
Tue Sep 10 12:52:13 2013 1 Volser: CreateVolume: volume 159986776
(user.airauld_tmp) created
Tue Sep 10 12:52:45 2013 1 Volser: CreateVolume: volume 144663544
(user.poncot_tmp) created
Tue Sep 10 16:33:56 2013 1 Volser: CreateVolume: volume 142021512
(user.wjacques_tmp) created
Tue Sep 10 16:35:18 2013 1 Volser: CreateVolume: volume 154334216
(user.cbrenckle_tmp) created
Tue Sep 10 16:35:30 2013 1 Volser: CreateVolume: volume 154165960
(user.rhasan_tmp) created
Tue Sep 10 16:35:57 2013 1 Volser: CreateVolume: volume 143848040
(user.lnan_tmp) created
Tue Sep 10 16:36:07 2013 1 Volser: CreateVolume: volume 146104976
(user.ihamed_tmp) created
==> /var/log/openafs/VolserLog.old <==
Mon Sep 9 16:58:54 2013 1 Volser: CreateVolume: volume 163242808
(user.asalrajh) created
Mon Sep 9 16:58:55 2013 1 Volser: CreateVolume: volume 173085448
(user.rhassan) created
Mon Sep 9 16:58:57 2013 1 Volser: CreateVolume: volume 158343912
(user.asaif) created
Mon Sep 9 16:58:58 2013 1 Volser: CreateVolume: volume 159806320
(user.mtourni3) created
Mon Sep 9 22:00:01 2013 1 Volser: Clone: Recloning volume 536870918 to
volume 536870920
Mon Sep 9 22:00:01 2013 1 Volser: Clone: Recloning volume 536870945 to
volume 536870947
Mon Sep 9 22:00:03 2013 1 Volser: Clone: Recloning volume 151039480 to
volume 151039482
Mon Sep 9 22:00:03 2013 1 Volser: Clone: Recloning volume 536870975 to
volume 536870977
Mon Sep 9 22:00:03 2013 1 Volser: Clone: Recloning volume 536871149 to
volume 536871151
Mon Sep 9 22:00:03 2013 1 Volser: Clone: Recloning volume 536871152 to
volume 536871154
==> /var/log/openafs/FileLog <==
Tue Sep 10 19:55:55 2013 CB: ProbeUuid for 0x7f57bb0ef318
(172.20.128.130:7001) failed -01
Tue Sep 10 19:57:18 2013 CB: ProbeUuid for host 0x7f57bb0ef318
(172.20.128.133:7001) failed -01
Tue Sep 10 20:00:01 2013 CB: ProbeUuid for 0x7f57bb0ef318
(172.20.128.130:7001) failed -01
Tue Sep 10 20:00:38 2013 CB: WhoAreYou failed for host 0x7f57bb0ecfc0
(172.20.128.103:7001), error 1
Tue Sep 10 20:00:38 2013 CB: WhoAreYou failed for host 0x7f57bb0ed710
(172.20.128.188:7001), error 1
Tue Sep 10 20:00:40 2013 CB: WhoAreYou failed for host 0x7f57bb0ed5d8
(172.20.128.190:7001), error 1
Tue Sep 10 20:00:42 2013 CB: WhoAreYou failed for host 0x7f57bb0ec9a8
(172.20.128.192:7001), error 1
Tue Sep 10 20:01:07 2013 CB: RCallBackConnectBack failed for host
0x7f57bb0eb148 (172.20.128.130:7001)
Tue Sep 10 20:01:07 2013 CB: Call back connect back failed (in break
delayed) for Host 172.20.128.130:7001
Tue Sep 10 20:01:07 2013 BreakDelayedCallbacks FAILED for host
172.20.128.130:7001 which IS UP. Connection from 172.20.128.130:7001.
Possible network or routing failure.
Tue Sep 10 20:01:07 2013 MultiProbe failed to find new address for host
172.20.128.130:7001
Tue Sep 10 20:01:09 2013 CB: WhoAreYou failed for host 0x7f57bb0ef318
(172.20.128.133:7001), error 1
Tue Sep 10 20:01:26 2013 CB: ProbeUuid for host 0x7f57bb0ef318
(172.20.128.133:7001) failed -01