[OpenAFS] Re: fileserver process eat CPU

Andrew Deason adeason@sinenomine.net
Tue, 10 Sep 2013 10:43:37 -0500


On Tue, 10 Sep 2013 07:57:03 +0200
Jean-Marc Choulet <jm130794@gmail.com> wrote:

> We have a question about the fileserver process. On our server openafs 
> (Debian squeeze), the fileserver process eats 80% CPU every 5 seconds. 

Are you running the openafs version from squeeze
(1.4.12.1+dfsg-4+squeeze2)?

This is somewhat a guess, but: that version sill sync() every 10
seconds; that's a bug that is fixed in later versions. That's every 10
seconds, though, not every 5, and that should only really do anything
noticeable if you have a lot of disk activity on the machine (whether or
not it's caused by openafs). If you 'strace' the fileserver process, it
should be pretty obvious if that's happening; you'll see sync() calls
executing and taking a long time.

For how long does it use 80% cpu?

> No AFS clientis connected to the server
> We restarted our server but it is always the same.

If the above is the problem, you can patch the server to not do that
(it's a very small patch), or you may be able to alter the underlying
filesystem to make sync() calls less noticeable.

However, if that's not the problem, or just more generally to see
"what is going on", there are a few different things you can do:

 - Check FileLog and BosLog (or syslog if you log to syslog), or just
   all of the *Log files. Just see if anything abnormal looks like it's
   being logged. And of course, if there's anything getting logged every
   5 seconds, it's probably relevant.

 - If there's nothing in FileLog, try turning up the debug level.
   Sending the fileserver process a TSTP signal will increase the
   debug level to 1, and sending another will set it to 5, then 25, then
   125. If it's at 125 and you don't see anything, that is rather
   unusual. Send it a HUP signal to reset the debug level to 0. The
   debug log may not be _too_ clear about what's going on, but you
   should at least be able to pick out individual FIDs or IP addresses.
   Ask here about the output if you are confused by it.

 - If you still don't know what's going on, you can try capturing a
   bunch of stacks from the fileserver process with the 'pstack'
   command. Squeeze only has 'pstack' for i386, but if you grab the
   pstack command from wheezy for amd64, it'll probably work (iirc it's
   just a script wrapping some gdb commands). Once you have that, run
   something like this while the fileserver is using a lot of cpu:
 
for i in {0..9} ; do date pstack 1234 ; sleep .1 ; done > /tmp/fileserver.pstack

   and you'll probably need to share that output with a developer to
   interpret.

Note that if you share any of that log information or stacks publicly
(for example, on this list), it may contain volume ids, usernames, user
ids, and filenames. If any of that is private and you don't want to
share it, then don't share those logs, or scrub out the relevant info.

-- 
Andrew Deason
adeason@sinenomine.net