[OpenAFS] Prolonged period of blocked connections

Will Maier willmaier@ml1.net
Wed, 4 Feb 2009 15:38:52 -0600

Hi folks-

In the past, we've observed prolonged periods where one or more of
our servers would report more than 200 calls waiting for a thread.
This occurred again this morning and lasted for about four hours.
While the server reported the blocked calls, top showed that the
fileserver was pegged at >= 100% CPU and FileLog (with verbosity
increased via SIGTSTP) showed a huge number of SAFS_FetchStatuses
(and very little else).

During this time, I also noticed that the number of blocked calls
seemed to oscillate between 0 and ~220 over a period of about 100
seconds (with ~1300 total clients according to the hosts.dump file).
This made me wonder if there wasn't some component that was
periodically clearing the backlog and, if so, if the period might be
easily modifiable.

This condition tends to coincide with a large number of batch jobs
that, unfortunately, must get some of their shared libraries,
binaries and configuration/seed files from our AFS cell. We've done
as much as we can to limit the amount of data in AFS that these jobs
require, but we still observe blocked calls, especially when a large
number of jobs spin up at approximately the same time. It's also
possible that the jobs are overwhelming the clients' caches, which
could conceivably cause extra/spurious calls to the server. Is this
a possibility?

If the periodicity of the backlog's level is a red herring, is there
something else we might consider? See below for system details on
the file server. The clients all run Linux on 32 and 64-bit machines
connected to our servers via gigabit links.


$ uname -a
Linux 2.6.9-42.0.3.ELsmp #1 SMP Thu Oct 5 16:29:37 CDT 2006 x86_64 x86_64 x86_64 GNU/Linux

$ rpm -qa | grep openafs


[Will Maier]-----------------[willmaier@ml1.net|http://www.lfod.us/]