[OpenAFS] Prolonged period of blocked connections

Derrick Brashear shadow@gmail.com
Wed, 4 Feb 2009 16:42:05 -0500

On Wed, Feb 4, 2009 at 4:38 PM, Will Maier <willmaier@ml1.net> wrote:
> Hi folks-
> In the past, we've observed prolonged periods where one or more of
> our servers would report more than 200 calls waiting for a thread.
> This occurred again this morning and lasted for about four hours.

bos status (fileserverhost) fs -long

and post that information?

However, lots of bugs which would affect this fixed since 1.4.1, which
is ancient.

> While the server reported the blocked calls, top showed that the
> fileserver was pegged at >= 100% CPU and FileLog (with verbosity
> increased via SIGTSTP) showed a huge number of SAFS_FetchStatuses
> (and very little else).
> During this time, I also noticed that the number of blocked calls
> seemed to oscillate between 0 and ~220 over a period of about 100
> seconds (with ~1300 total clients according to the hosts.dump file).
> This made me wonder if there wasn't some component that was
> periodically clearing the backlog and, if so, if the period might be
> easily modifiable.
> This condition tends to coincide with a large number of batch jobs
> that, unfortunately, must get some of their shared libraries,
> binaries and configuration/seed files from our AFS cell. We've done
> as much as we can to limit the amount of data in AFS that these jobs
> require, but we still observe blocked calls, especially when a large
> number of jobs spin up at approximately the same time. It's also
> possible that the jobs are overwhelming the clients' caches, which
> could conceivably cause extra/spurious calls to the server. Is this
> a possibility?
> If the periodicity of the backlog's level is a red herring, is there
> something else we might consider?

Yes. OpenAFS 1.4.8.