[OpenAFS] Re: Tuning the -daemons.

Andrew Deason adeason@sinenomine.net
Mon, 7 Feb 2011 11:21:23 -0600

On Mon, 7 Feb 2011 16:53:25 +0100
Jan Johansson <janj@it.su.se> wrote:

> Short version:
> How do I see how many of the "background daemons" that are in use?

We don't offer an easy way to see this. I believe you are on Linux, in
which case you can look at the process backtraces via 'echo t >
/proc/sysrq-trigger', and see which of the background daemon processes
are busy doing something. You also may be able to examine the
afs_brsDaemons variable if you are able to use a kernel debugger to
retrieve values from the running kernel. (That one will tell you how
many daemons are idle; afs_brsWaiting can tell you how many requests are
waiting for a free daemon).

But I doubt that the background daemons are your problem (or at least,
the only problem); they mostly only handle prefetching readaheads and
background writes (if you've configured storebehind mode), iirc.

> On an untuned Ubuntu with 5 users we saw an issue where the cache
> manager would freeze and start reporting 
> afs: Lost contact with file server AAA.BBB.CCC.133 in cell example.com
> at the same time the file server reports
> fileserver[1139]: CB: Call back connect back failed (in break
> delayed) for Host AAA.BBB.CCC.186:7001
> fileserver[1139]: BreakDelayedCallbacks FAILED for host
> AAA.BBB.CCC.186:7001 which IS UP.  Connection from
> AAA.BBB.CCC.186:7001.  Possible network or routing failure.
> sometimes it recovers after a while other times it needs a
> reboot.

I think the only time I've actually seen this before is when the
client's network is acting weird, although maybe it could also be the
client's callback servicing thread hanging. This message says that we
got a package from a client that came from the IP A.B.C.186:7001, but
when we tried to call some RPC to the machine A.B.C.186:7001, the client
didn't respond.

Does this happen for long enough that you can rxdebug or cmdebug the
client while it is happening? Does 'rxdebug <client> 7001' respond
during this hang?

Also, platform and OpenAFS versions for the server and client?

Andrew Deason