[OpenAFS] Re: OpenAFS client cache overrun?

Mon, 2 Dec 2013 12:47:48 -0600

On Fri, 22 Nov 2013 16:28:04 -0500
Chris Garrison <ecgarris@iu.edu> wrote:

> The hosts' /usr/vice/etc/cacheinfo files look like this:
> 
>   /afs:/usr/vice/cache:7500000
[...]
> Something has been locking up the openafs client in the past month or
> so.  The cache will show as more and more full in "df" and then at
> some point, AFS stops answering, and any attempt to do a directory
> listing or to access a file results in a zombie process.

Sorry if you haven't received any information on this yet; I can't look
at this for too long right now, but I can try to provide a little
information.

Is /usr/vice/cache its own partition? Do you mean cache usage fills up
the partition the cache is on, or it just fills up to about the size the
cache is configured to? That is, does it fill up the disk, or you just
mean it fills up the configured ~7.5G?

> What could cause that lockup? It's usually only on one host at a time,
> and seems like it will "move" from host to host, even returning to the
> same host in the same day after reboot once in awhile.

Presumably all accesses are waiting for something to get kicked out of
the cache, since the cache is full. But for whatever reason, the thread
for kicking stuff out of the cache is not doing that.

> To me, it feels like maybe someone is forcing a huge file through and
> running the machine out of cache. Though if that's so, I wonder why it
> only just started happening after all these years. If nothing else, it
> seems like something new is going on with the user end that's causing
> it.

It's either someone reading or writing a bunch of data. At various
points in the past there have been problems when the cache is full of
data, and we can't evict stuff out of the cache because it's "in use" or
something like that. More recently there were some fixes to some cache
eviction processing, but I'm not clear on if that's relevant since I
haven't seen a description of the relevant problem it was fixing.  That
is included in 1.6.6pre1, though, if you wanted to try that.

> Any help would be appreciated, anything from a fix by limiting
> something in the openafs client or the cache or ideas as to what
> someone could be doing. Because at this point, it's like a denial of
> service attack that's making lots of problems for us.

What you could get is an "fstrace" of the client while this problem is
going on (there are instructions on the list and elsewhere for how to
collect this, but ask if you need to), or get a stack trace of the
CacheTruncateDaemon process. The latter you can get by installing the
kernel debuginfo package, and then running 'crash' on the machine as the
problem is happening. Find the PID for the 'afs_cachetrim' process, and
run inside 'crash':

set <pid>
bt > /tmp/somefile

Or, if you don't want to bother or can't find the PID, or if you want to
be sure to capture _all_ possible relevant information just run

foreach bt > /tmp/somefile

Instead, which will capture the stack trace of every process; that'll
take a little more time and CPU.

You can also just run 'cmdebug localhost' to see what processes are
hanging on, but I assume that will just show that they are hanging on
waiting for cache items to be evicted. And running 'cmdebug' may not
ever complete if the client is wedged hard enough.

-- 
Andrew Deason
adeason@sinenomine.net