[OpenAFS] Re: OpenAFS client cache overrun?

Kim Kimball dhk@ccreinc.com
Tue, 03 Dec 2013 13:16:18 -0700

Not too long ago a cache size of approx 2.5 GB was a maximum -- you 
might try reducing the configured cache size to 2.5GB.


On 12/2/2013 11:47 AM, Andrew Deason wrote:
> On Fri, 22 Nov 2013 16:28:04 -0500
> Chris Garrison <ecgarris@iu.edu> wrote:
>> The hosts' /usr/vice/etc/cacheinfo files look like this:
>>    /afs:/usr/vice/cache:7500000
> [...]
>> Something has been locking up the openafs client in the past month or
>> so.  The cache will show as more and more full in "df" and then at
>> some point, AFS stops answering, and any attempt to do a directory
>> listing or to access a file results in a zombie process.
> Sorry if you haven't received any information on this yet; I can't look
> at this for too long right now, but I can try to provide a little
> information.
> Is /usr/vice/cache its own partition? Do you mean cache usage fills up
> the partition the cache is on, or it just fills up to about the size the
> cache is configured to? That is, does it fill up the disk, or you just
> mean it fills up the configured ~7.5G?
>> What could cause that lockup? It's usually only on one host at a time,
>> and seems like it will "move" from host to host, even returning to the
>> same host in the same day after reboot once in awhile.
> Presumably all accesses are waiting for something to get kicked out of
> the cache, since the cache is full. But for whatever reason, the thread
> for kicking stuff out of the cache is not doing that.
>> To me, it feels like maybe someone is forcing a huge file through and
>> running the machine out of cache. Though if that's so, I wonder why it
>> only just started happening after all these years. If nothing else, it
>> seems like something new is going on with the user end that's causing
>> it.
> It's either someone reading or writing a bunch of data. At various
> points in the past there have been problems when the cache is full of
> data, and we can't evict stuff out of the cache because it's "in use" or
> something like that. More recently there were some fixes to some cache
> eviction processing, but I'm not clear on if that's relevant since I
> haven't seen a description of the relevant problem it was fixing.  That
> is included in 1.6.6pre1, though, if you wanted to try that.
>> Any help would be appreciated, anything from a fix by limiting
>> something in the openafs client or the cache or ideas as to what
>> someone could be doing. Because at this point, it's like a denial of
>> service attack that's making lots of problems for us.
> What you could get is an "fstrace" of the client while this problem is
> going on (there are instructions on the list and elsewhere for how to
> collect this, but ask if you need to), or get a stack trace of the
> CacheTruncateDaemon process. The latter you can get by installing the
> kernel debuginfo package, and then running 'crash' on the machine as the
> problem is happening. Find the PID for the 'afs_cachetrim' process, and
> run inside 'crash':
> set <pid>
> bt > /tmp/somefile
> Or, if you don't want to bother or can't find the PID, or if you want to
> be sure to capture _all_ possible relevant information just run
> foreach bt > /tmp/somefile
> Instead, which will capture the stack trace of every process; that'll
> take a little more time and CPU.
> You can also just run 'cmdebug localhost' to see what processes are
> hanging on, but I assume that will just show that they are hanging on
> waiting for cache items to be evicted. And running 'cmdebug' may not
> ever complete if the client is wedged hard enough.