[OpenAFS] OpenAFS client cache overrun?

Chris Garrison ecgarris@iu.edu
Fri, 22 Nov 2013 16:28:04 -0500


Hello,

We have some RHEL 5.5 servers with openafs-client-1.6.1-1 running. There
are 4 of them in a round-robin DNS, with Apache and Samba sitting on top
of OpenAFS filesystem.

The hosts' /etc/sysconfig/openafs files look like this:

  # OpenAFS Client Configuration
  AFSD_ARGS="-dynroot -fakestat-all -daemons 8 -chunksize 22"

The hosts' /usr/vice/etc/cacheinfo files look like this:

  /afs:/usr/vice/cache:7500000

I realize it's better for users to all use the openafs client for their
own OS, but we have a large base of users who insist on wanting to just
map a drive without installing a client. We have been running like this
for 8+ years now, it's not a new setup.

Something has been locking up the openafs client in the past month or
so.  The cache will show as more and more full in "df" and then at some
point, AFS stops answering, and any attempt to do a directory listing or
to access a file results in a zombie process.

The zombie processes mount up fast, the load on the machine skyrockets,
and the only solution seems to be to reboot.

What could cause that lockup? It's usually only on one host at a time,
and seems like it will "move" from host to host, even returning to the
same host in the same day after reboot once in awhile.

I doubled the cache size on these hosts, and it seemed to slow things
down, but we had another lockup today after a restart of all the clients
on Sunday during a hardware upgrade on the SAN, so no host had been
running more than 3 days.

To me, it feels like maybe someone is forcing a huge file through and
running the machine out of cache. Though if that's so, I wonder why it
only just started happening after all these years. If nothing else, it
seems like something new is going on with the user end that's causing it.

Any help would be appreciated, anything from a fix by limiting something
in the openafs client or the cache or ideas as to what someone could be
doing. Because at this point, it's like a denial of service attack
that's making lots of problems for us.

Thank you,

Chris