[OpenAFS] OpenAFS client cache overrun?

Eric Chris Garrison ecgarris@iu.edu
Wed, 20 Nov 2013 16:47:44 -0500


> This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

--B_3467810866_27876052
Content-type: text/plain;
	charset="US-ASCII"
Content-transfer-encoding: 7bit

Hello,

We have some RHEL 5.5 servers with openafs-client-1.6.1-1 running. There are
4 of them in a round-robin DNS, with Apache and Samba sitting on top of
OpenAFS filesystem.

The hosts' /etc/sysconfig/openafs files look like this:

  # OpenAFS Client Configuration
  AFSD_ARGS="-dynroot -fakestat-all -daemons 8 -chunksize 22"

The hosts' /usr/vice/etc/cacheinfo files look like this:

  /afs:/usr/vice/cache:7500000

I realize it's better for users to all use the openafs client for their own
OS, but we have a large base of users who insist on wanting to just map a
drive without installing a client. We have been running like this for 8+
years now, it's not a new setup.

Something has been locking up the openafs client in the past month or so.
The cache will show as more and more full in "df" and then at some point,
AFS stops answering, and any attempt to do a directory listing or to access
a file results in a zombie process.

The zombie processes mount up fast, the load on the machine skyrockets, and
the only solution seems to be to reboot.

What could cause that lockup? It's usually only on one host at a time, and
seems like it will "move" from host to host, even returning to the same host
in the same day after reboot once in awhile.

I doubled the cache size on these hosts, and it seemed to slow things down,
but we had another lockup today after a restart of all the clients on Sunday
during a hardware upgrade on the SAN, so no host had been running more than
3 days.

To me, it feels like maybe someone is forcing a huge file through and
running the machine out of cache. Though if that's so, I wonder why it only
just started happening after all these years. If nothing else, it seems like
something new is going on with the user end that's causing it.

Any help would be appreciated, anything from a fix by limiting something in
the openafs client or the cache or ideas as to what someone could be doing.
Because at this point, it's like a denial of service attack that's making
lots of problems for us.

Thank you,

Chris Garrison
Indiana University Research Storage



--B_3467810866_27876052
Content-type: text/html;
	charset="US-ASCII"
Content-transfer-encoding: quoted-printable

<html><head></head><body style=3D"word-wrap: break-word; -webkit-nbsp-mode: s=
pace; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-size:=
 14px; font-family: Calibri, sans-serif; "><div>Hello,</div><div><br></div><=
div>We have some RHEL 5.5 servers with openafs-client-1.6.1-1 running. There=
 are 4 of them in a round-robin DNS, with Apache and Samba sitting on top of=
 OpenAFS filesystem.</div><div><br></div><div>The hosts' /etc/sysconfig/open=
afs files look like this:</div><div><br></div><div><div>&nbsp; # OpenAFS Cli=
ent Configuration</div><div>&nbsp; AFSD_ARGS=3D"-dynroot -fakestat-all -daemon=
s 8 -chunksize 22"</div></div><div><br></div><div>The hosts' /usr/vice/etc/c=
acheinfo files look like this:</div><div><br></div><div>&nbsp; /afs:/usr/vic=
e/cache:7500000</div><div><br></div><div>I realize it's better for users to =
all use the openafs client for their own OS, but we have a large base of use=
rs who insist on wanting to just map a drive without installing a client. We=
 have been running like this for 8+ years now, it's not a new setup.</div><d=
iv><br></div><div>Something has been locking up the openafs client in the pa=
st month or so. &nbsp;The cache will show as more and more full in "df" and =
then at some point, AFS stops answering, and any attempt to do a directory l=
isting or to access a file results in a zombie process. &nbsp;</div><div><br=
></div><div>The zombie processes mount up fast, the load on the machine skyr=
ockets, and the only solution seems to be to reboot.</div><div><br></div><di=
v>What could cause that lockup? It's usually only on one host at a time, and=
 seems like it will "move" from host to host, even returning to the same hos=
t in the same day after reboot once in awhile.</div><div><br></div><div>I do=
ubled the cache size on these hosts, and it seemed to slow things down, but =
we had another lockup today after a restart of all the clients on Sunday dur=
ing a hardware upgrade on the SAN, so no host had been running more than 3 d=
ays.</div><div><br></div><div>To me, it feels like maybe someone is forcing =
a huge file through and running the machine out of cache. Though if that's s=
o, I wonder why it only just started happening after all these years. If not=
hing else, it seems like something new is going on with the user end that's =
causing it.</div><div><br></div><div>Any help would be appreciated, anything=
 from a fix by limiting something in the openafs client or the cache or idea=
s as to what someone could be doing. Because at this point, it's like a deni=
al of service attack that's making lots of problems for us.</div><div><br></=
div><div>Thank you,</div><div><br></div><div>Chris Garrison</div><div>Indian=
a University Research Storage</div></body></html>

--B_3467810866_27876052--