[OpenAFS] OpenAFS client cache overrun?
Eric Chris Garrison
ecgarris@iu.edu
Thu, 06 Mar 2014 16:32:46 -0500
> This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.
--B_3476968368_85404263
Content-type: text/plain;
charset="US-ASCII"
Content-transfer-encoding: 7bit
We upgraded the gateways mentioned in the original email to
openafs-client-1.6.6-0.pre1 awhile back, since there was a bugfix for cache
overrun in it (thanks for the help, Derrick). And for awhile it seemed like
it had worked, our AFS clients on the gateway hosts weren't locking up.
But the problem is back. We've had many lockups, requiring reboot, over the
past month, usually happening in clusters, like a user locking up one host,
them moving to another to lock it up.
After taking "smbstatus" snapshots each time it locks up, I've finally found
a common factor: large Outlook .pst files being locked:
One host:
4229 46516 DENY_ALL 0x7019f RDWR NONE
/afs/iu.edu/home/b/e/xxxxxx
xxxxxx@ads.iu.edu/BL-ECON-WY214-1/Data/C/Users/xxxxxx/Documents/Outlook
Files/Xxx
x Xxxxxx E-Mail Archive (2006-2011) (2014_02_27 18_02_49 UTC).pst Mon Mar
3 13:57:23 201412868 46516 DENY_ALL 0x7019f RDWR
NONE /afs/iu.edu/home/b/e/xxxxxx
xxxxxx@ads.iu.edu/BL-ECON-WY214-1/Data/C/Users/xxxxxx/Documents/Outlook
Files/Xxxx Xxxxxx E-Mail Archive (2006-2011) (2014_02_27 18_02_49 UTC).pst
Mon Mar 3 16:30:32 2014
30686 46516 DENY_ALL 0x7019f RDWR NONE
/afs/iu.edu/home/b/e/xxxxxx
xxxxxx@ads.iu.edu/BL-ECON-WY214-1/Data/C/Users/xxxxxx/Documents/Outlook
Files/Xxxx Xxxxxx E-Mail Archive (2006-2011) (2014_02_27 18_02_49 UTC).pst
Thu Mar 6 14:53:39 2014
On another host:
/home/b/e/xxxxxx
xxxxxx@ads.iu.edu/BL-ECON-WY214-1/Data/C/Users/xxxxxx/Documents/Outlook
Files/Xxxx Xxxxxx E-Mail Archive (2006-2011) (2014_02_27 18_02_49 UTC).pst
Mon Mar 3 15:21:53 2014
ecg-ss2:24849 46516 DENY_ALL 0x7019f RDWR NONE
/afs/iu.edu/home/b/e/xxxxxx
xxxxxx@ads.iu.edu/BL-ECON-WY214-1/Data/C/Users/xxxxxx/Documents/Outlook
Files/Xxxx Xxxxxx E-Mail Archive (2006-2011) (2014_03_06 18_11_27 UTC).pst
Thu Mar 6 14:44:26 2014
These are always present on each host that's locked up. Same .pst file,
even. It is a 6.5 GB file. Our AFS client cache is 7GB in size on a 9GB
partition.
I'm writing to the user to see if he's doing anything extraordinary.
Still looking for ideas. I haven't tried Kim Kaball's idea of lowering the
cache size to 2.5GB, I may try that next, but I worry that it'll impact
performance too much.
Thanks!!!
Chris Garrison
Indiana University
UITS Research Storage
From: Chris Garrison <ecgarris@iu.edu>
Date: Wednesday, November 20, 2013 4:47 PM
To: "openafs-info@openafs.org" <openafs-info@openafs.org>
Subject: [OpenAFS] OpenAFS client cache overrun?
Hello,
We have some RHEL 5.5 servers with openafs-client-1.6.1-1 running. There are
4 of them in a round-robin DNS, with Apache and Samba sitting on top of
OpenAFS filesystem.
The hosts' /etc/sysconfig/openafs files look like this:
# OpenAFS Client Configuration
AFSD_ARGS="-dynroot -fakestat-all -daemons 8 -chunksize 22"
The hosts' /usr/vice/etc/cacheinfo files look like this:
/afs:/usr/vice/cache:7500000
I realize it's better for users to all use the openafs client for their own
OS, but we have a large base of users who insist on wanting to just map a
drive without installing a client. We have been running like this for 8+
years now, it's not a new setup.
Something has been locking up the openafs client in the past month or so.
The cache will show as more and more full in "df" and then at some point,
AFS stops answering, and any attempt to do a directory listing or to access
a file results in a zombie process.
The zombie processes mount up fast, the load on the machine skyrockets, and
the only solution seems to be to reboot.
What could cause that lockup? It's usually only on one host at a time, and
seems like it will "move" from host to host, even returning to the same host
in the same day after reboot once in awhile.
I doubled the cache size on these hosts, and it seemed to slow things down,
but we had another lockup today after a restart of all the clients on Sunday
during a hardware upgrade on the SAN, so no host had been running more than
3 days.
To me, it feels like maybe someone is forcing a huge file through and
running the machine out of cache. Though if that's so, I wonder why it only
just started happening after all these years. If nothing else, it seems like
something new is going on with the user end that's causing it.
Any help would be appreciated, anything from a fix by limiting something in
the openafs client or the cache or ideas as to what someone could be doing.
Because at this point, it's like a denial of service attack that's making
lots of problems for us.
Thank you,
Chris Garrison
Indiana University Research Storage
--B_3476968368_85404263
Content-type: text/html;
charset="US-ASCII"
Content-transfer-encoding: quoted-printable
<html><head></head><body style=3D"word-wrap: break-word; -webkit-nbsp-mode: s=
pace; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-size:=
14px; font-family: Calibri, sans-serif; "><div>We upgraded the gateways men=
tioned in the original email to openafs-client-1.6.6-0.pre1 awhile back, sin=
ce there was a bugfix for cache overrun in it (thanks for the help, Derrick)=
. And for awhile it seemed like it had worked, our AFS clients on the gatewa=
y hosts weren't locking up.</div><div><br></div><div>But the problem is back=
. We've had many lockups, requiring reboot, over the past month, usually hap=
pening in clusters, like a user locking up one host, them moving to another =
to lock it up.</div><div><br></div><div>After taking "smbstatus" snapshots e=
ach time it locks up, I've finally found a common factor: large Outlook .pst=
files being locked:</div><div><br></div><div>One host:</div><div><br></div>=
<div><div><div>4229 46516 DE=
NY_ALL 0x7019f RDWR NONE &n=
bsp; /afs/iu.edu/home/b/e/xxxxxx xxxxxx@a=
ds.iu.edu/BL-ECON-WY214-1/Data/C/Users/xxxxxx/Documents/Outlook Files/Xxx</d=
iv><div>x Xxxxxx E-Mail Archive (2006-2011) (2014_02_27 18_02_49 UTC).pst &n=
bsp; Mon Mar 3 13:57:23 201412868 46516 &nb=
sp; DENY_ALL 0x7019f RDWR &n=
bsp; NONE /afs/iu.edu/home/b/e/xxx=
xxx xxxxxx@ads.iu.edu/BL-ECON-WY214-1/Data/C/Users/xxxxxx/Documents/O=
utlook Files/Xxxx Xxxxxx E-Mail Archive (2006-2011) (2014_02_27 18_02_49 UTC=
).pst Mon Mar 3 16:30:32 2014</div><div>30686 &nb=
sp; 46516 DENY_ALL 0x7019f RD=
WR NONE /afs/=
iu.edu/home/b/e/xxxxxx xxxxxx@ads.iu.edu/BL-ECON-WY214-1/Data/C/Users=
/xxxxxx/Documents/Outlook Files/Xxxx Xxxxxx E-Mail Archive (2006-2011) (2014=
_02_27 18_02_49 UTC).pst Thu Mar 6 14:53:39 2014</div></div></d=
iv><div><br></div><div>On another host:</div><div><br></div><div><div>/home/=
b/e/xxxxxx xxxxxx@ads.iu.edu/BL-ECON-WY214-1/Data/C/Users/xxxxxx/Docu=
ments/Outlook Files/Xxxx Xxxxxx E-Mail Archive (2006-2011) (2014_02_27 18_02=
_49 UTC).pst Mon Mar 3 15:21:53 2014</div><div>ecg-ss2:24849 &n=
bsp; 46516 DENY_ALL 0x7019f &=
nbsp; RDWR NONE &nbs=
p; /afs/iu.edu/home/b/e/xxxxxx xxxxxx@ads.iu.edu/BL-ECON-WY214=
-1/Data/C/Users/xxxxxx/Documents/Outlook Files/Xxxx Xxxxxx E-Mail Archive (2=
006-2011) (2014_03_06 18_11_27 UTC).pst Thu Mar 6 14:44:26 2014=
</div></div><div><br></div><div>These are always present on each host that's=
locked up. Same .pst file, even. It is a 6.5 GB file. Our AFS client cache =
is 7GB in size on a 9GB partition.</div><div><br></div><div>I'm writing to t=
he user to see if he's doing anything extraordinary. </div><div><br></d=
iv><div>Still looking for ideas. I haven't tried Kim Kaball's idea of loweri=
ng the cache size to 2.5GB, I may try that next, but I worry that it'll impa=
ct performance too much.</div><div><br></div><div>Thanks!!!</div><div><br></=
div><div>Chris Garrison</div><div>Indiana University </div><div>UITS Re=
search Storage</div><div><br></div><span id=3D"OLK_SRC_BODY_SECTION"><div styl=
e=3D"font-family:Calibri; font-size:11pt; text-align:left; color:black; BORDER=
-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM: 0in; PADDING=
-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid; BORDER-RIGHT:=
medium none; PADDING-TOP: 3pt"><span style=3D"font-weight:bold">From: </span>=
Chris Garrison <<a href=3D"mailto:ecgarris@iu.edu">ecgarris@iu.edu</a>>=
<br><span style=3D"font-weight:bold">Date: </span> Wednesday, November 20, 201=
3 4:47 PM<br><span style=3D"font-weight:bold">To: </span> "<a href=3D"mailto:ope=
nafs-info@openafs.org">openafs-info@openafs.org</a>" <<a href=3D"mailto:ope=
nafs-info@openafs.org">openafs-info@openafs.org</a>><br><span style=3D"font=
-weight:bold">Subject: </span> [OpenAFS] OpenAFS client cache overrun?<br></=
div><div><br></div><div><div style=3D"word-wrap: break-word; -webkit-nbsp-mode=
: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-si=
ze: 14px; font-family: Calibri, sans-serif; "><div>Hello,</div><div><br></di=
v><div>We have some RHEL 5.5 servers with openafs-client-1.6.1-1 running. Th=
ere are 4 of them in a round-robin DNS, with Apache and Samba sitting on top=
of OpenAFS filesystem.</div><div><br></div><div>The hosts' /etc/sysconfig/o=
penafs files look like this:</div><div><br></div><div><div> # OpenAFS =
Client Configuration</div><div> AFSD_ARGS=3D"-dynroot -fakestat-all -dae=
mons 8 -chunksize 22"</div></div><div><br></div><div>The hosts' /usr/vice/et=
c/cacheinfo files look like this:</div><div><br></div><div> /afs:/usr/=
vice/cache:7500000</div><div><br></div><div>I realize it's better for users =
to all use the openafs client for their own OS, but we have a large base of =
users who insist on wanting to just map a drive without installing a client.=
We have been running like this for 8+ years now, it's not a new setup.</div=
><div><br></div><div>Something has been locking up the openafs client in the=
past month or so. The cache will show as more and more full in "df" a=
nd then at some point, AFS stops answering, and any attempt to do a director=
y listing or to access a file results in a zombie process. </div><div>=
<br></div><div>The zombie processes mount up fast, the load on the machine s=
kyrockets, and the only solution seems to be to reboot.</div><div><br></div>=
<div>What could cause that lockup? It's usually only on one host at a time, =
and seems like it will "move" from host to host, even returning to the same =
host in the same day after reboot once in awhile.</div><div><br></div><div>I=
doubled the cache size on these hosts, and it seemed to slow things down, b=
ut we had another lockup today after a restart of all the clients on Sunday =
during a hardware upgrade on the SAN, so no host had been running more than =
3 days.</div><div><br></div><div>To me, it feels like maybe someone is forci=
ng a huge file through and running the machine out of cache. Though if that'=
s so, I wonder why it only just started happening after all these years. If =
nothing else, it seems like something new is going on with the user end that=
's causing it.</div><div><br></div><div>Any help would be appreciated, anyth=
ing from a fix by limiting something in the openafs client or the cache or i=
deas as to what someone could be doing. Because at this point, it's like a d=
enial of service attack that's making lots of problems for us.</div><div><br=
></div><div>Thank you,</div><div><br></div><div>Chris Garrison</div><div>Ind=
iana University Research Storage</div></div></div></span></body></html>
--B_3476968368_85404263--