[OpenAFS] OpenAFS client cache overrun?

Jonathan Billings jsbillin@umich.edu
Wed, 12 Mar 2014 11:29:27 -0400


--20cf3077639f97ac2804f46a80dc
Content-Type: text/plain; charset=ISO-8859-1

You don't need to recompile to use 'crash', you can use the debuginfo
kernel (which is available in the -debuginfo channel for RHEL, you might
not have it subscribed).  In it is a kernel with debugging symbols.


On Wed, Mar 12, 2014 at 11:20 AM, Eric Chris Garrison <ecgarris@iu.edu>wrote:

> A few things.
>
> 1 - The user claims they were merely storing the enormous .pst file, not
> accessing them from Outlook.
>
> 2 - The user claimed that any large file bigger than about 4GB would cause
> the lockup. We haven't been able to replicate it, but he crammed a few
> 10GB files through this morning and locked up one of our gateways as a
> demonstration. He has not made my day any brighter.
>
> Additional info: WE were unable to reproduce this, but he mentioned that
> the test was conducted by copying from one AFS directory to another.
>
> Additional additional: If I didn't mention it before, this is all going
> over samba-on-OpenAFS. Yes, I know, users should be using the OpenAFS
> client rather than going through samba on a gateway. We have found it
> extremely difficult to get users to adopt this method, however, and have
> to try to make this work.
>
> 3 - I had enabled a 2GB cache bypass, and it seemed to have no effect
> whatsoever.
>
> 4 - I gathered what data I could. Looks like I can't use "crash" without a
> kernel recompile:
>
> This GDB was configured as "x86_64-unknown-linux-gnu"...(no debugging
> symbols found)...
>
> crash: /boot/vmlinuz-2.6.18-194.26.1.el5: no debugging data available
>
>
> cmbdebug said this:
>
> [root@rgwb1 ~]# cmdebug localhost
> Lock afs_discon_lock status: (none_waiting, 21876 read_locks(pid:29278))
>
> [root@rgwb1 ~]# !ps
> ps -ef | grep 29278
> root     29278  4477  0 09:27 ?        00:00:00 smbd
> root     30101 29337  0 09:37 pts/3    00:00:00 grep 29278
>
> When I ran "top" I saw that the afs_cachetrim process was #1, but
> presumably wedged.
>
>
> I goosed /proc/sysrq-trigger and as promised, it dumped a lot of call
> trace info to the syslog. I'm looking through it, but am not sure what to
> look for. Nothing stands out, anyway.
>
> Chris
>
> On 3/7/14 3:51 PM, "Andrew Deason" <adeason@sinenomine.net> wrote:
> >Message: 4
> >To: openafs-info@openafs.org
> >From: Andrew Deason <adeason@sinenomine.net>
> >Date: Fri, 7 Mar 2014 15:51:23 -0600
> >Organization: Sine Nomine Associates
> >Subject: [OpenAFS] Re: OpenAFS client cache overrun?
> >
> >On Fri, 07 Mar 2014 13:51:06 -0500
> >Eric Chris Garrison <ecgarris@iu.edu> wrote:
> >
> >>I'll have to look for that message from Andrew to gather data if the
> >>problem crops up again.
> >
> >It's this message:
> >
> ><
> http://thread.gmane.org/gmane.comp.file-systems.openafs.general/34517/foc
> >us=34532>
> >
> >The easiest / most basic information to get is just the stack trace from
> >the daemon that is supposed to be trimming the cache back when it gets
> >full. That message contains the commands where you can get that
> >information via the 'crash' tool.
> >
> >Or, another way to get that information is by running this:
> >
> ># echo t > /proc/sysrq-trigger
> >
> >That will generate a ton of information to the kernel log, which you'd
> >need to sift through or give to someone else. But it's at least a lot
> >easier to set up and run.
> >
> >>Thanks also for the mention of AFS cache bypass, I think that may be a
> >>BIG help with this problem.
> >
> >'Cache bypass' I don't believe is considered the most stable of
> >features. It could indeed maybe help here, but I'd be looking out for
> >kernel panics.
> >
> >--
> >Andrew Deason
> >adeason@sinenomine.net
> >
>
>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>



-- 
Jonathan Billings <jsbillin@umich.edu>
College of Engineering - CAEN - Unix and Linux Support

--20cf3077639f97ac2804f46a80dc
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">You don&#39;t need to recompile to use &#39;crash&#39;, yo=
u can use the debuginfo kernel (which is available in the -debuginfo channe=
l for RHEL, you might not have it subscribed).=A0 In it is a kernel with de=
bugging symbols.<br>
</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Wed,=
 Mar 12, 2014 at 11:20 AM, Eric Chris Garrison <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:ecgarris@iu.edu" target=3D"_blank">ecgarris@iu.edu</a>&gt;</spa=
n> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">A few things.<br>
<br>
1 - The user claims they were merely storing the enormous .pst file, not<br=
>
accessing them from Outlook.<br>
<br>
2 - The user claimed that any large file bigger than about 4GB would cause<=
br>
the lockup. We haven&#39;t been able to replicate it, but he crammed a few<=
br>
10GB files through this morning and locked up one of our gateways as a<br>
demonstration. He has not made my day any brighter.<br>
<br>
Additional info: WE were unable to reproduce this, but he mentioned that<br=
>
the test was conducted by copying from one AFS directory to another.<br>
<br>
Additional additional: If I didn&#39;t mention it before, this is all going=
<br>
over samba-on-OpenAFS. Yes, I know, users should be using the OpenAFS<br>
client rather than going through samba on a gateway. We have found it<br>
extremely difficult to get users to adopt this method, however, and have<br=
>
to try to make this work.<br>
<br>
3 - I had enabled a 2GB cache bypass, and it seemed to have no effect<br>
whatsoever.<br>
<br>
4 - I gathered what data I could. Looks like I can&#39;t use &quot;crash&qu=
ot; without a<br>
kernel recompile:<br>
<br>
This GDB was configured as &quot;x86_64-unknown-linux-gnu&quot;...(no debug=
ging<br>
symbols found)...<br>
<br>
crash: /boot/vmlinuz-2.6.18-194.26.1.el5: no debugging data available<br>
<br>
<br>
cmbdebug said this:<br>
<br>
[root@rgwb1 ~]# cmdebug localhost<br>
Lock afs_discon_lock status: (none_waiting, 21876 read_locks(pid:29278))<br=
>
<br>
[root@rgwb1 ~]# !ps<br>
ps -ef | grep 29278<br>
root =A0 =A0 29278 =A04477 =A00 09:27 ? =A0 =A0 =A0 =A000:00:00 smbd<br>
root =A0 =A0 30101 29337 =A00 09:37 pts/3 =A0 =A000:00:00 grep 29278<br>
<br>
When I ran &quot;top&quot; I saw that the afs_cachetrim process was #1, but=
<br>
presumably wedged.<br>
<br>
<br>
I goosed /proc/sysrq-trigger and as promised, it dumped a lot of call<br>
trace info to the syslog. I&#39;m looking through it, but am not sure what =
to<br>
look for. Nothing stands out, anyway.<br>
<br>
Chris<br>
<br>
On 3/7/14 3:51 PM, &quot;Andrew Deason&quot; &lt;<a href=3D"mailto:adeason@=
sinenomine.net">adeason@sinenomine.net</a>&gt; wrote:<br>
&gt;Message: 4<br>
&gt;To: <a href=3D"mailto:openafs-info@openafs.org">openafs-info@openafs.or=
g</a><br>
&gt;From: Andrew Deason &lt;<a href=3D"mailto:adeason@sinenomine.net">adeas=
on@sinenomine.net</a>&gt;<br>
&gt;Date: Fri, 7 Mar 2014 15:51:23 -0600<br>
&gt;Organization: Sine Nomine Associates<br>
&gt;Subject: [OpenAFS] Re: OpenAFS client cache overrun?<br>
<div class=3D"">&gt;<br>
&gt;On Fri, 07 Mar 2014 13:51:06 -0500<br>
&gt;Eric Chris Garrison &lt;<a href=3D"mailto:ecgarris@iu.edu">ecgarris@iu.=
edu</a>&gt; wrote:<br>
&gt;<br>
&gt;&gt;I&#39;ll have to look for that message from Andrew to gather data i=
f the<br>
&gt;&gt;problem crops up again.<br>
&gt;<br>
&gt;It&#39;s this message:<br>
&gt;<br>
&gt;&lt;<a href=3D"http://thread.gmane.org/gmane.comp.file-systems.openafs.=
general/34517/foc" target=3D"_blank">http://thread.gmane.org/gmane.comp.fil=
e-systems.openafs.general/34517/foc</a><br>
&gt;us=3D34532&gt;<br>
&gt;<br>
&gt;The easiest / most basic information to get is just the stack trace fro=
m<br>
&gt;the daemon that is supposed to be trimming the cache back when it gets<=
br>
&gt;full. That message contains the commands where you can get that<br>
&gt;information via the &#39;crash&#39; tool.<br>
&gt;<br>
&gt;Or, another way to get that information is by running this:<br>
&gt;<br>
&gt;# echo t &gt; /proc/sysrq-trigger<br>
&gt;<br>
&gt;That will generate a ton of information to the kernel log, which you&#3=
9;d<br>
&gt;need to sift through or give to someone else. But it&#39;s at least a l=
ot<br>
&gt;easier to set up and run.<br>
&gt;<br>
</div><div class=3D"">&gt;&gt;Thanks also for the mention of AFS cache bypa=
ss, I think that may be a<br>
&gt;&gt;BIG help with this problem.<br>
&gt;<br>
&gt;&#39;Cache bypass&#39; I don&#39;t believe is considered the most stabl=
e of<br>
&gt;features. It could indeed maybe help here, but I&#39;d be looking out f=
or<br>
&gt;kernel panics.<br>
&gt;<br>
</div>&gt;--<br>
&gt;Andrew Deason<br>
&gt;<a href=3D"mailto:adeason@sinenomine.net">adeason@sinenomine.net</a><br=
>
<div class=3D"HOEnZb"><div class=3D"h5">&gt;<br>
<br>
<br>
_______________________________________________<br>
OpenAFS-info mailing list<br>
<a href=3D"mailto:OpenAFS-info@openafs.org">OpenAFS-info@openafs.org</a><br=
>
<a href=3D"https://lists.openafs.org/mailman/listinfo/openafs-info" target=
=3D"_blank">https://lists.openafs.org/mailman/listinfo/openafs-info</a><br>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br>Jonathan Bi=
llings &lt;<a href=3D"mailto:jsbillin@umich.edu" target=3D"_blank">jsbillin=
@umich.edu</a>&gt;<br>College of Engineering - CAEN - Unix and Linux Suppor=
t<br>
<br>
</div>

--20cf3077639f97ac2804f46a80dc--