[OpenAFS] OpenAFS client cache overrun?
Jonathan Billings
jsbillin@umich.edu
Wed, 12 Mar 2014 11:29:27 -0400
--20cf3077639f97ac2804f46a80dc
Content-Type: text/plain; charset=ISO-8859-1
You don't need to recompile to use 'crash', you can use the debuginfo
kernel (which is available in the -debuginfo channel for RHEL, you might
not have it subscribed). In it is a kernel with debugging symbols.
On Wed, Mar 12, 2014 at 11:20 AM, Eric Chris Garrison <ecgarris@iu.edu>wrote:
> A few things.
>
> 1 - The user claims they were merely storing the enormous .pst file, not
> accessing them from Outlook.
>
> 2 - The user claimed that any large file bigger than about 4GB would cause
> the lockup. We haven't been able to replicate it, but he crammed a few
> 10GB files through this morning and locked up one of our gateways as a
> demonstration. He has not made my day any brighter.
>
> Additional info: WE were unable to reproduce this, but he mentioned that
> the test was conducted by copying from one AFS directory to another.
>
> Additional additional: If I didn't mention it before, this is all going
> over samba-on-OpenAFS. Yes, I know, users should be using the OpenAFS
> client rather than going through samba on a gateway. We have found it
> extremely difficult to get users to adopt this method, however, and have
> to try to make this work.
>
> 3 - I had enabled a 2GB cache bypass, and it seemed to have no effect
> whatsoever.
>
> 4 - I gathered what data I could. Looks like I can't use "crash" without a
> kernel recompile:
>
> This GDB was configured as "x86_64-unknown-linux-gnu"...(no debugging
> symbols found)...
>
> crash: /boot/vmlinuz-2.6.18-194.26.1.el5: no debugging data available
>
>
> cmbdebug said this:
>
> [root@rgwb1 ~]# cmdebug localhost
> Lock afs_discon_lock status: (none_waiting, 21876 read_locks(pid:29278))
>
> [root@rgwb1 ~]# !ps
> ps -ef | grep 29278
> root 29278 4477 0 09:27 ? 00:00:00 smbd
> root 30101 29337 0 09:37 pts/3 00:00:00 grep 29278
>
> When I ran "top" I saw that the afs_cachetrim process was #1, but
> presumably wedged.
>
>
> I goosed /proc/sysrq-trigger and as promised, it dumped a lot of call
> trace info to the syslog. I'm looking through it, but am not sure what to
> look for. Nothing stands out, anyway.
>
> Chris
>
> On 3/7/14 3:51 PM, "Andrew Deason" <adeason@sinenomine.net> wrote:
> >Message: 4
> >To: openafs-info@openafs.org
> >From: Andrew Deason <adeason@sinenomine.net>
> >Date: Fri, 7 Mar 2014 15:51:23 -0600
> >Organization: Sine Nomine Associates
> >Subject: [OpenAFS] Re: OpenAFS client cache overrun?
> >
> >On Fri, 07 Mar 2014 13:51:06 -0500
> >Eric Chris Garrison <ecgarris@iu.edu> wrote:
> >
> >>I'll have to look for that message from Andrew to gather data if the
> >>problem crops up again.
> >
> >It's this message:
> >
> ><
> http://thread.gmane.org/gmane.comp.file-systems.openafs.general/34517/foc
> >us=34532>
> >
> >The easiest / most basic information to get is just the stack trace from
> >the daemon that is supposed to be trimming the cache back when it gets
> >full. That message contains the commands where you can get that
> >information via the 'crash' tool.
> >
> >Or, another way to get that information is by running this:
> >
> ># echo t > /proc/sysrq-trigger
> >
> >That will generate a ton of information to the kernel log, which you'd
> >need to sift through or give to someone else. But it's at least a lot
> >easier to set up and run.
> >
> >>Thanks also for the mention of AFS cache bypass, I think that may be a
> >>BIG help with this problem.
> >
> >'Cache bypass' I don't believe is considered the most stable of
> >features. It could indeed maybe help here, but I'd be looking out for
> >kernel panics.
> >
> >--
> >Andrew Deason
> >adeason@sinenomine.net
> >
>
>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>
--
Jonathan Billings <jsbillin@umich.edu>
College of Engineering - CAEN - Unix and Linux Support
--20cf3077639f97ac2804f46a80dc
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">You don't need to recompile to use 'crash', yo=
u can use the debuginfo kernel (which is available in the -debuginfo channe=
l for RHEL, you might not have it subscribed).=A0 In it is a kernel with de=
bugging symbols.<br>
</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Wed,=
Mar 12, 2014 at 11:20 AM, Eric Chris Garrison <span dir=3D"ltr"><<a hre=
f=3D"mailto:ecgarris@iu.edu" target=3D"_blank">ecgarris@iu.edu</a>></spa=
n> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">A few things.<br>
<br>
1 - The user claims they were merely storing the enormous .pst file, not<br=
>
accessing them from Outlook.<br>
<br>
2 - The user claimed that any large file bigger than about 4GB would cause<=
br>
the lockup. We haven't been able to replicate it, but he crammed a few<=
br>
10GB files through this morning and locked up one of our gateways as a<br>
demonstration. He has not made my day any brighter.<br>
<br>
Additional info: WE were unable to reproduce this, but he mentioned that<br=
>
the test was conducted by copying from one AFS directory to another.<br>
<br>
Additional additional: If I didn't mention it before, this is all going=
<br>
over samba-on-OpenAFS. Yes, I know, users should be using the OpenAFS<br>
client rather than going through samba on a gateway. We have found it<br>
extremely difficult to get users to adopt this method, however, and have<br=
>
to try to make this work.<br>
<br>
3 - I had enabled a 2GB cache bypass, and it seemed to have no effect<br>
whatsoever.<br>
<br>
4 - I gathered what data I could. Looks like I can't use "crash&qu=
ot; without a<br>
kernel recompile:<br>
<br>
This GDB was configured as "x86_64-unknown-linux-gnu"...(no debug=
ging<br>
symbols found)...<br>
<br>
crash: /boot/vmlinuz-2.6.18-194.26.1.el5: no debugging data available<br>
<br>
<br>
cmbdebug said this:<br>
<br>
[root@rgwb1 ~]# cmdebug localhost<br>
Lock afs_discon_lock status: (none_waiting, 21876 read_locks(pid:29278))<br=
>
<br>
[root@rgwb1 ~]# !ps<br>
ps -ef | grep 29278<br>
root =A0 =A0 29278 =A04477 =A00 09:27 ? =A0 =A0 =A0 =A000:00:00 smbd<br>
root =A0 =A0 30101 29337 =A00 09:37 pts/3 =A0 =A000:00:00 grep 29278<br>
<br>
When I ran "top" I saw that the afs_cachetrim process was #1, but=
<br>
presumably wedged.<br>
<br>
<br>
I goosed /proc/sysrq-trigger and as promised, it dumped a lot of call<br>
trace info to the syslog. I'm looking through it, but am not sure what =
to<br>
look for. Nothing stands out, anyway.<br>
<br>
Chris<br>
<br>
On 3/7/14 3:51 PM, "Andrew Deason" <<a href=3D"mailto:adeason@=
sinenomine.net">adeason@sinenomine.net</a>> wrote:<br>
>Message: 4<br>
>To: <a href=3D"mailto:openafs-info@openafs.org">openafs-info@openafs.or=
g</a><br>
>From: Andrew Deason <<a href=3D"mailto:adeason@sinenomine.net">adeas=
on@sinenomine.net</a>><br>
>Date: Fri, 7 Mar 2014 15:51:23 -0600<br>
>Organization: Sine Nomine Associates<br>
>Subject: [OpenAFS] Re: OpenAFS client cache overrun?<br>
<div class=3D"">><br>
>On Fri, 07 Mar 2014 13:51:06 -0500<br>
>Eric Chris Garrison <<a href=3D"mailto:ecgarris@iu.edu">ecgarris@iu.=
edu</a>> wrote:<br>
><br>
>>I'll have to look for that message from Andrew to gather data i=
f the<br>
>>problem crops up again.<br>
><br>
>It's this message:<br>
><br>
><<a href=3D"http://thread.gmane.org/gmane.comp.file-systems.openafs.=
general/34517/foc" target=3D"_blank">http://thread.gmane.org/gmane.comp.fil=
e-systems.openafs.general/34517/foc</a><br>
>us=3D34532><br>
><br>
>The easiest / most basic information to get is just the stack trace fro=
m<br>
>the daemon that is supposed to be trimming the cache back when it gets<=
br>
>full. That message contains the commands where you can get that<br>
>information via the 'crash' tool.<br>
><br>
>Or, another way to get that information is by running this:<br>
><br>
># echo t > /proc/sysrq-trigger<br>
><br>
>That will generate a ton of information to the kernel log, which you=
9;d<br>
>need to sift through or give to someone else. But it's at least a l=
ot<br>
>easier to set up and run.<br>
><br>
</div><div class=3D"">>>Thanks also for the mention of AFS cache bypa=
ss, I think that may be a<br>
>>BIG help with this problem.<br>
><br>
>'Cache bypass' I don't believe is considered the most stabl=
e of<br>
>features. It could indeed maybe help here, but I'd be looking out f=
or<br>
>kernel panics.<br>
><br>
</div>>--<br>
>Andrew Deason<br>
><a href=3D"mailto:adeason@sinenomine.net">adeason@sinenomine.net</a><br=
>
<div class=3D"HOEnZb"><div class=3D"h5">><br>
<br>
<br>
_______________________________________________<br>
OpenAFS-info mailing list<br>
<a href=3D"mailto:OpenAFS-info@openafs.org">OpenAFS-info@openafs.org</a><br=
>
<a href=3D"https://lists.openafs.org/mailman/listinfo/openafs-info" target=
=3D"_blank">https://lists.openafs.org/mailman/listinfo/openafs-info</a><br>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br>Jonathan Bi=
llings <<a href=3D"mailto:jsbillin@umich.edu" target=3D"_blank">jsbillin=
@umich.edu</a>><br>College of Engineering - CAEN - Unix and Linux Suppor=
t<br>
<br>
</div>
--20cf3077639f97ac2804f46a80dc--