[OpenAFS] OpenAFS client cache overrun?

Wed, 12 Mar 2014 10:20:56 -0500

A few things.

1 - The user claims they were merely storing the enormous .pst file, not
accessing them from Outlook.

2 - The user claimed that any large file bigger than about 4GB would cause
the lockup. We haven't been able to replicate it, but he crammed a few
10GB files through this morning and locked up one of our gateways as a
demonstration. He has not made my day any brighter.

Additional info: WE were unable to reproduce this, but he mentioned that
the test was conducted by copying from one AFS directory to another.

Additional additional: If I didn't mention it before, this is all going
over samba-on-OpenAFS. Yes, I know, users should be using the OpenAFS
client rather than going through samba on a gateway. We have found it
extremely difficult to get users to adopt this method, however, and have
to try to make this work.

3 - I had enabled a 2GB cache bypass, and it seemed to have no effect
whatsoever.

4 - I gathered what data I could. Looks like I can't use "crash" without a
kernel recompile:

This GDB was configured as "x86_64-unknown-linux-gnu"...(no debugging
symbols found)...

crash: /boot/vmlinuz-2.6.18-194.26.1.el5: no debugging data available

cmbdebug said this:

[root@rgwb1 ~]# cmdebug localhost
Lock afs_discon_lock status: (none_waiting, 21876 read_locks(pid:29278))

[root@rgwb1 ~]# !ps
ps -ef | grep 29278
root     29278  4477  0 09:27 ?        00:00:00 smbd
root     30101 29337  0 09:37 pts/3    00:00:00 grep 29278

When I ran "top" I saw that the afs_cachetrim process was #1, but
presumably wedged.

I goosed /proc/sysrq-trigger and as promised, it dumped a lot of call
trace info to the syslog. I'm looking through it, but am not sure what to
look for. Nothing stands out, anyway.

Chris

On 3/7/14 3:51 PM, "Andrew Deason" <adeason@sinenomine.net> wrote:
>Message: 4
>To: openafs-info@openafs.org
>From: Andrew Deason <adeason@sinenomine.net>
>Date: Fri, 7 Mar 2014 15:51:23 -0600
>Organization: Sine Nomine Associates
>Subject: [OpenAFS] Re: OpenAFS client cache overrun?
>
>On Fri, 07 Mar 2014 13:51:06 -0500
>Eric Chris Garrison <ecgarris@iu.edu> wrote:
>
>>I'll have to look for that message from Andrew to gather data if the
>>problem crops up again.
>
>It's this message:
>
><http://thread.gmane.org/gmane.comp.file-systems.openafs.general/34517/foc
>us=34532>
>
>The easiest / most basic information to get is just the stack trace from
>the daemon that is supposed to be trimming the cache back when it gets
>full. That message contains the commands where you can get that
>information via the 'crash' tool.
>
>Or, another way to get that information is by running this:
>
># echo t > /proc/sysrq-trigger
>
>That will generate a ton of information to the kernel log, which you'd
>need to sift through or give to someone else. But it's at least a lot
>easier to set up and run.
>
>>Thanks also for the mention of AFS cache bypass, I think that may be a
>>BIG help with this problem.
>
>'Cache bypass' I don't believe is considered the most stable of
>features. It could indeed maybe help here, but I'd be looking out for
>kernel panics.
>
>--
>Andrew Deason
>adeason@sinenomine.net
>