[OpenAFS] Re: OpenAFS client cache overrun?

Fri, 14 Mar 2014 16:35:44 -0400

Hello again!

On 3/13/14 11:01 AM, "openafs-info-request@openafs.org"
<openafs-info-request@openafs.org> wrote:

>Message: 1
>To: openafs-info@openafs.org
>From: Andrew Deason <adeason@sinenomine.net>
>Date: Wed, 12 Mar 2014 11:04:03 -0500
>Organization: Sine Nomine Associates
>Subject: [OpenAFS] Re: OpenAFS client cache overrun?
>
>On Wed, 12 Mar 2014 10:20:56 -0500
>Eric Chris Garrison <ecgarris@iu.edu> wrote:
>
>>3 - I had enabled a 2GB cache bypass, and it seemed to have no effect
>>whatsoever.
>
>"cache bypass" doesn't do anything for writes, only for read operations.
>That probably wasn't clear, but I didn't know before if this was just
>something stuffing data into afs or reading/writing stuff, or what.

Yeah, we didn't either, the user clarified. Too bad about the bypass not
working for writes.

>
>>cmbdebug said this:
>>[root@rgwb1 ~]# cmdebug localhost
>>Lock afs_discon_lock status: (none_waiting, 21876 read_locks(pid:29278))
>
>To be clear, this just ran and then exited on its own, right? You didn't
>ctrl-C it or anything.

Yes, it exited on its own after a long time.

>
>>[root@rgwb1 ~]# !ps
>>ps -ef | grep 29278
>>root     29278  4477  0 09:27 ?        00:00:00 smbd
>>root     30101 29337  0 09:37 pts/3    00:00:00 grep 29278
>>When I ran "top" I saw that the afs_cachetrim process was #1, but
>>presumably wedged.
>>I goosed /proc/sysrq-trigger and as promised, it dumped a lot of call
>>trace info to the syslog. I'm looking through it, but am not sure what to
>>look for. Nothing stands out, anyway.
>
>You're looking for the stack trace for the afs_cachetrim process. Look
>in syslog for "afs_cachetrim", or its pid. Under that should be a trace
>of functions that indicates where we are in the code at that time.
>
>I would extract that, and the entry for a hanging process. So, maybe
>29278, or if anything hangs when touching anything in /afs, you could
>get the entry for that.

Oddly, there's nothing for afs_cachetrim.

Mar 13 10:16:59 rgwb1 smbd[29278]: [2014/03/13 10:16:59.762359,  1]
smbd/service.c:1084(make_connection_snum)
Mar 13 10:16:59 rgwb1 smbd[29278]:   XXXXX (::ffff:XXX.XXX.XXX.XXX)
connect to service projects initially as user XXXXXX (uid=349570, gid=100)
(pid 29278) 
Mar 13 10:17:11 rgwb1 smbd[29278]: [2014/03/13 10:17:11.703003,  1]
smbd/service.c:1265(close_cnum)
Mar 13 10:17:11 rgwb1 smbd[29278]:   XXXXX (::ffff:XXX.XXX.XXX.XXX) closed
connection to service projects

Mar 13 10:17:47 rgwb1 smbd[29278]: [2014/03/13 10:17:47.708467,  0]
lib/util_sock.c:474(read_fd_with_timeout)
Mar 13 10:17:47 rgwb1 smbd[29278]: [2014/03/13 10:17:47.708545,  0]
lib/util_sock.c:1441(get_peer_addr_internal)
Mar 13 10:17:47 rgwb1 smbd[29278]:   getpeername failed. Error was
Transport endpoint is not connected
Mar 13 10:17:47 rgwb1 smbd[29278]:   read_fd_with_timeout: client 0.0.0.0
read error = Connection reset by peer.

>
>Or if you want to try to find "everything", just look for anything
>containing the string "afs".

I get just this kind of message during the last lockup:

Mar 13 10:17:32 rgwb1 kernel: afs: byte-range locks only enforced for
processes on this machine (pid 15613 (smbd), user 673104).
Mar 13 10:19:32 rgwb1 kernel: afs: byte-range locks only enforced for
processes on this machine (pid 15613 (smbd), user 673104).

But we get that other times too.

>
>If you ever don't want to leave the system hanging while you examine it,
>but you want to capture information you can examine later, you can
>generate a core dump. If your system is setup to capture a core on crash
>(I'm not sure if this is the default... look at RHEL documentation, it
>should be something mentioning kdump or kexec), you can crash the system
>and you'll get a vmcore afterwards. To do this, send a 'c' to
>/proc/sysrq-trigger. That will of course crash the system and cause it
>to reboot, so don't do that if that's not what you want to happen.

Noted, will have to see what the defaults are for these systems.

Thanks,

Chris