[OpenAFS] openafs crash on linux: "Increase -stat parameter of afsd(VLRU cycle?)"

Christopher Allen Wing wingc@engin.umich.edu
Wed, 11 Dec 2002 11:49:11 -0500 (EST)


Hello,

On several heavily used servers around here, we have been getting kernel
crashes in the openafs kernel module.

The servers are running more or less vanilla Red Hat Linux 7.2, with Red
Hat's 2.4.9-34 kernel. (will probably upgrade this when openafs and the
sys_call_table-less kernel play nicely together out of the box or if I
hack it manually)

We originally saw crashes with openafs-1.2.6 that would almost completely
kill the machine. All of user space died, although the machine would still
respond to pings and I could still reboot it (with "magic sysrq" on a
serial console). The oops was dumped to the console.


Here's a condensed linux oops (openafs-1.2.6)

Unable to handle kernel paging request at virtual address ffffffe0
EIP is at afs_FlushVCBs [libafs-2.4.9-34-i686.mp] 0x70b
Call Trace: [<f89fb8b6>] afs_NewVCache [libafs-2.4.9-34-i686.mp] 0xfe
[<f89fd452>] afs_GetVCache [libafs-2.4.9-34-i686.mp] 0x122
[<f89f0ae4>] afs_dir_GetBlob [libafs-2.4.9-34-i686.mp] 0x18
[<f89f0c16>] DirHash [libafs-2.4.9-34-i686.mp] 0x116
[<f89f08d3>] afs_dir_LookupOffset [libafs-2.4.9-34-i686.mp] 0x6f
[<f8a06fb4>] afs_lookup [libafs-2.4.9-34-i686.mp] 0xa24
[<f8a2f732>] vcache2inode [libafs-2.4.9-34-i686.mp] 0x22
[<f89ffe56>] afs_AccessOK [libafs-2.4.9-34-i686.mp] 0x3a
[<f8a000f2>] afs_access [libafs-2.4.9-34-i686.mp] 0x186
[<f8a31408>] afs_linux_lookup [libafs-2.4.9-34-i686.mp] 0x68
[<c0151d99>] d_alloc [kernel] 0x19
[<c0148703>] real_lookup [kernel] 0x73
[<c0148f6e>] path_walk [kernel] 0x68e
[<f8852305>] __insmod_ext3_S.text_L45168 [ext3] 0x62a5
[<c01497fa>] __user_walk [kernel] 0x3a
[<c0145a14>] vfs_stat [kernel] 0x14
[<c0145fb1>] sys_stat64 [kernel] 0x11
[<c0145ea6>] sys_readlink [kernel] 0x76
[<c0145eb3>] sys_readlink [kernel] 0x83
[<c01176b0>] do_page_fault [kernel] 0x0
[<c01071ab>] system_call [kernel] 0x33


It also died similarly other times in:

Unable to handle kernel NULL pointer dereference at virtual address 00000004
afs_FlushVCBs -> afs_global_lock -> osi_VM_FlushVCache -> dput (death)




Eventually, I upgraded one machine to openafs-1.2.7 as a test. Now this
machine doesn't die completely, but AFS just locks up and eventually takes
the machine down as every process with a reference to /afs goes into
uninterruptible sleep.

For this reason, it can also write to the syslog and not just the console:


(condensed oops with openafs-1.2.7)

Increase -stat parameter of afsd(VLRU cycle?)
Unable to handle kernel paging request at virtual address ffffffff
EIP is at osi_Panic [libafs-2.4.9-34-i686.mp] 0x28
Call Trace: [<f89bcb95>] afs_NewVCache [libafs-2.4.9-34-i686.mp] 0xd9
[<f8a06d00>] __insmod_libafs-2.4.9-34-i686.mp_S.rodata_L2232 [libafs-2.4.9-34-i686.mp] 0x2420
[<f89be78a>] afs_GetVCache [libafs-2.4.9-34-i686.mp] 0x122
[<f89b1c5c>] afs_dir_GetBlob [libafs-2.4.9-34-i686.mp] 0x18
[<f89b1d1d>] DirHash [libafs-2.4.9-34-i686.mp] 0xa5
[<f89cf506>] afs_GetVolume [libafs-2.4.9-34-i686.mp] 0x1a
[<f89b1a4b>] afs_dir_LookupOffset [libafs-2.4.9-34-i686.mp] 0x6f
[<f89c8ad3>] afs_lookup [libafs-2.4.9-34-i686.mp] 0xa93
[<f89f404d>] afs_linux_dir_read [libafs-2.4.9-34-i686.mp] 0x495
[<f89f4494>] afs_linux_lookup [libafs-2.4.9-34-i686.mp] 0x68
[d_alloc+25/384] d_alloc [kernel] 0x19
[real_lookup+115/272] real_lookup [kernel] 0x73
[path_walk+1678/2352] path_walk [kernel] 0x68e
[open_namei+140/1696] open_namei [kernel] 0x8c
[vfs_stat+20/80] vfs_stat [kernel] 0x14
[filp_open+54/96] filp_open [kernel] 0x36
[getname+94/160] getname [kernel] 0x5e
[sys_open+54/224] sys_open [kernel] 0x36
[system_call+51/56] system_call [kernel] 0x33




I've captured several similar oopses, which I won't waste space
duplicating here. All of them, however, seem to die in the same piece of
afs code: afs_lookup leading to afs_*VCache, then crash.


At first I tried switching from an ext3 to an ext2 cache partition,
without any improvement. However, after trying openafs-1.2.7 and getting
the message "Increase -stat parameter of afsd", I am writing to ask if
this is just a configuration error.


Can someone advise me on how to choose -stat? Currently I am using the
Linux RPM package default of MEDIUM:

	MEDIUM="-stat 2000 -dcache 800 -daemons 3 -volumes 70"


The machines are interactive login servers with a large number of
simultaneous users (up to 200 logins, 1000+ processes, etc.)


Would you expect increasing -stat to fix this problem or do you think
there's an underlying bug here?


Thanks,

Chris Wing
wingc@engin.umich.edu