[OpenAFS-devel] Re: BUG: unable to handle kernel NULL pointer

Markus Suvanto markus.suvanto@gmail.com
Wed, 22 May 2013 22:57:44 +0300


Hello

I have not managed to reproduce this problem yet using KVM virtual machine and
btrfs file cache. But today my real hardware hangs. Unfortunately my
kernel is compiled
without CONFIG_KALLSYMS so my trace below is useless (maybe this is
different problem than tmpfs
case and not even related to openafs), but there is "INFO: rcu_sched
self-detected stall on CPU" and if
I remember correctly last time I have the same problem and it was
fixed in commit
5c52f277108e2b97a14aba6539849f71bd554cd0

My current configuration is
kernel: 3.9.2
openafs: 1.6.3pre2 (using memcache)
gcc (Gentoo 4.6.3 p1.13, pie-0.5.2) 4.6.3

I will recompile kernel using CONFIG_KALLSYMS and try to get some traces (it may
takes days to hang) and send it here if there is something openafs related.

-Markus

May 22 22:11:04 z600 kernel: retire_capture_urb: 2492 callbacks suppressed
May 22 22:11:07 z600 kernel: INFO: rcu_sched self-detected stall on
CPU { 4}  (t=18000 jiffies g=19179142 c=19179141 q=206281)
May 22 22:11:07 z600 kernel: Pid: 5532, comm: gdu-notificatio Tainted:
P      D    O 3.9.2 #5
May 22 22:11:07 z600 kernel: Call Trace:
May 22 22:11:07 z600 kernel: <IRQ>  [<ffffffff8109bc6b>] ? 0xffffffff8109bc6b
May 22 22:11:07 z600 kernel: [<ffffffff8104326f>] ? 0xffffffff8104326f
May 22 22:11:07 z600 kernel: [<ffffffff8107c076>] ? 0xffffffff8107c076
May 22 22:11:07 z600 kernel: [<ffffffff81056399>] ? 0xffffffff81056399
May 22 22:11:07 z600 kernel: [<ffffffff81056bd4>] ? 0xffffffff81056bd4
May 22 22:11:07 z600 kernel: [<ffffffff81022c83>] ? 0xffffffff81022c83
May 22 22:11:07 z600 kernel: [<ffffffff8132138a>] ? 0xffffffff8132138a
May 22 22:11:07 z600 kernel: <EOI>  [<ffffffff8105abef>] ? 0xffffffff8105abef
May 22 22:11:07 z600 kernel: [<ffffffff8131fc8d>] ? 0xffffffff8131fc8d
May 22 22:11:07 z600 kernel: [<ffffffff81131cdd>] ? 0xffffffff81131cdd
May 22 22:11:07 z600 kernel: [<ffffffff811324dc>] ? 0xffffffff811324dc
May 22 22:11:07 z600 kernel: [<ffffffff8113284c>] ? 0xffffffff8113284c
May 22 22:11:07 z600 kernel: [<ffffffff81131af9>] ? 0xffffffff81131af9
May 22 22:11:07 z600 kernel: [<ffffffff81133a32>] ? 0xffffffff81133a32
May 22 22:11:07 z600 kernel: [<ffffffff810f9673>] ? 0xffffffff810f9673
May 22 22:11:07 z600 kernel: [<ffffffff810504c4>] ? 0xffffffff810504c4
May 22 22:11:07 z600 kernel: [<ffffffff8103a5d2>] ? 0xffffffff8103a5d2
May 22 22:11:07 z600 kernel: [<ffffffff810f82b6>] ? 0xffffffff810f82b6
May 22 22:11:07 z600 kernel: [<ffffffff8103aa61>] ? 0xffffffff8103aa61
May 22 22:11:07 z600 kernel: [<ffffffff8103aad2>] ? 0xffffffff8103aad2
May 22 22:11:07 z600 kernel: [<ffffffff81320852>] ? 0xffffffff81320852
May 22 22:11:09 z600 kernel: retire_capture_urb: 2493 callbacks suppressed

2013/5/21 Marc Dionne <marc.c.dionne@gmail.com>:
> On Mon, May 20, 2013 at 4:41 PM, Markus Suvanto
> <markus.suvanto@gmail.com> wrote:
>> May 20 23:37:04 kvm1 kernel: RIP: 0010:[<0000000000000000>]  [<
>>   (null)>]           (null)
>> May 20 23:37:04 kvm1 kernel: RSP: 0018:ffff88011623ba50  EFLAGS: 00010246
>> May 20 23:37:04 kvm1 kernel: RAX: ffffffff813401c0 RBX:
>> ffff88011623bad8 RCX: 000000000001461e
>> May 20 23:37:04 kvm1 kernel: RDX: 0000000000014681 RSI:
>> ffffea0004375840 RDI: 0000000000000000
>> May 20 23:37:04 kvm1 kernel: RBP: ffffea0004375880 R08:
>> ffffea0004375880 R09: 0000000000013746
>> May 20 23:37:04 kvm1 kernel: R10: 000000000000095e R11:
>> 0000000000000000 R12: ffffea0004375840
>> May 20 23:37:04 kvm1 kernel: R13: ffff88011716f078 R14:
>> 0000000000000001 R15: 0000000000000000
>> May 20 23:37:04 kvm1 kernel: FS:  00007f4404ccc700(0000)
>> GS:ffff88011fc80000(0000) knlGS:0000000000000000
>> May 20 23:37:04 kvm1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> May 20 23:37:04 kvm1 kernel: CR2: 0000000000000000 CR3:
>> 00000001191a4000 CR4: 00000000000007e0
>> May 20 23:37:04 kvm1 kernel: DR0: 0000000000000000 DR1:
>> 0000000000000000 DR2: 0000000000000000
>> May 20 23:37:04 kvm1 kernel: DR3: 0000000000000000 DR6:
>> 00000000ffff0ff0 DR7: 0000000000000400
>> May 20 23:37:04 kvm1 kernel: Process ld (pid: 1599, threadinfo
>> ffff88011623a000, task ffff88011215d280)
>> May 20 23:37:04 kvm1 kernel: Stack:
>> May 20 23:37:04 kvm1 kernel:  ffffffffa0807a32 ffff880119a01300
>> 0000000018608b40 ffffc90001478800
>> May 20 23:37:04 kvm1 kernel:  ffffea0004375880 0000000000001000
>> ffff880118608b40 ffffc90001478800
>> May 20 23:37:04 kvm1 kernel:  ffff880119a01300 0000000000000001
>> ffffffffa08082c8 ffff88011fff8c00
>> May 20 23:37:04 kvm1 kernel: Call Trace:
>> May 20 23:37:04 kvm1 kernel:  [<ffffffffa0807a32>] ?
>> afs_linux_read_cache.isra.20+0x182/0x340 [libafs]
>> May 20 23:37:04 kvm1 kernel:  [<ffffffffa08082c8>] ?
>> afs_linux_fillpage+0x6d8/0x9b0 [libafs]
>> May 20 23:37:04 kvm1 kernel:  [<ffffffffa0796d95>] ?
>> afs_InitReq+0x85/0xf0 [libafs]
>> May 20 23:37:04 kvm1 kernel:  [<ffffffffa0809472>] ?
>> afs_linux_readpage+0x192/0x490 [libafs]
>> May 20 23:37:04 kvm1 kernel:  [<ffffffff810b1db4>] ?
>> add_to_page_cache_locked+0x84/0xe0
>> May 20 23:37:04 kvm1 kernel:  [<ffffffff810b2f64>] ?
>> generic_file_aio_read+0x204/0x6e0
>> May 20 23:37:04 kvm1 kernel:  [<ffffffffa0804fa2>] ?
>> afs_linux_aio_read+0xf2/0x290 [libafs]
>> May 20 23:37:04 kvm1 kernel:  [<ffffffff810d2e6e>] ? handle_pte_fault+0xae/0x9a0
>> May 20 23:37:04 kvm1 kernel:  [<ffffffff81100494>] ? do_sync_read+0x94/0xd0
>> May 20 23:37:04 kvm1 kernel:  [<ffffffff81100c9d>] ? vfs_read+0x16d/0x190
>> May 20 23:37:04 kvm1 kernel:  [<ffffffff81100e70>] ? sys_read+0x50/0xa0
>> May 20 23:37:04 kvm1 kernel:  [<ffffffff8132dea9>] ?
>> system_call_fastpath+0x16/0x1b
>> May 20 23:37:04 kvm1 kernel: Code:  Bad RIP value.
>> May 20 23:37:04 kvm1 kernel:  RSP <ffff88011623ba50>
>> May 20 23:37:04 kvm1 kernel: CR2: 0000000000000000
>> May 20 23:37:04 kvm1 kernel: ---[ end trace cd4f184e9e79af88 ]---
>>
>> -Markus
>
> Thanks for testing - for tmpfs, this is because the readpage()
> function is no longer implemented by tmpfs as of kernel 3.1, and some
> of the optimizations rely on this being available.  I have a
> workaround to detect this and bypass the affected code, but we'll have
> to see if it's something that people think is suitable for 1.6.3.
>
> I'll also see if I can test the btrfs case - not something I have
> tested recently.
>
> Marc