[OpenAFS-devel] reliable crashing using -memcache

Stefaan stefaan.deroeck@gmail.com
Wed, 24 Aug 2005 13:09:45 +0200


Hi!

I have a 2.6.12-gentoo-r4 kernel, single CPU p4, SMP (HT) enabled,
preemption disabled.  I'm running openafs 1.3.87.
When I start "afsd" with the parameters -memcache -chunksize 14 -afsdb
-dynroot, and when I have the following /etc/openafs/cacheinfo:
  /afs:/usr/vice/cache:500000  (When using cachesize 50000, the
problem doesn't occur, or at least not as easily (which means: I have
seen errors when using smaller cachesize, but they may well have been
caused by something else))
The console displays
"afsd: All AFS daemons started."
and then waits forever.  Very shortly after that, I get a kernel oops.
 The machine doesn't hang however.
In ps auxwf I find:

root     12829  0.0  0.0   2000   868 tty3     D+   12:56   0:00     =20
   \_ /usr/sbin/afsd -memcache -
chunksize 14 -afsdb -dynroot
root     12833  0.0  0.0      0     0 tty3     Z<+  12:56   0:00     =20
       \_ [afsd] <defunct>
root     12834  0.0  0.0      0     0 tty3     Z+   12:56   0:00     =20
       \_ [afsd] <defunct>
root     12837  0.0  0.0      0     0 tty3     Z<+  12:56   0:00     =20
       \_ [afsd] <defunct>
root     12839  0.0  0.0      0     0 tty3     Z+   12:56   0:00     =20
       \_ [afsd] <defunct>
root     12842  0.0  0.0      0     0 tty3     Z+   12:56   0:00     =20
       \_ [afsd] <defunct>
root     12844  0.0  0.0   1996   860 tty3     D+   12:56   0:00     =20
       \_ /usr/sbin/afsd -memcac
he -chunksize 14 -afsdb -dynroot
root     12846  0.0  0.0      0     0 tty3     Z+   12:56   0:00     =20
       \_ [afsd] <defunct>
root     12848  0.0  0.0   1996   860 tty3     D+   12:56   0:00     =20
       \_ /usr/sbin/afsd -memcac
he -chunksize 14 -afsdb -dynroot
root     12850  0.0  0.0      0     0 tty3     Z+   12:56   0:00     =20
       \_ [afsd] <defunct>

and also:


root     12835  0.0  0.0      0     0 ?        S    12:56   0:00
[afs_rxlistener]
root     12836  0.0  0.0      0     0 ?        S    12:56   0:00 [afs_callb=
ack]
root     12838  0.0  0.0      0     0 ?        D    12:56   0:00 [afs_rxeve=
nt]
root     12840  0.0  0.0   1996   860 ?        Ss   12:56   0:00
/usr/sbin/afsd -memcache -chunksize 14
 -afsdb -dynroot
root     12843  0.0  0.0      0     0 ?        D    12:56   0:00 [afsd]
root     12845  0.0  0.0      0     0 ?        D    12:56   0:00
[afs_checkserver]
root     12847  0.0  0.0      0     0 ?        S    12:56   0:00
[afs_background]
root     12849  0.0  0.0      0     0 ?        S    12:56   0:00
[afs_background]

The oops looks like this: (dmesg | ksymoops)
ksymoops 2.4.11 on i686 2.6.12-gentoo-r4. =20
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.6.12-gentoo-r4/ (default)
     -m /boot/kernel-2.6.12-gentoo-r4/System.map (specified)

Error (regular_file): read_ksyms stat /proc/ksyms failed
ksymoops: No such file or directory
No modules in ksyms, skipping objects
No ksyms, skipping lsmod
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
Machine check exception polling timer started.
SGI XFS with large block numbers, no debug enabled
ehci_hcd 0000:00:1d.7: debug port 1
Unable to handle kernel NULL pointer dereference at virtual address 0000014=
7
f9b58c77
*pde =3D 00000000
Oops: 0000 [#1]
CPU:    1
EIP:    0060:[<f9b58c77>]    Tainted: P      VLI
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010206   (2.6.12-gentoo-r4)
eax: f9bd34d4   ebx: 000000d7   ecx: 00007a12   edx: 00000000
esi: 0000000a   edi: 00000000   ebp: 00000000   esp: cea27e30
ds: 007b   es: 007b   ss: 0068
Stack: c011cb47 cf372520 f6329300 32b7b53b 32b7b53b 0000006e cea27e68 c011c=
c9e
       cf372520 c1807558 d17226e0 d1722520 f5cca580 d1722648 00000004 00000=
0d7
       00000000 00000009 c042aa62 00000000 00000002 00000001 00000000 cea27=
ea8
Call Trace:
 [<c011cb47>] recalc_task_prio+0x8e/0x155
 [<c011cc9e>] activate_task+0x90/0xa4
 [<c042aa62>] schedule+0x3c6/0xc81
 [<c042a514>] __down+0xcc/0xdb
 [<c011ed5a>] default_wake_function+0x0/0x12
 [<c0137a91>] remove_wait_queue+0x1a/0x4a
 [<f9ba772c>] afs_osi_SleepSig+0x150/0x1a7 [libafs]
 [<f9b5821a>] afs_CacheTruncateDaemon+0x0/0x456 [libafs]
 [<c011ed5a>] default_wake_function+0x0/0x12
 [<f9ba7819>] afs_osi_Sleep+0x96/0xbb [libafs]
 [<c010788c>] do_gettimeofday+0x1e/0xbf
 [<f9b58325>] afs_CacheTruncateDaemon+0x10b/0x456 [libafs]
 [<f9bac7b0>] afsd_thread+0x3d0/0x656 [libafs]
 [<f9bac3e0>] afsd_thread+0x0/0x656 [libafs]
 [<c0101401>] kernel_thread_helper+0x5/0xb
Code: 31 bd f9 7c ec 8b 84 24 74 01 00 00 85 c0 0f 8e 55 01 00 00 8b
0d 44 31 bd f9 e9 39 fb ff ff a1 64 31 bd f9 8b 1c b0 85 db 74 0b <66>
83 7b 70 00 0f 85 5a fb ff ff a1 e4 31 bd f9 80 e2 08 8b 3c


>>EIP; f9b58c77 <pg0+3959fc77/3fa45400>   <=3D=3D=3D=3D=3D

>>eax; f9bd34d4 <pg0+3961a4d4/3fa45400>
>>esp; cea27e30 <pg0+e46ee30/3fa45400>

Trace; c011cb47 <recalc_task_prio+8e/155>
Trace; c011cc9e <activate_task+90/a4>
Trace; c042aa62 <schedule+3c6/c81>
Trace; c042a514 <__down+cc/db>
Trace; c011ed5a <default_wake_function+0/12>
Trace; c0137a91 <remove_wait_queue+1a/4a>
Trace; f9ba772c <pg0+395ee72c/3fa45400>
Trace; f9b5821a <pg0+3959f21a/3fa45400>
Trace; c011ed5a <default_wake_function+0/12>
Trace; f9ba7819 <pg0+395ee819/3fa45400>
Trace; c010788c <do_gettimeofday+1e/bf>
Trace; f9b58325 <pg0+3959f325/3fa45400>
Trace; f9bac7b0 <pg0+395f37b0/3fa45400>
Trace; f9bac3e0 <pg0+395f33e0/3fa45400>
Trace; c0101401 <kernel_thread_helper+5/b>

This architecture has variable length instructions, decoding before eip
is unreliable, take these instructions with a pinch of salt.

Code;  f9b58c4c <pg0+3959fc4c/3fa45400>
00000000 <_EIP>:
Code;  f9b58c4c <pg0+3959fc4c/3fa45400>
   0:   31 bd f9 7c ec 8b         xor    %edi,0x8bec7cf9(%ebp)
Code;  f9b58c52 <pg0+3959fc52/3fa45400>
   6:   84 24 74                  test   %ah,(%esp,%esi,2)
Code;  f9b58c55 <pg0+3959fc55/3fa45400>
   9:   01 00                     add    %eax,(%eax)
Code;  f9b58c57 <pg0+3959fc57/3fa45400>
   b:   00 85 c0 0f 8e 55         add    %al,0x558e0fc0(%ebp)
Code;  f9b58c5d <pg0+3959fc5d/3fa45400>
  11:   01 00                     add    %eax,(%eax)
Code;  f9b58c5f <pg0+3959fc5f/3fa45400>
  13:   00 8b 0d 44 31 bd         add    %cl,0xbd31440d(%ebx)
Code;  f9b58c65 <pg0+3959fc65/3fa45400>
  19:   f9                        stc
Code;  f9b58c66 <pg0+3959fc66/3fa45400>
  1a:   e9 39 fb ff ff            jmp    fffffb58 <_EIP+0xfffffb58>
Code;  f9b58c6b <pg0+3959fc6b/3fa45400>
  1f:   a1 64 31 bd f9            mov    0xf9bd3164,%eax
Code;  f9b58c70 <pg0+3959fc70/3fa45400>
  24:   8b 1c b0                  mov    (%eax,%esi,4),%ebx
Code;  f9b58c73 <pg0+3959fc73/3fa45400>
  27:   85 db                     test   %ebx,%ebx
Code;  f9b58c75 <pg0+3959fc75/3fa45400>
  29:   74 0b                     je     36 <_EIP+0x36>

This decode from eip onwards should be reliable

Code;  f9b58c77 <pg0+3959fc77/3fa45400>
00000000 <_EIP>:
Code;  f9b58c77 <pg0+3959fc77/3fa45400>   <=3D=3D=3D=3D=3D
   0:   66 83 7b 70 00            cmpw   $0x0,0x70(%ebx)   <=3D=3D=3D=3D=3D
Code;  f9b58c7c <pg0+3959fc7c/3fa45400>
   5:   0f 85 5a fb ff ff         jne    fffffb65 <_EIP+0xfffffb65>
Code;  f9b58c82 <pg0+3959fc82/3fa45400>
   b:   a1 e4 31 bd f9            mov    0xf9bd31e4,%eax
Code;  f9b58c87 <pg0+3959fc87/3fa45400>
  10:   80 e2 08                  and    $0x8,%dl
Code;  f9b58c8a <pg0+3959fc8a/3fa45400>
  13:   8b                        .byte 0x8b
Code;  f9b58c8b <pg0+3959fc8b/3fa45400>
  14:   3c                        .byte 0x3c


1 error issued.  Results may not be reliable.


Cheers,
Stefaan