[OpenAFS-devel] OpenAFS on 2.4.26 ? OpenMosix ?

Jeffrey Hutzelman jhutz@cmu.edu
Wed, 15 Dec 2004 14:48:32 -0500


On Wednesday, December 15, 2004 14:02:26 -0500 Terry Gliedt <tpg@umich.edu> 
wrote:

>####### from /var/log/messages   Watch for line wraps
>
> Unable to handle kernel NULL pointer dereference at virtual address
> 00000004   printing eip:
>   f8b73af8
>   *pde = 2bcc0001
>   *pte = 00000000
>   Oops: 0000
>   CPU:    2
>   EIP:    0010:[<f8b73af8>]    Tainted: PF
>   EFLAGS: 00010282
>   eax: 20003312   ebx: f8c4be14   ecx: ec6b5dfc   edx: 00000000
>   esi: f8c4c038   edi: ec6b5da0   ebp: ec6b5da0   esp: ecbbfe40
>   ds: 0018   es: 0018   ss: 0018
>   Process cp (pid: 3288, stackpage=ecbbf000)
>   Stack: f9417000 ecbbe000 00000000 f8c4be14 f8c4c038 ecbbfe90 ec6b5da0
> f8b776b2          ec6b5da0 ec6b5dfc 00000002 ecbbfe90 c0360a00 ec71ad20
> 00000001 f9417000          ec6b5dfc f8c4c038 ec6b5dfc 0000ffff 0001e194
> 00000040 f8ba22c0 f8b78a00   Call Trace:    [<f8b776b2>] [<f8ba22c0>]
> [<f8b78a00>] [<c01611ed>] [<c0161a22>]     [<c01620c9>] [<c0162429>]
> [<c0153443>] [<c016c8d1>] [<c0155f88>] [<c01befd5>]     [<c01bf0df>]
> [<c010b8bc>]
>
>   Code: 39 42 04 0f 84 c7 00 00 00 e8 3a e7 ff ff 89 c5 50 8d 44 24

That's not surprising.  In all of the cases you described where a process 
randomly seg faults, you should see output like that in /var/log/messages 
or in dmesg output.  There are a wide variety of bad things that, if user 
code does them, cause the program to exit on a signal like SIGSEGV or 
SIGBUS, and drop a core file.  In Linux, if one of these things happens in 
kernel code, the process exits on SIGSEGV (no core), and you get an "oops" 
message which contains information about the state of the kernel at the 
time of the failure.  That's what the message you quoted is.

Unfortunately, the oops message is not useful in its raw form.  All of the 
numbers you see in [<>] are actually addresses inside the kernel.  In order 
for the backtrace to be useful, these need to be converted to symbolic 
form.  This is usually done automatically by the logging software, if it 
can find the kernel symbol table, which is usually available in a file 
called "System.map".  Since the conversion did not happen automatically, 
you will need to either find and use ksymoops, or reconfigure the kernel 
logging software to do the translation, and then reproduce the problem 
again.

The simplest thing to do is to make sure that klogd is able to find the 
System.map file, and that it is not invoked with -x.  You will probably get 
the best results by running klogd with -p, so it will reload symbol table 
information when it sees an error (otherwise it may not have a complete set 
of symbols for openafs).


FWIW, I have not heard of anyone getting OpenAFS and OpenMosix to work 
together, even to the extent that you've reported so far.  We have had 
several reports of failures in the past, though...

-- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
   Sr. Research Systems Programmer
   School of Computer Science - Research Computing Facility
   Carnegie Mellon University - Pittsburgh, PA