[OpenAFS-devel] Re: OpenAFS on 2.4.26 ? OpenMosix ?

Onime Clement onime@ictp.trieste.it
Fri, 17 Dec 2004 22:12:44 +0100 (CET)


Hi Terry,
Unfortunately you did not say if you have omfs enabled or not.
I have a cluster of about 40 machines (user nodes) with openafs and
openmosix but with omfs disabled on all of them. THe openmosix options I
use are for programs that do IO to become locked to the originating node,
this fixed most of the perl problems for me.
I would suggest to try to disable omfs, if it is enabled..

Thanks
Clement Onime

>
>    1. Re: OpenAFS on 2.4.26 ? OpenMosix ? (Terry Gliedt)
>    2. Re: OpenAFS on 2.4.26 ? OpenMosix ? (Jeffrey Hutzelman)
>    3. Re: 1.3.75 on FC3 (Matthew N. Andrews)
>
> --__--__--
>
> Message: 1
> Date: Wed, 15 Dec 2004 14:02:26 -0500
> From: Terry Gliedt <tpg@umich.edu>
> Organization: Biostatistics
> To: openafs-devel@openafs.org
> Subject: Re: [OpenAFS-devel] OpenAFS on 2.4.26 ? OpenMosix ?
>
> The previous post has not been the last of the story.  We tried one more
> time, this time moving to a 2.4.27 kernel and OpenMosix
> patch-2.4.27-om-20041102.bz2. The OpenAFS code remained unchanged.
>
> We discovered that pinning a task to a particular processor allowed the
> tasks to run to completion. At the same time we discovered that using
> migrate to move a task from one processor to another (even for identical
> machines hardware-wise), resulted in a segment fault.
>
> Eventually we found three examples of code for testing. Two failed every
> time, sometimes quickly, sometimes not. One thing they had in common was
> the use of Perl. We speculated the problem was related to Perl threads,
> even though the Perl code was very simple and used no threads. I created
> my own version of Perl, with no threads and all instances of failures
> stopped.
>
> Some of you may not be surprised by this, but I sure was. Obviously
> there is something in using a thread-enabled Perl which just does not
> work in OpenMosix. In our experience migrating a task using a
> thread-enabled Perl will fail 100% of the time.
>
> We've replaced FC1 Perl and have a more stable environment. We enabled
> OpenAFS for this environment and have had pretty good success, but not
> complete success. Obtaining tokens at login behaves just as we wanted -
> we're out of the password business.
>
> Reading AFS data seems to be solid. We've not noticed any failures in
> the cache or in copying data (this is hardly a completely solid
> endorsement, but so far, so good).
>
> Writing into AFS volumes, however, is not always successful. Sometimes
> the program (e.g. cp) doing the writing will segment fault. I've seen
> various other write failures that I think had to do with locking, but
> exactly what was going on was unclear.
>
> In one case I got a segment fault in cp and retried the command. The
> kernel got seriously 'sick'. In /var/log/messages I found the messages
> below. The machine has very unresponsive, to the point I rebooted. Nasty!!
>
> The problem could possibly be in OpenMosix (whose mailing list I will
> also post to), but I thought I should tell you folks of my experience in
> case it rings a bell. If anyone is interested in pursuing this further I
> can probably arrange some testing. These problems all seem pretty common
> and can often be reproduced.
>
>
> ####### from /var/log/messages   Watch for line wraps
>
> Unable to handle kernel NULL pointer dereference at virtual address
> 00000004
>   printing eip:
>   f8b73af8
>   *pde = 2bcc0001
>   *pte = 00000000
>   Oops: 0000
>   CPU:    2
>   EIP:    0010:[<f8b73af8>]    Tainted: PF
>   EFLAGS: 00010282
>   eax: 20003312   ebx: f8c4be14   ecx: ec6b5dfc   edx: 00000000
>   esi: f8c4c038   edi: ec6b5da0   ebp: ec6b5da0   esp: ecbbfe40
>   ds: 0018   es: 0018   ss: 0018
>   Process cp (pid: 3288, stackpage=ecbbf000)
>   Stack: f9417000 ecbbe000 00000000 f8c4be14 f8c4c038 ecbbfe90 ec6b5da0
> f8b776b2
>          ec6b5da0 ec6b5dfc 00000002 ecbbfe90 c0360a00 ec71ad20 00000001
> f9417000
>          ec6b5dfc f8c4c038 ec6b5dfc 0000ffff 0001e194 00000040 f8ba22c0
> f8b78a00
>   Call Trace:    [<f8b776b2>] [<f8ba22c0>] [<f8b78a00>] [<c01611ed>]
> [<c0161a22>]
>     [<c01620c9>] [<c0162429>] [<c0153443>] [<c016c8d1>] [<c0155f88>]
> [<c01befd5>]
>     [<c01bf0df>] [<c010b8bc>]
>
>   Code: 39 42 04 0f 84 c7 00 00 00 e8 3a e7 ff ff 89 c5 50 8d 44 24
>
>
>
>
> Terry Gliedt wrote:
>> This is a followup on my experience with OpenAFS and OpenMosix. I moved
>> user's HOME to a local disk, rather than in AFS and got everything
>> configured as I wanted. Then I opened the machines (one gateway + one
>> dedicated node in the cluster) to one user.
>>
>> She started a simulation which consisted of a Perl program driving a C
>> program running it several tens of thousands of times. The program was
>> running on the remote cluster. This is a computationally heavy task with
>> very little in or out I/O (typical for our world). The program was not
>> running in an AFS directory, but in a directory on a local disk.
>>
>> After ten minutes or so her task segment faulted. This same software has
>> been running on several dozen other machines for the past several weeks,
>> so it's not her problem.  I disabled AFS in the rc.d scripts and
>> rebooted. The same tasks have been running for three days.
>>
>> I'm afraid there is some fairly basic interaction between OpenAFS and
>> OpenMosix. I have a small window of opportunity to get some debug
>> information if someone wants to pursue this - just give me the details
>> of what you need (and how to get them).
>>
>> Details:
>>
>>   Fedora Core 1
>>   2.4.26 kernel
>>   patch-2.4.26-om-20041102.bz2 for OpenMosix
>>   OpenAFS 1.3.73
>>
>>
>> Terry Gliedt wrote:
>>
>>> Miles Davis wrote:
>>>
>>>> On Tue, Nov 09, 2004 at 09:15:44AM -0500, Terry Gliedt wrote:
>>>>
>>>>> I can now confirm the combination of a  2.4.26 kernel  + 1.3.73
>>>>> OpenAFS works just fine. Adding OpenMosix will immediately results
>>>>> in this symptom:
>>>>>
>>>>>  SSH with X11 forwarding to OpenMosix+OpenAFS machine
>>>>>  Observe messages about a fail in locking .Xauthority file
>>>>>
>>>>> What apparently is happening is that as X11 attempts to add a new
>>>>> entry to .Xauthority, it creates .Xauthority-n and presumably does a
>>>>> move which fails. This results in the user's .Xauthority
>>>>> "disappearing". A simple 'mv .Xauthority-n .Xauthority' allows X11
>>>>> to work properly again.
>>>>>
>>>>> I presume this has something to do with locking, but that's just my
>>>>> guess. I've seen other strangeness in AFS behavior also which may be
>>>>> related (or not), however the ssh scenario I mention above has been
>>>>> my lithmus test.
>>>>
>>>>
>>>>
>>>>
>>>> I've had that happen several times on 1.3.73 clients, so it probably
>>>> has nothing to do with openMosix. I haven't tried 1.3.74 yet, but you
>>>> should probably give that a try.
>>>
>>>
>>>
>>> Well, I did, but that did not help. I really believe this is an
>>> interaction between OpenAFS and OpenMosix.  If I apply OpenAFS 1.3.73
>>> to a pure linux 2.4.26 kernel, AFS behaves as expected. Adding
>>> OpenMosix definately causes the problem.  Thanks for the thought.
>>>
>>
>>
>
>
> --
> =============================================================
> Terry Gliedt     tpg@umich.edu       http://www.hps.com/~tpg/
> Biostatistics, Univ of Michigan  Personal Email:  tpg@hps.com
>
> --__--__--
>
> Message: 2
> Date: Wed, 15 Dec 2004 14:48:32 -0500
> From: Jeffrey Hutzelman <jhutz@cmu.edu>
> To: Terry Gliedt <tpg@umich.edu>, openafs-devel@openafs.org
> Subject: Re: [OpenAFS-devel] OpenAFS on 2.4.26 ? OpenMosix ?
>
>
>
> On Wednesday, December 15, 2004 14:02:26 -0500 Terry Gliedt
> <tpg@umich.edu>
> wrote:
>
>>####### from /var/log/messages   Watch for line wraps
>>
>> Unable to handle kernel NULL pointer dereference at virtual address
>> 00000004   printing eip:
>>   f8b73af8
>>   *pde = 2bcc0001
>>   *pte = 00000000
>>   Oops: 0000
>>   CPU:    2
>>   EIP:    0010:[<f8b73af8>]    Tainted: PF
>>   EFLAGS: 00010282
>>   eax: 20003312   ebx: f8c4be14   ecx: ec6b5dfc   edx: 00000000
>>   esi: f8c4c038   edi: ec6b5da0   ebp: ec6b5da0   esp: ecbbfe40
>>   ds: 0018   es: 0018   ss: 0018
>>   Process cp (pid: 3288, stackpage=ecbbf000)
>>   Stack: f9417000 ecbbe000 00000000 f8c4be14 f8c4c038 ecbbfe90 ec6b5da0
>> f8b776b2          ec6b5da0 ec6b5dfc 00000002 ecbbfe90 c0360a00 ec71ad20
>> 00000001 f9417000          ec6b5dfc f8c4c038 ec6b5dfc 0000ffff 0001e194
>> 00000040 f8ba22c0 f8b78a00   Call Trace:    [<f8b776b2>] [<f8ba22c0>]
>> [<f8b78a00>] [<c01611ed>] [<c0161a22>]     [<c01620c9>] [<c0162429>]
>> [<c0153443>] [<c016c8d1>] [<c0155f88>] [<c01befd5>]     [<c01bf0df>]
>> [<c010b8bc>]
>>
>>   Code: 39 42 04 0f 84 c7 00 00 00 e8 3a e7 ff ff 89 c5 50 8d 44 24
>
> That's not surprising.  In all of the cases you described where a process
> randomly seg faults, you should see output like that in /var/log/messages
> or in dmesg output.  There are a wide variety of bad things that, if user
> code does them, cause the program to exit on a signal like SIGSEGV or
> SIGBUS, and drop a core file.  In Linux, if one of these things happens in
> kernel code, the process exits on SIGSEGV (no core), and you get an "oops"
> message which contains information about the state of the kernel at the
> time of the failure.  That's what the message you quoted is.
>
> Unfortunately, the oops message is not useful in its raw form.  All of the
> numbers you see in [<>] are actually addresses inside the kernel.  In
> order
> for the backtrace to be useful, these need to be converted to symbolic
> form.  This is usually done automatically by the logging software, if it
> can find the kernel symbol table, which is usually available in a file
> called "System.map".  Since the conversion did not happen automatically,
> you will need to either find and use ksymoops, or reconfigure the kernel
> logging software to do the translation, and then reproduce the problem
> again.
>
> The simplest thing to do is to make sure that klogd is able to find the
> System.map file, and that it is not invoked with -x.  You will probably
> get
> the best results by running klogd with -p, so it will reload symbol table
> information when it sees an error (otherwise it may not have a complete
> set
> of symbols for openafs).
>
>
> FWIW, I have not heard of anyone getting OpenAFS and OpenMosix to work
> together, even to the extent that you've reported so far.  We have had
> several reports of failures in the past, though...
>
> -- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
>    Sr. Research Systems Programmer
>    School of Computer Science - Research Computing Facility
>    Carnegie Mellon University - Pittsburgh, PA
>
>
> --__--__--
>
> Message: 3
> Date: Wed, 15 Dec 2004 17:36:46 -0800
> From: "Matthew N. Andrews" <matt@slackers.net>
> To: openafs-devel@openafs.org
> Subject: Re: [OpenAFS-devel] 1.3.75 on FC3
>
> d'oh,
>
> here's some more info on my problems with openafs on FC3 x86_64
>
> after looking at dmesg and slapping my forhead I see:
>
> libafs: Unknown symbol ia32_sys_call_table
>
> at this point I looked at acinclude.m4, and tried this patch to force the
> test
> for ia32_sys_cal_table to fail:
>
> ---- cut here ----
> --- acinclude.m4        2004-12-13 11:40:42.000000000 -0800
> +++ acinclude.m4.no_ia32_sys_call_table 2004-12-15 16:31:22.093260576
> -0800
> @@ -579,9 +579,7 @@
>                   if test "x$ac_cv_linux_config_modversions" = "xno" -o
> $AFS_SYSKVERS -ge 26; then
>                     AC_MSG_WARN([Cannot determine sys_call_table status.
> assuming it isn't exported])
>                     ac_cv_linux_exports_sys_call_table=no
> -                  if test -f
> "$LINUX_KERNEL_PATH/include/asm/ia32_unistd.h"; then
> -                    ac_cv_linux_exports_ia32_sys_call_table=yes
> -                  fi
> +                  ac_cv_linux_exports_ia32_sys_call_table=no
>                   else
>                     LINUX_EXPORTS_INIT_MM
>                     LINUX_EXPORTS_KALLSYMS_ADDRESS
> ---- cut here ----
>
> this then causes the make to fail when compiling the libafs module with
> these
> errors:
>
>  CC [M]
> /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/AFS_component_version_number.o
>   CC [M]
> /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.o
> /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.c:
> In function `afs_init':
> /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.c:453:
> warning: `interruptible_sleep_on' is deprecated (declared at
> include/linux/wait.h:290)
> /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.c:462:
> error: `sys_exit' undeclared (first use in this function)
> /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.c:462:
> error: (Each undeclared identifier is reported only once
> /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.c:462:
> error: for each function it appears in.)
> /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.c:463:
> error: `sys_open' undeclared (first use in this function)
> /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.c:464:
> warning: assignment from incompatible pointer type
> make[6]: ***
> [/usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.o]
> Error 1
> make[5]: ***
> [_module_/usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP]
> Error 2
> make[5]: Leaving directory `/lib/modules/2.6.9-1.678_FC3smp/build'
> make[4]: *** [libafs.ko] Error 2
> make[4]: Leaving directory
> `/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP'
> make[3]: *** [linux_compdirs] Error 2
> make[3]: Leaving directory
> `/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs'
> make[2]: *** [libafs] Error 2
> make[2]: Leaving directory `/home/ma3d/rpmbuild/BUILD/openafs-1.3.76'
> make[1]: *** [build] Error 2
> make[1]: Leaving directory `/home/ma3d/rpmbuild/BUILD/openafs-1.3.76'
> make: *** [all] Error 2
>
> Is there a way to get the current openafs code to work on a machine which
> has
> neither ia32_sys_call_table, nor sys_call_table?
>
> -Matt
>
>
> Matthew N. Andrews wrote:
>> hello,
>>
>> after getting 1.3.75 to compile on a dual processor x86_64 FC3 machine,
>> I am now stuck with a module that fails to load with the following
>> error:
>>
>> # insmod /usr/vice/etc/modload/libafs-2.6.9-1.678_FC3smp-amd64.ko
>> insmod: error inserting
>> '/usr/vice/etc/modload/libafs-2.6.9-1.678_FC3smp-amd64.ko': -1 Unknown
>> symbol in module
>>
>>
>> I remember others seeing this same error earlier on thelist, but
>> couldn't find a reference to what the problem was then. anyone have any
>> ideas?
>>
>> thanks for any help.
>>
>> -Matthew Andrews
>> _______________________________________________
>> OpenAFS-devel mailing list
>> OpenAFS-devel@openafs.org
>> https://lists.openafs.org/mailman/listinfo/openafs-devel
>>
>>
>
>
>
> --__--__--
>
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel
>
>
> End of OpenAFS-devel Digest
>