[OpenAFS-devel] OpenAFS on 2.4.26 ? OpenMosix ?

Terry Gliedt tpg@umich.edu
Wed, 15 Dec 2004 14:02:26 -0500


The previous post has not been the last of the story.  We tried one more 
time, this time moving to a 2.4.27 kernel and OpenMosix 
patch-2.4.27-om-20041102.bz2. The OpenAFS code remained unchanged.

We discovered that pinning a task to a particular processor allowed the 
tasks to run to completion. At the same time we discovered that using 
migrate to move a task from one processor to another (even for identical 
machines hardware-wise), resulted in a segment fault.

Eventually we found three examples of code for testing. Two failed every 
time, sometimes quickly, sometimes not. One thing they had in common was 
the use of Perl. We speculated the problem was related to Perl threads, 
even though the Perl code was very simple and used no threads. I created 
my own version of Perl, with no threads and all instances of failures 
stopped.

Some of you may not be surprised by this, but I sure was. Obviously 
there is something in using a thread-enabled Perl which just does not 
work in OpenMosix. In our experience migrating a task using a 
thread-enabled Perl will fail 100% of the time.

We've replaced FC1 Perl and have a more stable environment. We enabled 
OpenAFS for this environment and have had pretty good success, but not 
complete success. Obtaining tokens at login behaves just as we wanted - 
we're out of the password business.

Reading AFS data seems to be solid. We've not noticed any failures in 
the cache or in copying data (this is hardly a completely solid 
endorsement, but so far, so good).

Writing into AFS volumes, however, is not always successful. Sometimes 
the program (e.g. cp) doing the writing will segment fault. I've seen 
various other write failures that I think had to do with locking, but 
exactly what was going on was unclear.

In one case I got a segment fault in cp and retried the command. The 
kernel got seriously 'sick'. In /var/log/messages I found the messages 
below. The machine has very unresponsive, to the point I rebooted. Nasty!!

The problem could possibly be in OpenMosix (whose mailing list I will 
also post to), but I thought I should tell you folks of my experience in 
case it rings a bell. If anyone is interested in pursuing this further I 
can probably arrange some testing. These problems all seem pretty common 
and can often be reproduced.


####### from /var/log/messages   Watch for line wraps

Unable to handle kernel NULL pointer dereference at virtual address 00000004
  printing eip:
  f8b73af8
  *pde = 2bcc0001
  *pte = 00000000
  Oops: 0000
  CPU:    2
  EIP:    0010:[<f8b73af8>]    Tainted: PF
  EFLAGS: 00010282
  eax: 20003312   ebx: f8c4be14   ecx: ec6b5dfc   edx: 00000000
  esi: f8c4c038   edi: ec6b5da0   ebp: ec6b5da0   esp: ecbbfe40
  ds: 0018   es: 0018   ss: 0018
  Process cp (pid: 3288, stackpage=ecbbf000)
  Stack: f9417000 ecbbe000 00000000 f8c4be14 f8c4c038 ecbbfe90 ec6b5da0 
f8b776b2
         ec6b5da0 ec6b5dfc 00000002 ecbbfe90 c0360a00 ec71ad20 00000001 
f9417000
         ec6b5dfc f8c4c038 ec6b5dfc 0000ffff 0001e194 00000040 f8ba22c0 
f8b78a00
  Call Trace:    [<f8b776b2>] [<f8ba22c0>] [<f8b78a00>] [<c01611ed>] 
[<c0161a22>]
    [<c01620c9>] [<c0162429>] [<c0153443>] [<c016c8d1>] [<c0155f88>] 
[<c01befd5>]
    [<c01bf0df>] [<c010b8bc>]

  Code: 39 42 04 0f 84 c7 00 00 00 e8 3a e7 ff ff 89 c5 50 8d 44 24




Terry Gliedt wrote:
> This is a followup on my experience with OpenAFS and OpenMosix. I moved 
> user's HOME to a local disk, rather than in AFS and got everything 
> configured as I wanted. Then I opened the machines (one gateway + one 
> dedicated node in the cluster) to one user.
> 
> She started a simulation which consisted of a Perl program driving a C 
> program running it several tens of thousands of times. The program was 
> running on the remote cluster. This is a computationally heavy task with 
> very little in or out I/O (typical for our world). The program was not 
> running in an AFS directory, but in a directory on a local disk.
> 
> After ten minutes or so her task segment faulted. This same software has 
> been running on several dozen other machines for the past several weeks, 
> so it's not her problem.  I disabled AFS in the rc.d scripts and 
> rebooted. The same tasks have been running for three days.
> 
> I'm afraid there is some fairly basic interaction between OpenAFS and 
> OpenMosix. I have a small window of opportunity to get some debug 
> information if someone wants to pursue this - just give me the details 
> of what you need (and how to get them).
> 
> Details:
> 
>   Fedora Core 1
>   2.4.26 kernel
>   patch-2.4.26-om-20041102.bz2 for OpenMosix
>   OpenAFS 1.3.73
> 
> 
> Terry Gliedt wrote:
> 
>> Miles Davis wrote:
>>
>>> On Tue, Nov 09, 2004 at 09:15:44AM -0500, Terry Gliedt wrote:
>>>
>>>> I can now confirm the combination of a  2.4.26 kernel  + 1.3.73 
>>>> OpenAFS works just fine. Adding OpenMosix will immediately results 
>>>> in this symptom:
>>>>
>>>>  SSH with X11 forwarding to OpenMosix+OpenAFS machine
>>>>  Observe messages about a fail in locking .Xauthority file
>>>>
>>>> What apparently is happening is that as X11 attempts to add a new 
>>>> entry to .Xauthority, it creates .Xauthority-n and presumably does a 
>>>> move which fails. This results in the user's .Xauthority 
>>>> "disappearing". A simple 'mv .Xauthority-n .Xauthority' allows X11 
>>>> to work properly again.
>>>>
>>>> I presume this has something to do with locking, but that's just my 
>>>> guess. I've seen other strangeness in AFS behavior also which may be 
>>>> related (or not), however the ssh scenario I mention above has been 
>>>> my lithmus test.
>>>
>>>
>>>
>>>
>>> I've had that happen several times on 1.3.73 clients, so it probably 
>>> has nothing to do with openMosix. I haven't tried 1.3.74 yet, but you 
>>> should probably give that a try.
>>
>>
>>
>> Well, I did, but that did not help. I really believe this is an 
>> interaction between OpenAFS and OpenMosix.  If I apply OpenAFS 1.3.73 
>> to a pure linux 2.4.26 kernel, AFS behaves as expected. Adding 
>> OpenMosix definately causes the problem.  Thanks for the thought.
>>
> 
> 


-- 
=============================================================
Terry Gliedt     tpg@umich.edu       http://www.hps.com/~tpg/
Biostatistics, Univ of Michigan  Personal Email:  tpg@hps.com