[OpenAFS-devel] OpenAFS on 2.4.26 ? OpenMosix ?

Terry Gliedt tpg@umich.edu
Mon, 15 Nov 2004 08:59:06 -0500


This is a followup on my experience with OpenAFS and OpenMosix. I moved 
user's HOME to a local disk, rather than in AFS and got everything 
configured as I wanted. Then I opened the machines (one gateway + one 
dedicated node in the cluster) to one user.

She started a simulation which consisted of a Perl program driving a C 
program running it several tens of thousands of times. The program was 
running on the remote cluster. This is a computationally heavy task with 
very little in or out I/O (typical for our world). The program was not 
running in an AFS directory, but in a directory on a local disk.

After ten minutes or so her task segment faulted. This same software has 
been running on several dozen other machines for the past several weeks, 
so it's not her problem.  I disabled AFS in the rc.d scripts and 
rebooted. The same tasks have been running for three days.

I'm afraid there is some fairly basic interaction between OpenAFS and 
OpenMosix. I have a small window of opportunity to get some debug 
information if someone wants to pursue this - just give me the details 
of what you need (and how to get them).

Details:

   Fedora Core 1
   2.4.26 kernel
   patch-2.4.26-om-20041102.bz2 for OpenMosix
   OpenAFS 1.3.73


Terry Gliedt wrote:
> Miles Davis wrote:
> 
>> On Tue, Nov 09, 2004 at 09:15:44AM -0500, Terry Gliedt wrote:
>>
>>> I can now confirm the combination of a  2.4.26 kernel  + 1.3.73 
>>> OpenAFS works just fine. Adding OpenMosix will immediately results in 
>>> this symptom:
>>>
>>>  SSH with X11 forwarding to OpenMosix+OpenAFS machine
>>>  Observe messages about a fail in locking .Xauthority file
>>>
>>> What apparently is happening is that as X11 attempts to add a new 
>>> entry to .Xauthority, it creates .Xauthority-n and presumably does a 
>>> move which fails. This results in the user's .Xauthority 
>>> "disappearing". A simple 'mv .Xauthority-n .Xauthority' allows X11 to 
>>> work properly again.
>>>
>>> I presume this has something to do with locking, but that's just my 
>>> guess. I've seen other strangeness in AFS behavior also which may be 
>>> related (or not), however the ssh scenario I mention above has been 
>>> my lithmus test.
>>
>>
>>
>> I've had that happen several times on 1.3.73 clients, so it probably 
>> has nothing to do with openMosix. I haven't tried 1.3.74 yet, but you 
>> should probably give that a try.
> 
> 
> Well, I did, but that did not help. I really believe this is an 
> interaction between OpenAFS and OpenMosix.  If I apply OpenAFS 1.3.73 to 
> a pure linux 2.4.26 kernel, AFS behaves as expected. Adding OpenMosix 
> definately causes the problem.  Thanks for the thought.
> 


-- 
=============================================================
Terry Gliedt     tpg@umich.edu       http://www.hps.com/~tpg/
Biostatistics, Univ of Michigan  Personal Email:  tpg@hps.com