[OpenAFS-port-darwin] mv issues with panther and 1.2.10a ?

Jason Young itecs-openafs@engr.ncsu.edu
Thu, 08 Jan 2004 13:08:31 -0500


--On Wednesday, January 7, 2004 4:45 PM -0500 Everette Gray Allen
<Everette_Allen@ncsu.edu> wrote:

> I can consistently hang my G4 and G5 machines (have not tried G3) when
> OpenAFS 1.2.10a is installed by doing:
> cd /afs/wherever
> mv ./anyfilesizedoesnotmatter /tmp
> 
> anyone else seeing this problem?

Some additional information that may be of use:

This problem likely will manifest itself with other applications.  The
first provocation of my steady stream of curse words related to this
started when I tried to use SubEthaEdit to edit files in AFS.  On "save"
the system would completely freeze - not even a "please press the power
button" kernel panic message.

Watching how SubEthaEdit writes files with fs_usage produces:

17:20:50  stat            <file in /private/tmp>
17:20:50  open            <file in /private/tmp>
17:20:50  write          
17:20:50  fsync           
17:20:50  close           
17:20:50  rename          <file in /private/tmp to file in /afs>  

An incomplete panic log is produced after the system reboots.

partial/relevant information:

panic(cpu 0): 0x300 - Data access
Latest stack backtrace for cpu 0:
      Backtrace:
         0x000833B8 0x0008389C 0x0001ED8C 0x000908C0 0x00093B8C 
Proceeding back via exception chain:
   Exception state (sv=0x296F4500)
      PC=0x002239F4; MSR=0x00009030; DAR=0xDEADD1F5; DSISR=0x40000000;
LR=0x000C06AC; 
      R1=0x172D3CD0; XCP=0x0000000C (0x300 - Data access)
      Backtrace:
         0x050A259C 0x000C8DE4 0x0023DD24 0x00093D20 0x002F0073 
         backtrace terminated - frame not mapped or invalid: 0xBFFFBBC0

   Exception state (sv=0x28484A00)
      PC=0x9002E18C; MSR=0x0200F030; DAR=0xE02FF000; DSISR=0x42000000;
LR=0x90323F08; 
      R1=0xBFFFBBC0; XCP=0x00000030 (0xC00 - System call)

Kernel version:
Darwin Kernel Version 7.2.0:
Thu Dec 11 16:20:23 PST 2003; root:xnu/xnu-517.3.7.obj~1/RELEASE_PPC

I walked through the steps at:

 http://developer.apple.com/technotes/tn2002/tn2063.html

Because the backtrace can't be finished (I guess because of the reboot or
some other deeper issue - I'm at the limit of my knowledge here of how
relevant trying to troubleshoot a panic.log after the reboot is) - the
usefulness of that document is limited.

However, the few accessible addresses in the first line of the backtrace
were branch instructions in what appeared to the be the rename() syscall.
I'm honestly not up to snuff on the architecture and my rather dusty
gdb-style debugging skills (if I ever had any) to know if I was actually
seeing something that correlated back to the problem at hand (that the
system code responsible for rename was going to be at the same memory
locations across the reboot if no other kernel modules changes - and that
the panic log was useful) but it seems like a strong correlation that
Everette's mv tests, plus what I was seeing with SubEthaEdit might be an
ongoing problem with rename().  Everette mentioned something that reminded
me of the Jaguar "cd foo, mkdir foo mv foo .." kernel panic - and it seems
possible that whatever internal changes were made on rename() to fix that
might be causing this.

I tried different "size" parameters for the afsd.options settings and tried
a -memcache also - just to possibly try and mitigate/rule out disk caching
related issues - and the afsd.options settings have no effect on mitigating
the crash.  There's also some correlational observations with SubEthaEdit
writing a lot to /.vol/  which I know is for HFS+ use, but I don't
know/understand the relationship for /.vol/ and remote filesystems.

I did some limited (observational, I didn't keep the captures) wire
sniffing - enough that I'm reasonably sure nothing is going out on the wire
when it crashes - e.g. it's not getting something strange back from the AFS
servers or sending something strange.  I may need to more thoroughly verify
this though.

I have not tried any of the dev builds - this is all using 1.2.10a.  I
haven't tried any other remote filesystems.  

We have three boxes now (setup differently by different NCSU organizations)
that we can reproduce the crash on, so I've ruled out my individual machine
(or now Everette's) - but I'm somewhat at a loss (both experience, and
probably time) to how to more effectively debug through either OpenAFS or
rename() to see if the darwin porters - or more likely, Apple to see if
they fix things (or us fix things, if this is limited to NCSU).  This would
showstop any Panther+AFS deployments we can do in Engineering - and I
imagine it's going to heavily effect Everette and overall-NCSU lab use of
Panther+AFS (and is likely the heads up for a lot of the rest of you that
either are seeing it or haven't seen it yet - unless this is only isolated
to NCSU).

Jason

----------------------------------------------------
 Jason Young  # ITECS Systems Group Manager
 NC State University College of Engineering