[OpenAFS-port-darwin] Mystery Problem with directories on Tiger

Garance A Drosihn drosih@rpi.edu
Mon, 19 Mar 2007 17:32:30 -0400


At 1:38 PM -0400 3/19/07, Derrick J Brashear wrote:
>On Mon, 19 Mar 2007, Garance A Drosihn wrote:
>
>>The run of my ruby script finished on my intel-based Mac-mini.  It
>>checked almost 200,000 directories, found 2067 AFS volumes, and all of
>>those AFS volumes worked fine.  No getcwd() problems.
>
>Ok. Can you produce the getcwd problem with 1.5 on that machine?

So far, I have never seen the getcwd problem on the intel-based
Mac-mini.  It's a "Core Duo" machine, so two-CPUs in one chip, but
the version which is 32-bit only.  I do not have a Core 2 Duo machine
to test it on.  I have tried to generate the getcwd-problem on both
OpenAFS 1.4.2 and OpenAFS 1.5.1.  I *did* have one case where my
Mac-mini hung up at shutdown after running the script, so my script
might still be tickling some bug when run on the Mac-mini, but the
getcwd() problem does not show up.

On my Dual-G5 machine, I have seen the getcwd() problem when running
OpenAFS 1.4.1, 1.4.2, and 1.5.1.  Both machines are running the same
version of MacOS (I've tested this with both 10.4.8 and 10.4.9, on
both machines).

As much as I didn't want to disrupt my regular work on my desktop,
curiosity got the better of me and I tried a few more tests.  I shutdown
the Mac-mini, and switched the network cable to my desktop.  The problem
still shows up when the desktop is using the same network cable and the
same network address as the Mac-mini, so this isn't a networking issue.

This problem has been particularly irksome to me, because it seemed
much more likely to show up when I was ready to do some serious work,
and never seemed to show up when I was just doing a quick check of
something in the same directories.  I kept telling myself I was just
imagining that, but it turns out there seems to be something to that.

It turns out that if I point the script into somewhere /afs/rpi.edu,
then it is much less likely to have any problems.  But if I point it
into /afs/.rpi.edu, then the getcwd problem is much more likely to
happen.  And if I 'cd' into the /afs/.rpi.edu version of one of the
problematic AFS volumes, then I *will* see a problem at that volume
when I point the script at the /afs/rpi.edu version.  Now, when I'm
just checking some files in afs, I probably always 'cd' into the
rpi.edu version, but when I'm ready to do serious work I'll 'cd' to
the /afs/.rpi.edu version, even for a volume where I don't need to,
just by habit.

It is also interesting that none of the problematic AFS volumes are
replicated.  So, for all of them, the volume seen under /afs/rpi.edu
is exactly the same volume as seen under /afs/.rpi.edu.

Even armed with all the above info, I have not been able to trigger
the getcwd() problem on the Mac-mini, but I'll keep trying.

The script I use is a quickly thrown together hack, with various
sections copy&pasted from other scripts.  So, it ain't pretty, but if
anyone else wants to try it, the script is at:

http://www.rpi.edu/~drosehn/openafs/getwd_AFS_vols

That is the actual script, so it might be that your web browser will
just download it as a file.

You have to give it a directory to start at, and it will check that
directory and all directories below that one.  It will do an 'fs lq'
for every directory, and when it notices a new AFS volume it will
'cd' into the directory and then call the Dir.getwd of ruby (which
in turn calls getcwd).  The script will keep track of which calls fail.

Example:   getwd_AFS_vols --start-dir=/afs/rpi.edu/campus/print

although it'd probably be best to point into your local cell!

-- 
Garance Alistair Drosehn            =   gad@gilead.netel.rpi.edu
Senior Systems Programmer           or  gad@freebsd.org
Rensselaer Polytechnic Institute    or  drosih@rpi.edu