[OpenAFS-port-darwin] Mystery Problem with directories on Tiger

Sun, 18 Mar 2007 23:28:04 -0400

At 10:49 PM -0500 3/10/07, Garance A Drosihn wrote:
>At 1:32 PM -0500 3/10/07, Derrick J Brashear wrote:
>>1) what's in syslog.log?
>>2) does it happen with 1.4.2?
>
>Right now I seem to have an invocation of bash where the problem
>does not go away, which made it easier to investigate.  I found
>that what fails is the system routine getcwd().  The code:
>
>	errno = 0;
>	cwdres = getcwd(bigbuf, BIGBUFLEN);
>	if (cwdres == 0) {
>		perror("call to getcwd() failed");
>		return (1);
>	}
>	printf("cur dir = %s\n", cwdres);
>
>will print out:
>      call to getcwd() failed: No such file or directory

>I had installed the OpenAFS-1.5.15 about 3 weeks ago, and have not
>rebooted my machine in the past two weeks.  For my next step, I'll
>install 1.4.2, reboot, and see if the problem shows up again.  If it
>does, I can continue investigating with the simple getcwd() program
>and the more elaborate ruby script.

Okay, I installed 1.4.2, and the problem still shows up.

First I installed 1.4.2.  As mentioned in documentation, I hit the
issue where the installer says "You cannot continue. There is nothing
to install." So, I removed the OpenAFS package receipt, and I think I
also ran the un-installer.  I was then able to install okay, but the
machine hung when I rebooted.  On a hunch, I booted into single-user
mode and blew away the entire /var/db/openafs/cache directory, and
then re-created the directory.  After doing that, I had no trouble
booting up.  The nice side-effect of that is I know I started with a
clean AFS cache.

At first I didn't seem to have any trouble, but then I didn't do much
work in AFS for a few days.  On Friday I was ready to do some serious
work, and the two AFS volumes that I wanted to get at were showing
this problem with getcwd().  This time I opened several different
Terminal windows, and they were all showing the problem on the same
set of AFS volumes.

When I run my ruby script, I generally run it on a set of directories
which include 43 different AFS volumes.  Fairly often it is the same
set of 6 AFS volumes which have the problem with getcwd().  But at one
point on Friday the problem went away for the two AFS volumes that I
wanted to work in, even though it stayed around for the other four
volumes which I don't care about and haven't touched.

I then thought it'd be interesting to run the script on a much larger
set of directories.  That run found a few more AFS volumes which were
having the getcwd() problem.  Unfortunately, my main Mac also locked up
while running the script on that larger portion of our AFS cell.  Most
running processes were okay, but it seemed that anything new that I
started up would hang.  I was able to close all my active apps (and save
away any documents I had open), but then the machine completely froze
up when I tried out log out of that userid.  I had to do a forced reboot.

I usually have a lot going on, and lose a lot of context if I have to
reboot.  So, I brought up my intel-based Mac-Mini.  (My main desktop is
a dual-CPU G5).  It is also running MacOS 10.4.9, and is still running
OpenAFS 1.5.15.  I'm running my checking-script on the same large section
of our AFS cell, and so far it hasn't hit the getcwd problem on any AFS
volume.  The script does seem to run slower on the Mac-mini, but it has
already searched through 175,000 directories without hitting a problem.
I'm not sure how many AFS volumes that is, but I'm pretty sure it has
already checked about 50,000 more directories than had been checked
during the run which locked-up my desktop.

The mini-mac is setup with the same set of AFSD options as my desktop,
namely:
  -afsdb -stat 2000 -dcache 800 -daemons 3 -volumes 70 -dynroot -fakestat
The mini-mac is on a different network than my desktop, which might be
significant because the desktop is on a 10-Mbit network, while the
mini-mac is on 100-Mbit.

So, the problem still shows up, and it still doesn't make a whole lot of
sense.  It is only the getcwd() call which fails.  I found that I can
work around the problem with 'open' and 'opendiff' commands if I just
fully-specify the filenames I want to edit or compare.  I can do this by
using the PWD variable that bash is keeping track of.  So, if the command
     opendiff afile bfile
fails, then I can simply use
     opendiff $PWD/afile $PWD/bfile
and it will work fine.

For the AFS volumes where I've seen the getcwd() problem, they don't seem
to have a lot in common.  They are spread out across a few different AFS
fileservers, for instance.

Obviously the problem is very repeatable on my desktop, but I'd like to
have it repeatable on some machine where I won't lose so much work if I
need to reboot.  So, I'll keep trying various combinations of things, to
see if I can pin down the problem some more.  I think this is a problem
at the OS-level, and not at the AFS-locking level.  I can always get to
the files in question, and can do anything I want to do with them just
as long as I don't need to call getcwd().

-- 
Garance Alistair Drosehn            =   gad@gilead.netel.rpi.edu
Senior Systems Programmer           or  gad@freebsd.org
Rensselaer Polytechnic Institute    or  drosih@rpi.edu