[OpenAFS] Re: Linux client, AFS homes, getcwd() failures, apparently deleted home directories

Richard Brittain Richard.Brittain@dartmouth.edu
Tue, 25 Jun 2013 23:23:30 -0400 (EDT)


On Tue, 25 Jun 2013, Andrew Deason wrote:

> On Tue, 25 Jun 2013 16:02:00 -0400 (EDT)
> Richard Brittain <Richard.Brittain@dartmouth.edu> wrote:
>
>> we have a strange problem on a large RHEL6 system with AFS home
>> directories.  I'm not even sure if the problem is in the AFS cache
>> manager or the kernel.
>
> It's (almost certainly) us. Anders noted a similar thing in jabber a
> week or so ago, and it's almost certainly due to the games we have to
> play with linux dentries.
...
> I'm not sure if our flushing commands will clear the relevant things
> here; you can try 'echo 3 > /proc/sys/vm/drop_caches' which _might_ help
> clear it up.

I remembered about 'drop_caches' right after posting, and tried it - no 
luck.  I also tried gradually reducing the cache size with fs setcache to 
try to force some directory metadata to be reloaded, but that didn't work 
either.

I tried making an alternate mount point and putting the new path to the 
home directory in /etc/passwd.  Curiously, this still fails, because the 
shell still ends up with CWD set to the old, bad location, even though it 
got there by a different path.

> Well, the only way to get more useful information out of this is to
> generate a vmcore of the machine while you're experiencing the problem,
> or to run the 'crash' command and examine the various in-memory
> structures with some specific commands. If you want to do something like
> that, please say so, and I or someone else can come up with the
> necessary info.
>
> Or if you happen to find out a certain access or directory pattern that
> creates this situation, that would help. I would assume that it is
> possible to reach those directories via different paths / by traversing
> different mountpoints, which is what may be causing the confusion.

This isn't a complete show-stopper yet.  It would be nice if I could 
reproduce it on a smaller test machine, but so far it only shows up on our 
largest research server, which we try hard to be conservative on.  If this 
gets worse I'll get back to you about 'crash'.

Thanks,
    Richard
-- 
Richard Brittain,  Research Computing Group,
                    Computing Services, 37 Dewey Field Road, HB6219
                    Dartmouth College, Hanover NH 03755
Richard.Brittain@dartmouth.edu 6-2085