[OpenAFS] getcwd() error for RHEL 7.4 kernel

Garance A Drosehn drosih@rpi.edu
Fri, 17 Nov 2017 17:35:25 -0500


On 18 Oct 2017, at 19:21, Benjamin Kaduk wrote:

> On Tue, Oct 17, 2017 at 11:55:27AM -0400, Jacob Bonek wrote:
>>
>> This is a major issue that has caused us to have to stay at the 
>> latest
>> pre-RHEL 7.4 kernel for a long time now while this issue has existed.
>> This may be related to previous issues with getcwd() but something in
>> the RHEL 7.4 kernel seems to have made it much worse.

>> Has anyone else experienced this issue with RHEL 7.4? Is there 
>> anything
>> that we can do to narrow down what is causing this?
>
> I think we've seen another report or two, but it's always been hard to
> reproduce.  That said, with the specifics you've offered about the
> kernel version that introduced the issue, we've got a couple folks
> trying to reproduce in a controlled environment.

I'm seeing this (a little), but haven't had time to look into it.  But
here's some thoughts/observations:

I have three RHEL systems, all currently running:

kernel.x86_64             3.10.0-693.el7

They're all running the exact same build of OpenAFS, because I built
it on a different system, created RPM's, and installed the exact same
RPM's on all three systems.

kmod-openafs.x86_64       1.6.21-1.3.10.0_693.el7
openafs.x86_64            1.6.21-1.el7
openafs-client.x86_64     1.6.21-1.el7
openafs-docs.x86_64       1.6.21-1.el7
openafs-krb5.x86_64       1.6.21-1.el7

These are three remote-access machines for RPI users, so the intent is
that they should be exactly the same.  I'm sure there are some minor
changes, but at least for the kernel and openafs modules they are
definitely the same.

On one of them, if I log in to my userid and 'sudo bash', I get a lot
of messages like:

shell-init: error retrieving current directory: \
             getcwd: cannot access parent directories: No such file or 
directory
job-working-directory: error retrieving current directory: \
             getcwd: cannot access parent directories: No such file or 
directory

I've only seen this if the active working directory is my home 
directory.
It won't happen if I 'cd' into some sub-directory under my home 
directory
before I do the 'sudo bash'.

This seems to always happen on one of the three machines.  It never 
happens
on a second machine, and it *sometimes* happens on the third machine.  
By
"sometimes", I mean that some days it never happens, but other days it 
seems
to happen all the time.  I have not seen the problem right at login, but 
only
if I do a 'sudo bash' while in my home directory at any time after I 
logged in.

Once I have done the 'sudo bash', I can then 'cd' into the home 
directory
of my original userid and there are no error messages.

These machines are used by maybe 100 different people.  I have not heard 
of
anyone who has seen these error messages when they login, but we do have
some users who never report errors as long as they can get their work 
done.
And of course, I'm the only one who would be doing 'sudo' commands on 
these
machines.

I wonder if it has to do with the home directory being an AFS mount 
point
(as opposed to a standard directory somewhere inside an AFS volume), but 
I
have not had the time to do any tests of that idea.

The fact that I don't see the same behavior on all three machines makes 
me
wonder if it has to do with how much the other users have been doing.  
Maybe
they've used up more of the local AFS cache on some machines than 
others.
I haven't had the chance to reboot any of these machines for a few 
months
now, but I hope to do that over the long thanksgiving weekend.  Given 
the
errors seen at some other sites, I probably won't upgrade the kernel or
version of OpenAFS until the semester break.

-- 
Garance Alistair Drosehn                =     drosih@rpi.edu
Senior Systems Programmer               or   gad@FreeBSD.org
Rensselaer Polytechnic Institute;             Troy, NY;  USA