[OpenAFS] getcwd() error for RHEL 7.4 kernel

Stephan Wiesand stephan.wiesand@desy.de
Thu, 16 Nov 2017 18:26:22 +0100

On Nov 16, 2017, at 07:06 , Benjamin Kaduk wrote:

> On Wed, Nov 15, 2017 at 01:02:15PM -0500, Matt Vander Werf wrote:
>> Hello,
>> Are there any updates or progress on a potential fix for this issue?
>> Anything we can do to help figure things out?
> This topic was on the agenda for our release-team meeting yesterday.

Well, it has been for the last couple of weeks.

> If I remmber correctly, multiple developers have gotten fairly
> reliable ways to reproduce the issue locally.
> It also seems that as a workaround, reverting
> https://gerrit.openafs.org/#/c/12451/ is likely to reduce the
> likelihood of triggering events.

Yes, but there's at least one known client configuration (small stat =
cache, -disable-dynamic-vcaches) for which reverting that change =
actually makes things worse.

I ran a number of tests again today. I was unable to trigger any issue =
with the EL7.3 kernel, neither with OpenAFS 1.6.20 nor nor = with the change in question reverted. I was able to trigger the =
getcwd issue or the other one (a git clone into AFS space failing right =
away) with the EL7.4 kernel, depending on circumstances. We do have a =
problem with the 7.4 kernel, with or without that change.

The following could be complete nonsense, so please correct me: The best =
bet for sites getting desperate is probably to increase the minimum stat =
cache size beyond typical actual use on the client.

- Stephan

>> We are running into more and more users encountering the issue on =
>> we have updated, forcing us to have to downgrade the kernel on them =
yet as
>> well (including the system we were able to reproduce it on and test =
>> before). Is there any other information we might provide before we do =
> Given the assumption that developers are reproducing the same
> situation that you are, hopefully there is not a need for additional
> information from the production sites.
> Thanks,
> Ben