[OpenAFS] RHEL 7.5 beta / 3.10.0-830.el7.x86_66 kernel lock up

Stephan Wiesand stephan.wiesand@desy.de
Fri, 2 Mar 2018 10:14:48 +0100


Hello,

> On 2. Mar 2018, at 09:47, Anders Nordin <anders.j.nordin@ltu.se> =
wrote:
>=20
> Hello,
>=20
> Is there any progress on this issue?

incidentally, Mark uploaded https://gerrit.openafs.org/12935 a couple of =
hours ago. It's probably not final since it seems to cause build =
failures on some older platforms. But it's certainly worth a try on =
EL7.5 beta systems. It would also be interesting to know on which other =
platforms it fails to build (or work).

> Can we expect a stable release for RHEL 7.5?

Once we have a change confirmed to fix the EL7.5 issue and not break =
other platforms, yes. Whether it will be available quite in time for 7.5 =
GA is hard to say. You can help...

Best regards,

	Stephan


> MVH
> Anders
>=20
> -----Original Message-----
> From: openafs-info-admin@openafs.org =
[mailto:openafs-info-admin@openafs.org] On Behalf Of Benjamin Kaduk
> Sent: den 9 februari 2018 01:02
> To: Kodiak Firesmith <kfiresmith@gmail.com>
> Cc: openafs-info <openafs-info@openafs.org>
> Subject: Re: [OpenAFS] RHEL 7.5 beta / 3.10.0-830.el7.x86_66 kernel =
lock up
>=20
> On Wed, Feb 07, 2018 at 11:46:28AM -0500, Kodiak Firesmith wrote:
>> Hello again All,
>>=20
>> As part of continued testing, I've been able to confirm that the=20
>> SystemD double-service startup thing only happens to my hosts when=20
>> going from RHEL
>> 7.4 to RHEL 7.5beta.  On a test host installed directly as RHEL=20
>> 7.5beta, I get a bit farther with 1.6.18.22, in that I get to the=20
>> point where OpenAFS "kind of" works.
>=20
> Thanks for tracking this down.  The rpm packaging maintainers may want =
to try to track down why the double-start happens in the upgrade =
scenario, as that's pretty nasty behavior.
>=20
>> What I'm observing is that the openafs client Kernel module (built by=20=

>> DKMS) loads fine, and just so long as you know where you need to go =
in=20
>> /afs, you can get there, and you can read and write files and the =
OpenAFS 'fs'
>> command works.  But doing an 'ls' of /afs or any path underneath=20
>> results in
>> "ls: reading directory /afs/: Not a directory".
>>=20
>> I ran an strace of a good RHEL 7.4 host running ls on /afs, and a =
RHEL=20
>> 7.5beta host running ls on /afs and have created pastebins of both, =
as=20
>> well as an inline diff.
>>=20
>> All can be seen at the following locations:
>>=20
>> works
>> https://paste.fedoraproject.org/paste/Hiojt2~Be3wgez47bKNucQ
>>=20
>> fails
>> https://paste.fedoraproject.org/paste/13ZXBfJIOMsuEJFwFShBfg
>>=20
>>=20
>> diff
>> https://paste.fedoraproject.org/paste/FJKRwep1fWJogIDbLnkn8A
>>=20
>> Hopefully this might help the OpenAFS devs, or someone might know =
what=20
>> might be borking on every RHEL 7.5 beta host.  It does fit with what=20=

>> other
>> 7.5 beta users have observed OpenAFS doing.
>=20
> Yes, now it seems like all our reports are consistent, and we just =
have to wait for a developer to get a better look at what Red Hat =
changed in the kernel that we need to adapt to.
>=20
> -Ben
>=20
>> Thanks!
>> - Kodiak
>>=20
>> On Mon, Feb 5, 2018 at 12:31 PM, Stephan Wiesand=20
>> <stephan.wiesand@desy.de>
>> wrote:
>>=20
>>>=20
>>>> On 04.Feb 2018, at 02:11, Jeffrey Altman <jaltman@auristor.com> =
wrote:
>>>>=20
>>>> On 2/2/2018 6:04 PM, Kodiak Firesmith wrote:
>>>>> I'm relatively new to handling OpenAFS.  Are these problems part=20=

>>>>> of a normal "kernel release; openafs update" cycle and perhaps=20
>>>>> I'm getting snagged just by being too early of an adopter?  I=20
>>>>> wanted to raise the alarm on this and see if anything else was=20
>>>>> needed from me as the reporter of the issue, but perhaps that's=20
>>>>> an overreaction to what is just part of a normal process I just=20
>>>>> haven't been tuned into in prior RHEL release cycles?
>>>>=20
>>>>=20
>>>> Kodiak,
>>>>=20
>>>> On RHEL, DKMS is safe to use for kernel modules that restrict=20
>>>> themselves to using the restricted set of kernel interfaces (the=20
>>>> RHEL KABI) that Red Hat has designated will be supported across=20
>>>> the lifespan of the RHEL major version number.  OpenAFS is not=20
>>>> such a kernel module.  As a result it is vulnerable to breakage =
each and every time a new kernel is shipped.
>>>=20
>>> Jeffrey,
>>>=20
>>> the usual way to use DKMS is to either have it build a module for a=20=

>>> newly installed kernel or install a prebuilt module for that kernel.=20=

>>> It may be possible to abuse it for providing a module built for=20
>>> another kernel, but I think that won't happen accidentally.
>>>=20
>>> You may be confusing DKMS with RHEL's "KABI tracking kmods". Those=20=

>>> should be safe to use within a RHEL minor release (and the SL=20
>>> packaging has been using them like this since EL6.4), but aren't=20
>>> across minor releases (and that's why the SL packaging modifies the=20=

>>> kmod handling to require a build for the minor release in question.
>>>=20
>>>> There are two types of failures that can occur:
>>>>=20
>>>> 1. a change results in failure to build the OpenAFS kernel module
>>>>   for the new kernel
>>>>=20
>>>> 2. a change results in the OpenAFS kernel module building and
>>>>   successfully loading but failing to operate correctly
>>>=20
>>> The latter shouldn't happen within a minor release, but can across=20=

>>> minor releases.
>>>=20
>>>> It is the second of these possibilities that has taken place with=20=

>>>> the release of the 3.10.0-830.el7 kernel shipped as part of the=20
>>>> RHEL 7.5
>>> beta.
>>>>=20
>>>> Are you an early adopter of RHEL 7.5 beta?  Absolutely, its a beta=20=

>>>> release and as such you should expect that there will be bugs and=20=

>>>> that third party kernel modules that do not adhere to the KABI=20
>>>> functionality might have compatibility issues.
>>>=20
>>> The -830 kernel can break 3rd-party modules using non-whitelisted=20
>>> ABIs, whether or not they adhere to the "KABI functionality".
>>>=20
>>>> There was a compatibility issue with RHEL 7.4 kernel
>>>> (3.10.0_693.1.1.el7) as well that was only fixed in the OpenAFS=20
>>>> 1.6 release series this past week as part of 1.6.22.2:
>>>>=20
>>>> http://www.openafs.org/dl/openafs/1.6.22.2/RELNOTES-1.6.22.2
>>>=20
>>> Yes, and this one was hard to fix. Thanks are due to Mark Vitale for=20=

>>> developing the fix and all those who reviewed and tested it.
>>>=20
>>>> Jeffrey Altman
>>>> AuriStor, Inc.
>>>>=20
>>>> P.S. - Welcome to the community.
>>>=20
>>> Seconded. In particular, the problem report regarding the EL7.5beta=20=

>>> kernel was absolutely appropriate.

--=20
Stephan Wiesand
DESY -DV-
Platanenallee 6
15738 Zeuthen, Germany