[OpenAFS] RHEL 7.5 beta / 3.10.0-830.el7.x86_66 kernel lock up

Benjamin Kaduk kaduk@mit.edu
Thu, 8 Feb 2018 18:01:56 -0600


On Wed, Feb 07, 2018 at 11:46:28AM -0500, Kodiak Firesmith wrote:
> Hello again All,
> 
> As part of continued testing, I've been able to confirm that the SystemD
> double-service startup thing only happens to my hosts when going from RHEL
> 7.4 to RHEL 7.5beta.  On a test host installed directly as RHEL 7.5beta, I
> get a bit farther with 1.6.18.22, in that I get to the point where OpenAFS
> "kind of" works.

Thanks for tracking this down.  The rpm packaging maintainers may
want to try to track down why the double-start happens in the
upgrade scenario, as that's pretty nasty behavior.

> What I'm observing is that the openafs client Kernel module (built by DKMS)
> loads fine, and just so long as you know where you need to go in /afs, you
> can get there, and you can read and write files and the OpenAFS 'fs'
> command works.  But doing an 'ls' of /afs or any path underneath results in
> "ls: reading directory /afs/: Not a directory".
> 
> I ran an strace of a good RHEL 7.4 host running ls on /afs, and a RHEL
> 7.5beta host running ls on /afs and have created pastebins of both, as well
> as an inline diff.
> 
> All can be seen at the following locations:
> 
> works
> https://paste.fedoraproject.org/paste/Hiojt2~Be3wgez47bKNucQ
> 
> fails
> https://paste.fedoraproject.org/paste/13ZXBfJIOMsuEJFwFShBfg
> 
> 
> diff
> https://paste.fedoraproject.org/paste/FJKRwep1fWJogIDbLnkn8A
> 
> Hopefully this might help the OpenAFS devs, or someone might know what
> might be borking on every RHEL 7.5 beta host.  It does fit with what other
> 7.5 beta users have observed OpenAFS doing.

Yes, now it seems like all our reports are consistent, and we just
have to wait for a developer to get a better look at what Red Hat
changed in the kernel that we need to adapt to.

-Ben

> Thanks!
>  - Kodiak
> 
> On Mon, Feb 5, 2018 at 12:31 PM, Stephan Wiesand <stephan.wiesand@desy.de>
> wrote:
> 
> >
> > > On 04.Feb 2018, at 02:11, Jeffrey Altman <jaltman@auristor.com> wrote:
> > >
> > > On 2/2/2018 6:04 PM, Kodiak Firesmith wrote:
> > >> I'm relatively new to handling OpenAFS.  Are these problems part of a
> > >> normal "kernel release; openafs update" cycle and perhaps I'm getting
> > >> snagged just by being too early of an adopter?  I wanted to raise the
> > >> alarm on this and see if anything else was needed from me as the
> > >> reporter of the issue, but perhaps that's an overreaction to what is
> > >> just part of a normal process I just haven't been tuned into in prior
> > >> RHEL release cycles?
> > >
> > >
> > > Kodiak,
> > >
> > > On RHEL, DKMS is safe to use for kernel modules that restrict themselves
> > > to using the restricted set of kernel interfaces (the RHEL KABI) that
> > > Red Hat has designated will be supported across the lifespan of the RHEL
> > > major version number.  OpenAFS is not such a kernel module.  As a result
> > > it is vulnerable to breakage each and every time a new kernel is shipped.
> >
> > Jeffrey,
> >
> > the usual way to use DKMS is to either have it build a module for a newly
> > installed kernel or install a prebuilt module for that kernel. It may be
> > possible to abuse it for providing a module built for another kernel, but
> > I think that won't happen accidentally.
> >
> > You may be confusing DKMS with RHEL's "KABI tracking kmods". Those should
> > be safe to use within a RHEL minor release (and the SL packaging has been
> > using them like this since EL6.4), but aren't across minor releases (and
> > that's why the SL packaging modifies the kmod handling to require a build
> > for the minor release in question.
> >
> > > There are two types of failures that can occur:
> > >
> > > 1. a change results in failure to build the OpenAFS kernel module
> > >    for the new kernel
> > >
> > > 2. a change results in the OpenAFS kernel module building and
> > >    successfully loading but failing to operate correctly
> >
> > The latter shouldn't happen within a minor release, but can across
> > minor releases.
> >
> > > It is the second of these possibilities that has taken place with the
> > > release of the 3.10.0-830.el7 kernel shipped as part of the RHEL 7.5
> > beta.
> > >
> > > Are you an early adopter of RHEL 7.5 beta?  Absolutely, its a beta
> > > release and as such you should expect that there will be bugs and that
> > > third party kernel modules that do not adhere to the KABI functionality
> > > might have compatibility issues.
> >
> > The -830 kernel can break 3rd-party modules using non-whitelisted ABIs,
> > whether or not they adhere to the "KABI functionality".
> >
> > > There was a compatibility issue with RHEL 7.4 kernel
> > > (3.10.0_693.1.1.el7) as well that was only fixed in the OpenAFS 1.6
> > > release series this past week as part of 1.6.22.2:
> > >
> > >  http://www.openafs.org/dl/openafs/1.6.22.2/RELNOTES-1.6.22.2
> >
> > Yes, and this one was hard to fix. Thanks are due to Mark Vitale for
> > developing the fix and all those who reviewed and tested it.
> >
> > > Jeffrey Altman
> > > AuriStor, Inc.
> > >
> > > P.S. - Welcome to the community.
> >
> > Seconded. In particular, the problem report regarding the EL7.5beta
> > kernel was absolutely appropriate.
> >
> > --
> > Stephan Wiesand
> > DESY - DV -
> > Platanenallee 6
> > 15738 Zeuthen, Germany
> >
> >
> >