[OpenAFS-devel] Linux kernel buld warnings

Andrew Deason adeason@sinenomine.net
Mon, 13 Jan 2014 14:02:16 -0600


Russ reported an issue recently, and some of the details that have
emerged I think warrant some more discussion (or at least visibility).

The issue was that the openafs linux kernel module was BUG'ing on every
access into AFS with:

[86886.421520] BUG: unable to handle kernel NULL pointer dereference at 0000000000000049
[86886.421528] IP: [<ffffffffa0de959f>] afs_linux_dentry_revalidate+0xf/0x430 [openafs]

This was on kernel 3.12, on Debian with a 32-bit x86 userspace with an
amd64 kernel. The cause for these BUGs was that the
DOP_REVALIDATE_TAKES_UNSIGNED configure test failed, and the
DOP_REVALIDATE_TAKES_NAMEIDATA succeeded, when it's actually supposed to
be the other way around on kernel 3.12. So we get an unsigned int for
flags, we interpret it as a 'struct nameidata *', and things explode, as
should not be surprising.

The strange thing is, though, is that the DOP_REVALIDATE_TAKES_UNSIGNED
test correctly succeeds if you have an amd64 userspace with the amd64
kernel, so normally there's no problem (DOP_REVALIDATE_TAKES_NAMEIDATA
still incorrectly succeeds for other reasons, but it's not noticed due
to how these symbols are used).  The reason for this is because (at
least on my test systems while reproducing this) with an x86 userspace,
the kernel build process does not pass -Wno-set-but-unused-variable
during the configure tests. Our configure test for
DOP_REVALIDATE_TAKES_UNSIGNED triggers this warning, so it fails;
normally we don't notice, because that warning is turned off.

Going one level deeper, the reason that happens with an x86 userspace
and not the amd64 userspace is that the linux kernel build process
probes for acceptable -Wno-* options to gcc at runtime. Probing for
-Wno-set-but-unused-variable fails because we run this to see if the
warning option exists:

gcc -D__KERNEL__ -Wall -Wundef [...] -Wno-set-but-unused-variable -c -x c /dev/null -o /path/whatever.o

On the amd64-userspace machine, this succeeds. On the x86-userspace
machine, this fails with:

$ gcc [...]
In file included from <command-line>:0:0:
/usr/include/stdc-predef.h:30:26: fatal error: bits/predefs.h: No such file or directory
 #include <bits/predefs.h>
                          ^
compilation terminated.
$

And this happens because the x86-userspace machine doesn't have the
amd64 libc headers. I can't even install them in debian
(libc6-dev:amd64) without removing a bunch of x86 devel packages. But if
I run that gcc command with -nostdinc (as most gcc invocations are done
for the linux kernel), then it succeeds.


So, that's the explanation for the behavior. To fix this, there are a
few things that should be done:

First, it looks to me like the Linux kernel should be passing -nostdinc
to that gcc runtime probe thing. If someone more familiar with the Linux
kernel build process would like to take that to them, I would appreciate
that. (Or tell me I'm wrong and that's not broken.) The relevant code
cc-disable-warning in Kbuild.include, but I'm not sure what flags
variable or whatnot -nostdinc should maybe go in, or how it should be
involved there.

Next, our autoconf tests should be more robust so that they don't
generate warnings so we don't depend on the linux kernel build system
turning off certain warnings. I'm submitting a few changes for that, but
there maybe are more instances where this could be a problem.

Next, it would be really nice if we could enforce certain warnings
during our kernel module build. The problem that Russ reported would
only have been a build-time failure if we had -Werror'd some warnings.
Since, when that configure test fails the way it does, we get some
pretty serious warnings, like our d_revalidate function pointer is
pointing to the wrong function type. I'm not clear on how much control
we have over what warnings flags get to the compiler during this, since
we go through the Linux kernel build system to build these.

And finally, it would be nice to be able to maybe have more control over
the warnings behavior during our autoconf tests, so we can be a little
more sure of what will happen. I don't immediately know if there's a way
(or if it's feasible) to force what flags we use? Ideally we would not
be relying on warnings behavior, but a lot of the stuff we want to test
for is not possible to detect via non-warning compiler errors.

-- 
Andrew Deason
adeason@sinenomine.net