[OpenAFS] Heimdal on UNIX errno issue

Andrew Deason adeason@sinenomine.net
Sun, 4 May 2014 00:39:28 -0500


The following email is about an issue in the Heimdal libraries, and how
it affects OpenAFS fileservers. If you know that you don't use Heimdal
anywhere at your site, or if you only use Heimdal on Linux, then you are
not affected by this and you can probably skip this. If you use Heimdal
on Solaris or AIX (or any such "commercial Unix"), you probably do want
to read this.


Anyway, we've become recently aware that Heimdal on certain platforms
does not report errors correctly in threaded environments. This means
that certain functions in Heimdal's libraries in certain conditions will
return 0 (success), even when they have encountered an error and bailed
out. I believe this happens with all known currently-existing Heimdal
versions. At the time of this writing, the newest stable release of
Heimdal is 1.5.3, and the head of the 1.6 branch is
005f69c0cbe3538cbdb2f8808114b48995e0ca32.

I can confirm that the relevant issue does _not_ appear on Linux or
FreeBSD, but it _does_ appear on Solaris and AIX. I haven't tested every
version and variation of those platforms of course, but I don't expect
it to vary much between versions.

This issue is not OpenAFS-specific, and is an issue with Heimdal itself.
However, I believe this is of particular interest to OpenAFS, because
OpenAFS has not made libkrb5 calls from within threads until recently;
we started doing that in OpenAFS 1.6.5. So, this issue does not occur
with OpenAFS versions before 1.6.5, but it can happen with 1.6.5 and
later.

The most obvious way this issue manifests is that the fileserver can
crash very quickly, if you are not using rxkad.keytab. This is the issue
reported in
<https://rt.central.org/rt/Ticket/Display.html?id=131852&user=guest&pass=guest>,
and is easily worked around by just creating an rxkad.keytab file.

However, if you are using rxkad.keytab with a problematic Heimdal
library, you won't see such a crash, but there still may be other
issues. Since the underlying problem is that errors are not reported
properly, there may be other issues in other areas of code that are less
obvious. Although I don't know of any particular issues, running the
fileserver in such an environment would make me personally very nervous;
I would consider such an environment to effectively be "undefined
behavior" at this point.

So, if you are on a problematic platform, I would personally recommend
avoiding running OpenAFS >= 1.6.5 servers with Heimdal right now. Either
downgrade OpenAFS, link to a different libkrb5 library, or don't link to
a libkrb5 library at all (not linking to libkrb5 removes rxkad-k5
support, so you can only use DES). And of course, if this issue concerns
you, contact your support vendor and/or go talk to Heimdal.

It should be noted that running Heimdal on those platforms is probably
very uncommon, so we (OpenAFS) aren't making a big fuss about it. (At
least Heimdal 1.5.3 and 1.5.2 don't build on those platforms without
modification.) But I wanted to at least send something to give anyone a
chance to notice this, if they are for some reason running such an
environment.

As for what OpenAFS will do about this, the current plan is that we'll
include a workaround in OpenAFS to avoid the crash when not using
rxkad.keytab, but nothing more. We could possibly detect the issue when
using rxkad.keytab as well, and turn off rxkad-k5 when we detect it, but
right now it doesn't seem "worth it" to do that (and personally to me
doing that feels pretty ridiculous).


Some brief technical details:

Heimdal doesn't build with -mt/-pthread in CFLAGS, so anything that uses
'errno' doesn't work properly on Solaris/AIX. This is because _REENTRANT
changes the definition of 'errno' from a regular global int to a
function call to give a thread-specific storage location; so on
Solaris/AIX you get only the "main thread" errno when you reference
errno without -D_REENTRANT. On Linux/FreeBSD/others, errno is always
defined as using a function call, so it still works well enough even
without -pthread.

The crash occurs because accessing rxkad.keytab fails with ENOENT, but
Heimdal errors out with the error code '0', leaving some pointers in a
structure set to NULL or something similarly weird/wrong. We see the
function return success, so we think everything is fine, and calling
subsequent libkrb5 functions on that structure segfaults.

See
<https://rt.central.org/rt/Ticket/Display.html?id=131852&user=guest&pass=guest>
for any more details, of course.

-- 
Andrew Deason
adeason@sinenomine.net