[OpenAFS] Re: openafs server on freebsd 8.2 amd64: bosserver coredump

Andrew Deason adeason@sinenomine.net
Mon, 4 Apr 2011 15:03:36 -0500


On Mon, 04 Apr 2011 10:39:37 +0200
Mark <mark@nl.simpc.com> wrote:

> Can do, its not in production yet. I did install 1.6.0pre4 on it, and
> that one runs fine btw.

Yeah, and actually, 1.6 testing is probably more worthwhile at this
point with a platform like that. You don't need to keep fiddling with
1.4, unless you want to, of course.

> Backtrace from the generated bosserver.core:
> 
> (gdb) bt
> #0  0x000000080077afcc in kill () from /lib/libc.so.7
> #1  0x0000000800779dcb in abort () from /lib/libc.so.7
> #2  0x000000000041389b in osi_Panic (msg=Variable "msg" is not available.) at rx_user.c:225

So, the panic message gets printed to stderr, but bosserver will
normally redirect that to /dev/null. You can see the panic message if
you run bosserver in the foreground with bosserver -nofork. I expect
you'll get something complaining like "rx packet not free".

On reproducing this myself, I see that the rx free packet queue is
getting corrupted when we're in the middle of libc. At first this seemed
very odd, but it just looks like our LWP stack is too small.

The code path it gets corrupted is in rxkad decode_generalized_time ->
generalizedtime2time -> timegm. timegm eventually calls some function to
load some tz data, which seems to read from disk into some stack space
(tzload local variable u).  This involves requiring a rather large
amount of stack (imo; it's like 60k or so for that one frame, if I'm
reading this right). So, it's not too surprising that we crash, given
our stack for rx_Listener I believe should be the LWP minimum of 48k.

So, if you want to give this a quick try, if you start bosserver like
so:

AFS_LWP_STACK_SIZE=196608 /usr/afs/bin/bosserver

The problem should go away (it does for me). Of course, any other LWP
daemon will probably have the same problem, so if you want to run
dbserver processes on that machine, you'll need to do the same for them.
(fileserver and volserver use pthreads, so they should not be affected)

To me, this suggests that we just need to raise the minimum LWP stack
size for freebsd (or maybe fbsd 8, or something more specific?). I still
have no idea why this isn't a problem on 1.6/master, though, as I
thought we had similar stack sizes there, but I haven't looked too much
at that yet.

-- 
Andrew Deason
adeason@sinenomine.net