[OpenAFS] Re: Ubik problem

Marcus Watts mdw@umich.edu
Fri, 16 Apr 2010 16:33:40 -0400


Derrick Brashear <shadow@gmail.com> writes:
> Date:    Fri, 16 Apr 2010 12:43:30 EDT
> To:      Marcus Watts <mdw@umich.edu>
> cc:      openafs-info@openafs.org
> From:    Derrick Brashear <shadow@gmail.com>
> Subject: Re: [OpenAFS] Re: Ubik problem
> 
> It might actually be worth valgrinding.
> 
> On Fri, Apr 16, 2010 at 12:30 PM, Derrick Brashear <shadow@gmail.com> wrote=
> :
> > On Fri, Apr 16, 2010 at 12:19 PM, Marcus Watts <mdw@umich.edu> wrote:
> >> Derrick Brashear <shadow@gmail.com> sent:
> >>
> >>> Date: =A0 =A0Thu, 15 Apr 2010 23:02:33 EDT
> >>> To: =A0 =A0 =A0Russ Allbery <rra@stanford.edu>
> >>> cc: =A0 =A0 =A0openafs-info@openafs.org
> >>> From: =A0 =A0Derrick Brashear <shadow@gmail.com>
> >>> Subject: Re: [OpenAFS] Re: Ubik problem
> >>>
> >>> On Thu, Apr 15, 2010 at 9:13 PM, Russ Allbery <rra@stanford.edu> wrote:
> >>> > Andrew Deason <adeason@sinenomine.net> writes:
> >>> >> Atro Tossavainen <atro.tossavainen+openafs@helsinki.fi> wrote:
> >>> >
> >>> >>> Derrick,
> >>> >>>
> >>> >>> > I'd suggest just using the IBM binary for the kaserver (and only =
> the
> >>> >>> > kaserver) in your OpenAFS installation
> >>> >>>
> >>> >>> That's an interesting thought, but unfortunately it's nowhere near
> >>> >>> an option. =3DA0sunx86_ is quite simply not a supported platform fo=
> r
> >>> >>> IBM AFS at all, even at 3.6 Patch 19 (August 2009).
> >>> >
> >>> >> Older OpenAFS releases could be another option, but I don't know how
> >>> >> useful of an answer that is. I'm not sure what could have caused tha=
> t,
> >>> >> so I don't have a particular range in mind; maybe just earlier 1.4..=
> .
> >>> >> 1.4.9? 1.4.2?
> >>> >
> >>> > We were successfully running a 1.2.x version of kaserver on SPARC Sol=
> aris=3D
> >>> ,
> >>> > and upgrading to 1.4.2 on Linux failed (albeit with different symptom=
> s; i=3D
> >>> t
> >>> > would just stop successfully giving out tickets for a while and then =
> come
> >>> > back, regularly), so we stuck with 1.2.x on SPARC until we turned it =
> off
> >>> > entirely.
> >>>
> >>> I'm pretty sure it "broke" between 1.2.11 and 1.4.1.
> >>>
> >>> --=3D20
> >>> Derrick
> >>
> >> Gah. =A0You made me drag out my kaserver notes! =A0Worse! =A0You made me
> >> *run* the thing! =A0Bad! =A0Bad!
> >>
> >> "broke" is a pretty vague description, so...
> >>
> >> From the previous descriptions, it sounds like there might be ubik sync =
> issues.
> >
> > That's not what I was referring to. I think it's between ubik database
> > reads and the clients.
> >
> >> That could be caused either by problems in ubik, or unrelated problems
> >> that cause server crashes. =A0The reports do not include notes on any re=
> sulting
> >> core dumps, and the ubik problem reports clearly indicate another seriou=
> s
> >> problem with server address determination.
> >>
> >> I experimented with building a version of 1.2.11, running it and using s=
> ome
> >> of the diagnostic tools, followed by trying to run the resulting databas=
> e with
> >> 1.4.12. =A0I certainly didn't thoroughly explore things. =A0I now have a=
> n interesting
> >> list of "problems".
> >>
> >> /1/ ubik_hdr.size got changed to be a short, not a long. =A0ntohl is wro=
> ng. =A0This
> >> =A0 =A0 =A0 =A0is in ubik proper as well as kaserver diagnostics. =A0For=
> tunately, this
> >> =A0 =A0 =A0 =A0doesn't seem to break too much.
> >> /2/ udebug address output byte swap issues. =A0Previously mentioned as f=
> ixed.
> >> /3/ kadb_check complains about a lot of stuff, and the output does not
> >> =A0 =A0 =A0 =A0make much sense. =A0A lot of this looks like endian issue=
> s, but
> >> =A0 =A0 =A0 =A0also I think this tool probably started as a temporary ha=
> ck and
> >> =A0 =A0 =A0 =A0never well cleaned up. =A0The output was probably never r=
> eally
> >> =A0 =A0 =A0 =A0'clean" in the first place.
> >> /4/ I never got kaserver to core dump (granted, I'm not pushing it real =
> hard.)
> >>
> >> I think at least in some basic way, the kaserver in 1.4.12 still "works"=
> .
> >> So I am still curious as to what Derrick meant by "broke".
> >>
> >> possible generic action items,
> >> /1/ fix uhdr.size usage issues. (ntohs/htons not ntohl/htonl).
> >> /2/ fix kadb_check to produce correct output. =A0Should match on little
> >> =A0 =A0 =A0 =A0and big-endian machines.
> >> /3/ fix kadb_check to produce "better" output?
> >>
> >> For Atro Tossavainen, I think my recommendations are:
> >> /1/ can he only run one source version of kaserver on all db hosts (not =
> a
> >> =A0 =A0 =A0 =A0mixed ibm/openafs env),
> >> /2/ can he resolve the server setup such that when udebug is
> >> =A0 =A0 =A0 =A0run, it only reports "correct" IP addresses? =A0(Ideally =
> only
> >> =A0 =A0 =A0 =A0the primary, but the other interfaces should be ok so lon=
> g
> >> =A0 =A0 =A0 =A0as packets sent through them get to the same place.)
> >> /3/ can he resolve time so that he never sees "last beacon sent -3 secs =
> ago"?,
> >> =A0 =A0 =A0 =A0ubik does care, even more than kerberos, about time.
> >> /4/ can he resolve his keyfile reference such that he never gets
> >> =A0 =A0 =A0 =A0"unknown key version number"?
> >> =A0 =A0 =A0 =A0(My suspicion, he's got path issues between differently b=
> uilt binaries.)
> >
> > no, because i suspect 4 is the "real issue"
> >
> >
> > --
> > Derrick
> >
> 
> 
> 
> --=20
> Derrick

I tried valgrind.  Run #1 with stripped binaries - I got lots of "stuff".
First 5 complaints about uninitialized reads of size 4 come from various
bits of lwp.  Various later complaints are from deep inside of rx.
I didn't see anything that looked unique or specific to kaserver.
Generally speaking, valgrind didn't like anything lwp did at all.
I got tired of looking up symbols, so tried again with non-stripped binaries.
valgrind core dumps instantly.  The resulting core dump is badly damaged.
It might be worth trying again with a copy of kaserver built with "-g".
That shouldn't make a difference, but ...

So, yup, valgrind would be nice.  So far, it's not being
very helpful.

Andrew talks a bit about "errors that appear after the server's been
running for a while".  If this is a memory corruption problem, then
there is a good likelyhood of random seg faults, possible core dumps,
and server restarts.  If there are not core dumps, then it should be
possible to reconfigure things so there are.  Those core dumps would
help a lot.  Call tracebacks for failure points would tell us what code
paths and data matter here.  Just knowing that the software is restarting
spontaneously (cat /var/log/openafs/BosLog ?) would help a lot.

Some other problems that could cause intermittent behavior include:

/1/ flapping network routes.  We already know there are multiple addresses...
/2/ DNS.  Unlikely, but ubik likely depends on dns.  if "host `hostname`"
	lists more than one ip address, round robin behavior in dns
	might result in oddness.
/3/ dueling ubik masters.  Wasn't there once a problem with byte
	ordering and ubik ip address ranking?  We definitely
	have multiple code bases in the picture.  Duelling
	masters--uh, just don't go there.
/4/ roving ubik master selection.  With only 2 hosts, I don't quite
	understand how this could happen (but see previous).
	But since we know the key files aren't consistent, which
	machines can do "localauth" to others will definitely vary,
	which could easily look like "it stops working after a while".
	Usually this is usually only a problem with > 2 db hosts.

				-Marcus Watts