[OpenAFS] Re: Ubik problem

Marcus Watts mdw@umich.edu
Fri, 16 Apr 2010 12:19:45 -0400

Derrick Brashear <shadow@gmail.com> sent:

> Date:    Thu, 15 Apr 2010 23:02:33 EDT
> To:      Russ Allbery <rra@stanford.edu>
> cc:      openafs-info@openafs.org
> From:    Derrick Brashear <shadow@gmail.com>
> Subject: Re: [OpenAFS] Re: Ubik problem
> On Thu, Apr 15, 2010 at 9:13 PM, Russ Allbery <rra@stanford.edu> wrote:
> > Andrew Deason <adeason@sinenomine.net> writes:
> >> Atro Tossavainen <atro.tossavainen+openafs@helsinki.fi> wrote:
> >
> >>> Derrick,
> >>>
> >>> > I'd suggest just using the IBM binary for the kaserver (and only the
> >>> > kaserver) in your OpenAFS installation
> >>>
> >>> That's an interesting thought, but unfortunately it's nowhere near
> >>> an option. =A0sunx86_ is quite simply not a supported platform for
> >>> IBM AFS at all, even at 3.6 Patch 19 (August 2009).
> >
> >> Older OpenAFS releases could be another option, but I don't know how
> >> useful of an answer that is. I'm not sure what could have caused that,
> >> so I don't have a particular range in mind; maybe just earlier 1.4...
> >> 1.4.9? 1.4.2?
> >
> > We were successfully running a 1.2.x version of kaserver on SPARC Solaris=
> ,
> > and upgrading to 1.4.2 on Linux failed (albeit with different symptoms; i=
> t
> > would just stop successfully giving out tickets for a while and then come
> > back, regularly), so we stuck with 1.2.x on SPARC until we turned it off
> > entirely.
> I'm pretty sure it "broke" between 1.2.11 and 1.4.1.
> --=20
> Derrick

Gah.  You made me drag out my kaserver notes!  Worse!  You made me
*run* the thing!  Bad!  Bad!

"broke" is a pretty vague description, so...

>From the previous descriptions, it sounds like there might be ubik sync issues.
That could be caused either by problems in ubik, or unrelated problems
that cause server crashes.  The reports do not include notes on any resulting
core dumps, and the ubik problem reports clearly indicate another serious
problem with server address determination.

I experimented with building a version of 1.2.11, running it and using some
of the diagnostic tools, followed by trying to run the resulting database with
1.4.12.  I certainly didn't thoroughly explore things.  I now have an interesting
list of "problems".

/1/ ubik_hdr.size got changed to be a short, not a long.  ntohl is wrong.  This
	is in ubik proper as well as kaserver diagnostics.  Fortunately, this
	doesn't seem to break too much.
/2/ udebug address output byte swap issues.  Previously mentioned as fixed.
/3/ kadb_check complains about a lot of stuff, and the output does not
	make much sense.  A lot of this looks like endian issues, but
	also I think this tool probably started as a temporary hack and
	never well cleaned up.	The output was probably never really
	'clean" in the first place.
/4/ I never got kaserver to core dump (granted, I'm not pushing it real hard.)

I think at least in some basic way, the kaserver in 1.4.12 still "works".
So I am still curious as to what Derrick meant by "broke".

possible generic action items,
/1/ fix uhdr.size usage issues. (ntohs/htons not ntohl/htonl).
/2/ fix kadb_check to produce correct output.  Should match on little
	and big-endian machines.
/3/ fix kadb_check to produce "better" output?

For Atro Tossavainen, I think my recommendations are:
/1/ can he only run one source version of kaserver on all db hosts (not a
	mixed ibm/openafs env),
/2/ can he resolve the server setup such that when udebug is
	run, it only reports "correct" IP addresses?  (Ideally only
	the primary, but the other interfaces should be ok so long
	as packets sent through them get to the same place.)
/3/ can he resolve time so that he never sees "last beacon sent -3 secs ago"?,
	ubik does care, even more than kerberos, about time.
/4/ can he resolve his keyfile reference such that he never gets
	"unknown key version number"?
	(My suspicion, he's got path issues between differently built binaries.)

				-Marcus Watts