[OpenAFS] Re: Ubik problem

Derrick Brashear shadow@gmail.com
Fri, 16 Apr 2010 12:43:30 -0400


It might actually be worth valgrinding.

On Fri, Apr 16, 2010 at 12:30 PM, Derrick Brashear <shadow@gmail.com> wrote=
:
> On Fri, Apr 16, 2010 at 12:19 PM, Marcus Watts <mdw@umich.edu> wrote:
>> Derrick Brashear <shadow@gmail.com> sent:
>>
>>> Date: =A0 =A0Thu, 15 Apr 2010 23:02:33 EDT
>>> To: =A0 =A0 =A0Russ Allbery <rra@stanford.edu>
>>> cc: =A0 =A0 =A0openafs-info@openafs.org
>>> From: =A0 =A0Derrick Brashear <shadow@gmail.com>
>>> Subject: Re: [OpenAFS] Re: Ubik problem
>>>
>>> On Thu, Apr 15, 2010 at 9:13 PM, Russ Allbery <rra@stanford.edu> wrote:
>>> > Andrew Deason <adeason@sinenomine.net> writes:
>>> >> Atro Tossavainen <atro.tossavainen+openafs@helsinki.fi> wrote:
>>> >
>>> >>> Derrick,
>>> >>>
>>> >>> > I'd suggest just using the IBM binary for the kaserver (and only =
the
>>> >>> > kaserver) in your OpenAFS installation
>>> >>>
>>> >>> That's an interesting thought, but unfortunately it's nowhere near
>>> >>> an option. =3DA0sunx86_ is quite simply not a supported platform fo=
r
>>> >>> IBM AFS at all, even at 3.6 Patch 19 (August 2009).
>>> >
>>> >> Older OpenAFS releases could be another option, but I don't know how
>>> >> useful of an answer that is. I'm not sure what could have caused tha=
t,
>>> >> so I don't have a particular range in mind; maybe just earlier 1.4..=
.
>>> >> 1.4.9? 1.4.2?
>>> >
>>> > We were successfully running a 1.2.x version of kaserver on SPARC Sol=
aris=3D
>>> ,
>>> > and upgrading to 1.4.2 on Linux failed (albeit with different symptom=
s; i=3D
>>> t
>>> > would just stop successfully giving out tickets for a while and then =
come
>>> > back, regularly), so we stuck with 1.2.x on SPARC until we turned it =
off
>>> > entirely.
>>>
>>> I'm pretty sure it "broke" between 1.2.11 and 1.4.1.
>>>
>>> --=3D20
>>> Derrick
>>
>> Gah. =A0You made me drag out my kaserver notes! =A0Worse! =A0You made me
>> *run* the thing! =A0Bad! =A0Bad!
>>
>> "broke" is a pretty vague description, so...
>>
>> From the previous descriptions, it sounds like there might be ubik sync =
issues.
>
> That's not what I was referring to. I think it's between ubik database
> reads and the clients.
>
>> That could be caused either by problems in ubik, or unrelated problems
>> that cause server crashes. =A0The reports do not include notes on any re=
sulting
>> core dumps, and the ubik problem reports clearly indicate another seriou=
s
>> problem with server address determination.
>>
>> I experimented with building a version of 1.2.11, running it and using s=
ome
>> of the diagnostic tools, followed by trying to run the resulting databas=
e with
>> 1.4.12. =A0I certainly didn't thoroughly explore things. =A0I now have a=
n interesting
>> list of "problems".
>>
>> /1/ ubik_hdr.size got changed to be a short, not a long. =A0ntohl is wro=
ng. =A0This
>> =A0 =A0 =A0 =A0is in ubik proper as well as kaserver diagnostics. =A0For=
tunately, this
>> =A0 =A0 =A0 =A0doesn't seem to break too much.
>> /2/ udebug address output byte swap issues. =A0Previously mentioned as f=
ixed.
>> /3/ kadb_check complains about a lot of stuff, and the output does not
>> =A0 =A0 =A0 =A0make much sense. =A0A lot of this looks like endian issue=
s, but
>> =A0 =A0 =A0 =A0also I think this tool probably started as a temporary ha=
ck and
>> =A0 =A0 =A0 =A0never well cleaned up. =A0The output was probably never r=
eally
>> =A0 =A0 =A0 =A0'clean" in the first place.
>> /4/ I never got kaserver to core dump (granted, I'm not pushing it real =
hard.)
>>
>> I think at least in some basic way, the kaserver in 1.4.12 still "works"=
.
>> So I am still curious as to what Derrick meant by "broke".
>>
>> possible generic action items,
>> /1/ fix uhdr.size usage issues. (ntohs/htons not ntohl/htonl).
>> /2/ fix kadb_check to produce correct output. =A0Should match on little
>> =A0 =A0 =A0 =A0and big-endian machines.
>> /3/ fix kadb_check to produce "better" output?
>>
>> For Atro Tossavainen, I think my recommendations are:
>> /1/ can he only run one source version of kaserver on all db hosts (not =
a
>> =A0 =A0 =A0 =A0mixed ibm/openafs env),
>> /2/ can he resolve the server setup such that when udebug is
>> =A0 =A0 =A0 =A0run, it only reports "correct" IP addresses? =A0(Ideally =
only
>> =A0 =A0 =A0 =A0the primary, but the other interfaces should be ok so lon=
g
>> =A0 =A0 =A0 =A0as packets sent through them get to the same place.)
>> /3/ can he resolve time so that he never sees "last beacon sent -3 secs =
ago"?,
>> =A0 =A0 =A0 =A0ubik does care, even more than kerberos, about time.
>> /4/ can he resolve his keyfile reference such that he never gets
>> =A0 =A0 =A0 =A0"unknown key version number"?
>> =A0 =A0 =A0 =A0(My suspicion, he's got path issues between differently b=
uilt binaries.)
>
> no, because i suspect 4 is the "real issue"
>
>
> --
> Derrick
>



--=20
Derrick