Fwd: [OpenAFS] Re: afs/cell transition procedure
Kendrick Hernandez
kendrick.hernandez@umbc.edu
Mon, 9 Sep 2013 07:10:05 -0400
--047d7b621e8e2d15d604e5f16ef0
Content-Type: text/plain; charset=UTF-8
On Fri, Sep 6, 2013 at 11:54 AM, Andrew Deason <adeason@sinenomine.net>wrote:
> On Fri, 6 Sep 2013 10:41:50 -0400
> Kendrick Hernandez <kendrick.hernandez@umbc.edu> wrote:
>
> > and I was able to generate a new disabled afs/cell principal with
> > strong encryption, extract it to the rxkad.keytab file and distribute
> > it to our file servers, and do the restarts.
>
> Can you provide the exact commands you used to generate the disabled
> principal? What KDC are you using? (just for thoroughness)
>
Sure. I'm using MIT krb5 1.10.3. On the KDC I fired up kadmin.local and ran
ank -randkey -allow_tix afs/umbc.edu
With our KDC config, the default enctypes created are
aes256-cts-hmac-sha1-96:normal
aes128-cts-hmac-sha1-96:normal
des3-cbc-sha1:normal
arcfour-hmac-md5:normal
> You didn't modify or remove the old KeyFile during any of this, correct?
>
That's correct.
>
> > After this I've noticed the following message repeated in the FileLog
> > for our servers:
> >
> > VL_RegisterAddrs rpc failed; will retry periodically (code=19270407,
> err=0)
>
> To be clear, this doesn't involve anything with the KDC. Usually I think
> that would be indicative of a mismatch of the keying data between that
> fileserver, and the dbservers. I'm not saying you somehow distributed
> different keys or something (there's probably something else happening),
> but that's just kind of what the servers think happened.
>
Gotcha. For the sake of completeness, I'll mention that our db servers are
currently running 1.4.15, whereas the rest of the fileservers are 1.6.5.
We've been in the process of migrating our AFS infrastructure from Solaris
and OpenAFS 1.4 to RHEL6 and OpenAFS 1.6, and our db servers are last up.
>
> > When I went to enable the new afs/cell principal and disable the old
> > one, I was able to log in to a server and get an afs/cell service
> > ticket, tokens, and access my afs volume. I could also do the same for
> > my afs "admin" principal, but when I went to perform a "vos release"
> > operation, I got an error about
>
> I'm not clear about what is happening at this point. Does the above
> VL_RegisterAddrs message keep on appearing?
>
Yes, they keep appearing in the file server logs.
>
> Does "access my afs volume" mean that you were able to do things that
> required authenticated access? (e.g. writing to your volume, or in
> general accessing files or directories that are not readable to
> system:anyuser)
>
Yes, I didn't test writing to the volume, but the ACL on my home directory
is restricted to system:administrators and my own account.
>
> > Could not lock the VLDB entry for the volume XXXXXXXX.
> > rxk: security object was passed a bad ticket
> > Error in vos release command.
> > rxk: security object was passed a bad ticket
>
> This again suggests that the keying material on the dbservers
> (specifically the vlserver) is different from the other servers.
>
> In this situation it would be useful to try to see if an authenticated
> 'vos status' command (or similar fileserver-only 'vos' command) works
> against a fileserver. If that works, but authenticated connections to
> the vldb do not, then something's wrong with the vlserver keying
> material.
>
This seems to be the case; more on that below.
>
> It may be helpful to list the contents of the rxkad.keytab on each
> server (with MIT ktutil 'list -e'), as well as the contents of the
> KeyFile (via 'asetkey list' or 'bos listkeys'). You should be able to
> see a mismatch pretty obviously yourself, but if you want, post the
> information to the list, with any actual key data removed. Do NOT share
> the actual keys; remember that 'asetkey list' does show the actual keys,
> so you must scrub the output before sharing it.
>
> Specifically I'm just curious about the kvnos in play, and maybe the
> enctypes (for the keys in rxkad.keytab).
>
The output of "bos listkeys" matches for all db and file servers, and the
kvno is 9, which matches the kvno of the old afs principal on the KDC:
> bos listkeys db1.afs.umbc.edu
key 9 has cksum XXXXXXXXXX
Keys last changed on Tue May 8 16:53:14 2001.
All done.
The output of ktutil 'list -e' also matches for all db and file servers,
and the kvno is 2, which matches the kvno of the new afs/cell principal:
ktutil: rkt /usr/afs/etc/rxkad.keytab
ktutil: list -e
slot KVNO Principal
---- ----
---------------------------------------------------------------------
1 2 afs/umbc.edu@UMBC.EDU (aes256-cts-hmac-sha1-96)
2 2 afs/umbc.edu@UMBC.EDU (aes128-cts-hmac-sha1-96)
3 2 afs/umbc.edu@UMBC.EDU (des3-cbc-sha1)
4 2 afs/umbc.edu@UMBC.EDU (arcfour-hmac)
>
> > This leads me to believe that our servers are still using the old
> > principal.
>
> It suggests to me that your dbserver processes specifically may not be
> using the new rxkad.keytab for accepting connections. If you can
> authenticate to the fileserver with strong crypto, but not to the vldb,
> then that would be explained by the dbservers not having new keys.
>
Ah, okay. I've also noticed that one of our db servers does not appear to
be synchronizing with the other two. Going back to your previous suggestion
of attempting "vos status", I re-enabled the new afs/cell principal and was
able to 'vos status' several of our fileservers. I then tried some 'vos
listvldb' operations which failed with the "rxk: security object was passed
a bad ticket" error. On a hunch I shut off the server processes for the db
server that's not syncing, and this time the vos operations worked. Very
strange.
>
> > Do I need to restart the afs fileserver processes after enabling the
> > new afs/cell principal?
>
> No; you just need to restart after deploying the rxkad.keytab file. And
> even if you don't restart, things don't break (as long as you have the
> KeyFile around); it just means you're still using the DES long-term
> keys, so you still have a security problem. It may be helpful to explain
> a little bit about how the server keys are used/updated:
>
> You don't need to restart the server process for that server to accept
> incoming connections with the new keying material. That is, if a client,
> or another server, tries to contact us with new keys, we don't need to
> be restarted. We just need the new keys in rxkad.keytab, and for
> CellServDB to be touched.
>
> What we need to restart for is to use the new keys to create outgoing
> connections. This is what the servers use to communicate with each
> other. 'Why' is beyond the scope of this paragraph, but fileservers do
> talk to other fileservers, and fileservers talk to dbservers, and
> dbservers talk to each other. Whenever they do that, they need some keys
> to make an authenticated connection. If you just update rxkad.keytab,
> they will not recreate connections with the new keys; they only load the
> keys for creating connections once at startup. There are some exceptions
> to that, but for the purposes of this migration, you can treat that as
> true.
>
So if when I restarted the servers, the keys in rxkad.keytab were disabled
(meaning DISALLOW_ALL_TIX set), would they continue to use the old key in
KeyFile for outgoing connections?
>
> Does that help? That explanation doesn't explain what your issue is, but
> I hope it at least helps to explain what is supposed to occur.
>
I think I'm getting a better picture, thanks.
--
: Kendrick Hernandez
: UNIX Systems Administrator
: UNIX Systems and Infrastructure
: Division of Information Technology
: University of Maryland, Baltimore County
--047d7b621e8e2d15d604e5f16ef0
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><br><div class=3D"gmail_quote"><div dir=3D"ltr"><div class=
=3D"gmail_extra"><div class=3D"gmail_quote"><div>On Fri, Sep 6, 2013 at 11:=
54 AM, Andrew Deason <span dir=3D"ltr"><<a href=3D"mailto:adeason@sineno=
mine.net" target=3D"_blank">adeason@sinenomine.net</a>></span> wrote:<br=
>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div>On Fri, 6 Sep 2013 10:41:50 -0400<br>
Kendrick Hernandez <<a href=3D"mailto:kendrick.hernandez@umbc.edu" targe=
t=3D"_blank">kendrick.hernandez@umbc.edu</a>> wrote:<br>
<br>
> and I was able to generate a new disabled afs/cell principal with<br>
> strong encryption, extract it to the rxkad.keytab file and distribute<=
br>
> it to our file servers, and do the restarts.<br>
<br>
</div>Can you provide the exact commands you used to generate the disabled<=
br>
principal? What KDC are you using? (just for thoroughness)<br></blockquote>=
<div><br></div></div><div>Sure. I'm using MIT krb5 1.10.3. On the KDC I=
fired up kadmin.local and ran</div><div><br></div><div>ank -randkey -allow=
_tix afs/<a href=3D"http://umbc.edu" target=3D"_blank">umbc.edu</a></div>
<div><br></div><div>With our KDC config, the default enctypes created are=
=C2=A0</div><div>=C2=A0</div><div>aes256-cts-hmac-sha1-96:normal=C2=A0</div=
><div>aes128-cts-hmac-sha1-96:normal=C2=A0</div><div>des3-cbc-sha1:normal=
=C2=A0</div><div>
arcfour-hmac-md5:normal<br></div><div><div><br></div><blockquote class=3D"g=
mail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-=
left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
You didn't modify or remove the old KeyFile during any of this, correct=
?<br></blockquote><div><br></div></div><div>That's correct.</div><div><=
div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0=
px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-le=
ft-style:solid;padding-left:1ex">
<div><br>
> After this I've noticed the following message repeated in the File=
Log<br>
> for our servers:<br>
><br>
> VL_RegisterAddrs rpc failed; will retry periodically (code=3D19270407,=
err=3D0)<br>
<br>
</div>To be clear, this doesn't involve anything with the KDC. Usually =
I think<br>
that would be indicative of a mismatch of the keying data between that<br>
fileserver, and the dbservers. I'm not saying you somehow distributed<b=
r>
different keys or something (there's probably something else happening)=
,<br>
but that's just kind of what the servers think happened.<br></blockquot=
e><div><br></div></div><div>Gotcha. For the sake of completeness, I'll =
mention that our db servers are currently running 1.4.15, whereas the rest =
of the fileservers are 1.6.5. We've been in the process of migrating ou=
r AFS infrastructure from Solaris and OpenAFS 1.4 to RHEL6 and OpenAFS 1.6,=
and our db servers are last up.</div>
<div>
<div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px =
0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-l=
eft-style:solid;padding-left:1ex">
<div><br>
> When I went to enable the new afs/cell principal and disable the old<b=
r>
> one, I was able to log in to a server and get an afs/cell service<br>
> ticket, tokens, and access my afs volume. I could also do the same for=
<br>
> my afs "admin" principal, but when I went to perform a "=
;vos release"<br>
> operation, I got an error about<br>
<br>
</div>I'm not clear about what is happening at this point. Does the abo=
ve<br>
VL_RegisterAddrs message keep on appearing?<br></blockquote><div><br></div>=
</div><div>Yes, they keep appearing in the file server logs.</div><div><div=
>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px =
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-=
style:solid;padding-left:1ex">
<br>
Does "access my afs volume" mean that you were able to do things =
that<br>
required authenticated access? (e.g. writing to your volume, or in<br>
general accessing files or directories that are not readable to<br>
system:anyuser)<br></blockquote><div><br></div></div><div>Yes, I didn't=
test writing to the volume, but the ACL on my home directory is restricted=
to system:administrators and my own account.=C2=A0</div><div><div>
=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0=
.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-s=
tyle:solid;padding-left:1ex">
<div><br>
> Could not lock the VLDB entry for the volume XXXXXXXX.<br>
> rxk: security object was passed a bad ticket<br>
> Error in vos release command.<br>
> rxk: security object was passed a bad ticket<br>
<br>
</div>This again suggests that the keying material on the dbservers<br>
(specifically the vlserver) is different from the other servers.<br>
<br>
In this situation it would be useful to try to see if an authenticated<br>
'vos status' command (or similar fileserver-only 'vos' comm=
and) works<br>
against a fileserver. If that works, but authenticated connections to<br>
the vldb do not, then something's wrong with the vlserver keying<br>
material.<br></blockquote><div><br></div></div><div>=C2=A0This seems to be =
the case; more on that below.</div><div><div>=C2=A0</div><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;bo=
rder-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
It may be helpful to list the contents of the rxkad.keytab on each<br>
server (with MIT ktutil 'list -e'), as well as the contents of the<=
br>
KeyFile (via 'asetkey list' or 'bos listkeys'). You should =
be able to<br>
see a mismatch pretty obviously yourself, but if you want, post the<br>
information to the list, with any actual key data removed. Do NOT share<br>
the actual keys; remember that 'asetkey list' does show the actual =
keys,<br>
so you must scrub the output before sharing it.<br>
<br>
Specifically I'm just curious about the kvnos in play, and maybe the<br=
>
enctypes (for the keys in rxkad.keytab).<br></blockquote><div><br></div></d=
iv><div>The output of "bos listkeys" matches for all db and file =
servers, and the kvno is 9, which matches the kvno of the old afs principal=
on the KDC:</div>
<div><br></div><div><div>> bos listkeys <a href=3D"http://db1.afs.umbc.e=
du" target=3D"_blank">db1.afs.umbc.edu</a></div><div>key 9 has cksum XXXXXX=
XXXX</div><div>Keys last changed on Tue May =C2=A08 16:53:14 2001.</div><di=
v>All done.</div>
</div><div><br></div><div>The output of ktutil 'list -e' also match=
es for all db and file servers, and the kvno is 2, which matches the kvno o=
f the new afs/cell principal:</div><div><br></div><div>
<div>ktutil: =C2=A0rkt /usr/afs/etc/rxkad.keytab</div><div>ktutil: =C2=A0li=
st -e</div><div>slot KVNO Principal</div><div>---- ---- -------------------=
--------------------------------------------------</div><div>=C2=A0 =C2=A01=
=C2=A0 =C2=A02 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0afs/<a href=3D"mailto:umbc.edu@UMBC.EDU" target=3D"_blank">umbc.e=
du@UMBC.EDU</a> (aes256-cts-hmac-sha1-96)</div>
<div>=C2=A0 =C2=A02 =C2=A0 =C2=A02 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0afs/<a href=3D"mailto:umbc.edu@UMBC.EDU" tar=
get=3D"_blank">umbc.edu@UMBC.EDU</a> (aes128-cts-hmac-sha1-96)</div><div>=
=C2=A0 =C2=A03 =C2=A0 =C2=A02 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0afs/<a href=3D"mailto:umbc.edu@UMBC.EDU" target=3D"=
_blank">umbc.edu@UMBC.EDU</a> (des3-cbc-sha1)</div>
<div>=C2=A0 =C2=A04 =C2=A0 =C2=A02 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0afs/<a href=3D"mailto:umbc.edu@UMBC.EDU" tar=
get=3D"_blank">umbc.edu@UMBC.EDU</a> (arcfour-hmac)</div><div>=C2=A0<br></d=
iv></div><div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px=
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left=
-style:solid;padding-left:1ex">
<div><br>
> This leads me to believe that our servers are still using the old<br>
> principal.<br>
<br>
</div>It suggests to me that your dbserver processes specifically may not b=
e<br>
using the new rxkad.keytab for accepting connections. If you can<br>
authenticate to the fileserver with strong crypto, but not to the vldb,<br>
then that would be explained by the dbservers not having new keys.<br></blo=
ckquote><div><br></div></div><div>Ah, okay. I've also noticed that one =
of our db servers does not appear to be=C2=A0synchronizing=C2=A0with the ot=
her two. Going back to your previous suggestion of attempting "vos sta=
tus", I re-enabled the new afs/cell principal and was able to 'vos=
status' several of our fileservers. I then tried some 'vos listvld=
b' operations which failed with the "rxk: security object was pass=
ed a bad ticket" error. On a hunch I shut off the server processes for=
the db server that's not syncing, and this time the vos operations wor=
ked. Very strange.</div>
<div>
<div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px =
0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-l=
eft-style:solid;padding-left:1ex">
<div><br>
> Do I need to restart the afs fileserver processes after enabling the<b=
r>
> new afs/cell principal?<br>
<br>
</div>No; you just need to restart after deploying the rxkad.keytab file. A=
nd<br>
even if you don't restart, things don't break (as long as you have =
the<br>
KeyFile around); it just means you're still using the DES long-term<br>
keys, so you still have a security problem. It may be helpful to explain<br=
>
a little bit about how the server keys are used/updated:<br>
<br>
You don't need to restart the server process for that server to accept<=
br>
incoming connections with the new keying material. That is, if a client,<br=
>
or another server, tries to contact us with new keys, we don't need to<=
br>
be restarted. We just need the new keys in rxkad.keytab, and for<br>
CellServDB to be touched.<br>
<br>
What we need to restart for is to use the new keys to create outgoing<br>
connections. This is what the servers use to communicate with each<br>
other. 'Why' is beyond the scope of this paragraph, but fileservers=
do<br>
talk to other fileservers, and fileservers talk to dbservers, and<br>
dbservers talk to each other. Whenever they do that, they need some keys<br=
>
to make an authenticated connection. If you just update rxkad.keytab,<br>
they will not recreate connections with the new keys; they only load the<br=
>
keys for creating connections once at startup. There are some exceptions<br=
>
to that, but for the purposes of this migration, you can treat that as<br>
true.<br></blockquote><div><br></div></div><div>So if when I restarted the =
servers, the keys in rxkad.keytab were disabled (meaning DISALLOW_ALL_TIX s=
et), would they continue to use the old key in KeyFile for outgoing connect=
ions?</div>
<div>
<div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px =
0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-l=
eft-style:solid;padding-left:1ex">
<br>
Does that help? That explanation doesn't explain what your issue is, bu=
t<br>
I hope it at least helps to explain what is supposed to occur.<br></blockqu=
ote><div><br></div></div><div>I think I'm getting a better picture, tha=
nks.</div><div><br></div></div></div></div></div>-- <br><br>
: Kendrick Hernandez<br>: UNIX Systems Administrator<br>: UNIX Systems and =
Infrastructure<br>: Division of Information Technology<br>: University of M=
aryland, Baltimore County
</div>
--047d7b621e8e2d15d604e5f16ef0--