[OpenAFS] CellServDB file update...

Tue, 16 May 2006 02:20:21 -0400

At 01:19 AM 5/16/2006, Marcus Watts wrote:
> > From: Derrick J Brashear <shadow@dementia.org>
> > it cannot, actually. it can cause every server to need to be restarted
> > bjut the tokens in the client cache will still work after the new servers
> > start

This was not my experience, but I won't argue over this one.

> > also, i should pribably answer when i am more sober
>
>Uh, when you're more sober -- why exactly do you *need*
>to change the CellServDB on your fileservers?  This is
>definitely not a "normal operational thing" to do in the
>first place.

Because we are adding cell servers.  We had three, now we have five.  We 
will be removing two in the future.  We are changing the subnetting of the 
cell servers.  Definitely not "normal", but in this case needed.

If it weren't for the "bug" I don't think we would have had any issues.  We 
use a script that updates the file servers CellServDB files, we always 
have.  I admit now that this was "bad admin'ing".  It was our "assumption" 
that we could update the CellServDB files without the file servers picking 
up on the changes until we performed a bos restart.  We simply forgot about 
the "bos addhost/removehost" commands.  Like you said, it isn't something 
you do every day.

>It's certainly nice if the software does something "right"
>automatically when the server side CellServDB get changed.  It sounds
>like Derrick did that, modulo a minor bug or so.  It would also be nice
>if the documentation at least described what was going to happen if it
>isn't going to be "nice" behavior.  Sounds like the documentation at
>least managed to identify that this was risky, even if it wasn't very
>clear about why this was a problem.  At that point, the onus would seem
>to be on the AFS administrators to try this out in advance in a test
>environment and see what was going to happen, before trying it for real
>and risking breaking things.

After being bitten by the "bug" (not knowing it was a bug at the time) and 
looking into the problem we realized we had forgotten the "bos 
addhost/removehost" commands.  Upon reading deeper about the 
addhost/removehost commands I just wanted to verify that these commands 
were in fact "operational" and weren't mearly an "administrative practice" 
update mechanism, eg. mearly a practice that admins should follow for 
future AFS command upgrade purposes.

If "internally" the addhost/removehost commands do nothing more than "edit" 
the files themselves, like a text editor, then they are "currently" only 
practice policy.  It sounds to me instead that they actually do more than 
edit the files, because you actually have locking issues if the file server 
process is trying to read the CellServDB files at the same time you would 
manually copy over them.  The addhost/removehost commands probably stop the 
file server process from reading the files, update them, then allow them to 
be read again.

Well at least the "bug" caused a learning experience.  We are all more 
educated by our mistakes.  I've made plenty in my life.  I'm about 6 points 
away from being a god.  ;P

What did we learn?

      1.  The file server process actually reads the CellServDB file very 
often.
      2.  Never copy over the CellServDB file because of #1.
      3.  Use "bos addhost/removehost" commands to change the CellServDB files.
      4.  We should never have to "bos restart" a server to get it to see a 
new cell server if we use bos addhost...right?
      5.  Never do this on a Friday afternoon.  :)

>Could you have changed other things like your KeyFile or ThisCell?  That
>would certainly result in tossing tokens.

No, we were very careful with the whole process otherwise.  Other than 
being bitten by the "bug", everything went ok.

Rodney