[OpenAFS] CellServDB file update...

Jeffrey Hutzelman jhutz@cmu.edu
Tue, 16 May 2006 12:47:34 -0400

On Tuesday, May 16, 2006 02:20:21 AM -0400 Rodney M Dyer <rmdyer@uncc.edu> 

>> It's certainly nice if the software does something "right"
>> automatically when the server side CellServDB get changed.  It sounds
>> like Derrick did that, modulo a minor bug or so.  It would also be nice
>> if the documentation at least described what was going to happen if it
>> isn't going to be "nice" behavior.  Sounds like the documentation at
>> least managed to identify that this was risky, even if it wasn't very
>> clear about why this was a problem.  At that point, the onus would seem
>> to be on the AFS administrators to try this out in advance in a test
>> environment and see what was going to happen, before trying it for real
>> and risking breaking things.
> After being bitten by the "bug" (not knowing it was a bug at the time)
> and looking into the problem we realized we had forgotten the "bos
> addhost/removehost" commands.  Upon reading deeper about the
> addhost/removehost commands I just wanted to verify that these commands
> were in fact "operational" and weren't mearly an "administrative
> practice" update mechanism, eg. mearly a practice that admins should
> follow for future AFS command upgrade purposes.
> If "internally" the addhost/removehost commands do nothing more than
> "edit" the files themselves, like a text editor, then they are
> "currently" only practice policy.  It sounds to me instead that they
> actually do more than edit the files, because you actually have locking
> issues if the file server process is trying to read the CellServDB files
> at the same time you would manually copy over them.  The
> addhost/removehost commands probably stop the file server process from
> reading the files, update them, then allow them to be read again.

You're both confused, at least to some extent.  The addhost/removehost 
commands are indeed just a convenience; they do nothing more than read, 
modify, and update the CellServDB file on the server in question.  If the 
documentation says "always use these commands; never update the CellServDB 
file by hand", it's because that insures the file's contents will always be 
at least syntactically correct.  There's no chance, for example, of someone 
deciding that '#' is a comment introducer rather than a field separator, 
and either "leaving off the comment" or introducing a "comment line", both 
of which would make the file invalid (I believe our parser is more 
forgiving these days, but at the time that documentation was written it was 
very simplistic).

The bug is not that things break if you edit the CellServDB instead of 
using addhost/removehost.  The bug is that things break if you change the 
CellServDB out from under a buggy server without restarting it.  This 
applies to all servers, not just the fileserver, and the actual bug is in 
the way the KeyFile is reread.  Changing the CellServDB file triggers it 
because that's the mechanism used to trigger re-reading all of the 
server-side config files (in fact, if you are changing the KeyFile, you 
_should_ use the bos commands or other tools that know how to do this, or 
else touch the CellServDB file yourself).

>       1.  The file server process actually reads the CellServDB file very
> often.

No; only when it changes.  But it checks fairly often.

>       2.  Never copy over the CellServDB file because of #1.

Well, don't edit it (or the KeyFile) in place - always rename a valid file 
into place.

>       3.  Use "bos addhost/removehost" commands to change the CellServDB
> files.

Only if you want to.

>       4.  We should never have to "bos restart" a server to get it to see
> a new cell server if we use bos addhost...right?

Sort-of.  The fileserver needs the CellServDB file to register itself with 
the VLDB, which only happens at startup, and to get information from the 
ptserver about clients' PTS group memberships.  It builds a set of 
connections to ptservers based on the contents of the CellServDB file at 
startup, and only rebuilds this list when a ptserver call fails.

In practice, this means you can ignore the issue.  If there is ever a 
problem the fileserver will eventually pick up the new list; if there's not 
a problem, then it doesn't matter.

>       5.  Never do this on a Friday afternoon.  :)

Well, that's definitely a good policy.

-- Jeff