[OpenAFS] Volume access problems when DB server downed

Erwin Broschinski broschi@id.ethz.ch
Thu, 07 Mar 2002 21:02:23 +0100 (MET)


thanks a lot for your detailed answer. Some of my remarks further down:

On 07-Mar-2002 Craig_Everhart@transarc.com wrote:
| Excerpts from mail: 7-Mar-02 Re: [OpenAFS] Volume access.. Erwin
| Broschinski@id.eth (564*)
|> No, I left the other 4 Servers untouched - apart from a weekly restart of
|> bos.
| You didn't change the IP address in the other servers' CellServDB files?
|  How did you expect Ubik, which is a server-to-server protocol, to keep
| working?
  The server I took down came up during maintenance, but had no AFS server
processes running, and a different name and IP# to make sure nobody out there
would find any traces of the old DB server. Ubik worked as expected and a new
sync site was elected. It is the same situation as if a DB server just
disappears. I had five and then four of them. Would you then have to manually
manipulate the Ubik machines' CellServDB?

|> What happens, if any client or server directs a request to a server in its
|> CellServDB and finds that one dead? I assumed, it would try the next in the
|> list. But this was not observed in all cases.
| There are two issues.
| The client will indeed try the next one in its list, but there are
| things that only the sync site can do, and the different servers direct
| the client to contact the sync site in those cases, regardless of
| whether the sync site is in the client's CellServDB.  I'm not sure what
| the client will do when it is told to contact a server (the sync site)
| that isn't in its CellServDB, but you should avoid letting this happen.
  There was a functioning sync site at this time: one of the remaining four

| The other issue is that the Ubik servers know about each other by virtue
| of all of them being listed in the CellServDB, and that a Ubik server
| reads the CellServDB at startup and not subsequently.  Given that Ubik
| servers talk to each other to communicate database versions and carry
| out quorum elections and generally run the DB maintenance protocols,
| each server needs to know about all of the other Ubik sites.  Thus it's
| particularly important not only to update the CellServDB on all DB
| servers when you change an IP address for one of them, but also to
| restart all the Ubik servers (vlserver, ptserver, kaserver if you use
| it, the backup server) after changing the CellServDB.
  I thought the reason for having more than one DB server is redundancy, which
means, they *automagically* take over the job if one server fails. And this
only works if I manipulate the remaining servers manually? Good to know, I
will try this next time.

| There may be some measures in Ubik to try to accommodate administrators
| that don't do this, but they are not likely to be completely successful.
|               Craig