[OpenAFS-devel] volserver / replication question with older version of afs

Russ Allbery rra@stanford.edu
Thu, 02 Feb 2006 12:56:56 -0800


Josh Fiske <jfiske@clarkson.edu> writes:

> We have a cell with three older AFS servers (1.2.11).  They have been
> running great for quite some time.  However, twice in the past two weeks
> the Volserver has stopped responding on one of the servers.  When this
> happens, if I do a 'bos status' on the server, it tells me that
> everything is running normally.  But, I know from trying to do a 'vos
> listvol' on the server, that things are not normal, because it times
> out.  Both times this has happened, the server that the volserver died
> on was the sync site for the cell.

The volserver or the vlserver?  I'm only confused because you mention
sync sites, and I'm used to this being a volserver problem, which doesn't
have a sync site.

If you do mean volserver, this is a 1.2.11 bug.  I think it was fixed in
1.2.13; it's definitely fixed in 1.4.0.

> Also of note, we have quite a few volumes that are replicated.  When the
> volserver died on the sync site, the read-only replicas were no longer
> accessible.  If a read-only replica is unavailable on one server,
> shouldn't the client know to try one of the others?  I thought this was
> the whole point of replication.

Clients fail over if the server is completely off-line, but don't always
fail over if the server responds to Rx pings but nothing else,
unfortunately.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>