[OpenAFS] Minor question on moving AFS db servers

Wed, 29 Oct 2014 19:02:10 +0100

On 10/29/2014 06:47 PM, Garance A Drosehn wrote:
> Hi.
>
> We have AFS db servers on some ancient hardware, and decided to move
> them to be virtual machines on much newer hardware.  I've moved one
> of them already, and the final result seems to be fine.  There was
> one minor oddity during the physical-to-virtual move which was a
> little worrisome, so I thought I'd ask if there was some other step
> that I should do.
>
> We have four machines running as AFS DB servers, and we're virtualizing
> only one of those per day.
>
> What I did was get a list of running AFS processes via 'bos status'.
> I then did a 'bos stop -wait' for each of those processes (kaserver,
> buserver, ptserver, vlserver, upclientetc).  We then did the P2V copy
> to make a duplicate of the running system into a virtual-machine.
> After checking that copied system image, we disconnected the older
> hardware-based image from the network, brought up the VM copy, and
> I then 'bos start'-ed all the AFS processes which had been 'stop'-ed
> before the copy was done.  Once those AFS processes were running in
> the VM-based image, everything seems perfectly fine.
>
> The oddity is that during the time that the AFS processes were not
> running on either machine, AFS access on many of our AFS clients
> was pretty slow.  Everything worked, but much slower than normal.
> I'm pretty sure the delay was all in the lookup-step, and that if
> some AFS client already had a file open in AFS then I/O performance
> to that file was fine.
>
> Was there some step I should have done so all AFS clients would know
> that the DB server was gone, so they shouldn't wait around for replies
> from it?

Went through something similar, here is my understanding (corrections 
welcome!):
  AFS clients-as-in-the-kernel-module will have a preferred VLserver to 
talk to (fs getserverpref -vlservers), but should figure out after 
~60sec that that one is gone and then switch to the next one (and not 
come back until they restart, or their newly-preferred DB server also is 
unreachable).
AFS clients-as-in-userspace tools (vos exa, pts) will contact a random 
DB server each time, so in your case have 1/4 chance of waiting (no 
"learning" over several invocations).
And indeed once the client has already found a particular volume, they 
should not notice the DB server outage.

AFAIK there is no gentle way to pre-announce "this one is going away". 
You could push a new CellServDB before every update, and run "fs 
setserverprefs -vlservers" to penalize the machine that is going away 
(or restart the AFS clients), but in our case we didn't do this.

Cheers
jan