[OpenAFS] Minor question on moving AFS db servers
Jan Iven
jan.iven@cern.ch
Wed, 29 Oct 2014 19:02:10 +0100
On 10/29/2014 06:47 PM, Garance A Drosehn wrote:
> Hi.
>
> We have AFS db servers on some ancient hardware, and decided to move
> them to be virtual machines on much newer hardware. I've moved one
> of them already, and the final result seems to be fine. There was
> one minor oddity during the physical-to-virtual move which was a
> little worrisome, so I thought I'd ask if there was some other step
> that I should do.
>
> We have four machines running as AFS DB servers, and we're virtualizing
> only one of those per day.
>
> What I did was get a list of running AFS processes via 'bos status'.
> I then did a 'bos stop -wait' for each of those processes (kaserver,
> buserver, ptserver, vlserver, upclientetc). We then did the P2V copy
> to make a duplicate of the running system into a virtual-machine.
> After checking that copied system image, we disconnected the older
> hardware-based image from the network, brought up the VM copy, and
> I then 'bos start'-ed all the AFS processes which had been 'stop'-ed
> before the copy was done. Once those AFS processes were running in
> the VM-based image, everything seems perfectly fine.
>
> The oddity is that during the time that the AFS processes were not
> running on either machine, AFS access on many of our AFS clients
> was pretty slow. Everything worked, but much slower than normal.
> I'm pretty sure the delay was all in the lookup-step, and that if
> some AFS client already had a file open in AFS then I/O performance
> to that file was fine.
>
> Was there some step I should have done so all AFS clients would know
> that the DB server was gone, so they shouldn't wait around for replies
> from it?
Went through something similar, here is my understanding (corrections
welcome!):
AFS clients-as-in-the-kernel-module will have a preferred VLserver to
talk to (fs getserverpref -vlservers), but should figure out after
~60sec that that one is gone and then switch to the next one (and not
come back until they restart, or their newly-preferred DB server also is
unreachable).
AFS clients-as-in-userspace tools (vos exa, pts) will contact a random
DB server each time, so in your case have 1/4 chance of waiting (no
"learning" over several invocations).
And indeed once the client has already found a particular volume, they
should not notice the DB server outage.
AFAIK there is no gentle way to pre-announce "this one is going away".
You could push a new CellServDB before every update, and run "fs
setserverprefs -vlservers" to penalize the machine that is going away
(or restart the AFS clients), but in our case we didn't do this.
Cheers
jan