[OpenAFS] Re: proper way to bring down a file server?

Andrew Deason adeason@sinenomine.net
Wed, 23 Feb 2011 14:09:22 -0600


On Wed, 23 Feb 2011 11:42:27 -0800
Jonathan Nilsson <jnilsson@uci.edu> wrote:

> I have of course tested moving volumes around, and I figured that you
> could replace a server simply by moving all the volumes off the server
> and shutting it down.  Then at your leisure, build a new system from
> scratch, bring it online and move volumes onto it.

Yes.

> I tried this, and it seems to have worked fine for most clients, but
> one user got "connection timed out" when trying to login from from 2
> different CentOS 32-bit clients (one 1.4.14 with dkms modules, and
> another 1.4.12 with the kmod rpm, all from openafs.org).  Other CentOS
> and Ubuntu clients did not have this problem.  Many other volumes were
> moved with no problems at the same time from the old server to the
> same new server as this user's unavailable volume.
> 
> This was fixed with "fs flushmount".  Is this supposed to be necessary
> after you "vos move" a volume?

No. At what point did this problem occur? While volumes were being
moved, after they had been moved, after you turned off the server, ... ?

Clients cache the location of volumes for about 2 hours. This is usually
fine, since if they get the location wrong the fileserver will tell the
client, and the client will look up the fresh location. But if you move
a volume off of a server and then immediately turn it off, a client may
be slow to find the new location, since it tries to contact the old,
downed, fileserver first. So you may benefit from leaving the old,
empty, fileserver online for a few hours, and then turning it off.

However, the client should recover from this. I would expect a 'fs
checkv' may help resolve things more quickly, but a flushm may also
help, as you've found.

> Here is a curious entry in VolserLog on the new server which may be of
> interest (though there are other similar messages for the other
> volumes that I moved):
> 
> VolserLog:Tue Feb 22 15:58:13 2011 VAttachVolume: Failed to open
> /vicepa/V0536870955.vol (errno 2)
> VolserLog:Tue Feb 22 15:58:13 2011 1 Volser: CreateVolume: volume 536870955
> (users.glang) created

This is fine. When you do a volume move, 'vos' checks if the destination
already exists; that first message is just the "error" that the volume
does not already exist there.

> I also tried "fs checkservers" and the two problematic clients both
> reported "These servers unavailable due to network or server problems:
> athens.ss2k.uci.edu" - athens being the old server that I removed.
> All other clients seem happy and report "All servers are running."

Clients never forget about fileservers they have contacted. If you move
all volumes off of a server and turn the server off, it will still be
remembered in the client's list of fileservers forever (until you
restart the client). There is generally no problem with this (except
perhaps cosmetic), as the only effect is that the client will try to
'ping' the fileserver when you 'fs checks', or when the client tries to
ping all fileservers it knows about during its periodic fileserver
check.

> Am I supposed to remove athens from the VLDB with "vos changeaddr
> -oldaddr <athens IP> -remove"? I will build a new "athens" server, but
> am waiting for new hardware to arrive, so it may be a few weeks.

You don't need to do that, but it shouldn't hurt anything. When you
bring the new "athens" server online, it will tell the VLDB that it is
the new "athens" server, and will replace the existing entry.

-- 
Andrew Deason
adeason@sinenomine.net