[OpenAFS] proper way to bring down a file server?

Jeffrey Altman jaltman@secure-endpoints.com
Thu, 24 Feb 2011 03:22:19 +0100

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 2/23/2011 8:42 PM, Jonathan Nilsson wrote:
> Hello, first I'd like to say that I'm loving how AFS has simplified so
> many sysadmin tasks; thank you so much to all the AFS developers for
> making a great product that just keeps getting better!
> I have of course tested moving volumes around, and I figured that you
> could replace a server simply by moving all the volumes off the server
> and shutting it down.  Then at your leisure, build a new system from
> scratch, bring it online and move volumes onto it.
> I tried this, and it seems to have worked fine for most clients, but on=
> user got "connection timed out" when trying to login from from 2
> different CentOS 32-bit clients (one 1.4.14 with dkms modules, and
> another 1.4.12 with the kmod rpm, all from openafs.org
> <http://openafs.org>).  Other CentOS and Ubuntu clients did not have
> this problem.  Many other volumes were moved with no problems at the
> same time from the old server to the same new server as this user's
> unavailable volume.
> This was fixed with "fs flushmount".  Is this supposed to be necessary
> after you "vos move" a volume? =20

The cache manager looks up volume location information and considers it
valid for approximately two hours at which point the data is supposed to
be discarded.  When a volume is moved the clients are not notified on
the move.  Instead, the next time the client attempts to contact the
file server it believes the volume is located on, the file server is
supposed to return either a VNOVOL (volume not attached) or VMOVED
(volume has been moved) error.  When the clients receive these errors,
they issue a new query to the volume location database servers and retry
the request.

I say "supposed to" because depending on which file server version is in
use it may contain one of the bugs that results in a VOFFLINE error
being returned instead of VNOVOL or VMOVED.  When a VOFFLINE error
occurs the clients are not supposed to issue a new query to the VLDB
servers.  In a similar vein, if the file server is inaccessible, the
client does not issue a new VLDB query.  Therefore, it is important that
when file servers are being vacated for maintenance that they not be
shutdown until after the approximate two hour window has passed.

That being said, it does not sound as if that is the problem you
encountered since you indicate that "fs flushmount" solved the problem.
 "fs flushmount" does not cause the volume location data to become
invalid.  I suspect the clients which experienced the problem actually
had bad mount point target data in the cache.  If you see this problem
in the future, issue "fs listmount" first to confirm that the mount
point is in fact referring to the correct volume.

Jeffrey Altman

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

Version: GnuPG v1.4.9 (MingW32)