[OpenAFS-devel] Re: OpenAFS cache manager cold vs warm shutdown

Andrew Deason adeason@sinenomine.net
Thu, 3 Jul 2014 19:23:17 -0500


On Thu, 3 Jul 2014 21:45:47 +0000
Mark Vitale <mvitale@sinenomine.net> wrote:

> (cross-posted to openafs-devel and port-solaris)

I don't think this has anything to do with Solaris, but I'll keep it on
the cc list for now so this doesn't just look responseless. (And in case
this came from some Solaris historical artifact or something.)

> The AFS Unix cache manager has two ways to shutdown:
> 
> - 'afsd -shutdown' requires that /afs already be umounted; it then
> sets afs_cold_shutdown=1 and calls afs_shutdown().
> 
> - 'umount /afs' also calls afs_shutdown() on most platforms.  Some of
> them set afs_cold_shutdown=1 first, while others do not.
> 
> If afs_shutdown() is called "warm" (afs_cold_shutdown==0), the
> shutdown logic skips the clearing and releasing of some resources.  I
> see no rhyme or reason to which resources AFS leaves unreleased.  Nor
> do I understand the (possibly historical) reason for why there is a
> distinction between cold and warm shutdown.

My understanding is that a WARM shutdown is via 'umount' (i.e., the
normal way), and a COLD shutdown is via afsd -shutdown. That is the
definition of those terms; that's all they mean (or maybe the defining
factor is "if /afs is mounted" or some other
similar-yet-technically-different distinction). Note that this (and
below) just comes from my own experience with the code; not from any
proper documented or otherwise authoritative source.

Normally you can never shutdown via 'afsd -shutdown' while /afs is
mounted. The reason 'afsd -shutdown', and thus, the COLD shutdown
procedure, exists at all is:

 - If mounting /afs fails in the middle of initialization, you can't
   umount /afs. So you can 'afsd -shutdown' instead to de-initialize
   some things that came up before afs was mounted.

 - You can run some of the client daemons without actually mounting /afs
   in "normal" operation. I'm not sure if I've ever done this, but I
   have a vague recollection from some code comment or post somewhere
   referencing running the nfs xlator PAG manager this way. Maybe there
   are other reasons, other ways.

The reason that some cleanup procedures are different is because during
the normal WARM shutdown procedure, we assume some daemons or other
things will cleanup after themselves; but during a COLD procedure we're
not sure if they're really there or functioning properly, so there's
some extra cleanup along the way. I may not be remembering that properly
and there may be mistakes; you'll have to be more specific if you want
more info.

The reason why I think Russ associates COLD shutdown procedures with
brokenness is probably because that would happen if something broke
during initialization. Init scripts usually try both ways of shutting
down on 'stop':

umount /afs
afsd -shutdown

Just to handle both cases. If AFS started successfully, the 'afsd
-shutdown' would do nothing, because either the 'umount' would succeed
and we'd already be shutdown, or the 'umount' failed and 'afsd
-shutdown' would refuse to run because /afs was still mounted.
Conversely, if AFS did not start successfully, the 'umount' would fail
because /afs is not mounted, and we'd try the 'afsd -shutdown'.

For unsuccessful starts, in the past, our handling of errors during init
wasn't great (and still isn't now, but it used to be worse), so things
would be left in an inconsistent state, but wouldn't break in a panic or
whatever until we tried to shutdown. So a lot of the time on Linx in the
past, whenever you saw a COLD shutdown, it was because something broke
during startup in a way we did not handle, and we panic'd on shutdown.
For WARM shutdowns, we started up successfully so there were no such
issues.

If you look at the platforms that set COLD shutdown during umount, you
just see AIX, obsd, and nbsd, and the 'COLD' shutdown was enabled when
support was added for a certain platform version. I would guess that
doing so is "incorrect", and someone just set it to workaround some
problem and never mentioned it, and possibly didn't really know why it
was there. (Of course, I'm not one to talk, since most of what I'm
saying here is also just guessing.)

> Then there is the question of when it is safe to rmmod/modunload the
> libafs kernel module.  Does warm or cold shutdown affect the answer to
> this question?

It's supposed to be safe after either shutdown is finished. But there
can always be bugs with that, and the ability to stop the client at all
on Solaris is only a few years old (which is why some older mentions of
AFS on Solaris say you can't do it), and there have been a bug or two
already in its shutdown support in that time.

And since COLD shutdowns are more rare (from my understanding above), of
course the code path is less exercised and more prone to bugs.

-- 
Andrew Deason
adeason@sinenomine.net