[OpenAFS-devel] Re: bos killed fileserver before it was shut down cleanly.

Tue, 12 Oct 2010 13:26:54 -0500

On Tue, 12 Oct 2010 13:58:43 -0400
Jeffrey Hutzelman <jhutz@cmu.edu> wrote:

> However, for the "human admins want to see what's going on" problem,
> perhaps an RPC interface is better.  It should be a separate Rx
> service (though probably on the same port), and have at least one
> dedicated thread.  And for introspection, it may want to completely
> ignore locks and risk giving out bogus data rather than risking
> deadlock.

Well, that's the most reliable way to do this, sure. I'm just not sure
how much work we really want to expend on this / how many cases to
cover. Practically speaking, I think at least recently any deadlocks (or
just "takes a long time shutting down") on shutdown would be involving
VOL_LOCK or H_LOCK, so just avoiding those would be fine for almost all
people.

Maybe I'm just being lazy, though, and what you describe is the right
way it should be done. Personally, I'm looking more at fixing the causes
of such slow shutdowns, at least in the short term.

> You could have the fileserver send periodic signals to its parent
> while shutting down.  Or, provide for an environment variable
> containing the number of a file descriptor over which periodic
> heartbeats should be sent.

My worry about this kind of approach is when there's some kind of bug
that causes the "shutdown the volumes" loop to continue, but we don't
actually make any kind of progress. That is, we keep trying to offline
the same volume over and over again or something. I'd rather something
that keeps a count of offlined volumes and total number of volumes (or
something like that), so invalid counts can be aborted immediately.

Of course, see above on dealing with unlikely corner cases that aren't
actually a problem... What you describe sounds easy, but it's going to
(potentially) screw up launching the fileserver process outside of
bosserver, which I like to be able to do (easier to attach a debugger on
startup that way).

> > Or, as I've mentioned before, if the timeout code is just added to
> > the fileserver itself, this isn't a problem.
> 
> No; the idea is to KILL KILL KILL the fileserver (or any other server)
> if it doesn't shut down in a reasonable time.  That has to be done
> outside; a process that is hung isn't going to kill itself.

If the first thing we do on shutdown is spawn a thread that just
abort()s after N seconds, or N seconds after not receiving a signal or
whatever, it's hard for me to see what can go wrong with it. Of course
memory corruption and other "anything-can-happen" scenarios could screw
it up, but really...

I'm not saying to also remove an external timeout in bosserver, though.
Just that the fileserver itself could have a much finger-grained timeout
(adjusting for # of volumes, or the last internal heartbeat, etc) with
bosserver having a larger unconditional one.

-- 
Andrew Deason
adeason@sinenomine.net