[OpenAFS-devel] Re: bos killed fileserver before it was shut down cleanly.

Jeffrey Hutzelman jhutz@cmu.edu
Tue, 12 Oct 2010 15:29:49 -0400


--On Tuesday, October 12, 2010 01:26:54 PM -0500 Andrew Deason 
<adeason@sinenomine.net> wrote:

> My worry about this kind of approach is when there's some kind of bug
> that causes the "shutdown the volumes" loop to continue, but we don't
> actually make any kind of progress. That is, we keep trying to offline
> the same volume over and over again or something. I'd rather something
> that keeps a count of offlined volumes and total number of volumes (or
> something like that), so invalid counts can be aborted immediately.

Well, I didnt' say what the heartbeat format is...

> Of course, see above on dealing with unlikely corner cases that aren't
> actually a problem... What you describe sounds easy, but it's going to
> (potentially) screw up launching the fileserver process outside of
> bosserver, which I like to be able to do (easier to attach a debugger on
> startup that way).

Oh, no; that's why an environment variable.  It allows the fileserver to 
work with a caller that isn't prepared to receive heartbeats, and it allows 
the bosserver to advertise the feature without breaking fileservers that 
don't support it.



> I'm not saying to also remove an external timeout in bosserver, though.
> Just that the fileserver itself could have a much finger-grained timeout
> (adjusting for # of volumes, or the last internal heartbeat, etc) with
> bosserver having a larger unconditional one.

I hate to be too agressive.  Sometimes I have servers that take a long time 
to shut down because the disk is being flaky and maybe 10% or fewer of disk 
operations don't fail with some retryable error.  In such a situation, 
progress isn't made very quickly, but I'd sure hate for "something" to 
decide that we're not making fast enough progress and shoot the fileserver 
in the head before it can write everything out.

Really, what it boils down to is that in most such cases, if the server is 
alive, it's up to me to decide that it's taking too long.