[OpenAFS] Re: bos killed fileserver before it was shut down cleanly.

Andrew Deason adeason@sinenomine.net
Sun, 10 Oct 2010 21:00:44 -0500


On Sun, 10 Oct 2010 12:36:09 -0700
Russ Allbery <rra@stanford.edu> wrote:

> Adam Megacz <adam@megacz.com> writes:
>
> > Just curious, is this "stall" a bug in the fileserver, or something
> > which happens for a good reason?  If so, what is the reason?
> 
> It happens, in my experience, when there are hundreds of thousands of
> open callbacks, often to hosts behind NAT that are now unreachable and
> produce UDP timeouts.  The fileserver tries to break all those
> callbacks, which if left to run to completion can take many hours.

Unless I'm just out of it today and not comprehending, we don't break
callbacks on shutdown (instead we initcallbackstate on contacting the
client after restart for non-DAFS). However, we do wait for client
traffic to stop, which includes at least some callback breaks.

There have been bugs that cause hangs too, though. Notably a few
versions leading up to 1.4.12 had issues corrupting the host list which
caused some deadlocks.

-- 
Andrew Deason
adeason@sinenomine.net