[OpenAFS-devel] Re: dealing with rxevent queue stalls

Andrew Deason adeason@sinenomine.net
Mon, 30 Sep 2013 13:33:29 -0500


On Mon, 30 Sep 2013 13:19:40 -0500
Andrew Deason <adeason@sinenomine.net> wrote:

> I don't think you should be avoiding this for master; no matter how
> fast/better the code for processing rxevents is, if the rxevent
> handler only gets to run every e.g. once every 20 seconds, it still
> only gets events to run once every 20 seconds. That can always happen
> in LWP if an LWP is not yielding for any reason, and it can happen
> with any threading implementation if there is a bug. 

Hmm, also, one more thing I don't think has come up. I think people tend
to wave away these problems when saying these are caused by hardware
problems, or disks being slow, etc etc. While I tend to agree that it's
not worthwhile to work around those to fix the local process we're
running in, the impacts can be greater than that. For example, one of
the situations that Mark described I believe causes clients to keep
contacting a "sick" server, and then getting connection timeouts.

If such weird rxevent behavior slows down or brings down a whole cell,
or a whole replicated volume, it's still our responsibility to handle it
even if it's caused by a hardware fault. If we don't have some way of
handling it, then such a failing fileserver process can be a single
point of failure for data availability, even if volume data is
replicated.

I'm not sure how possible/serious such issues are with the behavior
described in this thread; I'll let Mark either correct me or advocate
for this line of reasoning if he wants.

-- 
Andrew Deason
adeason@sinenomine.net