[OpenAFS-devel] Re: dealing with rxevent queue stalls

Mark Vitale mvitale@sinenomine.net
Wed, 23 Oct 2013 15:23:21 +0000 (UTC)


I'm just getting back to this after a few weeks of distractions...

Andrew Deason <adeason <at> sinenomine.net> writes:
> 
> On Mon, 30 Sep 2013 13:19:40 -0500
> Andrew Deason <adeason <at> sinenomine.net> wrote:
> 
> > I don't think you should be avoiding this for master; no matter how
> > fast/better the code for processing rxevents is, if the rxevent
> > handler only gets to run every e.g. once every 20 seconds, it still
> > only gets events to run once every 20 seconds. That can always happen
> > in LWP if an LWP is not yielding for any reason, and it can happen
> > with any threading implementation if there is a bug. 
> 
> Hmm, also, one more thing I don't think has come up. I think people tend
> to wave away these problems when saying these are caused by hardware
> problems, or disks being slow, etc etc. While I tend to agree that it's
> not worthwhile to work around those to fix the local process we're
> running in, the impacts can be greater than that. For example, one of
> the situations that Mark described I believe causes clients to keep
> contacting a "sick" server, and then getting connection timeouts.
Yes, the case you are describing was initially reported as poor performance
for many clients contacting the "sick" fileserver.  Only a few of the
affected clients recognized that the server was "down" (due to rxevent
caused delays and timeouts); but even those clients immediately marked the
server "back up" when they received undelayed responses RXAFS_GetTime
probes.  So they would continue to contact the "sick" fileserver.

> If such weird rxevent behavior slows down or brings down a whole cell,
> or a whole replicated volume, it's still our responsibility to handle it
> even if it's caused by a hardware fault. If we don't have some way of
> handling it, then such a failing fileserver process can be a single
> point of failure for data availability, even if volume data is
> replicated.
Agreed, that's essentially what happened in the "sick" fileserver case, and
why I realized this can really only be addressed on the server side; most
clients never realize that anything is wrong or that they should switch to
another fileserver.

> I'm not sure how possible/serious such issues are with the behavior
> described in this thread; I'll let Mark either correct me or advocate
> for this line of reasoning if he wants.
Done.  Thanks, Andrew.

--
Mark Vitale
mvitale@sinenomine.net