[OpenAFS-devel] dealing with rxevent queue stalls

Tue, 24 Sep 2013 08:02:53 -0400

--089e011827aead197e04e71fea98
Content-Type: text/plain; charset=ISO-8859-1

On Mon, Sep 23, 2013 at 3:52 PM, Mark Vitale <mvitale@sinenomine.net> wrote:

> Recently I've been working on several problems with very different
> externals but a similar root cause:
>
> 1) While accessing a particular fileserver, AFS clients experience
> performance delays; some also see multiple "server down/back up" problems.
>   - root cause was a hardware bug on the fileserver that prevented timers
> from firing reliably; this unpredictably delayed any task in the rxevent
> queue, while leaving the rest of the fileserver function relatively
> unaffected.  (btw, this was a pthreaded fileserver).
>
> 2) Volume releases suffer from poor performance and occasionally fail with
> timeouts.
>   - root cause was heavier-than-normal vlserver load (perhaps caused by
> disk performance slowdowns); this starved LWP IOMGR, which in turn
> prevented LWP rx_Listener from being dispatched (priority inversion),
> leading to a grossly delayed rxevent queue.
>
> So in two very different situations, the rxevent queue was unable to
> process scheduled events in a timely manner, leading to very strange and
> difficult-to-diagnose symptoms.
>
> I'm writing this note to begin a discussion on possible ways to address
> this in OpenAFS.
>
> One possible approach is to implement some watchdog/sentinel code to
> detect when the rxevent queue is not working correctly; that is, when it's
> unable to run scheduled events in a timely manner.   Certainly rxevent
> can't watch itself; but rather than adding another thread as a watchdog, I
> chose to insert a sanity check into rxevent_Post().  This check essentially
> compares the current time (if supplied on the "now" parameter) with the
> scheduled time for the top rxevent on the queue.  If it's later than a
> certain threshold, then we know that the rxevent queue has fallen behind
> (is "sick") for some unknown reason.  At this point, I set a state flag
> which causes any new connections to abort (with timeout or busy, for
> example).  Both the threshold and reply could be configurable, similar to
> the current implementation of the -busyat thread-busy threshold and
> response.  After the rxevent queue is able to catch up with its scheduling
> work, the "sick" state is reset.  And lastly, warning messages could be
> written to the log to indicate that the rxevent queue is having
> difficulties and later has returned to normal.  I have some prototype code
> working in my test environment; it needs some work before it will be
> suitable for review.
>
>
I haven't profiled rxevent queue handling in about 5 years. With your code
does the queue appear "sick" on a normally functioning host if you assume
that threshold is 0 (if now is later than the top scheduled event, assume
it should have already fired)?

-- 
Derrick

--089e011827aead197e04e71fea98
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Mon, Sep 23, 2013 at 3:52 PM, Mark Vitale <span dir=3D"ltr">&lt;=
<a href=3D"mailto:mvitale@sinenomine.net" target=3D"_blank">mvitale@sinenom=
ine.net</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Recently I&#39;ve been working on several pr=
oblems with very different externals but a similar root cause:<br>
<br>
1) While accessing a particular fileserver, AFS clients experience performa=
nce delays; some also see multiple &quot;server down/back up&quot; problems=
.<br>
=A0 - root cause was a hardware bug on the fileserver that prevented timers=
 from firing reliably; this unpredictably delayed any task in the rxevent q=
ueue, while leaving the rest of the fileserver function relatively unaffect=
ed. =A0(btw, this was a pthreaded fileserver).<br>

<br>
2) Volume releases suffer from poor performance and occasionally fail with =
timeouts.<br>
=A0 - root cause was heavier-than-normal vlserver load (perhaps caused by d=
isk performance slowdowns); this starved LWP IOMGR, which in turn prevented=
 LWP rx_Listener from being dispatched (priority inversion), leading to a g=
rossly delayed rxevent queue.<br>

<br>
So in two very different situations, the rxevent queue was unable to proces=
s scheduled events in a timely manner, leading to very strange and difficul=
t-to-diagnose symptoms.<br>
<br>
I&#39;m writing this note to begin a discussion on possible ways to address=
 this in OpenAFS.<br>
<br>
One possible approach is to implement some watchdog/sentinel code to detect=
 when the rxevent queue is not working correctly; that is, when it&#39;s un=
able to run scheduled events in a timely manner. =A0 Certainly rxevent can&=
#39;t watch itself; but rather than adding another thread as a watchdog, I =
chose to insert a sanity check into rxevent_Post(). =A0This check essential=
ly compares the current time (if supplied on the &quot;now&quot; parameter)=
 with the scheduled time for the top rxevent on the queue. =A0If it&#39;s l=
ater than a certain threshold, then we know that the rxevent queue has fall=
en behind (is &quot;sick&quot;) for some unknown reason. =A0At this point, =
I set a state flag which causes any new connections to abort (with timeout =
or busy, for example). =A0Both the threshold and reply could be configurabl=
e, similar to the current implementation of the -busyat thread-busy thresho=
ld and response. =A0After the rxevent queue is able to catch up with its sc=
heduling work, the &quot;sick&quot; state is reset. =A0And lastly, warning =
messages could be written to the log to indicate that the rxevent queue is =
having difficulties and later has returned to normal. =A0I have some protot=
ype code working in my test environment; it needs some work before it will =
be suitable for review.<br>

<br></blockquote><div><br></div>I haven&#39;t profiled rxevent queue handli=
ng in about 5 years. With your code does the queue appear &quot;sick&quot; =
on a normally functioning host if you assume that threshold is 0 (if now is=
 later than the top scheduled event, assume it should have already fired)?<=
br>
<br><div>=A0<br clear=3D"all"></div></div><br>-- <br>Derrick
</div></div>

--089e011827aead197e04e71fea98--