[OpenAFS-devel] Re: dealing with rxevent queue stalls

Andrew Deason adeason@sinenomine.net
Tue, 24 Sep 2013 10:27:44 -0500


On Mon, 23 Sep 2013 19:52:37 +0000
Mark Vitale <mvitale@sinenomine.net> wrote:

> 1) While accessing a particular fileserver, AFS clients experience
> performance delays; some also see multiple "server down/back up"
> problems.
>   - root cause was a hardware bug on the fileserver that prevented
>   timers from firing reliably; this unpredictably delayed any task in
>   the rxevent queue, while leaving the rest of the fileserver function
>   relatively unaffected.  (btw, this was a pthreaded fileserver).

In my opinion it's not worth it to work around this, unless there's some
way to address it that's easy and everyone agrees it's obviously
correct.

Logging it is definitely helpful and OK, though logging from rx is not
great. Are you currently using existing mechanisms by just printing to
stderr, or some new mechanism for logging from rx?

> 2) Volume releases suffer from poor performance and occasionally fail
> with timeouts.
>   - root cause was heavier-than-normal vlserver load (perhaps caused
>   by disk performance slowdowns); this starved LWP IOMGR, which in
>   turn prevented LWP rx_Listener from being dispatched (priority
>   inversion), leading to a grossly delayed rxevent queue.

I'm not sure if I'm mistaken as to what this is about, or if I just find
this phrasing really confusing. I thought the issue here was just that
ubik proceses (such as vlserver) use plain read() and write() calls to
read and write from disk; so if they take a while, all LWPs will freeze
because we cannot preempt the LWP waiting on i/o. Is that correct?

-- 
Andrew Deason
adeason@sinenomine.net