[OpenAFS] Chronic blocked connections on fileserver

Derrick Brashear shadow@gmail.com
Mon, 24 Sep 2007 10:37:23 -0400


------=_Part_3783_18871169.1190644643282
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

On 9/24/07, Will Maier <willmaier@ml1.net> wrote:
>
> Hi, all-
>
> We've been having very acute and chronic periods during which one of
> our main fileservers shows large numbers of blocked connections.
> These periods do not (it seems) correlate with high system load,
> high network interface utilization, dropped packets, UDP errors,
> high I/O or other badness indicators that I'm accustomed to looking
> for.
>
> rxdebug shows up to 200-300 blocked connections during these
> periods, which last up to an hour or so after which the badness
> abates. Since this server hosts several critical volumes, including
> one in which many $PATH elements live, users notice these
> disruptions very quickly.
>
> We've tried our best to balance accesses between our three main
> servers and have moved several very active volumes off the
> misbehaving server. After the move, the server handles ~1 million
> volume accesses in an hour; our busiest server (which does not
> experience this problem) handles nearly three times as many
> accesses. rxdebug usually shows ~8 thousand active server and client
> connections on this server.
>
> No events in the FileLog correspond with the blocked connections. I
> do see regular ProbeUuid failures, but those are benign (right?).
>
> This server has a dual-core 3.00GHz Xeon CPU, 4GB RAM and a 1Gbps
> network connection. Its vice partitions are stored on a
> fibre-attached Xserve RAID array.
>
> What other information would help resolve this problem? Is there
> another aspect of the system that I should examine? What further
> steps might we take to try to resolve the issue?


A backtrace might help, but at first brush, the patch in OpenAFS RT ticket
19461 is probably what you want.

------=_Part_3783_18871169.1190644643282
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

<br><br><div><span class="gmail_quote">On 9/24/07, <b class="gmail_sendername">Will Maier</b> &lt;<a href="mailto:willmaier@ml1.net">willmaier@ml1.net</a>&gt; wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi, all-<br><br>We&#39;ve been having very acute and chronic periods during which one of<br>our main fileservers shows large numbers of blocked connections.<br>These periods do not (it seems) correlate with high system load,
<br>high network interface utilization, dropped packets, UDP errors,<br>high I/O or other badness indicators that I&#39;m accustomed to looking<br>for.<br><br>rxdebug shows up to 200-300 blocked connections during these<br>
periods, which last up to an hour or so after which the badness<br>abates. Since this server hosts several critical volumes, including<br>one in which many $PATH elements live, users notice these<br>disruptions very quickly.
<br><br>We&#39;ve tried our best to balance accesses between our three main<br>servers and have moved several very active volumes off the<br>misbehaving server. After the move, the server handles ~1 million<br>volume accesses in an hour; our busiest server (which does not
<br>experience this problem) handles nearly three times as many<br>accesses. rxdebug usually shows ~8 thousand active server and client<br>connections on this server.<br><br>No events in the FileLog correspond with the blocked connections. I
<br>do see regular ProbeUuid failures, but those are benign (right?).<br><br>This server has a dual-core 3.00GHz Xeon CPU, 4GB RAM and a 1Gbps<br>network connection. Its vice partitions are stored on a<br>fibre-attached Xserve RAID array.
<br><br>What other information would help resolve this problem? Is there<br>another aspect of the system that I should examine? What further<br>steps might we take to try to resolve the issue?</blockquote><div><br>A backtrace might help, but at first brush, the patch in OpenAFS RT ticket 19461 is probably what you want.
<br><br><br></div><br></div><br>

------=_Part_3783_18871169.1190644643282--