[OpenAFS] idle dead timeout issue?

Jeffrey Altman jaltman@your-file-system.com
Wed, 04 Apr 2012 14:58:00 -0400

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 4/4/2012 2:00 PM, Jack Neely wrote:
> On Tue, Apr 03, 2012 at 11:21:10AM -0400, Jeffrey Altman wrote:
>> On 4/3/2012 11:13 AM, Jack Neely wrote:
>>> Folks,
>>> What's the status of the idle dead timeout issue?  We are continuing =
>>> have issues with 1.6.1pre2.  I've seen a lot of git activity and am
>>> wondering if the idle dead issue has been resolved at this point.
>>> Thanks,
>>> Jack
>> What are the issues and why do you believe they are idle dead related?=

> Because Russ told me so.  ;-)
>     https://lists.openafs.org/pipermail/openafs-info/2012-January/03752=
> See the bottom of the email.

Indeed read the bottom of the mail.  Russ identifies two issues:

 1. idle dead timeouts

 2. vnode lock contention tying up all of the server threads

1.6.1pre2 contained a fix for the idle dead issues.   It does not
contain a fix for the vnode lock contention issues in the file server.
Nor does it contain a fix for client side "rx busy" packet processing
that was added in 1.6.1pre4.

> I've grabbed 1.6.1, read the release notes and saw some notes that
> probably apply to this situation.  I'm still unclear if the OpenAFS
> folks believe this issue is solved or just better.  In any case there's=

> nothing like tossing it on one of the web servers and giving her a spin=

That depends by what is meant by "this issue".

The client side idle dead processing issue is fixed to the extent that
we are going to fix it.   A request to a file server for a
non-replicated object will use a timeout of 20 minutes.  If your file
servers are not responding in that time period (unless your data is
offline in an hierarchical storage management system) then you bigger

If the request is against a replicated object (read/only) then a shorter
timeout is used but a retry will not be sent to the same file server
that failed to respond in a timely manner.

If "this issue" refers to the limited thread pool size in OpenAFS file
servers and the fact that all of the threads can become blocked on a
single vnode, then the answer is 'no'.  This is not fixed.

If "this issue" refers to the fact that a client can receive the
response to a call and issue the next call on the same call channel
before the file server has finished cleaning up the prior call which
results in an RX BUSY packet being sent to the client, the answer is
'no'.  There is a race in the file server that has existed since the
beginning of time.   Simon and I have researched several solutions but
none of them are appealing because they all introduce additional
contention with the file server's rx listener thread which slows down
the file server's ability to process incoming packets.

If "this issue" refers to the fact that the UNIX client was failing to
unconditionally retry requests when an RX BUSY packet is received, the
answer is 'yes'.

If "this issue" refers to something else, please specify what you
intend to mean.

The fact is that I cannot root cause your performance problems by the
data provided in this e-mail thread.   No one can.  People can make a
guess as to what the issues might be and that is all.

> Performance appears better compared to our other web servers, slightly.=

> However, we are still getting periods of time where AFS takes multiple
> seconds to 30 seconds to respond.  Then suddenly, all hanging AFS
> transactions return at the same time.
> See the graph.  The Y-axis is the number of httpd processes, the X-axis=

> is the number of seconds past 13:00 today.  (Data gathered from the htt=
> logs of how long each request took.)
>     http://www4.ncsu.edu/~jjneely/web-apr4-1325.pdf

So you have more httpd processes issuing requests to /afs then the
maximum number of file server threads.  If those threads block (perhaps
waiting for callbacks on a client to break or on other threads waiting
for a callback to complete), everything will halt until the active
threads can finish their task.

Jeffrey Altman

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

Version: GnuPG v1.4.9 (MingW32)