[OpenAFS-devel] idle dead timeout processing in clients

Simon Wilkinson sxw@inf.ed.ac.uk
Thu, 8 Dec 2011 14:14:01 +0000


It's pretty clear that there are a number of issues, both in the 1.4 and =
1.6.x series, with extremely poor performance with "busy" fileservers or =
clients. Some of these have been there for a long time, others come from =
changes made over the life of the 1.4.x and 1.6.x series. In particular, =
what seems to happen is that servers (and, in some cases, whole cells) =
will quite happily scale up to a particular load. However once that load =
is exceeded instead of degrading gracefully, things will just jam up =
completely.

The first possible cause is journalling filesystems. Many of these flush =
their journals to disk at regular intervals, blocking or reducing access =
to the filesystem during the journal flush. This block can be enough to =
cause the fileserver to start queuing incoming connections, and in a =
site that is finely balanced, may be enough to cause performance to =
stall. This was made considerably worse by the fileserver performing a =
sync() operation every 10 seconds. This is fixed in 1.6.0 - a 1.4.x =
release containing the fix has yet to appear.

The next cause is due to deadlocks between the client and the =
fileserver. The Linux dynamic vcaches code which was added in 1.4.10 is =
known to interact badly with fileserver callback breaks, especially in =
situations where the fileserver is under heavy load. There is a fix in =
1.6.0, but we have yet to ship a 1.4.x release which contains it. You =
can also work around this particular problem by disabling dynamic =
vcaches in your clients.

The idle dead code (present since 1.4.8, made considerably worse in =
1.6.0) then exacerbates any performance problems that you may be seeing. =
If the client hasn't received a response from the server in a (small) =
number of seconds, it gives up on the request, and tries a different =
server. However, if the server hasn't responded because it is =
overloaded, or because it is waiting for a callback break, then the =
clients request will still be queued on the server - either taking up a =
valuable thread, or amongst the "calls waiting for a thread". The server =
will then (eventually) process the packets which it has received, and =
attempt to perform the operation requested by the client, which has long =
since gone away. In some experimental situations, idle dead can actually =
lead to exponential load increases on the fileserver as clients pound on =
a particular busy server.

Hope that's of some use...

Simon.