[OpenAFS-devel] fileserver loop

Nickolai Zeldovich kolya@MIT.EDU
Wed, 21 Aug 2002 15:18:57 -0400


> Last night we started seeing blocked connections above 50 across four of
> our AFS servers; one that holds read-write volumes with one replica and
> three pure replica servesr.

The symptoms sound really similar to the asymmetric client lossage.
That bug is fixed in the mainline of OpenAFS (and the cells at MIT
have been running with this patch for a while now), but it doesn't
look like it was pulled up to the 1.2.x branch.  The problem comes
up when some client is able to send packets to the server, but the
server is unable to send packets back to the client (because of a
firewall, or some other misconfiguration).  This ties up server's
worker threads for a long time as the server tries to contact the
client.  If the client sends new requests sufficiently often (e.g.
the Windows AFS client, whose timeouts are much lower than those
of the UNIX client), the server runs out of worker threads.

If you're interested, the deltas on the mainline for this bugfix
are:

  rx-protect-servers-from-half-reachable-clients-20020119
  rx-cleanup-deadlock-and-refcnt-leak-20020121
  better-protection-against-asymmetric-clients-20020222
  minor-rx-lock-cleanup-20020330
  clear-attachwait-flag-20020403

-- kolya