[OpenAFS-devel] idle dead timeout processing in clients

Wed, 30 Nov 2011 12:53:03 -0500

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig06F5DDF7675196ABEF0DA306
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Since before OpenAFS, Rx has included idle peer detection that was
activated only in the servers.  This idle peer detection is referred to
in the code as the idle dead timeout.  Idle peer detection is an
important mechanism of preventing denial of service attacks against the
AFS infrastructure.

AFS clients are just as susceptible to bad behavior as the file servers.
 A file server can respond to the client with keep alive packets for an
extended period of time for a variety of reasons.  The file server could:=

  a. be severely overloaded and unable to process all
     incoming requests in a timely fashion. (requests waiting
     for threads.)

  b. have a partition whose underlying disk (or iSCSI, etc) is
     failing and all I/O requests on that device are blocking.

  c. have a large number of threads blocking on a single vnode
     and cannot process requests for other vnodes as a result.

  d. be retrieving the requested data from a hierarchical storage
     management system.

  e. be malicious.

=46rom 2003 until the present there has been a gradual move towards
activating idle peer detection to clients with idle peer detection
arriving on the master branch (and in Windows 1.5.x clients) in the
Spring of 2008 and in the 1.6.0 release this past Summer.

The motivating factor for Unix clients was protection against case (b)
and improved fail over to a .readonly replica when cases (a) and (c)
occur.  For Windows, the motivating factor was ensuring that all AFS
RPCs would terminate within the SMB timeout period (45 seconds) in order
to avoid the SMB client tearing down it SMB connection.

Unfortunately, client side idle peer detection is the root cause behind
Stephan Wiesand's bug report [RT#130327] in which a large number of
clients writing to a single volume begin to timeout, mark servers down,
and fail to complete store operations.

Derrick, Simon and I spent the last week analyzing the problem.  Here is
our analysis:

1.  The use of the RX_CALL_DEAD error to indicate an idle peer
    does not provide enough information to the cache manager for
    it to respond in a sensible manner.  RX_CALL_DEAD is an
    indication that the peer is not responding and should be marked
    "down" until the next server probe.

2.  Idle peer detection is only safe to use when the object that is
    being accessed is known to be replicated and is not stored in a
    HSM.

2a. The mere existence of idle peer detection breaks HSM deployments
    because the file server can be expected to take an extremely long
    time to retrieve some data.  Perhaps hours in some edge cases.

2b. Data changing RPCs require that callbacks be broken.  The timeout
    for a callback break is the hard dead timeout which is 2 minutes.
    The timeout for an idle peer is 1 minute.  Any clients that are
    waiting for an RPC to complete against a vnode on which a callback
    is pending can end up marking the server down prior to the
    completion of the RPC.

2c. When multiple clients issue data changing RPCs against a single
    vnode there are increasingly longer completion times.  When the
    queue of pending requests is long enough idle peer detection trips
    causing the server to be marked down and the RPC to fail due to
    lack of a replica to failover to.

3.  When a replica is available and it is known to not be backed
    by an HSM, the use of idle peer detection is a win.  Unfortunately,
    the client has no knowledge of the backing store; nor should it.

And our conclusions:

1.  Protecting against a failed disk, partition, etc. must be done
    on the file server.  Only the file server knows whether keep
    alives are being sent while a pending I/O has failed to return.

2.  Only the client knows whether there are replicas that can be
    used to fail over to.  A client can implement idle peer detection
    but only for RPCs that are issued against replicated volumes
    for which there is an available replica.

3.  For all RPCs against volumes without replicas, idle peer detection
    must be disabled.

4.  The AFS protocol was not designed for use with HSMs.  An RPC that
    must be held in a keep alive state while the HSM retrieves the
    necessary data blocks a limited file server resource (the worker
    thread).  Instead the file server should return a VBUSY indicating
    that the object is being retrieved.

    For new RPCs a better model can be implemented where the file
    server issues a callback when the object is available. This
    avoids having the clients poll the server. This is a problem
    for the AFS3 stds group to address.  In the meantime,  it is
    recommended that sites that deploy AFS backed by HSM not
    stored replicated volumes in the HSM.

5.  Protecting against a malicious server is hard.  There is no idle
    peer timeout value that can be set that won't cause some legitimate
    workload to fail.  As a result, at this time we cannot implement
    such protection.

6.  Idle peer detection in the client must never result in the file
    server being marked "down".  That is the impetus behind
      http://gerrit.openafs.org/#change,6128
    which permits a new locally generated error RX_CALL_IDLE to be
    reported to the cache manager when the idle peer detection has
    triggered.

7.  Idle peer timeouts increase the load on the file servers due to
    an increased likelihood that the client will re-use a call channel
    that the file server considers in use.

For 1.6.1 we propose that:

1.  Since client side idle peer detection is inherently broken that
    it be disabled entirely on Unix clients.

2.  Since Windows clients must support idle peer detection to address
    the SMB timeout issue, idle peer detection will be activated only
    for SMB initiated requests.  A registry option will be provided
    to permit a cell to be configured in no-HSM mode.  For such a
    cell, idle peer timeouts will be active only when an available
    replica is known to be available.  This is possible for Windows
    because of the existence of the registry based CellServDB.

Jeffrey Altman

--------------enig06F5DDF7675196ABEF0DA306
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)

iQEcBAEBAgAGBQJO1m2BAAoJENxm1CNJffh43ksIAMWgmEWYiqMSbK5bGpaMGf9i
iYrMYk78dF3PK90nR68U0qUdBY65ilbTipDqF782mMA3gvxhZUqi1UT+TTOVp7aF
DxndMCrLmbr4d7O7u9S3apqCe3BNJ8+du6r+UYOZeannMIg/3wWdZMaL5ZrBqBEd
pYNsNmEzzsv7JTfrXt0GtihaR0NMnhoTI37QPEUOV3Kg/SPXCdUpsSSXx5qMvTO4
M8GybFs2p+JObZSBz6EW8mka9vYqh0DgZshvlbbYzOBMHWQsamHALqWERzmPHPCp
EZ66WzsgA91kqObaPkon3ejxFgTpvRs4bZy+7/YG1ko1VfoK1jzOnskz9gUl9Is=
=gdRc
-----END PGP SIGNATURE-----

--------------enig06F5DDF7675196ABEF0DA306--