[OpenAFS] connection timed out, how long is the timeout?

Jeffrey Altman jaltman@auristor.com
Sun, 4 Feb 2018 17:21:16 -0500

This is a cryptographically signed message in MIME format.

Content-Type: multipart/mixed;
Content-Language: en-US

This is a multi-part message in MIME format.
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

On 2/4/2018 7:29 AM, Jose M Calhariz wrote:
> I am chasing the root problem in my infra-structure of afsdb and
> afs-fileservers.  Sometimes my afsdb loses quorum in the middle of a
> vos operation or the Linux clients time out talking to the
> file servers.  To help diagnose the problem I would like to know how
> long is the timeout and if I can change the time out connections in
> the Debian clients and for the vos operations.
> The core of my infra-structure are 4 afsdb running Debian 9, and using
> OpenAFS from Debian 1.6.20, on a shared virtualization platform.  The
> file-servers running Debian 9 and using OpenAFS from Debian, 1.6.20,
> are VMs in dedicated hosts for OpenAFS on top of libvirt/KVM.


There is unlikely to be a single problem but since I'm procrastinating
and curious I decided to perform some research on your cell.  This
research is the type of analysis that AuriStor performs on behalf of our
support customers.  Many of the problems you are experiencing with
OpenAFS are likely due to or exacerbated by architectural limitations
that are simply not present in AuriStorFS.

Your cell has four db servers afs01 through afs04 with associated IP
addresses that rank the servers from afs01 through afs04.  therefore
afs01 is the preferred coordinator (sync site) and if its not running
afs02 will be elected.  Given there are four servers it is not possible
for afs03 or afs04 to be elected.

There are of course multiple independent ubik database services (vl, pt,
and bu) and it is possible for quorum to exist for one and not for others=

The vl service is used to store volume location information as well as
fileserver/volserver location information.  vl entries are modified when
a fileserver restarts, when a vos command locks and unlocks an entry, or
creates, updates or deletes an entry.   Its primary consumer is the afs
client which queries volume and file server location information.

The pt service stores user and group entries.  pt entries are modified
by pts when new user entries are created, modified or deleted; and when
groups are created, modified or deleted; or when group membership
information is modified.  The primary consumer is the fileserver which
queries the pt service for user and host current protection sets each
time a client establishes an rxkad connection to the fileserver.

The vl and pt services are of course ubik services.  Therefore each
vlserver and ptserver process also offers the ubik disk and vote
services which are critical.  The vote service is used to hold
elections, distribute current database version info, and maintain
quorum.  The disk service is used to distribute the database, update the
database, and maintain database consistency.  It should be noted that
the vote service is time sensitive in that packets that are used to
request votes from peers and the responses only have a limited valid

Some statistics regarding your vl service.  Each server is configured
with 16 LWP threads.  afs03 and afs04 have both failed to service calls
in a timely fashion since the last restart.  If those failures were vote
or disk calls then the coordinator would mark afs03 and afs04 as
unreachable, force a recovery operation, and if both were marked down
across an election could result in lose of quorum.

Since the last restart afs01 has processed 1894352 vl transactions,
afs02 1075698 transactions, afs03 2059186 transactions, and afs04
1403592 transactions.  That will provide you some idea of the load
balancing across your cache managers. The coordinator of course is the
only one to handle write transactions; the rest are read transactions.

For the pt service the transaction counts are afs01 1818212, afs02
1619962, afs03 1554918, and afs04 1075620.  Roughly on par with the vl
service load.  Like the vl service each server has 16 LWP threads.
However, unlike the vl service the pt service is not keeping up with the
requests.  Since the last restart all four servers have failed to
service incoming calls in a timely manner thousands of times each.

The pt service failing to be responsive is a problem because it has
ripple effects on the file servers.  The longer it takes a fileserver to
query the CPS data the longer it takes to accept a new connection from a
cache manager.

The ubik services in all versions of OpenAFS prior to the 1.8 branch
have been built as LWP (cooperatively threaded) processes.  There is
only a single thread in the process that swaps context state.  The rx
threads (listener, event, ...), the vote, disk, and application (vl, pt,
bu, ...) contexts are swapped in either upon a blocking event or a
yield.  Failure of a context to yield blocks other activities including
reading packets, processing requests, etc.  Like AuriStorFS the OpenAFS
1.8 series converts the ubik services (vl, pt, bu) to native threading.
This will permit the vote and disk services and the rx threads
(listener, event,...) to operate with greater parallelism.  Unlike
AuriStorFS, the OpenAFS implementation still relies to a large extent on
global locks for thread safety so all of the resource contention
remains.  Still, 1.8 will be much less vulnerable to vote packets being
delayed beyond their validity.  The vote and disk timeouts cannot be
adjusted because they are a core part of the protocol.

Each of the fileserver bnodes is configured as such

  dafileserver -L -b 8192 -vc 32768 -s 65536 -l 16384 -p 256 \
   -udpsize 16777216 -cb 1048576 -rxpck 4000

  davolserver -p 16 -udpsize 16777216

The fileserver and volserver use pthreads and not LWP.  There above
states that there should be a total of 256 fileserver threads (including
rx and other background threads) and 16 volserver threads.   The
fileservers are also configured with one million callback entries each.

It might be the case that some fileservers in the cell are firewalled
from the internet.  Of those that are visible here are some important

                Too few      	Max GUCB	Max FStat	Wait-for
                CBs    		seconds		seconds		threads

  afs11          yes		112		101		no
  afs12          yes		110		150		no
  afs13		 yes		111		 97		no
  afs14		 yes		112		155		no
  afs15		 yes		 80		166		no
  afs16		 yes		112		106		no
  afs17          yes		118		118		no
  afs18          yes		105		123		no
  afs19          no		 56		 53		yes
  afs20          no		 32		 85		no
  afs21 	 no		 72		 63		yes
  afs22		 no		 56		 29		no=09
  afs23		 yes		 64		 29		no
  afs24		 yes		106		 84		no
  afs25		 no 		 32		 60		no
  afs26		 no		 64		120		yes

The first column "too few CBs" indicates whether the number of callbacks
requested by clients exceeds the size of the callback table.  When the
callback table is too small thrashing occurs because the fileserver is
forced to invalidate existing callbacks in order to allocate the new
one. Those invalidations require the fileserver to notify the impacted
cache manager whose callback promise is going be broken prior to
expiration even though there was no change to the file or directory.

The second column is the maximum number of seconds it took the
fileserver to process a GiveUpCallBacks call.  GUCBs is issued when the
cache manager's vcache is too small and it attempts to notify the
fileserver that it no longer requires the callback promise because its
flushing the object from the cache in order to make room for something
else that is likely to require a callback promise.

The third column is the maximum number of seconds it took to acquire the
current metadata for the requested object and register a callback promise=

The fourth column is an indication of whether or not the fileserver has
experienced the situation in which it received a call from a cache
manager that could not be processed because all of the available worker
threads were blocked waiting for other operations to complete.

Before I answer your question about timeouts let me explain the workflow
necessary to process a call issued by a cache manager on a new rx

 1. CM issues call to file server on new connection.

 2. if the call is protected by rxkad, the FS issues an rxkad
    challenge to the client

 3. if a challenge was received, the CM issues an rxkad response

 4. FS finds an available worker thread to process the call

 5. FS starts statistics clock

 6. FS issues TellMeAboutYourself call to CM with the FS's

 7. CM replies to TMAY call with its uuid and capabilities

 8. if the UUID was previously associated with another endpoint
    (ipv4 address and port) or if the endpoint was associated
    with another UUID, the FS issues ProbeUuid calls to confirm
    whether or not there is a conflict.

 9. If received, CM responds with a yes or no answer depending
    upon whether or not the Uuid matches.

10. FS obtains the necessary volume and vnode locks (if any)

11. For a StoreData call, the FS receives the rest of the data
    and writes it to disk

12. FS issues callback notifications to other cache managers
    impacted by the call

13. FS allocates or updates a new callback promise if necessary
    for the caller.

14. FS, if necessary, issues callback invalidation to affected
    CM which might be the same CM as issued the call that is
    being processed.

15. CM either replies or not to the callback notification. If not,
    the notification is added to the delayed callback queue for the

16. FS releases any volume and vnode locks.

17. FS updates call statistics.

18. FS completes the call to the CM with success or failure.

For any rx connection there are three timeout values that can be set on
both sides of the connection.

 1. Connection timeout.  How long to wait if no packets have been
    received from the peer.

 2. Idle timeout.  How long to wait if ping packets are received
    but no data packets have been received.  This is usually set
    only on the server side of a connection.

 3. Hard timeout. How long is the call permitted to live before it is
    killed for taking too long even if data if flowing slowly.

The defaults are a connection timeout of 12 seconds, an idle timeout of
60 seconds on the server side, and no hard dead timeout.

A CM typically sets a 50 second connection timeout and no idle or hard
timeout on calls to the FS.

The FS sets a 50 second connection timeout and 120 second hard timeout
on calls to the CM callback service; except for the ProbeUuid calls
which are assigned a connection time of 12 seconds.

The FS connections to the PT service use the defaults.

I selected the GiveUpCallBacks call statistics because that call doesn't
require any volume or vnode locks, nor can it involve any notifications
to other CMs.  Long timeouts for GUCBs means one or more of the following=

 a. this is the first call on a new connection and the CM's one and
    only callback service thread is not responding to the FS promptly

 b. this is the first call on a new connection and the connection
    endpoint and the CM's UUID do not match and there is a conflict to

 c. this is the first call on a new connection and the FS's two CPS
    queries to the protection service take a long time or timeout if
    the selected ptserver stops responding to ping ack packets.

 c. the FS's host table / callback table lock is in use by other threads
    and this thread cannot make progress

The FetchStatus call is similar except that it can also block waiting
for Volume and Vnode locks which might not be released until callback
notifications are issued.

So what are the potential bottlenecks that can result in extended delays
totally tens or hundreds of seconds?

1. The single callback service thread in the cache manager which is
   known to experience soft-deadlocks.

2. The responsiveness of the ptservers to the file servers.

3. Blocking on callback invalidations due to the callback table being
   too small.

4. Network connectivity between the FS and both PT servers and CMs.

Its time for the Super Bowl so I will send off this message as is.
Perhaps it will be useful.

Jeffrey Altman
AuriStor, Inc.

Content-Type: text/x-vcard; charset=utf-8;
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;

fn:Jeffrey Altman
org:AuriStor, Inc.
adr:Suite 6B;;255 West 94Th Street;New York;New York;10025-6985;United St=
title:Founder and CEO
note;quoted-printable:LinkedIn: https://www.linkedin.com/in/jeffreyaltman=
	Skype: jeffrey.e.altman=3D0D=3D0A=3D


Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature