[OpenAFS] connection timed out, how long is the timeout?

Jeffrey Altman jaltman@auristor.com
Sun, 4 Feb 2018 17:21:16 -0500


This is a cryptographically signed message in MIME format.

--------------ms010403050703030304060109
Content-Type: multipart/mixed;
 boundary="------------71CAD8B71AC7923BE183F978"
Content-Language: en-US

This is a multi-part message in MIME format.
--------------71CAD8B71AC7923BE183F978
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

On 2/4/2018 7:29 AM, Jose M Calhariz wrote:
> I am chasing the root problem in my infra-structure of afsdb and
> afs-fileservers.  Sometimes my afsdb loses quorum in the middle of a
> vos operation or the Linux clients time out talking to the
> file servers.  To help diagnose the problem I would like to know how
> long is the timeout and if I can change the time out connections in
> the Debian clients and for the vos operations.
>[...]
> The core of my infra-structure are 4 afsdb running Debian 9, and using
> OpenAFS from Debian 1.6.20, on a shared virtualization platform.  The
> file-servers running Debian 9 and using OpenAFS from Debian, 1.6.20,
> are VMs in dedicated hosts for OpenAFS on top of libvirt/KVM.

Jose,

There is unlikely to be a single problem but since I'm procrastinating
and curious I decided to perform some research on your cell.  This
research is the type of analysis that AuriStor performs on behalf of our
support customers.  Many of the problems you are experiencing with
OpenAFS are likely due to or exacerbated by architectural limitations
that are simply not present in AuriStorFS.

Your cell has four db servers afs01 through afs04 with associated IP
addresses that rank the servers from afs01 through afs04.  therefore
afs01 is the preferred coordinator (sync site) and if its not running
afs02 will be elected.  Given there are four servers it is not possible
for afs03 or afs04 to be elected.

There are of course multiple independent ubik database services (vl, pt,
and bu) and it is possible for quorum to exist for one and not for others=
=2E

The vl service is used to store volume location information as well as
fileserver/volserver location information.  vl entries are modified when
a fileserver restarts, when a vos command locks and unlocks an entry, or
creates, updates or deletes an entry.   Its primary consumer is the afs
client which queries volume and file server location information.

The pt service stores user and group entries.  pt entries are modified
by pts when new user entries are created, modified or deleted; and when
groups are created, modified or deleted; or when group membership
information is modified.  The primary consumer is the fileserver which
queries the pt service for user and host current protection sets each
time a client establishes an rxkad connection to the fileserver.

The vl and pt services are of course ubik services.  Therefore each
vlserver and ptserver process also offers the ubik disk and vote
services which are critical.  The vote service is used to hold
elections, distribute current database version info, and maintain
quorum.  The disk service is used to distribute the database, update the
database, and maintain database consistency.  It should be noted that
the vote service is time sensitive in that packets that are used to
request votes from peers and the responses only have a limited valid
lifetime.

Some statistics regarding your vl service.  Each server is configured
with 16 LWP threads.  afs03 and afs04 have both failed to service calls
in a timely fashion since the last restart.  If those failures were vote
or disk calls then the coordinator would mark afs03 and afs04 as
unreachable, force a recovery operation, and if both were marked down
across an election could result in lose of quorum.

Since the last restart afs01 has processed 1894352 vl transactions,
afs02 1075698 transactions, afs03 2059186 transactions, and afs04
1403592 transactions.  That will provide you some idea of the load
balancing across your cache managers. The coordinator of course is the
only one to handle write transactions; the rest are read transactions.

For the pt service the transaction counts are afs01 1818212, afs02
1619962, afs03 1554918, and afs04 1075620.  Roughly on par with the vl
service load.  Like the vl service each server has 16 LWP threads.
However, unlike the vl service the pt service is not keeping up with the
requests.  Since the last restart all four servers have failed to
service incoming calls in a timely manner thousands of times each.

The pt service failing to be responsive is a problem because it has
ripple effects on the file servers.  The longer it takes a fileserver to
query the CPS data the longer it takes to accept a new connection from a
cache manager.

The ubik services in all versions of OpenAFS prior to the 1.8 branch
have been built as LWP (cooperatively threaded) processes.  There is
only a single thread in the process that swaps context state.  The rx
threads (listener, event, ...), the vote, disk, and application (vl, pt,
bu, ...) contexts are swapped in either upon a blocking event or a
yield.  Failure of a context to yield blocks other activities including
reading packets, processing requests, etc.  Like AuriStorFS the OpenAFS
1.8 series converts the ubik services (vl, pt, bu) to native threading.
This will permit the vote and disk services and the rx threads
(listener, event,...) to operate with greater parallelism.  Unlike
AuriStorFS, the OpenAFS implementation still relies to a large extent on
global locks for thread safety so all of the resource contention
remains.  Still, 1.8 will be much less vulnerable to vote packets being
delayed beyond their validity.  The vote and disk timeouts cannot be
adjusted because they are a core part of the protocol.

Each of the fileserver bnodes is configured as such

  dafileserver -L -b 8192 -vc 32768 -s 65536 -l 16384 -p 256 \
   -udpsize 16777216 -cb 1048576 -rxpck 4000

  davolserver -p 16 -udpsize 16777216

The fileserver and volserver use pthreads and not LWP.  There above
states that there should be a total of 256 fileserver threads (including
rx and other background threads) and 16 volserver threads.   The
fileservers are also configured with one million callback entries each.

It might be the case that some fileservers in the cell are firewalled
from the internet.  Of those that are visible here are some important
details.

                Too few      	Max GUCB	Max FStat	Wait-for
                CBs    		seconds		seconds		threads

  afs11          yes		112		101		no
  afs12          yes		110		150		no
  afs13		 yes		111		 97		no
  afs14		 yes		112		155		no
  afs15		 yes		 80		166		no
  afs16		 yes		112		106		no
  afs17          yes		118		118		no
  afs18          yes		105		123		no
  afs19          no		 56		 53		yes
  afs20          no		 32		 85		no
  afs21 	 no		 72		 63		yes
  afs22		 no		 56		 29		no=09
  afs23		 yes		 64		 29		no
  afs24		 yes		106		 84		no
  afs25		 no 		 32		 60		no
  afs26		 no		 64		120		yes

The first column "too few CBs" indicates whether the number of callbacks
requested by clients exceeds the size of the callback table.  When the
callback table is too small thrashing occurs because the fileserver is
forced to invalidate existing callbacks in order to allocate the new
one. Those invalidations require the fileserver to notify the impacted
cache manager whose callback promise is going be broken prior to
expiration even though there was no change to the file or directory.

The second column is the maximum number of seconds it took the
fileserver to process a GiveUpCallBacks call.  GUCBs is issued when the
cache manager's vcache is too small and it attempts to notify the
fileserver that it no longer requires the callback promise because its
flushing the object from the cache in order to make room for something
else that is likely to require a callback promise.

The third column is the maximum number of seconds it took to acquire the
current metadata for the requested object and register a callback promise=
=2E

The fourth column is an indication of whether or not the fileserver has
experienced the situation in which it received a call from a cache
manager that could not be processed because all of the available worker
threads were blocked waiting for other operations to complete.

Before I answer your question about timeouts let me explain the workflow
necessary to process a call issued by a cache manager on a new rx
connection.

 1. CM issues call to file server on new connection.

 2. if the call is protected by rxkad, the FS issues an rxkad
    challenge to the client

 3. if a challenge was received, the CM issues an rxkad response

 4. FS finds an available worker thread to process the call

 5. FS starts statistics clock

 6. FS issues TellMeAboutYourself call to CM with the FS's
    capabilities

 7. CM replies to TMAY call with its uuid and capabilities

 8. if the UUID was previously associated with another endpoint
    (ipv4 address and port) or if the endpoint was associated
    with another UUID, the FS issues ProbeUuid calls to confirm
    whether or not there is a conflict.

 9. If received, CM responds with a yes or no answer depending
    upon whether or not the Uuid matches.

10. FS obtains the necessary volume and vnode locks (if any)

11. For a StoreData call, the FS receives the rest of the data
    and writes it to disk

12. FS issues callback notifications to other cache managers
    impacted by the call

13. FS allocates or updates a new callback promise if necessary
    for the caller.

14. FS, if necessary, issues callback invalidation to affected
    CM which might be the same CM as issued the call that is
    being processed.

15. CM either replies or not to the callback notification. If not,
    the notification is added to the delayed callback queue for the
    CM.

16. FS releases any volume and vnode locks.

17. FS updates call statistics.

18. FS completes the call to the CM with success or failure.


For any rx connection there are three timeout values that can be set on
both sides of the connection.

 1. Connection timeout.  How long to wait if no packets have been
    received from the peer.

 2. Idle timeout.  How long to wait if ping packets are received
    but no data packets have been received.  This is usually set
    only on the server side of a connection.

 3. Hard timeout. How long is the call permitted to live before it is
    killed for taking too long even if data if flowing slowly.


The defaults are a connection timeout of 12 seconds, an idle timeout of
60 seconds on the server side, and no hard dead timeout.

A CM typically sets a 50 second connection timeout and no idle or hard
timeout on calls to the FS.

The FS sets a 50 second connection timeout and 120 second hard timeout
on calls to the CM callback service; except for the ProbeUuid calls
which are assigned a connection time of 12 seconds.

The FS connections to the PT service use the defaults.

I selected the GiveUpCallBacks call statistics because that call doesn't
require any volume or vnode locks, nor can it involve any notifications
to other CMs.  Long timeouts for GUCBs means one or more of the following=
:

 a. this is the first call on a new connection and the CM's one and
    only callback service thread is not responding to the FS promptly

 b. this is the first call on a new connection and the connection
    endpoint and the CM's UUID do not match and there is a conflict to
    resolve

 c. this is the first call on a new connection and the FS's two CPS
    queries to the protection service take a long time or timeout if
    the selected ptserver stops responding to ping ack packets.

 c. the FS's host table / callback table lock is in use by other threads
    and this thread cannot make progress

The FetchStatus call is similar except that it can also block waiting
for Volume and Vnode locks which might not be released until callback
notifications are issued.

So what are the potential bottlenecks that can result in extended delays
totally tens or hundreds of seconds?

1. The single callback service thread in the cache manager which is
   known to experience soft-deadlocks.

2. The responsiveness of the ptservers to the file servers.

3. Blocking on callback invalidations due to the callback table being
   too small.

4. Network connectivity between the FS and both PT servers and CMs.

Its time for the Super Bowl so I will send off this message as is.
Perhaps it will be useful.

Jeffrey Altman
AuriStor, Inc.

--------------71CAD8B71AC7923BE183F978
Content-Type: text/x-vcard; charset=utf-8;
 name="jaltman.vcf"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
 filename="jaltman.vcf"

begin:vcard
fn:Jeffrey Altman
n:Altman;Jeffrey
org:AuriStor, Inc.
adr:Suite 6B;;255 West 94Th Street;New York;New York;10025-6985;United St=
ates
email;internet:jaltman@auristor.com
title:Founder and CEO
tel;work:+1-212-769-9018
note;quoted-printable:LinkedIn: https://www.linkedin.com/in/jeffreyaltman=
=3D0D=3D0A=3D
	Skype: jeffrey.e.altman=3D0D=3D0A=3D
=09
url:https://www.auristor.com/
version:2.1
end:vcard


--------------71CAD8B71AC7923BE183F978--

--------------ms010403050703030304060109
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature

MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgEFADCABgkqhkiG9w0BBwEAAKCC
DIIwggXpMIIE0aADAgECAhBAAV7gPRitcrlGsJTzkwjvMA0GCSqGSIb3DQEBCwUAMDoxCzAJ
BgNVBAYTAlVTMRIwEAYDVQQKEwlJZGVuVHJ1c3QxFzAVBgNVBAMTDlRydXN0SUQgQ0EgQTEy
MB4XDTE3MTAwMzAzMTczM1oXDTE4MTEwMzAzMTczM1owgYUxLTArBgNVBAsMJFZlcmlmaWVk
IEVtYWlsOiBqYWx0bWFuQGF1cmlzdG9yLmNvbTEjMCEGCSqGSIb3DQEJARYUamFsdG1hbkBh
dXJpc3Rvci5jb20xLzAtBgoJkiaJk/IsZAEBEx9BMDE0MjdFMDAwMDAxNUVFMDNEMTg3QTAw
MDA0QUE1MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAqqJC89ZA1DSS7t/Ug8Dd
BQv5nBDumInWtFvHwVCORitVCvlkX4SfqKpERATq0eHOSc0zEz1PUjhAT8lgbNj8Bs92pL9t
DW/VHHpq11w06rCEmZJNxgErAIvMpRuAhGrzvBpQBLj8nDArHWw+5nRn/KnK7ZO81LEEj4TG
w0PEKGSa0aFA+JdRTJ6BZSDP2o/8AHx+Bw4JgW8VppAe4IuY/F+JoYtyQDL+fm1YMnFMtf1A
6IvlGXD7gMksPRbVIfD+QpHZbQvNXZAVVDaCWZuWQq46Vl4lSlkmW9yMlGddvFGl2zSMK7ny
f0kbWJLw9lZxXDegY0/ciJPACPsyBwuyLwIDAQABo4ICnTCCApkwDgYDVR0PAQH/BAQDAgWg
MIGEBggrBgEFBQcBAQR4MHYwMAYIKwYBBQUHMAGGJGh0dHA6Ly9jb21tZXJjaWFsLm9jc3Au
aWRlbnRydXN0LmNvbTBCBggrBgEFBQcwAoY2aHR0cDovL3ZhbGlkYXRpb24uaWRlbnRydXN0
LmNvbS9jZXJ0cy90cnVzdGlkY2FhMTIucDdjMB8GA1UdIwQYMBaAFKRz2u9pNYp1zKAZewgy
+GuJ5ELsMAkGA1UdEwQCMAAwggEsBgNVHSAEggEjMIIBHzCCARsGC2CGSAGG+S8ABgsBMIIB
CjBKBggrBgEFBQcCARY+aHR0cHM6Ly9zZWN1cmUuaWRlbnRydXN0LmNvbS9jZXJ0aWZpY2F0
ZXMvcG9saWN5L3RzL2luZGV4Lmh0bWwwgbsGCCsGAQUFBwICMIGuGoGrVGhpcyBUcnVzdElE
IENlcnRpZmljYXRlIGhhcyBiZWVuIGlzc3VlZCBpbiBhY2NvcmRhbmNlIHdpdGggCklkZW5U
cnVzdCdzIFRydXN0SUQgQ2VydGlmaWNhdGUgUG9saWN5IGZvdW5kIGF0IGh0dHBzOi8vc2Vj
dXJlLmlkZW50cnVzdC5jb20vY2VydGlmaWNhdGVzL3BvbGljeS90cy9pbmRleC5odG1sMEUG
A1UdHwQ+MDwwOqA4oDaGNGh0dHA6Ly92YWxpZGF0aW9uLmlkZW50cnVzdC5jb20vY3JsL3Ry
dXN0aWRjYWExMi5jcmwwHwYDVR0RBBgwFoEUamFsdG1hbkBhdXJpc3Rvci5jb20wHQYDVR0O
BBYEFNefZrPaqPUvaS6V6kAmHDwFhoDiMB0GA1UdJQQWMBQGCCsGAQUFBwMCBggrBgEFBQcD
BDANBgkqhkiG9w0BAQsFAAOCAQEAKlssrfOJ5+WwHyhFSeSsioN0qpg2QDX/uvodF38JbquO
1U0my0j3Cc/bwk48++bjzp0Fvk/Kkcmss5/6zzJMjr9rf12QCQfKkbO9nMm8Bg6IP3pYgk0W
/F1h3ZQF3OgBn3zZoOd3f1a6dF6z12MqKA/2g5GKrQFxkdzTGrNw6ISE9uY8ysvc3i2N2kas
HNi5Etk7StZ1jvFX5sQMIeNdlF+z+BU/AyT7NoBS4gCH+ggF+DG7fAYywvy42Lfu8p6kopKT
5JZpYce1cNjnOaDhzhgeR+oXxoDbekF27JinXHQSKjBxhujcZu5leAkpctFpZxnIKZJZUBiu
31Nm7xYaijCCBpEwggR5oAMCAQICEQD53lZ/yU0Md3D5YBtS2hU7MA0GCSqGSIb3DQEBCwUA
MEoxCzAJBgNVBAYTAlVTMRIwEAYDVQQKEwlJZGVuVHJ1c3QxJzAlBgNVBAMTHklkZW5UcnVz
dCBDb21tZXJjaWFsIFJvb3QgQ0EgMTAeFw0xNTAyMTgyMjI1MTlaFw0yMzAyMTgyMjI1MTla
MDoxCzAJBgNVBAYTAlVTMRIwEAYDVQQKEwlJZGVuVHJ1c3QxFzAVBgNVBAMTDlRydXN0SUQg
Q0EgQTEyMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA0ZFNPM8KJzSSrkvpmtQl
a3ksT+fq1s9c+Ea3YSC/umUkygSm9UkkOoaoNjKZoCx3wef1kwC4pQQV2XHk+AKR+7uMvnOC
Iw2cAVUP0/Kuy4X6miqaXGGVDTqwVjaFuFCRVVDTQoI2BTMpwFQi+O/TjD5+E0+TAZbkzsB7
krk4YUbA6hFyT0YboxRUq9M2QHDb+80w53b1UZVO1HS2Mfk9LnINeyzjxiXU/iENK07YvjBO
xbY/ftAYPbv/9cY3wrpqZYHoXZc6B9/8+aVCNA45FP3k+YuTDC+ZrmePQBLQJWnyS/QrZEdX
saieWUqkUMxPQKTExArCiP61YRYlOIMpKwIDAQABo4ICgDCCAnwwgYkGCCsGAQUFBwEBBH0w
ezAwBggrBgEFBQcwAYYkaHR0cDovL2NvbW1lcmNpYWwub2NzcC5pZGVudHJ1c3QuY29tMEcG
CCsGAQUFBzAChjtodHRwOi8vdmFsaWRhdGlvbi5pZGVudHJ1c3QuY29tL3Jvb3RzL2NvbW1l
cmNpYWxyb290Y2ExLnA3YzAfBgNVHSMEGDAWgBTtRBnA0/AGi+6ke75C5yZUyI42djAPBgNV
HRMBAf8EBTADAQH/MIIBIAYDVR0gBIIBFzCCARMwggEPBgRVHSAAMIIBBTCCAQEGCCsGAQUF
BwICMIH0MEUWPmh0dHBzOi8vc2VjdXJlLmlkZW50cnVzdC5jb20vY2VydGlmaWNhdGVzL3Bv
bGljeS90cy9pbmRleC5odG1sMAMCAQEagapUaGlzIFRydXN0SUQgQ2VydGlmaWNhdGUgaGFz
IGJlZW4gaXNzdWVkIGluIGFjY29yZGFuY2Ugd2l0aCBJZGVuVHJ1c3QncyBUcnVzdElEIENl
cnRpZmljYXRlIFBvbGljeSBmb3VuZCBhdCBodHRwczovL3NlY3VyZS5pZGVudHJ1c3QuY29t
L2NlcnRpZmljYXRlcy9wb2xpY3kvdHMvaW5kZXguaHRtbDBKBgNVHR8EQzBBMD+gPaA7hjlo
dHRwOi8vdmFsaWRhdGlvbi5pZGVudHJ1c3QuY29tL2NybC9jb21tZXJjaWFscm9vdGNhMS5j
cmwwHQYDVR0lBBYwFAYIKwYBBQUHAwIGCCsGAQUFBwMEMA4GA1UdDwEB/wQEAwIBhjAdBgNV
HQ4EFgQUpHPa72k1inXMoBl7CDL4a4nkQuwwDQYJKoZIhvcNAQELBQADggIBAA3hgq7S+/Tr
Yxl+D7ExI1Rdgq8fC9kiT7ofWlSaK/IMjgjoDfBbPGWvzdkmbSgYgXo8GxuAon9+HLIjNv68
BgUmbIjwj/SYaVz6chA25XZdjxzKk+hUkqCmfOn/twQJeRfxHg3I+0Sfwp5xs10YF0Robhrs
CRne6OUmh9mph0fE3b21k90OVnx9Hfr+YAV4ISrTA6045zQTKGzb370whliPLFo+hNL6XzEt
y5hfdFaWKtHIfpE994CLmTJI4SEbWq40d7TpAjCmKCPIVPq/+9GqggGvtakM5K3VXNc9VtKP
U9xYGCTDIYoeVBQ65JsdsdyM4PzDzAdINsv4vaF7yE03nh2jLV7XAkcqad9vS4EB4hKjFFsm
cwxa+ACUfkVWtBaWBqN4f/o1thsFJHEAu4Q6oRB6mYkzqrPigPazF2rgYw3lp0B1gSzCRj+j
RtErIVdMPeZ2p5Fdx7SNhBtabuhqmpJkFxwW9SBg6sHvy0HpzVvEiBpApFKG1ZHXMwzQl+pR
8P27wWDsblJU7Qgb8ZzGRK9l5GOFhxtN+oXZ4CCmunLMtaZ2vSai7du/VKrg64GGZNAKerEB
evjJVNFgeSnmUK9GB4kCZ7U5NWlU+2H87scntW4Q/0Y6vqQJcJeaMHg/dQnahTQ2p+hB1xJJ
K32GWIAucTFMSOKLbQHadIOiMYIDFDCCAxACAQEwTjA6MQswCQYDVQQGEwJVUzESMBAGA1UE
ChMJSWRlblRydXN0MRcwFQYDVQQDEw5UcnVzdElEIENBIEExMgIQQAFe4D0YrXK5RrCU85MI
7zANBglghkgBZQMEAgEFAKCCAZcwGAYJKoZIhvcNAQkDMQsGCSqGSIb3DQEHATAcBgkqhkiG
9w0BCQUxDxcNMTgwMjA0MjIyMTE2WjAvBgkqhkiG9w0BCQQxIgQgfGR9xX2saiQsxyju1zf6
8w9RHDcDQd4jynucxY9Z/wMwXQYJKwYBBAGCNxAEMVAwTjA6MQswCQYDVQQGEwJVUzESMBAG
A1UEChMJSWRlblRydXN0MRcwFQYDVQQDEw5UcnVzdElEIENBIEExMgIQQAFe4D0YrXK5RrCU
85MI7zBfBgsqhkiG9w0BCRACCzFQoE4wOjELMAkGA1UEBhMCVVMxEjAQBgNVBAoTCUlkZW5U
cnVzdDEXMBUGA1UEAxMOVHJ1c3RJRCBDQSBBMTICEEABXuA9GK1yuUawlPOTCO8wbAYJKoZI
hvcNAQkPMV8wXTALBglghkgBZQMEASowCwYJYIZIAWUDBAECMAoGCCqGSIb3DQMHMA4GCCqG
SIb3DQMCAgIAgDANBggqhkiG9w0DAgIBQDAHBgUrDgMCBzANBggqhkiG9w0DAgIBKDANBgkq
hkiG9w0BAQEFAASCAQCEVmQQG49XQLTytRh3JlXyWdx6t6yK+Jn1PRT61i+/N/1wRKKcqsb6
M6VbUZcYaTQ9JhZIOetEDfCtlXjUBc6uBynMAr6H/ELgJs1sFBD8/MYHRiOedd2Fpe4AoM6c
QvK6mSTeiVsXJM5ELgbgVe04sX3P5qLbaP14hkeHRRVW7f2jNepcQB1KovyRQcyvVqNRwNjw
LQttrDSqq0nSxCOO92Zcf0AgVKiCML5CMYxFUQYFiQtAxIoStEWetxK+d+LXs3FOVTGlB3im
5ynOkJuLXR/K+BkGcGo9SZ8BfgOHB8319Ia+F9OFqSoXlsiUGfSoAvaRVUuWJVP9kHJ/UcOt
AAAAAAAA
--------------ms010403050703030304060109--