OpenAFS Master Repository branch, master, updated. openafs-devel-1_9_2-156-g3155f47
Gerrit Code Review
gerrit@openafs.org
Tue, 25 Feb 2025 11:45:29 -0500
The following commit has been merged in the master branch:
commit 3155f472e036d8d93afc644baa256a98052d0331
Author: Mark Vitale <mvitale@sinenomine.net>
Date: Mon Aug 22 16:34:34 2022 -0400
viced: Avoid blocking in multi_Rx
Currently in MultiBreakCallBack_r(), if a callback break fails, we can
h_Lock_r() a host in the body of our multi_Rx() loop for
multi_RXAFSCB_CallBack. This violates the locking rules in host.h
(specifically, the note above struct host->z.callback_rxcon), because we
must not wait on h_Lock_r() while we have a call open on
host->z.callback_rxcon.
Even though the RXAFSCB_CallBack call for the current host has ended
when we h_Lock_r, there are other calls in the multi_Rx for other hosts
that may still be open, so this violates the locking rules.
This can result in a deadlock. Consider this simplified example:
- Thread T1 is inside MultiBreakCallBack_r(), where we have an rx call
open on callback_rxcon for host H2. We are waiting for h_Lock_r on
host H1.
- Thread T2 is inside CallPreamble -> BreakDelayedCallBacks_r ->
XCallBackBulk_r -> RXAFS_CallBack(). h_Lock_r is held for
host H1, and we're waiting to create an rx call on callback_rxcon for
host H1, but its call channels are full.
- Thread T3 is inside MultiBreakCallBack_r(), breaking
callbacks on multiple hosts. It has created an rx call on host H1, and
is waiting to create an rx call on callback_rxcon host H2, but its
call channels are full.
In this example:
- Thread T1 is waiting for T2 to release h_Lock_r(H1)
- Thread T2 is waiting for T3 to end its call on H1's callback_rxcon
- Thread T3 is waiting for T1 to end its call on H2's callback_rxcon
Since all of these threads are waiting on each other, this results in a
deadlock and no progress can be made; the threads will be stuck forever.
This is a simplified example; in real scenarios where this occurs, each
callback_rxcon has 4 call channels, so there are 4 threads that must be
holding open an rx call on the relevant conns. But the general situation
is the same; the graph of dependencies is clearly circular.
To avoid this, change MultiBreakCallBack_r() to not do any blocking
operations in the body of its multi_Rx(), and defer any complex
processing to after the multi_Rx() has ended.
Also change our other multi_Rx() calls in
MultiBreakCallBackAlternateAddress_r() and
MultiProbeAlternateAddress_r() to avoid blocking operations in
multi_Rx(). These don't have the same deadlock as the scenario with
MultiBreakCallBack_r() described above (since they create their own
connections, instead of using host->z.callback_rxcon), but this may
help avoid other similar issues with complex operations inside
multi_Rx(), and may also help reduce contention on rx call channels,
since the calls can finish more quickly.
Document the potential for issues like this in rx_multi.h, and recommend
that callers avoid slow or complex processing in the multi_Rx() body.
Many thanks to Marcio Barbosa, who assisted with characterizing and
reproducing the deadlock under lab conditions.
[adeason@sinenomine.net: Rephrased commit message and some comments.]
Change-Id: Ie926042cab9af78eb10720c7c5aa465b9204392a
Reviewed-on: https://gerrit.openafs.org/15123
Reviewed-by: Mark Vitale <mvitale@sinenomine.net>
Reviewed-by: Michael Meffie <mmeffie@sinenomine.net>
Reviewed-by: Cheyenne Wills <cwills@sinenomine.net>
Reviewed-by: Andrew Deason <adeason@sinenomine.net>
Tested-by: BuildBot <buildbot@rampaginggeek.com>
src/rx/rx_multi.h | 47 ++++++++++++++++
src/viced/callback.c | 153 ++++++++++++++++++++++++++++++---------------------
2 files changed, 138 insertions(+), 62 deletions(-)
--
OpenAFS Master Repository