OpenAFS Master Repository branch, master, updated. openafs-devel-1_9_2-156-g3155f47

Gerrit Code Review gerrit@openafs.org
Tue, 25 Feb 2025 11:45:29 -0500


The following commit has been merged in the master branch:
commit 3155f472e036d8d93afc644baa256a98052d0331
Author: Mark Vitale <mvitale@sinenomine.net>
Date:   Mon Aug 22 16:34:34 2022 -0400

    viced: Avoid blocking in multi_Rx
    
    Currently in MultiBreakCallBack_r(), if a callback break fails, we can
    h_Lock_r() a host in the body of our multi_Rx() loop for
    multi_RXAFSCB_CallBack. This violates the locking rules in host.h
    (specifically, the note above struct host->z.callback_rxcon), because we
    must not wait on h_Lock_r() while we have a call open on
    host->z.callback_rxcon.
    
    Even though the RXAFSCB_CallBack call for the current host has ended
    when we h_Lock_r, there are other calls in the multi_Rx for other hosts
    that may still be open, so this violates the locking rules.
    
    This can result in a deadlock. Consider this simplified example:
    
    - Thread T1 is inside MultiBreakCallBack_r(), where we have an rx call
      open on callback_rxcon for host H2. We are waiting for h_Lock_r on
      host H1.
    
    - Thread T2 is inside CallPreamble -> BreakDelayedCallBacks_r ->
      XCallBackBulk_r -> RXAFS_CallBack(). h_Lock_r is held for
      host H1, and we're waiting to create an rx call on callback_rxcon for
      host H1, but its call channels are full.
    
    - Thread T3 is inside MultiBreakCallBack_r(), breaking
      callbacks on multiple hosts. It has created an rx call on host H1, and
      is waiting to create an rx call on callback_rxcon host H2, but its
      call channels are full.
    
    In this example:
    
    - Thread T1 is waiting for T2 to release h_Lock_r(H1)
    
    - Thread T2 is waiting for T3 to end its call on H1's callback_rxcon
    
    - Thread T3 is waiting for T1 to end its call on H2's callback_rxcon
    
    Since all of these threads are waiting on each other, this results in a
    deadlock and no progress can be made; the threads will be stuck forever.
    
    This is a simplified example; in real scenarios where this occurs, each
    callback_rxcon has 4 call channels, so there are 4 threads that must be
    holding open an rx call on the relevant conns. But the general situation
    is the same; the graph of dependencies is clearly circular.
    
    To avoid this, change MultiBreakCallBack_r() to not do any blocking
    operations in the body of its multi_Rx(), and defer any complex
    processing to after the multi_Rx() has ended.
    
    Also change our other multi_Rx() calls in
    MultiBreakCallBackAlternateAddress_r() and
    MultiProbeAlternateAddress_r() to avoid blocking operations in
    multi_Rx(). These don't have the same deadlock as the scenario with
    MultiBreakCallBack_r() described above (since they create their own
    connections, instead of using host->z.callback_rxcon), but this may
    help avoid other similar issues with complex operations inside
    multi_Rx(), and may also help reduce contention on rx call channels,
    since the calls can finish more quickly.
    
    Document the potential for issues like this in rx_multi.h, and recommend
    that callers avoid slow or complex processing in the multi_Rx() body.
    
    Many thanks to Marcio Barbosa, who assisted with characterizing and
    reproducing the deadlock under lab conditions.
    
    [adeason@sinenomine.net: Rephrased commit message and some comments.]
    
    Change-Id: Ie926042cab9af78eb10720c7c5aa465b9204392a
    Reviewed-on: https://gerrit.openafs.org/15123
    Reviewed-by: Mark Vitale <mvitale@sinenomine.net>
    Reviewed-by: Michael Meffie <mmeffie@sinenomine.net>
    Reviewed-by: Cheyenne Wills <cwills@sinenomine.net>
    Reviewed-by: Andrew Deason <adeason@sinenomine.net>
    Tested-by: BuildBot <buildbot@rampaginggeek.com>

 src/rx/rx_multi.h    |  47 ++++++++++++++++
 src/viced/callback.c | 153 ++++++++++++++++++++++++++++++---------------------
 2 files changed, 138 insertions(+), 62 deletions(-)

-- 
OpenAFS Master Repository