[AFS3-std] rxgk-afs: moving SetCallBackKey to a separate document?

Thomas Keiser tkeiser@gmail.com
Mon, 4 Mar 2013 19:56:38 -0500


On Fri, Mar 1, 2013 at 11:34 PM, Chaskiel Grundman <cg2v@andrew.cmu.edu> wrote:
>
> > Unprotected callback channels also permit Denial of Service attacks
> > against the cache manager because any IP address can send the cache
> > manager RPCs that invalidate the contents of the cache.
>
> The rxgk callback protection described in the document does not prevent that. In particular:
>
> >   Only RPCs issued over an rxgk protected connection should receive
> >   rxgk protected callbacks
>
> And in any event, Why can't the attacker just send RXAFSCB_InitCallbackState3 and invalidate the cache that way? There is no way to require that call be protected (think fileserver restart where the state save/load didn't work).
>

Indeed.  I have been contemplating that issue off-and-on for a long
time.  I don't think it's intractable, yet it is non-trivial to
resolve.  Once we resume discussion of RPC refresh, it will become
possible to devise alternate methods of detecting a loss of state on
the server.

One potential RPC refresh mechanism I have pondered is, upon receipt
of an IN flag bit requesting such behavior, to return a magic value in
the AFSCallBack OUT param.  The idea would be to communicate that the
operation's return parameters have no coherence guarantees--rather
than synchronously invoking reverse calls over an unsecured
channel--thereby securely inviting the client to asynchronously
establish a context via SetCallBackKey.

Of course, such semantics would prove troublesome for mechanisms that
likely wish to ensure binding to a known client, e.g., future locking.
 This then begs the old question of how to securely fail such calls...

An idea, which I remember discussing with someone long ago, involved
working around the unsecured abort problem with OUT unions (e.g., one
leg simply returns an error code, while another returns the normal OUT
parameters).  Alas, this would noticeably increase implementation
complexity.

It's a shame: secure call aborts would make a lot of this simpler,
however I'll readily concede that problem is simply beyond our time
constraints.


>
> I also find this notion that callback revocations could be used in an amplification attack silly. The CM is not going to respond to every RXAFSB_CallBack() with RXAFS_FetchStatus(). It will only do that the next time that afs vnode is touched by a client


Yes, for the vast majority of vnodes within a cell's namespace, there
is little temporal correlation between a callback break and the
subsequent FetchStatus.  However, in addition to the issues to which
Jeff already alluded, there is another class of workload where your
intuition is at odds with the data.    Given that most CMs within the
cell agree on the root of the path traversal digraph, the cumulative
(across all CMs referencing that cell) vnode reference frequency
distribution tends to have a significant tail of outliers.  This tail
largely contains vnodes within a small diameter of the root, as a
result of, primarily, path to FID lookups.  CallBack to FetchStatus
amplification is particularly troublesome when there exist volumes
(particularly within a small graph diameter of the root) whose
topology contains much fan-out.  In such cases, even if only a small
percentage of the traversals through that subgraph are hot, the net
result will be a substantial--and, moreover, temporally
local--amplification following the originating break.

The xstat time-series data I've had the privilege of visualizing (when
I worked for Sine Nomine--particularly at one client, who superimposed
vos release events, for critical volumes, on top of the time series),
warrant the desire for finding ways to mitigate this problem.  Much
like Jeff, I think XCB would be tremendously useful in this regard,
modulo the need to develop heuristics to mitigate the notification
tailing problem--whereby (disinterested) clients continue to receive
notification streams, due to the difficulty in reaching distributed
consensus on when notification load crosses over from beneficial to
deleterious.

Cheers,

-Tom