[OpenAFS] Re: Ubik trouble

Jeffrey Hutzelman jhutz@cmu.edu
Tue, 14 Jan 2014 10:24:09 -0500


On Mon, 2014-01-13 at 23:22 -0600, Andrew Deason wrote:
> On Mon, 13 Jan 2014 12:32:12 -0500
> Jeffrey Hutzelman <jhutz@cmu.edu> wrote:
> 
> > A worse situation arises when server A makes an RPC to server B, but the
> > best route from server B back to the original source address goes via a
> > different interface than the request came in on.  In this situation, the
> > kernel will assign the wrong source address to server B's outgoing
> > reply, which may cause Rx on server A to drop it on the floor.
> 
> But we ignore the source address when the multihoming bit is set in the
> epoch.

Unfortunately, this behavior has changed a few times.  There are
actually several tests:

- On a client-mode connection, the source address is always ignored.
This
  actually should have the effect of making small requests like votes
always
  work.  But for some reason it doesn't.

- Both the source address and port are ignored if the epoch multihome
bit is
  set.  This happens on both client- and server-mode connections, except
that
  for a period of about 2 years starting in 2004, it happened on
client-mode
  connections only.

So you're right; the exact scenario I described, where a packet is
dropped by the calling client due to a mismatch of the server's address,
shouldn't happen.  The practical effect of this is that it is possible
for voting to work fine, because that's a single-round-trip operation,
while larger calls such as transferring a database update fare not so
well (or consistently).


> But all processes (that use rxkad) set the multihoming bit. Unless you
> are talking about something else? I don't even see where a process would
> manually set or clear the multihoming bit, unless it manually set the rx
> epoch, and nobody does that. The 'switch' is always flipped (or always
> not flipped, I assume, if you go back far enough).

rx sets the multihome bit by default only in kernel mode.  In user mode,
it is not set.  As it turns out, you're right -- the multihome bit is
also set by rxkad, not only for the current connection but for all
future connections, whenever a new connection is set up.  That code has
been there since AFS 3.1, but I've never noticed it before in all that
time.

This is rather significant, because it means that, except for that
two-year period 10 years ago, we should never have this sort of
multi-homing problem.  Ever.  And yet that clearly has not been the
case.  Blargh.



OK; sorry, Harald.  It seems I can't explain what you've seen after all.

-- Jeff