[OpenAFS-devel] Intreresting Fileserver client-lockout bug (patch included)

Derek Atkins warlord@MIT.EDU
21 Mar 2001 23:14:34 -0500


I noticed an interesting fileserver bug, and with the help of JHutz
tracked it down and hereby submit a patch to fix it.  The bug is also
relevant in Transarc/IBM AFS, and a similar patch would solve the
problem there, too.

First, some background: The AFS Fileserver tries really hard to keep
track of all the "interfaces" of a client.  Generally this is for a
multi-homed client, so that the server realizes that you are the same
client when you come from multiple addresses.  However, this also
winds up applying to a mobile host whose IP address changes over time.

When the Fileserver sees a "new" address, it asks the client for its
Uuid and, if that Uuid already exists, it adds this new address to the
list of interfaces for the existing host.  However, it keeps a
callback connection open to the original address.

Here's the problem: Assume the client has callbacks registered with
the server and then disappears from the network.  While the client is
off the net, someone else makes a change that causes that callback to
be broken.  The fileserver can't reach the client, so the break gets
added to the delayed callback list.  The logic is such that no client
requests will be processed by a host while there are outstanding
delayed callbacks to that client.

Now, if the client comes back on the same IP Address, everything works
fine.  The fileserver uses the cached callback connection and the
callbacks are cleared successfully.  However, if the client returns to
the network under a different address, this new address is added to
the existing host structure and then the delayed callbacks are
attempted.  Unfortunately it is using the (invalid) cached connection
to the old IP Address, so the delayed break fails.  Therefore, this
client is locked out of the fileserver until:

	1) the fileserver reboots,
	2) the client returns to the original IP Address, or
	3) All the callbacks timeout on their own.

This patch will fix this problem.  When the client makes a request and
the fileserver tries to break the delayed callbacks, if the breaking
fails then the fileserver will attempt to find a 'working' interface
by probing all the host interfaces for one that responds with the
correct Uuid.  If that succeeds then it resets the cached callback
connection and then breaks the delayed callbacks, thereby regaining
the connection to the client and proceeding with the proper cleanup
before the original request is completed.

This patch is against OpenAFS 1.0.3, and Jeff and I both:

	a) were able to reproduce the problem reliably, and
	b) tested that this patch actually corrects the problem.

Please let us know if you have more questions.

-derek

--- src/viced/afsfileprocs.c-orig	Tue Mar  6 16:38:58 2001
+++ src/viced/afsfileprocs.c	Wed Mar 21 18:28:21 2001
@@ -314,7 +314,18 @@
     else if (thost->hostFlags & VENUSDOWN) {
       if (BreakDelayedCallBacks_r(thost)) {
 	ViceLog(0,("BreakDelayedCallbacks FAILED for host %08x which IS UP.  Possible network or routing failure.\n",thost->host));
-	code = -1;
+	if ( MultiProbeAlternateAddress_r (thost) ) {
+	  ViceLog(0, ("MultiProbe failed to find new address for host %x.%d\n",
+		      thost->host, thost->port));
+	  code = -1;
+	} else {
+	  ViceLog(0, ("MultiProbe found new address for host %x.%d\n",
+		      thost->host, thost->port));
+	  if (BreakDelayedCallBacks_r(thost)) {
+	    ViceLog(0,("BreakDelayedCallbacks FAILED AGAIN for host %08x which IS UP.  Possible network or routing failure.\n",thost->host));
+	    code = -1;
+	  }
+	}
       }
     } else {
        code =  0;

-- 
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       warlord@MIT.EDU                        PGP key available