[OpenAFS] [1.2.7] Strange file server meltdown

Nickolai Zeldovich kolya@MIT.EDU
Thu, 12 Dec 2002 23:42:58 -0500


Russ Allbery <rra@stanford.edu> wrote:

> (2) From the rxdebug output, I can see that there are several clients that
>     have multiple connections open to the server.  Over time, as the
>     server continues to have problems, they all continue opening more
>     connections (about 20 per system over the course of ten minutes).  All
>     of the clients that have this property appear to be Windows machines;
>     none of the Unix clients seem to be doing this.

Out of curiosity, could you send the rxdebug output that pertains to
the problem client?  In particular, I'm interested whether there are
client connections from the fileserver to the client that are waiting
for the client to reply, for each server connection that's attached.

>     This sounds like a bug in the Windows client.  I remember previous
>     list traffic mentioning something about Windows having nasty timeouts,
>     and this seems to confirm that.  Are there any thoughts on ways to
>     deal with this?  (Is this a tunable parameter somewhere, for example?)

The timeouts in the Windows client are infact significantly lower than
those in the Unix clients.  I don't think they are tunable, but they are
#define's in src/WINNT/afsd/cm_conn.h:

  #define CM_CONN_CONNDEADTIME            20
  #define CM_CONN_HARDDEADTIME            40

(as opposed to 50 and infinity on the unix clients respectively).  These
parameters let the Windows client create connections just a bit faster
than the server times them out, resulting in the accumulation you are
seeing.

> (4) Once the server goes into this failure mode, I would have expected
>     clients accessing replicated volumes on that server to fall over to
>     other replica sites, but they don't.  From the client perspective, the
>     server connection ends up in waiting_for_process for basically
>     forever.  Some client processes seem to just wait forever for it;
>     others seem to time out, but that timeout doesn't apparently turn into
>     a recognition that the file server is down, and the next time the same
>     volume is accessed, the client goes back to waiting on that file
>     server again.

The problem is that the server keeps responding to the Rx pings, so the
client doesn't think the server is down, just slow.  And the Unix clients
don't set a hard dead timeout by default, so it keeps trying forever.  On
my own client builds, I use this change that makes things time out much
faster:

--- afs.h	2002/10/16 03:58:15	1.35
+++ afs.h	2002/12/13 04:42:18
@@ -83,8 +83,8 @@
 #define	AFS_LRALLOCSIZ 	4096	    /* "Large" allocated size */
 #define	VCACHE_FREE	5
 #define	AFS_NRXPACKETS	80
-#define	AFS_RXDEADTIME	50
-#define AFS_HARDDEADTIME        120
+#define	AFS_RXDEADTIME	10
+#define AFS_HARDDEADTIME 30
 
 struct sysname_info {
   char *name;

--- afs_conn.c	2002/10/16 03:58:16	1.10
+++ afs_conn.c	2002/12/13 04:42:18
@@ -226,9 +226,7 @@
 	AFS_GUNLOCK();
 	tc->id = rx_NewConnection(sap->sa_ip, aport, service, csec, isec);
 	AFS_GLOCK();
-        if (service == 52) { 
-           rx_SetConnHardDeadTime(tc->id, AFS_HARDDEADTIME);
-       }
+        rx_SetConnHardDeadTime(tc->id, AFS_HARDDEADTIME);
 
 
 	tc->forceConnectFS = 0;	/* apparently we're appropriately connected now */

-- kolya