[OpenAFS-devel] Windows Broken Pipes

Lantzer, Ryan lantzer@umr.edu
Mon, 9 Jun 2003 16:46:33 -0500


I am including a couple of patches based on OpenAFS 1.2.9 which I think
provide a possible solution to the problem with the Windows Client being
unable to see a volume after it was (briefly) marked as "missing" while
trying to talk to an AFS server. With the patched code, if a volume is
"missing" after talking to a particular server, that server is marked
offline for that volume, and the failed operation should be retried
without any added delay. The operation should then be tried against each
server for that volume until the operation succeeds, or until all of the
servers for that volume are marked offline. If all of the servers are
marked offline, it will wait 5 seconds before refreshing the server list
for that volume (with cm_ForceUpdateVolume) and indicating that the
operation should be retried. If all instances of that volume continue to
be "missing", the operation should eventually time out. But as long as
there is at least one server for that volume, the volume will not get
flagged with CM_ERROR_NOSUCHVOLUME, which would appear to make the
volume permanently unavailable until "fs checkvolumes" (or an
equivalent) is called.

Ryan Lantzer


src\WINNT\afsd\cm.h

--- cm.h.orig	2001-04-30 01:48:02.000000000 -0500
+++ cm.h	2003-06-09 10:50:31.000000000 -0500
@@ -245,5 +245,6 @@
 #define CM_ERROR_BADNTFILENAME		(CM_ERROR_BASE+37)
 #define CM_ERROR_BUFFERTOOSMALL		(CM_ERROR_BASE+38)
 #define CM_ERROR_RENAME_IDENTICAL	(CM_ERROR_BASE+39)
+#define CM_ERROR_ALLOFFLINE		(CM_ERROR_BASE+40)
=20
 #endif /*  __CM_H_ENV__ */


src\WINNT\afsd\cm_conn.c

--- cm_conn.c.orig	2003-03-14 13:49:00.000000000 -0600
+++ cm_conn.c	2003-06-09 12:54:00.000000000 -0500
@@ -134,6 +134,15 @@
 	if (reqp->flags & CM_REQ_NORETRY)
 		goto out;
=20
+	/* if all servers are offline, mark them non-busy and start over
*/
+	if (errorCode =3D=3D CM_ERROR_ALLOFFLINE) {
+		osi_Log0(afsd_logp, "cm_Analyze passed
CM_ERROR_ALLOFFLINE.");
+		thrd_Sleep(5000);
+		/* cm_ForceUpdateVolume marks all servers as non_busy */
+		cm_ForceUpdateVolume(fidp, userp, reqp);
+		retry =3D 1;
+	}
+
 	/* if all servers are busy, mark them non-busy and start over */
 	if (errorCode =3D=3D CM_ERROR_ALLBUSY) {
 		cm_GetServerList(fidp, userp, reqp, &serversp);
@@ -164,23 +173,35 @@
 		long oldSum, newSum;
 		int same;
=20
-		/* Back off to allow move to complete */
-		thrd_Sleep(2000);
+		/* Log server being offline for this volume */
+		osi_Log4(afsd_logp, "cm_Analyze found server %d.%d.%d.%d
marked offline for a volume",
+			((serverp->addr.sin_addr.s_addr & 0xff)),
+			((serverp->addr.sin_addr.s_addr & 0xff00)>> 8),
+			((serverp->addr.sin_addr.s_addr & 0xff0000)>>
16),
+			((serverp->addr.sin_addr.s_addr & 0xff000000)>>
24));
+		/* Create Event Log message */=20
+		{
+			HANDLE h;
+			char *ptbuf[1];
+			char s[100];
+			h =3D RegisterEventSource(NULL,
AFS_DAEMON_EVENT_NAME);
+			sprintf(s, "cm_Analyze: Server %d.%d.%d.%d
reported volume %d as missing.",
+				((serverp->addr.sin_addr.s_addr &
0xff)),
+				((serverp->addr.sin_addr.s_addr &
0xff00)>> 8),
+				((serverp->addr.sin_addr.s_addr &
0xff0000)>> 16),
+				((serverp->addr.sin_addr.s_addr &
0xff000000)>> 24),
+				fidp->volume);
+			ptbuf[0] =3D s;
+			ReportEvent(h, EVENTLOG_WARNING_TYPE, 0, 1009,
NULL,
+				1, 0, ptbuf, NULL);
+			DeregisterEventSource(h);
+		}
=20
-		/* Update the volume location and see if it changed */
-		cm_GetServerList(fidp, userp, reqp, &serversp);
-		oldSum =3D cm_ChecksumServerList(serversp);
-		cm_ForceUpdateVolume(fidp, userp, reqp);
+		/* Mark server offline for this volume */
 		cm_GetServerList(fidp, userp, reqp, &serversp);
-		newSum =3D cm_ChecksumServerList(serversp);
-		same =3D (oldSum =3D=3D newSum);
-
-		/* mark servers as appropriate */
 		for (tsrp =3D serversp; tsrp; tsrp=3Dtsrp->next) {
 			if (tsrp->server =3D=3D serverp)
 				tsrp->status =3D offline;
-			else if (!same)
-				tsrp->status =3D not_busy;
 		}
 		retry =3D 1;
 	}
@@ -312,8 +333,10 @@
 	lock_ReleaseWrite(&cm_serverLock);
 	if (firstError =3D=3D 0) {
 		if (someBusy) firstError =3D CM_ERROR_ALLBUSY;
-		else if (someOffline) firstError =3D
CM_ERROR_NOSUCHVOLUME;
-		else firstError =3D CM_ERROR_TIMEDOUT;
+		else if (someOffline) firstError =3D CM_ERROR_ALLOFFLINE;
+		else if (serversp) firstError =3D CM_ERROR_TIMEDOUT;
+		/* Only return CM_ERROR_NOSUCHVOLUME if there are no
servers for this volume */
+		else firstError =3D CM_ERROR_NOSUCHVOLUME;
 	}
 	osi_Log1(afsd_logp, "cm_ConnByMServers returning %x",
firstError);
         return firstError;


-----Original Message-----
From: James Peterson  james@abrakus.com
Sent: Thu, 6 Mar 2003 09:14:43 -0800
To: 'openafs-devel@openafs.org'
Subject: [OpenAFS-devel] Windows Broken Pipes


Ryan I have seen something similar on my XP system.  The drive is not
labeled and is not accessible.  It has happened with "subst" drives
without
AFS running and AFS drives.  I was going to try and remove something
about
the drive definition from the registry and reboot.  I suspect its XP,
DOS
and I have been just living with it.  Lets stay in touch about this.

James
"Integrity is the Base of Excellence"

=09

-----Original Message-----
From: Lantzer, Ryan=20
Sent: Wednesday, March 05, 2003 10:25 AM
To: 'openafs-devel@openafs.org'
Subject: [OpenAFS-devel] Windows AFS client reports broken pipe when
trying to flush volumes that have become unavailable


At our site, we seem to have a large number of instances where one or
more (but not all) drives mapped to volumes on AFS appear to suddenly
become unavailable. The drives appear to be empty, and attempting to
flush the problematic volumes results in an error message indicating a
broken pipe. This problem does not seem to affect all drives mapped to
resources on AFS, since some drives mapped to different volumes continue
to work properly. Either rebooting or refreshing the volume ID/name map
appears to make things start working again. We are seeing the problem
under Windows XP Pro with OpenAFS 1.2.8 installed.

Has anyone else seen similar problems and/or know of a way to reproduce
this problem?

Ryan Lantzer