[OpenAFS-devel] Windows Broken Pipes

Fri, 4 Apr 2003 15:47:01 -0600

While trying to look further into this problem, I discovered an
undocumented "vos offline" command that can be used to temporarily mark
an instance of a volume on a particular server/partition to offline (or
busy). When we used this command to mark a RW volume offline and used
the commands "fs flushv" and "dir" repeatedly on a Windows client to
ensure that the volume was repeatedly being accessed, the client started
producing CM_ERROR_NOSUCHVOLUME error codes and started having the
"Broken pipe" message. After the "vos offline" command returned the
volume to online status, the Windows client did not recover. The volume
still remained inaccessible, and flushing the volume via Windows
Explorer still resulted in the "Broken pipe" message. The volume
continued to be unavailable until after the command "fs checkvolumes"
was run.

FYI, the syntax for setting an instance of a volume offline is as
follows:

vos offline <file server> <partition (e.g. "a")> <volume name> <seconds
to remain offline>

I believe that cm_Analyze and cm_ConnByMServers in
src/WINNT/afsd/cm_conn.c are not properly handling the case where all of
the servers for a particular volume get marked as "offline" because they
each return one of the following errors for that volume: VOFFLINE,
VNOVOL, VMOVED, VSALVAGE, and VNOSERVICE. If this happens, the next call
to cm_ConnByMServers (by way of cm_Conn) will return
CM_ERROR_NOSUCHVOLUME. For any further attempts to access that volume,
the cached list of servers for that volume will still show that volume
is marked offline on all of the servers for that volume (unless some of
the servers happen to be marked as down). Once this happens,
cm_ConnByMServers and cm_Analyze will never try to reset the "offline"
status and they will continue to report the volume as
CM_ERROR_NOSUCHVOLUME until something else causes the volume cache entry
to be reset, such as the command 'fs checkvolumes'. I believe that I
have observed this behavior for a RW volume that is stored on only one
server, after marking in offline with the 'vos offline' command.

I have an idea for modifying these two functions to alter the behavior
for processing offline volumes to consider that the volume could
possibly become online later. Part of the idea is to create a new CM
error code CM_ERROR_ALLOFFLINE to signify that a particular volume is
marked offline on all of the servers for that volume (similar to the
code CM_ERROR_ALLBUSY).

First, the following changes would be made to the this section of
cm_Analyze:

	/* special codes:  missing volumes */
	if (errorCode =3D=3D VNOVOL || errorCode =3D=3D VMOVED || errorCode =
=3D=3D
VOFFLINE
	    || errorCode =3D=3D VSALVAGE || errorCode =3D=3D VNOSERVICE) {
		long oldSum, newSum;
		int same;

		/* Back off to allow move to complete */
/*		thrd_Sleep(2000);*/

		/* Update the volume location and see if it changed */
/*		cm_GetServerList(fidp, userp, reqp, &serversp); */
/*		oldSum =3D cm_ChecksumServerList(serversp); */
/*		cm_ForceUpdateVolume(fidp, userp, reqp); */
		cm_GetServerList(fidp, userp, reqp, &serversp);
/*		newSum =3D cm_ChecksumServerList(serversp); */
/*		same =3D (oldSum =3D=3D newSum); */

		/* mark servers as appropriate */
		for (tsrp =3D serversp; tsrp; tsrp=3Dtsrp->next) {
			if (tsrp->server =3D=3D serverp)
				tsrp->status =3D offline;
/*			else if (!same) */
/*				tsrp->status =3D not_busy; */
		}
		retry =3D 1;
	}

The first change would remove the delay so that one offline replica of a
RO volume would quickly failover to other replicas with the hope that
another replica will be online. The comment by the thrd_Sleep suggests
that it might be helpful to wait for a move to complete, but it looks
like the code VMOVED will be returned after the move is already
complete, so there doesn't seem to be a reason to wait here. Next the
call to cm_ForceUpdateVolume and the cm_ChecksumServerList would be
removed because the call to cm_ForceUpdateVolume appears to mark all of
the status flags to "not_busy", and not remember if any servers were
previously marked "offline". I don't understand what old code is
supposed to do when a RO volume with multiple replicas is marked
offline. These changes should allow the client to quickly failover from
one RO replica to each of the others until they all get marked offline.

Then, the end of cm_ConnByMServers would be changed to the following:

	if (firstError =3D=3D 0) {
		if (someBusy) firstError =3D CM_ERROR_ALLBUSY;
		else if (someOffline) firstError =3D CM_ERROR_ALLOFFLINE;
		else if (serversp) firstError =3D CM_ERROR_TIMEDOUT;
		else firstError =3D CM_ERROR_NOSUCHVOLUME;
	}

If all of the servers that are not marked as down are marked offline and
there is at least one server marked offline, it will return
CM_ERROR_ALLOFFLINE. If all of the servers are down and there is at
least one server down, it will return CM_ERROR_TIMEDOUT to indicate that
there was an unrecoverable error, but it does not suggest that the
volume doesn't exist. If there are no servers, it will return
CM_ERROR_NOSUCHVOLUME (I expect that this would happen if a volume were
deleted).

Lastly, I would add a case in cm_Analyze to handle the new error
CM_ERROR_ALLOFFLINE. It should be placed in the same section as the case
for handling the CM_ERROR_ALLBUSY code.

	/* if all servers are offline, mark them non-busy and start over
*/
	if (errorCode =3D=3D CM_ERROR_ALLOFFLINE) {
		thrd_Sleep(5000);
		/* cm_ForceUpdateVolume marks all servers as non_busy */
		cm_ForceUpdateVolume(fidp, userp, reqp);
		retry =3D 1;
	}

First, it will wait awhile for the cause of the volume being offline to
go away. Then it will run cm_ForceUpdateVolume to reload the list of
servers for this volume, and to mark all of them as not_busy. Lastly, it
will indicate that the operation should be retried. I believe that the
cm_ConnByMServers function will halt the operation with the error
CM_ERROR_TIMEDOUT if the operation takes too long.

Of course, the new error code would be added to the end of the file
src/WINNT/afsd/cm.h.

I plan on testing these changes out before posting diffs. But I would
like to know if someone has any opinions about this kind of change to
the code. In particular, if someone sees a problem with these changes,
please let me know. Also, if someone else can get a better idea of what
is currently wrong with the code and suggest a better solution, that
would be welcome too.

Ryan Lantzer

-----Original Message-----
From: Lantzer, Ryan=20
Sent: Friday, March 21, 2003 4:03 PM
To: 'openafs-devel@openafs.org'
Subject: [OpenAFS-devel] Windows Broken Pipes

There have been more instances of this problem at our site, and we were
able to produce a trace dump that seems to have produced some usefull
output. I believe that a RO volume with 3 replicas was released just
before or just as the problem started occurring on this system. A couple
of seconds after the client received a cm_RevokeVolumeCallback() for the
volume in question, cm_GetCallback() was called and it looks like it was
trying to fetch the status of something within that volume. It looks
like the call to RXAFS_FetchStatus() failed two times with VOFFLINE
against two of the RO servers. But on the third try cm_ConnByMServers()
failed with CM_ERROR_NOSUCHVOLUME, making it look like it didn't even
try the third server. After receiving CM_ERROR_NOSUCHVOLUME, the
cm_GetCallback() function decides that the operation has failed and
exits. It looks like once cm_ConnByMServers() returns
CM_ERROR_NOSUCHVOLUME, there is no way for cm_Conn() to connect to that
volume again except after running 'fs checkv' to invalidate the volume
cache entry, and force it to be reloaded from the VLDB.

I found that once this problem had occurred, trying to flush the volume
in question using the Explorer interface resulted in the following error
message:

afs_shl_ext

Error flushing volume for S:\: Broken pipe

I'm still trying to find a way to reproduce this problem so that I can
perform additional tests to find out more about it.

In the mean time, does anyone know why CM_ERROR_NOSUCHVOLUME would be
returned for an offline volume? If there is a reason, shouldn't the
volume become available again after it comes back online?

Ryan Lantzer

-----Original Message-----
From: James Peterson  james@abrakus.com
Sent: Thu, 6 Mar 2003 09:14:43 -0800
To: 'openafs-devel@openafs.org'
Subject: [OpenAFS-devel] Windows Broken Pipes

Ryan I have seen something similar on my XP system.  The drive is not
labeled and is not accessible.  It has happened with "subst" drives
without
AFS running and AFS drives.  I was going to try and remove something
about
the drive definition from the registry and reboot.  I suspect its XP,
DOS
and I have been just living with it.  Lets stay in touch about this.

James
"Integrity is the Base of Excellence"

=09

-----Original Message-----
From: Lantzer, Ryan=20
Sent: Wednesday, March 05, 2003 10:25 AM
To: 'openafs-devel@openafs.org'
Subject: [OpenAFS-devel] Windows AFS client reports broken pipe when
trying to flush volumes that have become unavailable

At our site, we seem to have a large number of instances where one or
more (but not all) drives mapped to volumes on AFS appear to suddenly
become unavailable. The drives appear to be empty, and attempting to
flush the problematic volumes results in an error message indicating a
broken pipe. This problem does not seem to affect all drives mapped to
resources on AFS, since some drives mapped to different volumes continue
to work properly. Either rebooting or refreshing the volume ID/name map
appears to make things start working again. We are seeing the problem
under Windows XP Pro with OpenAFS 1.2.8 installed.

Has anyone else seen similar problems and/or know of a way to reproduce
this problem?

Ryan Lantzer