[OpenAFS-devel] CopyOnWrite failures continue still

Thu, 28 Mar 2002 12:59:54 -0500

hoffman@cs.pitt.edu writes:
> The second corrupted volume, cs.usr0.mbell, 536877058, exhibited the
> same behavior as all of the other CopyOnWrite failures.  A log of
> what I did this morning is in ftp://ftp.cs.pitt.edu/hoffman/openafs/mbell.log.
> 
> The affected fileserver is running RedHat 7.2, kernel 2.4.9-21 and
> OpenAFS 1.2.3, non-threaded fileserver with the ihandle.c patch.
> 
> What should I try next?
> 
> 	---Bob.

I'm assuming that you have source to all the bits, and are prepared
to tackle this as a software developer, and want only clues as to which
avenues to investigate first.  If this doesn't describe you, the
following may be of less value to you.

The fileserver process is most likely seeing "EIO" from the kernel.  If
it didn't come from the hardware, then it's got to be a software
thing.  Several approaches:

(1) grep for EIO in the kernel source, try to figure out
	why the fileserver might get this error (ie, dig
	through the kernel filesys layer, syscall interface,
	etc...)  For kernel filesys code, this error *should*
	only indicate a hardware failure, but the linux developers
	(or POSIX standards, or yada yada...) don't necessarily
	share that belief, and you may well find a code path
	where that error really means "invalid parameter" (which
	"should" be EINVAL), authorization failure (EPERM or EACCES)
	etc.
(2) do a "strace" on the fileserver process, poke at the
	volume, see if you can get get a record of
	the failing syscall & parameters.  Also try strace
	on anything else that touches the disk and exercises the bug.
(3) The AFS source at least used to come with standalone utilities
	that would poke at the filesystem directly, using
	the same interface the fileserver uses.  There might
	be a self-contained voldump utility, or read an arbitrary
	inode, or some such.  Perhaps running those will generate
	interesting clues, or better yet, offer a smaller
	self-contained way to exercise the bug.
(4) try doing a "vos dump" for the affected volume, see what you get,
	both in terms of errors from volserver, & in terms of
	what's actually in the dump.  This won't destroy any data,
	so is definitely a simple diagnostic.
(5) try running the salvager on the affected volume(s), see
	the salvager sees.  Be prepared to restore the volume
	from tape--on the other hand, if the salvager eats the
	volume, you were probably going to have to do that anyways.
	This will likely destroy data, so it may destroy evidence
	of the bug.  You'll want to collect read-only evidence
	first before trying this.
(6) If strace doesn't show EIO coming back from kernel-land,
	then another avenue to investigate is libc -- is
	there something in there that could be returning EIO
	to the fileserver (there shouldn't be, not for this,
	but you never know...)  Also check for anything in
	the AFS libraries proper that might just happen to
	return this.

				-Marcus Watts
				UM ITCS Umich Systems Group