[OpenAFS-devel] CopyOnWrite failures continue still
Marcus Watts
mdw@umich.edu
Thu, 28 Mar 2002 12:59:54 -0500
hoffman@cs.pitt.edu writes:
> The second corrupted volume, cs.usr0.mbell, 536877058, exhibited the
> same behavior as all of the other CopyOnWrite failures. A log of
> what I did this morning is in ftp://ftp.cs.pitt.edu/hoffman/openafs/mbell.log.
>
> The affected fileserver is running RedHat 7.2, kernel 2.4.9-21 and
> OpenAFS 1.2.3, non-threaded fileserver with the ihandle.c patch.
>
> What should I try next?
>
> ---Bob.
I'm assuming that you have source to all the bits, and are prepared
to tackle this as a software developer, and want only clues as to which
avenues to investigate first. If this doesn't describe you, the
following may be of less value to you.
The fileserver process is most likely seeing "EIO" from the kernel. If
it didn't come from the hardware, then it's got to be a software
thing. Several approaches:
(1) grep for EIO in the kernel source, try to figure out
why the fileserver might get this error (ie, dig
through the kernel filesys layer, syscall interface,
etc...) For kernel filesys code, this error *should*
only indicate a hardware failure, but the linux developers
(or POSIX standards, or yada yada...) don't necessarily
share that belief, and you may well find a code path
where that error really means "invalid parameter" (which
"should" be EINVAL), authorization failure (EPERM or EACCES)
etc.
(2) do a "strace" on the fileserver process, poke at the
volume, see if you can get get a record of
the failing syscall & parameters. Also try strace
on anything else that touches the disk and exercises the bug.
(3) The AFS source at least used to come with standalone utilities
that would poke at the filesystem directly, using
the same interface the fileserver uses. There might
be a self-contained voldump utility, or read an arbitrary
inode, or some such. Perhaps running those will generate
interesting clues, or better yet, offer a smaller
self-contained way to exercise the bug.
(4) try doing a "vos dump" for the affected volume, see what you get,
both in terms of errors from volserver, & in terms of
what's actually in the dump. This won't destroy any data,
so is definitely a simple diagnostic.
(5) try running the salvager on the affected volume(s), see
the salvager sees. Be prepared to restore the volume
from tape--on the other hand, if the salvager eats the
volume, you were probably going to have to do that anyways.
This will likely destroy data, so it may destroy evidence
of the bug. You'll want to collect read-only evidence
first before trying this.
(6) If strace doesn't show EIO coming back from kernel-land,
then another avenue to investigate is libc -- is
there something in there that could be returning EIO
to the fileserver (there shouldn't be, not for this,
but you never know...) Also check for anything in
the AFS libraries proper that might just happen to
return this.
-Marcus Watts
UM ITCS Umich Systems Group