[OpenAFS-devel] CopyOnWrite failures continue still

Fri, 29 Mar 2002 09:51:33 -0500 (EST)

Marcus,

Thanks for your detailed response.

> (1) grep for EIO in the kernel source

Did that -- only 2,440 instances.  :-)

I think this one was answered by Chaskiel Grundman yesterday.
The EIO errors that I was seeing were happening only when the corrupted
volume did NOT cause a CopyOnWrite error.  The CopyOnWrite errors have
never been accompanied by an EIO.

> (2) do a "strace" on the fileserver process
> (3) The AFS source at least used to come with standalone utilities
> 	that would poke at the filesystem directly

I will try these.

> (4) try doing a "vos dump" for the affected volume, see what you get,

Here's what I got:

    [root@spot]# vos dump -id 536878347 -file /tmp/536878347.dump \
         -server spot -partition /vicepc -verbose
    Could not start transaction on the volume 536878347 to be dumped
    Volume needs to be salvaged
    Error in vos dump command.
    Volume needs to be salvaged
    [root@spot# 

VolserLog reports:

    Fri Mar 29 09:35:21 2002 VAttachVolume: volume salvage flag is ON for
        /vicepc/V0536878347.vol; volume needs salvage

> (5) try running the salvager on the affected volume(s), see
> 	the salvager sees.

This is what we always do; the log of yesterday's salvage is in the
mbell.log file in my public FTP area.

> (6) If strace doesn't show EIO coming back from kernel-land,
> 	then another avenue to investigate is libc -- is
> 	there something in there that could be returning EIO
> 	to the fileserver (there shouldn't be, not for this,
> 	but you never know...)  Also check for anything in
> 	the AFS libraries proper that might just happen to
> 	return this.

And this is precisely what Chaskiel found, in viced/physio.c.

	Thanks,
	---Bob.