[OpenAFS-devel] How to create inconsistency in the volserver and my mind.

Jeffrey Hutzelman jhutz@cmu.edu
Wed, 16 Mar 2005 20:30:08 -0500


On Tuesday, March 15, 2005 12:20:14 PM +0100 Harald Barth <haba@pdc.kth.se> 
wrote:

> command-that-fails | vos restore example a foo -overwrite full
>
> Now volume foo is gone on server but still exists in vldb. Why does
> a vos restore first delete and then create the volume? Would it
> not be better to first create a clone and then when we have a good
> clone remove the original?

I suppose it's possible you could construct something that does this using 
the convert-RO-to-RW functionality that is in very recent servers.  But I'd 
have to think about it for a lot longer to convince myself that this would 
actually be stable.


> Another smaller problem are the error messages that seem to be bogous:
>
> Tue Mar 15 11:05:19 2005 VCreateVolume: Header file
> /vicepb/V0537057012.vol already exists! Tue Mar 15 11:05:19 2005 1
> Volser: CreateVolume: Unable to create the volume; aborted, error code 104
> Tue Mar 15 11:05:19 2005 : Connection reset by peer
>
> This is not an error, I wanted an overwrite, so of course the volume
> exists. And what exactly is aborted? And what is 104?

Those aren't error messages; they're log messages.  They are normal.  The 
-overwrite switch doesn't mean the volume already exists; it tells vos what 
to do _if_ the volume already exists.  The way it tells that is by trying 
to create the volume and looking at the error code.

"aborted" is what happens when an RPC returns an error.
Error 104 is VVOLEXISTS, "volume already exists".
On Linux, it also happens to be ECONNRESET, "Connection reset by peer", 
which is why you're seeing that message.  It's unfortunate that volume 
package errors conflict with system errors, but it is a legacy we must live 
with, at least for now.


> Tue Mar 15 11:05:19 2005 1 Volser: Delete: volume 537057012 deleted
> Tue Mar 15 11:05:19 2005 1 Volser: CreateVolume: volume 537057012
> (dah.test.flopp) created Tue Mar 15 11:05:19 2005 1 Volser:
> RestoreVolume: Error reading header file for dump; aborted
>
> And this is the log from the broken -overwrite full which results in
> the vl-volser inconsistency.

Yeah, that makes sense.  The error is referring to the volserver's 
inability to read the dump header over the wire, which is not unsurprising 
since in your example, vos will never send one.


> But it is not less confusing when the overwrite actually succeeds:
>
># /opt/afsbackup/bin/adsmpipe -s /scratch -A -x -f
># dah.test.houting.a.backup.00000000.20050210 | vos restore houting a
># dah.test.flapp -overwrite full -verbose -local
> Volume exists; Will delete and perform full restore
> Restoring volume dah.test.flapp Id 537057015 on server houting.pdc.kth.se
> partition /vicepa ..Deleting the previous volume 537057015 ... done  done
> Updating the existing VLDB entry
> ------- Old entry -------
>
> dah.test.flapp
>     RWrite: 537057015
>     number of sites -> 1
>        server houting.pdc.kth.se partition /vicepa RW Site
> ------- New entry -------
>
> Failed to get info about server's -2098337598 address(es) from vlserver
> (err=0)
> dah.test.flapp
>     RWrite: 537057015
>     number of sites -> 1
>        server houting.pdc.kth.se partition /vicepa RW Site
> Restored volume dah.test.flapp on houting /vicepa
> bash-2.05b# /opt/afsbackup/bin/adsmpipe -s /scratch -A -x -f
> dah.test.houting.a.backup.20050210.20050305 | vos restore houting a
> dah.test.flapp -overwrite incremental -verbose -local  Restoring volume
> dah.test.flapp Id 537057015 on server houting.pdc.kth.se partition
> /vicepa .. done Restored volume dah.test.flapp on houting /vicepa
>
> -2098337598??? What I can see there is nothing wrong with the addrs of
> that server:

-2098337598 is 0x82EDE8C2 or 130.237.232.194, houting.pdc.kth.se
The call which prints this message knows it is always printing a server's 
IP address; I don't know why it prints it as an integer instead of as a 
dotted quad.

You'll note the message in question says (err=0).  This message actually 
shouldn't be printed at all in that case, but the conditional was 
inadvertently removed between src/volser/vsprocs.c verisons 1.15 and 1.16, 
in DELTA no-copy-libafs-builds-20021015.  What this delta has to do with 
changing the way errors are reported in vsprocs, I do not know.



> Then in other news I have a sudden volserver restart on another server
>
> cysteine# tail -1 BosLog
> Mon Mar 14 18:15:08 2005: fs:vol exited on signal 6
>
> and there is nothing in the Volserlog but the vos backup that
> according to the script that run the vos backup completed
> without error. No core either.
>
> cysteine# tail -1 VolserLog.old
> Mon Mar 14 18:15:08 2005 1 Volser: Clone: Recloning volume 537039787 to
> volume 537039789
>
> Suggestions and leads welcome,


What version and platform?  How long was it running before it exited?

-- Jeff