[OpenAFS-devel] How to create inconsistency in the volserver and my mind.
Harald Barth
haba@pdc.kth.se
Tue, 15 Mar 2005 12:20:14 +0100 (MET)
Hi everybody,
I think this behaviour needs improvement....
Start with existing volume foo:
create volume example a foo 4711
Do a restore
command-that-fails | vos restore example a foo -overwrite full
Now volume foo is gone on server but still exists in vldb. Why does
a vos restore first delete and then create the volume? Would it
not be better to first create a clone and then when we have a good
clone remove the original?
Another smaller problem are the error messages that seem to be bogous:
Tue Mar 15 11:05:19 2005 VCreateVolume: Header file /vicepb/V0537057012.vol already exists!
Tue Mar 15 11:05:19 2005 1 Volser: CreateVolume: Unable to create the volume; aborted, error code 104
Tue Mar 15 11:05:19 2005 : Connection reset by peer
This is not an error, I wanted an overwrite, so of course the volume
exists. And what exactly is aborted? And what is 104?
Tue Mar 15 11:05:19 2005 1 Volser: Delete: volume 537057012 deleted
Tue Mar 15 11:05:19 2005 1 Volser: CreateVolume: volume 537057012 (dah.test.flopp) created
Tue Mar 15 11:05:19 2005 1 Volser: RestoreVolume: Error reading header file for dump; aborted
And this is the log from the broken -overwrite full which results in
the vl-volser inconsistency.
But it is not less confusing when the overwrite actually succeeds:
# /opt/afsbackup/bin/adsmpipe -s /scratch -A -x -f dah.test.houting.a.backup.00000000.20050210 | vos restore houting a dah.test.flapp -overwrite full -verbose -local
Volume exists; Will delete and perform full restore
Restoring volume dah.test.flapp Id 537057015 on server houting.pdc.kth.se partition /vicepa ..Deleting the previous volume 537057015 ... done
done
Updating the existing VLDB entry
------- Old entry -------
dah.test.flapp
RWrite: 537057015
number of sites -> 1
server houting.pdc.kth.se partition /vicepa RW Site
------- New entry -------
Failed to get info about server's -2098337598 address(es) from vlserver (err=0)
dah.test.flapp
RWrite: 537057015
number of sites -> 1
server houting.pdc.kth.se partition /vicepa RW Site
Restored volume dah.test.flapp on houting /vicepa
bash-2.05b# /opt/afsbackup/bin/adsmpipe -s /scratch -A -x -f dah.test.houting.a.backup.20050210.20050305 | vos restore houting a dah.test.flapp -overwrite incremental -verbose -local
Restoring volume dah.test.flapp Id 537057015 on server houting.pdc.kth.se partition /vicepa .. done
Restored volume dah.test.flapp on houting /vicepa
-2098337598??? What I can see there is nothing wrong with the addrs of that server:
# vos listaddr -printuuid
....
UUID: 00787cce-7ea0-1214-8a-26-c2e8ed82aa77
houting.pdc.kth.se
houting-le.pdc.kth.se
Then in other news I have a sudden volserver restart on another server
cysteine# tail -1 BosLog
Mon Mar 14 18:15:08 2005: fs:vol exited on signal 6
and there is nothing in the Volserlog but the vos backup that
according to the script that run the vos backup completed
without error. No core either.
cysteine# tail -1 VolserLog.old
Mon Mar 14 18:15:08 2005 1 Volser: Clone: Recloning volume 537039787 to volume 537039789
Suggestions and leads welcome,
Harald.