[OpenAFS-devel] Unable to release R/O volume -- 1 Volser: ReadVnodes: IH_CREATE: File
exists - restore aborted
Ian Wienand
iwienand@redhat.com
Thu, 24 May 2018 20:40:02 +1000
Hello,
We lost the backing storage on our R/O server /vicepa sometime
yesterday (it's cloud block storage out of our control, so it
disappeared in a unknown manner). Once things came back, we had
volumes in a range of mostly locked states from updates and "vos
release"s triggered by update cron jobs.
Quite a few I could manually unlock and re-release, and things went
OK. Others have proven more of a problem.
To cut things short, there was a lot of debugging, and we ended up
with stuck transactions between the R/W and R/O server and
un-unlockable volumes. Eventually we rebooted both to clear out
everything. In an attempt to just clear the R/O mirrors and start
again, I did for each problem volume:
vos unlock $MIRROR
vos remove -server afs02.dfw.openstack.org -partition a -id $MIRROR.readonly
vos release -v $MIRROR
vos addsite -server afs02.dfw.openstack.org -partition a -id $MIRROR
My theory being this would completely remove the R/O mirror volume and
start fresh. I then proceeded to do a "vos release" on each volume in
sequence (more details in [1]).
However, this release on the new R/O volume has not worked. Here is
the output from the release of one of the volumes:
---
Thu May 24 09:49:54 UTC 2018
Kerberos initialization for service/afsadmin@OPENSTACK.ORG
mirror.ubuntu-ports
RWrite: 536871041 ROnly: 536871042
number of sites -> 3
server afs01.dfw.openstack.org partition /vicepa RW Site
server afs01.dfw.openstack.org partition /vicepa RO Site
server afs02.dfw.openstack.org partition /vicepa RO Site -- Not released
This is a complete release of volume 536871041
There are new RO sites; we will try to only release to new sites
Querying old RO sites for update times... done
RW vol has not changed; only releasing to new RO sites
Starting transaction on cloned volume 536871042... done
Creating new volume 536871042 on replication site afs02.dfw.openstack.org: done
This will be a full dump: read-only volume needs be created for new site
Starting ForwardMulti from 536871042 to 536871042 on afs02.dfw.openstack.org (entire volume).
Release failed: VOLSER: Problems encountered in doing the dump !
The volume 536871041 could not be released to the following 1 sites:
afs02.dfw.openstack.org /vicepa
VOLSER: release could not be completed
Error in vos release command.
VOLSER: release could not be completed
Thu May 24 09:51:49 UTC 2018
---
It triggers the salvage, on the I presume only partially cloned
volume, which logs
---
05/24/2018 09:51:49 dispatching child to salvage volume 536871041...
05/24/2018 09:51:49 namei_ListAFSSubDirs: warning: VG 536871042 does not have a link table; salvager will recreate it.
05/24/2018 09:51:49 fileserver requested salvage of clone 536871042; scheduling salvage of volume group 536871041...
05/24/2018 09:51:49 VReadVolumeDiskHeader: Couldn't open header for volume 536871041 (errno 2).
05/24/2018 09:51:49 2 nVolumesInInodeFile 64
05/24/2018 09:51:49 CHECKING CLONED VOLUME 536871042.
05/24/2018 09:51:49 mirror.ubuntu-ports.readonly (536871042) updated 05/24/2018 06:08
05/24/2018 09:51:49 totalInodes 32896
---
On the R/O server side (afs02) we have
---
Thu May 24 09:49:55 2018 VReadVolumeDiskHeader: Couldn't open header for volume 536871042 (errno 2).
Thu May 24 09:49:55 2018 attach2: forcing vol 536871042 to error state (state 0 flags 0x0 ec 103)
Thu May 24 09:49:55 2018 1 Volser: CreateVolume: volume 536871042 (mirror.ubuntu-ports.readonly) created
Thu May 24 09:51:49 2018 1 Volser: ReadVnodes: IH_CREATE: File exists - restore aborted
Thu May 24 09:51:49 2018 Scheduling salvage for volume 536871042 on part /vicepa over FSSYNC
---
I do not see anything on the R/W server side (afs01).
I have fsck'd the /vicepa partition on the RO server (afs02) and it is
OK.
I can not find much info on "IH_CREATE: File exists" which I assume is
the problem here. I would welcome any suggestions! Clearly my theory
of "vos remove" and "vos add" of the mirror hasn't cleared out enough
state to recover things?
All servers are Xenial-based with it's current 1.6.7-1ubuntu1.1
openafs packages.
Thanks,
-i
[1] http://lists.openstack.org/pipermail/openstack-infra/2018-May/005949.html