[OpenAFS-devel] Unable to release R/O volume -- 1 Volser: ReadVnodes: IH_CREATE: File exists - restore aborted

Thu, 24 May 2018 20:40:02 +1000

Hello,

We lost the backing storage on our R/O server /vicepa sometime
yesterday (it's cloud block storage out of our control, so it
disappeared in a unknown manner).  Once things came back, we had
volumes in a range of mostly locked states from updates and "vos
release"s triggered by update cron jobs.

Quite a few I could manually unlock and re-release, and things went
OK.  Others have proven more of a problem.

To cut things short, there was a lot of debugging, and we ended up
with stuck transactions between the R/W and R/O server and
un-unlockable volumes.  Eventually we rebooted both to clear out
everything.  In an attempt to just clear the R/O mirrors and start
again, I did for each problem volume:

 vos unlock $MIRROR
 vos remove -server afs02.dfw.openstack.org -partition a -id $MIRROR.readonly
 vos release -v $MIRROR
 vos addsite -server afs02.dfw.openstack.org -partition a -id $MIRROR

My theory being this would completely remove the R/O mirror volume and
start fresh.  I then proceeded to do a "vos release" on each volume in
sequence (more details in [1]).

However, this release on the new R/O volume has not worked.  Here is
the output from the release of one of the volumes:

---
Thu May 24 09:49:54 UTC 2018
Kerberos initialization for service/afsadmin@OPENSTACK.ORG

mirror.ubuntu-ports
    RWrite: 536871041     ROnly: 536871042
    number of sites -> 3
       server afs01.dfw.openstack.org partition /vicepa RW Site
       server afs01.dfw.openstack.org partition /vicepa RO Site
       server afs02.dfw.openstack.org partition /vicepa RO Site  -- Not released
This is a complete release of volume 536871041
There are new RO sites; we will try to only release to new sites
Querying old RO sites for update times... done
RW vol has not changed; only releasing to new RO sites
Starting transaction on cloned volume 536871042... done
Creating new volume 536871042 on replication site afs02.dfw.openstack.org:  done
This will be a full dump: read-only volume needs be created for new site
Starting ForwardMulti from 536871042 to 536871042 on afs02.dfw.openstack.org (entire volume).
Release failed: VOLSER: Problems encountered in doing the dump !
The volume 536871041 could not be released to the following 1 sites:
                    afs02.dfw.openstack.org /vicepa
VOLSER: release could not be completed
Error in vos release command.
VOLSER: release could not be completed
Thu May 24 09:51:49 UTC 2018
---

It triggers the salvage, on the I presume only partially cloned
volume, which logs

---
05/24/2018 09:51:49 dispatching child to salvage volume 536871041...
05/24/2018 09:51:49 namei_ListAFSSubDirs: warning: VG 536871042 does not have a link table; salvager will recreate it.
05/24/2018 09:51:49 fileserver requested salvage of clone 536871042; scheduling salvage of volume group 536871041...
05/24/2018 09:51:49 VReadVolumeDiskHeader: Couldn't open header for volume 536871041 (errno 2).
05/24/2018 09:51:49 2 nVolumesInInodeFile 64 
05/24/2018 09:51:49 CHECKING CLONED VOLUME 536871042.
05/24/2018 09:51:49 mirror.ubuntu-ports.readonly (536871042) updated 05/24/2018 06:08
05/24/2018 09:51:49 totalInodes 32896
---

On the R/O server side (afs02) we have

---
Thu May 24 09:49:55 2018 VReadVolumeDiskHeader: Couldn't open header for volume 536871042 (errno 2).
Thu May 24 09:49:55 2018 attach2: forcing vol 536871042 to error state (state 0 flags 0x0 ec 103)
Thu May 24 09:49:55 2018 1 Volser: CreateVolume: volume 536871042 (mirror.ubuntu-ports.readonly) created
Thu May 24 09:51:49 2018 1 Volser: ReadVnodes: IH_CREATE: File exists - restore aborted
Thu May 24 09:51:49 2018 Scheduling salvage for volume 536871042 on part /vicepa over FSSYNC
---

I do not see anything on the R/W server side (afs01).

I have fsck'd the /vicepa partition on the RO server (afs02) and it is
OK.

I can not find much info on "IH_CREATE: File exists" which I assume is
the problem here.  I would welcome any suggestions!  Clearly my theory
of "vos remove" and "vos add" of the mirror hasn't cleared out enough
state to recover things?

All servers are Xenial-based with it's current 1.6.7-1ubuntu1.1
openafs packages.

Thanks,

-i

[1] http://lists.openstack.org/pipermail/openstack-infra/2018-May/005949.html