[OpenAFS] Weirdness after 'vos move's - Epilogue

Sun, 05 Mar 2017 20:48:20 -0500

On 20 Feb 2017, at 16:07, Garance A Drosehn wrote:

> On 20 Feb 2017, at 0:25, Benjamin Kaduk wrote:
>
>> [...] if I was in this situation, I would be looking at
>> hardware diagnostics on this machine (memtest86, SMART
>> output,  bonnie++, etc.).  I do not believe that openafs
>> is expected to be particularly robust against failing
>> hardware...

> [...skipping lots...]
>
> In any case, it now seems almost certain that the crash on
> Feb 8th is the primary cause for all the problems we're seeing.

In case anyone is curious, I was successful at moving volumes off
the broken file server.  As I mentioned elsewhere, I was lucky in
that most of the busier volumes had been moved off this server
before the crash happened.  Many of the remaining ones were
solo-RO instances, where the RW volume is on a different file
server.  So for those I just destroyed the RO-instance and then
re-created a new RO-instance on a different file server.

With the others, I ran into a problem where a plain 'vos move'
did not work.  However a 'vos move -live' did work.  And since
all of these are volumes that were not being actively modified,
I assume that a 'vos move -live' was not much of a risk.  But I
also did the moves late in the evening, just to reduce the risk
a bit more.

As of right now, a 'listvol' of the broken file server shows only
four volumes.  All four volumes are ones where a 'vos move' *to*
the broken file server had failed.  It's those failures which
were the first obvious signs that something was broken.

While 'listvol' of the broken file server shows those volumes,
if I do a 'vos examine' for each volume then the VLDB shows
the volumes exist on other file servers.  This makes sense,
since the vos-moves had failed before they finished.

I'm pretty confident that all I need to do now is enter the
commands so the file server is completely removed from our
cell.  Then I'll destroy these virtual disks, create some new
virtual disks and use those to build a new file server, from
scratch.

So all things considered, this could have gone much worse
than it did go.

-- 
Garance Alistair Drosehn                =     drosih@rpi.edu
Senior Systems Programmer               or   gad@FreeBSD.org
Rensselaer Polytechnic Institute;             Troy, NY;  USA