[OpenAFS] Weirdness after 'vos move's - core files?

Garance A Drosehn drosih@rpi.edu
Mon, 20 Feb 2017 16:07:49 -0500


On 20 Feb 2017, at 0:25, Benjamin Kaduk wrote:

> On Sun, Feb 19, 2017, Garance A Drosehn wrote:
>>
>> Is there something I could do with those core files which would help 
>> to
>> figure out what the problem is with this file server?  I also have
>> plenty of log files, if those would provide some clues.
>
> Well, it's not entirely clear.  One could of course load them up in
> gdb and see what the backtrace looks like, of course, but given the
> described behavior, if I was in this situation, I would be looking
> at hardware diagnostics on this machine (memtest86, SMART output,
> bonnie++, etc.).  I do not believe that openafs is expected to be
> particularly robust against failing hardware...

I skipped over some pretty significant info in my few messages to
this mailing list.  Back on Jan 6th, this file server (and several
other VM's, including two other file servers) were running on a
different hypervisor.  That hypervisor crashed Jan 6th due to the
motherboard failing (!).  Parts were replaced, and we were back up.

We intended to move all these VM's to a newer VMware cluster, but
that took a little while to setup.  Then on Feb 8th, a "minor
config change" to the hypervisor which had died on Jan 6th caused
that hypervisor to crash again.  Ugh!  But we brought everything
back up again.

We then moved each of the file servers, one at a time, and on
separate days.  We'd completely shut down a file-server, move all
the data, and then bring it back up in the cluster.  The first two
of these moves went perfectly fine.  The move of the third file-server
also went fine, done on Feb 14th.  I didn't notice any problems until
I started moving AFS volumes back to that file server, on the 16th.

So the *current* hardware and disk storage should be fine, but
there *were* two very abrupt crashes on the previous hardware, and
those could have corrupted data structure(s) on the file server.

The salvager process ran automatically when the file-server came
back up on both Jan 6th and Feb 8th.  No interesting messages in
the SalvageLog on Jan 6th, but I now notice that the log for
Feb 8th includes the (wrapped) line:

  - namei_ListAFSSubDirs: warning: VG 536877708 does not \
            have a link table; salvager will recreate it.".

That seems mighty suspicious given that the full-salvage done
on Jan 6th did not report that problem.

And there's also a SalvageLog from the 16th, which is when I did
the "bos salvage -all -salvagedirs".  That file has over 225,000
lines in it, and at the time I had only looked at the first few
hundred and the last few hundred.  Nothing exciting there.  But I
now see that in the middle of the log there are many lines such as:

  - Salvaging directory 263...
  - Checking the results of the directory salvage...
  - dir vnode 263: invalid entry deleted: ./toolbox/..etc.. (vnode 
23654, unique 12858)

I am guessing that is also "not good".  Sigh.

While I have upgraded five of our servers to OpenAFS 1.6.20.1, I
have only done the "bos salvage -all -salvagedirs" step on three
of those five.  But for those three, this is the only server which
has any complaints about "invalid entry deleted".

In any case, it now seems almost certain that the crash on
Feb 8th is the primary cause for all the problems we're seeing.

-- 
Garance Alistair Drosehn                =     drosih@rpi.edu
Senior Systems Programmer               or   gad@FreeBSD.org
Rensselaer Polytechnic Institute;             Troy, NY;  USA