[OpenAFS] Weirdness after 'vos move's - core files?

Mon, 20 Feb 2017 13:56:39 -0500

On 20 Feb 2017, at 0:25, Benjamin Kaduk wrote:

> On Sun, Feb 19, 2017 at 11:49:40AM -0500, Garance A Drosehn wrote:
>>
>> Is there something I could do with those core files which would help
>> to  figure out what the problem is with this file server?  I also
>> have plenty of log files, if those would provide some clues.
>
> Well, it's not entirely clear.  One could of course load them up in
> gdb and see what the backtrace looks like, of course, but given the
> described behavior, if I was in this situation, I would be looking
> at hardware diagnostics on this machine (memtest86, SMART output,
> bonnie++, etc.).  I do not believe that openafs is expected to be
> particularly robust against failing hardware...
>
> -Ben

This file server is a virtual machine on hardware and disk-storage
that is being shared with several other virtual machines.  I realize
the main culprit could certainly be hardware, but if it is then I
won't be the only sysadmin who is will be unhappy about that!  :)
That also makes it harder to run true HW-level diagnostics.

It has been awhile since I've tried to use gdb to investigate a core
file, so right now I'm trying to resurrect those ancient corners of
my brain.

FWIW, here's an anonymized version of the log-file entry of the last
vos-move which worked.  This is from the Volser log file on the
(just-upgraded) destination file server, which is the one I'm
claiming is the broken one:

  - 12:14:24 - VReadVolumeDiskHeader: Couldn't open header for volume 
53...561 (errno 2).
  - 12:14:24 - admin_gad on gads_mac.rpi.edu is executing CreateVolume 
'c._worked_.i41'
  - 12:14:24 - 1 Volser: CreateVolume: volume 53...561 (c._worked_.i41) 
created
  - 12:14:24 - <LocalAuth> on fs_src.rpi.edu is executing Restore 
53...561
  - 12:14:26 - RestoreVolume Cleanup: Removed 0 inodes for volume 
53...561
  - 12:14:26 - RestoreVolume Cleanup: Removed 0 inodes for volume 
53...561
  - 12:14:26 - <LocalAuth> on fs_src.rpi.edu is executing Restore 
53...561
  - 12:14:26 - RestoreVolume Cleanup: Removed 0 inodes for volume 
53...561
  - 12:14:26 - RestoreVolume Cleanup: Removed 0 inodes for volume 
53...561
  - 12:14:27 - VReadVolumeDiskHeader: Couldn't open header for volume 
53...563 (errno 2).
  - 12:14:27 - admin_gad on gads_mac.rpi.edu is executing Clone Volume 
new name=c._worked_.i41.backup
  - 12:14:27 - 1 Volser: Clone: Cloning volume 53...561 to new volume 
53...563

And here's the info for the first vos-move which failed:

  - 12:14:28 - VReadVolumeDiskHeader: Couldn't open header for volume 
53...175 (errno 2).
  - 12:14:28 - admin_gad on gads_mac.rpi.edu is executing CreateVolume 
'c._failed_.2006a'
  - 12:14:28 - 1 Volser: CreateVolume: volume 53...175 
(c._failed_.2006a) created
  - 12:14:28 - <LocalAuth> on fs_src.rpi.edu is executing Restore 
53...175
  - 12:14:28 - 1 Volser: ReadVnodes: IH_CREATE: Structure needs cleaning 
- restore aborted
  - 12:14:28 - SYNC_ask: negative response on circuit 'FSSYNC'
  - 12:14:28 - FSYNC_askfs: FSSYNC request denied for reason=101
  - 12:14:28 - SYNC_ask: negative response on circuit 'FSSYNC'
  - 12:14:28 - FSYNC_askfs: FSSYNC request denied for reason=101
  - 12:17:26 - 1 Volser: GetVolInfo: Could not attach volume 53...175 
(/vicepa:V053...175.vol) error=101

Does that suggest hardware issues?  (I have no idea if it does...)

Looking at the log files, I can also see some 'vos release's, where an
RO-instance of the replicated volume is on this broken server.  If I do
a 'vos examine' of the volume, the RO-instance on this server is the
"old release" while the instances on other file servers are the "new
release".  I'm going to guess that this is NotGood(tm).

-- 
Garance Alistair Drosehn                =     drosih@rpi.edu
Senior Systems Programmer               or   gad@FreeBSD.org
Rensselaer Polytechnic Institute;             Troy, NY;  USA