[OpenAFS] Weirdness after 'vos move's - core files?
Garance A Drosehn
drosih@rpi.edu
Mon, 20 Feb 2017 13:56:39 -0500
On 20 Feb 2017, at 0:25, Benjamin Kaduk wrote:
> On Sun, Feb 19, 2017 at 11:49:40AM -0500, Garance A Drosehn wrote:
>>
>> Is there something I could do with those core files which would help
>> to figure out what the problem is with this file server? I also
>> have plenty of log files, if those would provide some clues.
>
> Well, it's not entirely clear. One could of course load them up in
> gdb and see what the backtrace looks like, of course, but given the
> described behavior, if I was in this situation, I would be looking
> at hardware diagnostics on this machine (memtest86, SMART output,
> bonnie++, etc.). I do not believe that openafs is expected to be
> particularly robust against failing hardware...
>
> -Ben
This file server is a virtual machine on hardware and disk-storage
that is being shared with several other virtual machines. I realize
the main culprit could certainly be hardware, but if it is then I
won't be the only sysadmin who is will be unhappy about that! :)
That also makes it harder to run true HW-level diagnostics.
It has been awhile since I've tried to use gdb to investigate a core
file, so right now I'm trying to resurrect those ancient corners of
my brain.
FWIW, here's an anonymized version of the log-file entry of the last
vos-move which worked. This is from the Volser log file on the
(just-upgraded) destination file server, which is the one I'm
claiming is the broken one:
- 12:14:24 - VReadVolumeDiskHeader: Couldn't open header for volume
53...561 (errno 2).
- 12:14:24 - admin_gad on gads_mac.rpi.edu is executing CreateVolume
'c._worked_.i41'
- 12:14:24 - 1 Volser: CreateVolume: volume 53...561 (c._worked_.i41)
created
- 12:14:24 - <LocalAuth> on fs_src.rpi.edu is executing Restore
53...561
- 12:14:26 - RestoreVolume Cleanup: Removed 0 inodes for volume
53...561
- 12:14:26 - RestoreVolume Cleanup: Removed 0 inodes for volume
53...561
- 12:14:26 - <LocalAuth> on fs_src.rpi.edu is executing Restore
53...561
- 12:14:26 - RestoreVolume Cleanup: Removed 0 inodes for volume
53...561
- 12:14:26 - RestoreVolume Cleanup: Removed 0 inodes for volume
53...561
- 12:14:27 - VReadVolumeDiskHeader: Couldn't open header for volume
53...563 (errno 2).
- 12:14:27 - admin_gad on gads_mac.rpi.edu is executing Clone Volume
new name=c._worked_.i41.backup
- 12:14:27 - 1 Volser: Clone: Cloning volume 53...561 to new volume
53...563
And here's the info for the first vos-move which failed:
- 12:14:28 - VReadVolumeDiskHeader: Couldn't open header for volume
53...175 (errno 2).
- 12:14:28 - admin_gad on gads_mac.rpi.edu is executing CreateVolume
'c._failed_.2006a'
- 12:14:28 - 1 Volser: CreateVolume: volume 53...175
(c._failed_.2006a) created
- 12:14:28 - <LocalAuth> on fs_src.rpi.edu is executing Restore
53...175
- 12:14:28 - 1 Volser: ReadVnodes: IH_CREATE: Structure needs cleaning
- restore aborted
- 12:14:28 - SYNC_ask: negative response on circuit 'FSSYNC'
- 12:14:28 - FSYNC_askfs: FSSYNC request denied for reason=101
- 12:14:28 - SYNC_ask: negative response on circuit 'FSSYNC'
- 12:14:28 - FSYNC_askfs: FSSYNC request denied for reason=101
- 12:17:26 - 1 Volser: GetVolInfo: Could not attach volume 53...175
(/vicepa:V053...175.vol) error=101
Does that suggest hardware issues? (I have no idea if it does...)
Looking at the log files, I can also see some 'vos release's, where an
RO-instance of the replicated volume is on this broken server. If I do
a 'vos examine' of the volume, the RO-instance on this server is the
"old release" while the instances on other file servers are the "new
release". I'm going to guess that this is NotGood(tm).
--
Garance Alistair Drosehn = drosih@rpi.edu
Senior Systems Programmer or gad@FreeBSD.org
Rensselaer Polytechnic Institute; Troy, NY; USA