[OpenAFS] Re: nightly failure since upgrading to 1.6.5

Andrew Deason adeason@sinenomine.net
Mon, 10 Feb 2014 14:23:15 -0600


On Mon, 10 Feb 2014 00:27:59 -0600
Tracy Di Marco White <gendalia@gmail.com> wrote:

> Every night at midnight, we run 'vos backupsys'. For three nights in a
> row, on one of the servers I've upgraded to 1.6.5 and dafs, I've been
> getting the following errors, and it mostly stops being a fileserver.
> Is this fixed in 1.6.6? Anyone else seeing it? This is on NetBSD
> 6.1.3.

I would guess you are the only one using NetBSD for a "real" fileserver,
at least for DAFS. The errors you've posted indicate there are some
problems with the mechanism by which the fileserver and other processes
use to communicate with each other, so it may be advisable to not trust
DAFS on NetBSD with "real" data until it's known what's going on, as
errors like this could possibly lead to corrupted volumes.

Do you know if this seems to happen immediately, or if 'vos backupsys'
seems to correctly create some backup clones, and then eventually
triggers this error? I (or someone else) will probably need to reproduce
this to get a better idea of what's going on, but you can maybe save us
some time with some more info:

> VolserLog
> Sat Feb  8 00:02:42 2014 SYNC_ask:  length field in response inconsistent
> on circuit 'FSSYNC'
> Sat Feb  8 00:02:42 2014 SYNC_ask: protocol communications failure on
> circuit 'FSSYNC'; attempting reconnect to server

This message says what one of the problems is, but isn't providing a lot
of information. If it's convenient for you to apply a patch and rebuild,
the following patch would give us a little more information in this
situation (from gerrit 10829):

<http://git.openafs.org/?p=openafs.git;a=patch;h=9604a45e94ed23a2941d0a7e11bfd892a0bd0bf7>

On Mon, 10 Feb 2014 12:15:08 -0600
Tracy Di Marco White <gendalia@gmail.com> wrote:

> root      4129  0.0  0.2  46288   5124 ?     Sl    7:46AM  0:00.02
> /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo
> root      7155  0.0  1.2  85200  42424 ?     Il    8:06AM  1:27.36
> /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo

Do you have any idea why you have multiple davolserver processes running
at once? Does BosLog maybe say anything about processes dying or
anything? Could you provide a 'ps' listing of all afs server processes
on that machine?

-- 
Andrew Deason
adeason@sinenomine.net