[OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug

Tue, 1 Mar 2011 22:23:34 -0600

On Tue, 1 Mar 2011 22:38:07 -0500 (EST)
Andy Cobaugh <phalenor@gmail.com> wrote:

> (and I think you meant dafssync-debug. I may not have mentioned that.)

fssync-debug should detect a DAFS fileserver and execute dafssync-debug
for you.

> Do you want the .vol file for this volume?

No, the problematic .vol is long gone. It looks like the volserver is
actally creating the BK, but is erroring on giving it to the fileserver.
If you manually schedule a salvage (bos salvage) and then restart the
fileserver, you should be able create it again.

The problem with the recovery is (probably) that the salvager doesn't
properly inform the fileserver when it destroys a volume, so the
erroneous volume state prevents you from doing anything with the volume
after it's destroyed. I need to test that behavior out tomorrow and see
what happens.

> My suspicion is that a previous 'vos backup' left it in this state.
> The volume group hasn't been touched other than for backups in many
> months.  I've never had a problem like this with that fileserver or
> volume until I upgraded from 1.4.11 to 1.6.0pre2.

Have you done successful 'vos backup's of that volume after the
1.6.0pre2 upgrade? Or did you upgrade and it broke?

> First time I see any FSYNC messages is this evening when I tried to
> fix things. I see this line repeated 39 times:

Hmm, well, I interpreted "turned debugging up" to mean "up all the way",
which actually probably isn't true. The messages I'm looking for are at
level 125, and there's a lot of them (they log every FSSYNC request and
response).

> If I look in FileLog.old (I restarted at some point to up the debug 
> level), I see these lines:

You can change that with SIGHUP/SIGTSTP (unless you're doing that for a
permanent change).

> Tue Mar  1 16:11:34 2011 FSYNC_com:  read failed; dropping connection (cnt=94804)
> Tue Mar  1 16:11:34 2011 FSYNC_com:  read failed; dropping connection (cnt=94805)

There should be a SYNC_getCom right before these (though it probably
just says "error receiving command"). Just to be sure, there aren't any
processes dying/respawning in BosLog{,.old}, are there?

> I would also like to note that the vos backup ocurring Sunday failed
> with a timeout, then succeeded Monday, then failed today.
> 
> Command output from vos backup on Sunday:
> 
> Failed to end the transaction on the rw volume 536871059 
> ____: server not responding promptly 
> Error in vos backup command.
> ____: server not responding promptly

That's RX_CALL_TIMEOUT, which I'm not used to seeing on volserver
RPCs... Do you know how long it took to error out with that? If it takes
a while, a core of the volserver/fileserver while it's hanging would be
ideal. It might just be the fileserver trying to salvage the volume a
bunch of times or something, though, and that takes too long.

-- 
Andrew Deason
adeason@sinenomine.net