[OpenAFS] Re: - 50% SOLVED - Re: - Locked volumes

Andrew Deason adeason@sinenomine.net
Fri, 9 Mar 2012 14:40:15 -0600


On Fri, 09 Mar 2012 15:46:03 +0100
ProbaNet <info@probanet.it> wrote:

> > If you want to quickly test this, you can run udebug against each of
> > the dbservers from each of the other dbservers. If beacons are
> > getting through but not database updates, though, maybe there could
> > be an issue where only packets over a certain size aren't getting
> > through or something, hmm...
> 
> Yes, I fear so.. We did all the udebug tests (from all servers to all
> servers) and everithing was ok.

Everything you've said so far makes it sound like _something_ is getting
dropped along the way, due to the rather broad nature of the
problems/timeouts. And I take it the network layout you have for this is
not exactly the simplest thing, is that correct? There's some firewalls
and VPNs and maybe NATs getting traversed; is that right? My guess would
be fragmented packets are getting lost, or maybe just packets over a
certain size.

So, let's look at a couple of things. First, some stats. Let's get the
output from 'rxdebug <server> 7003 -rxstat -noconn', for at least
afsmn1, and perhaps all of the other vlservers while we're at it.

Also, just to get data from the variety of ways this seems to manifest,
get the same data for the volserver 'vos release'-ish hangs. Run
'rxdebug <server> 7005 -rxstat -noconn' for afsrm1 and afsmn5 (the two
servers involved in the hanging 'vos release' example you mentioned
before).

In addition to all that, you can run a more 'real' test of connectivity
between the servers with rxperf. If you don't have rxperf, you can build
it from source by building the tree, and then going into src/rx (1.4.x)
or src/rx/test (1.6.x and beyond, I think) and running 'make rxperf'.

On one server, run 'rxperf server -p 12345'. On the other, run:

rxperf client -c send -b 1048576 -T 5 -p 12345 -s <otherserver>
rxperf client -c recv -b 1048576 -T 5 -p 12345 -s <otherserver>

and see if you can get it to hang. If it succeeds, it should print out
some stats and exit after transferring 5M. If you want it to transfer
more and run a bit longer, just up the -T parameter.

If you can get it to hang, capture a network dump for UDP port 12345 on
each server. If you cannot get it to hang, we can try looking at a
network capture for the 'real' communication of vlserver or volserver,
etc, when those hang, but it seems better to get it to happen with a
test scenario first, if possible.

> > You mention above there are at least _some_ messages in the log.
> > What are they?
> 
> All messages was kind of 'remote server X voted "yes" on date Y',
> 'Received beacon=1 from server X', etc etc.. Nothing strange (debug
> level 125).

Ah okay, at debug level 125, sure. I thought you meant you were seeing
log messages at log level 0, which I would be much more suspicious about
:)

-- 
Andrew Deason
adeason@sinenomine.net