[OpenAFS] Re: vldb / prdb synchronization problem

Thu, 22 Jul 2010 14:27:24 -0500

On Tue, 20 Jul 2010 11:10:11 +0200
ProbaNet Staff <info@probanet.it> wrote:

> we are the administrators of a small afs network composed by 5 servers
> (debian lenny), 3 in the same location and 2 more in 2 different
> offices connected via openvpn, the master is one of the 3 server in
> the main office (srvM1).

Which of these are fileservers, database servers, etc...?

> Unfortunately since 5 days the situation has changed and we don't
> understand why: srvX1 is no longer able to get vldb synchronized. In
> /var/log/openafs/VLLog we see (each minute):
> ---
> Tue Jul 20 10:35:09 2010 Ubik: Synchronize database with server 192.168.10.12
> Tue Jul 20 10:36:11 2010 Ubik: Synchronize database with server 192.168.10.12 
> failed (error = 1)
> ---

This '1' I would guess is BULK_ERROR, indicating it couldn't receive the
database data from RX. Can you try running rxdebug between srvX1 and
srvM1 to try to make sure RX can reach both servers okay?

On srvX1 run

rxdebug 192.168.10.12 7003 -version
rxdebug 10.250.10.1 7003 -version

and on srvM1 run

rxdebug 192.168.11.12 7003 -version
rxdebug 10.250.11.1 7003 -version

each of those should output a version string and a build date; if
instead one of those hangs or outputs an error, something's probably
wrong with the network.

To get more information from the logs, you can turn up the debugging
level for the vlserver on srvX1 and srvM1 if you want (this will output
a lot more stuff to VLLog). The easiest way to do this is to run
'kill -TSTP <vlserver pid>' twice. To turn off debugging again, run
'kill -HUP <vlserver pid>'.

If you do that, and the problem appears, I would expect you to see a
message like this on srvX1:

Rx-read length error=<number>

and a message on srvM1 mentioning some error that srvM1 had sending the
data.

-- 
Andrew Deason
adeason@sinenomine.net