[OpenAFS] vldb / prdb synchronization problem

ProbaNet Staff info@probanet.it
Tue, 20 Jul 2010 11:10:11 +0200


Hi all,
     we are the administrators of a small afs network composed by 5 
servers (debian lenny), 3 in the same location and 2 more in 2 different 
offices connected via openvpn, the master is one of the 3 server in the main 
office (srvM1).

A simple schema could be:
------ main office -----
srvM1 192.168.10.12 (vpn 10.250.10.1)
      /var/lib/openafs/local/NetInfo:
      192.168.10.12
      f 10.250.10.1
srvM2 192.168.10.13
srvM3 192.168.10.14

------ office Y -------
srvX1 192.168.11.12 (vpn 10.250.11.1)
      /var/lib/openafs/local/NetInfo:
      192.168.11.12
      f 10.250.11.1

------ office Y --------
srvY1 192.168.12.12 (vpn 10.250.12.1)
      /var/lib/openafs/local/NetInfo:
      192.168.12.12
      f 10.250.12.1

Up to 1 week ago everything worked very nice, some time the connection 
with remote places got interruptions but as soon as it came back ptdb and vldb 
was synchronized correctly in all servers. An example of what appened in  
/var/log/openafs/VLLog (server srvX1) was:
---
Ubik: Synchronize database with server 192.168.10.12
Ubik: Synchronize database completed
---

Unfortunately since 5 days the situation has changed and we don't understand 
why: srvX1 is no longer able to get vldb synchronized. In 
/var/log/openafs/VLLog we see (each minute):
---
Tue Jul 20 10:35:09 2010 Ubik: Synchronize database with server 192.168.10.12
Tue Jul 20 10:36:11 2010 Ubik: Synchronize database with server 192.168.10.12 
failed (error = 1)
---

This morning the same started happening in PtLog too.. Only in srvX1 (in all 
other server the synchronization is fine). We tried to restart ptserver and 
vlserver on srvX1 (and then also on master server srvM1) but the error is 
still there. Looking in /var/lib/openafs/db/ we see a small file 
'vldb.DB0.TMP' growing up to 35 Kb and then the error (the good vldb.DB0 is 
around 2.5 MB).
We tried then to stop vlserver and copy via rsync the vldb files into srvX in 
order to have a working and fast system (vos commands are very slow when the 
problem shows up). That worked, but each time the vpn connection goes down the 
problem returns (when VPN is back the synchronization fails).

Some more informations:
# In all server openafs-dbserver is version 1.4.11+dfsg-6~bpo50+1 (from lenny-
backports repo)
# All servers have 2GB RAM
# udebug srvM1 vlserver shows:
  --- WITHOUT PROBLEM:
  Recovery state 1f
  Sync site's db version is 1279615662.172
  ...
  Server (192.168.11.12 10.250.11.1): (db 1279615662.172)
    last vote rcvd 10 secs ago (at Tue Jul 20 10:53:26 2010),
    last beacon sent 10 secs ago (at Tue Jul 20 10:53:26 2010), last vote was 
yes
    dbcurrent=1, up=1 beaconSince=1
  ...
  -----------
  --- WITH PROBLEM:
  Recovery state f
  Sync site's db version is 1279615662.172
  ...
  Server (192.168.11.12 10.250.11.1): (db 3768902.37)
    last vote rcvd 10 secs ago (at Tue Jul 20 10:53:26 2010),
    last beacon sent 10 secs ago (at Tue Jul 20 10:53:26 2010), last vote was 
yes
    dbcurrent=0, up=1 beaconSince=1
  ...
  -----------
# Also when the problem is ON the connection seems good (no ping losses, 
normal speed in file transfers, etc). Only the vos commands become randomely 
slow ('vos examine volname' can take 1 second, or 50 seconds, or 20 
seconds..).
# vldb_ckech and prdb_check show no errors
# we have around 30000 entries in vldb (2.5 MB DB) and 90000 in prdb (18 MB 
DB)

Please, help us! :)

Stefano & Fabio