[OpenAFS] please help no sync site problem

Otto-Michael BRAUN omb@computer.org
Sun, 11 Jan 2004 19:30:04 +0100


Hi,

I am running an OpenAFS 1.2.10 test site with two db servers under RedHat 
8.0, site is up since November 2002.

Today, since this morning no administration is possible, error message (Win 
client) is "error, no quorum elected (0x00001500)" when I try to create a 
volume, replica or release.

I read in the archives and found: the time sync seens to be ok, all volumes 
are mounted, servers are in hosts file and DNS (no changes made), the 
FileLog says:

Sun Jan 11 16:51:35 2004 File server starting
Sun Jan 11 16:51:35 2004 afs_krb_get_lrealm failed, using ombbln.de.
Sun Jan 11 16:52:34 2004 VL_RegisterAddrs rpc failed; will retry 
periodically (code=5376, err=0)
Sun Jan 11 16:52:34 2004 Set thread id 14 for FSYNC_sync
Sun Jan 11 16:52:34 2004 Partition /vicepe: attached 1 volumes; 0 volumes 
not attached
Sun Jan 11 16:52:52 2004 Partition /vicepa: attached 401 volumes; 0 volumes 
not attached
Sun Jan 11 16:53:00 2004 Partition /vicepb: attached 133 volumes; 0 volumes 
not attached
Sun Jan 11 16:53:04 2004 Partition /vicepc: attached 49 volumes; 0 volumes 
not attached
Sun Jan 11 16:53:07 2004 Partition /vicepd: attached 33 volumes; 0 volumes 
not attached
Sun Jan 11 16:53:07 2004 Set thread id 15 for 'FiveMinuteCheckLWP'
Sun Jan 11 16:53:07 2004 Set thread id 16 for 'HostCheckLWP'
Sun Jan 11 16:53:07 2004 Getting FileServer name...
Sun Jan 11 16:53:07 2004 FileServer host name is 'afs1'
Sun Jan 11 16:53:07 2004 Getting FileServer address...
Sun Jan 11 16:53:07 2004 FileServer afs1 has address 192.168.9.7 (0x709a8c0 
or 0xc0a80907 in host byte order)
Sun Jan 11 16:53:07 2004 File Server started Sun Jan 11 16:53:07 2004
Sun Jan 11 16:58:07 2004 VL_RegisterAddrs rpc failed; will retry 
periodically (code=5376, err=0)

last message continuously repeated ...

I made a udebug on both servers and got:

[root@afs1 root]# udebug afs1 7003 -long
Host's addresses are: 192.168.9.7
Host's 192.168.9.7 time is Sun Jan 11 19:01:35 2004
Local time is Sun Jan 11 19:01:36 2004 (time differential 1 secs)
Last yes vote for 192.168.9.7 was 0 secs ago (not sync site);
Last vote started 0 secs ago (at Sun Jan 11 19:01:36 2004)
Local db version is 1073185535.95
I am not sync site
Lowest host 192.168.9.7 was set 0 secs ago
Sync host 0.0.0.0 was set 1073844095 secs ago
Sync site's db version is 1073185535.95
0 locked pages, 0 of them for write

Server (192.168.9.8): (db 0.0)
     last vote rcvd 1 secs ago (at Sun Jan 11 19:01:35 2004),
     last beacon sent 0 secs ago (at Sun Jan 11 19:01:36 2004), last vote 
was yes
     dbcurrent=0, up=1 beaconSince=1


[root@afs1 root]# udebug afs2 7003 -long
Host's addresses are: 192.168.9.8
Host's 192.168.9.8 time is Sun Jan 11 19:02:42 2004
Local time is Sun Jan 11 19:02:45 2004 (time differential 3 secs)
Last yes vote for 192.168.9.7 was 8 secs ago (not sync site);
Last vote started 7 secs ago (at Sun Jan 11 19:02:38 2004)
Local db version is 1073185535.95
I am not sync site
Lowest host 192.168.9.7 was set 8 secs ago
Sync host 0.0.0.0 was set 1073844162 secs ago
Sync site's db version is 1073185535.95
0 locked pages, 0 of them for write

Server (192.168.9.7): (db 0.0)
     last vote rcvd 6263 secs ago (at Sun Jan 11 17:18:22 2004),
     last beacon sent 6263 secs ago (at Sun Jan 11 17:18:22 2004), last 
vote was no
     dbcurrent=0, up=1 beaconSince=1

The problem seems to be that none of the two servers is the sync site, but 
address 0.0.0.0 (which is really the lowest possible ip-address) is beeing 
held to be the sync site.

Any help appreciated!

Michael Braun