[OpenAFS] replica server not "failing over" ?

James Schmidt james@JamesSchmidt.Com
Wed, 25 Feb 2004 17:33:00 -0600 (CST)


Hi All-

I've racked my brains over this issue and I keep hitting a brick wall.  I've read every bit of documentation I can find and I failed to see where I'm going wrong.

I've got my two openafs servers, afs1 and afs2.  Afs1 is the primary.  I've created RO volume replicas on AFS2, and 'vos listvldb' shows the correct info, however if I offline afs1, all of the clients time out (including AFS2, which is also a client).

Here is the configuration info for both servers (sorry for such a long message but I wanted to dump all of the info I had).

Server Hardware/OS Information:
-------------------------------
Linux Fedora Core 1, Kernel 2.4.22-1.2115.nptl, using openafs-1.2.11-fc1.0.1.i386 RPMs, on generic Pentium III test boxes.

CellServDB (/usr/vice/etc/CellServDB which is symlinked to /usr/afs/etc on both machines).  This is also the CellServDB which is on all of the clients.
----------------------------------------------------------------------------
>mydomain.com	#Cell name
192.168.2.20	#afs1.mydomain.com
192.168.2.21	#afs2.mydomain.com

ThisCell (also symlinked on both machines):
--------------------------
mydomain.com

'fs listcells' output:
Cell mydomain.com on hosts afs1.mydomain.com afs2.mydomain.com.

"vos listvldb" output from AFS1:
--------------------------------
[admin@afs1 admin]$ vos listvldb
VLDB entries for all servers

root.afs
    RWrite: 536870912     ROnly: 536870913
    number of sites -> 3
       server afs1.mydomain.com partition /vicepa RW Site
       server afs1.mydomain.com partition /vicepa RO Site
       server afs2.mydomain.com partition /vicepa RO Site

root.cell
    RWrite: 536870915     ROnly: 536870916
    number of sites -> 3
       server afs1.mydomain.com partition /vicepa RW Site
       server afs1.mydomain.com partition /vicepa RO Site
       server afs2.mydomain.com partition /vicepa RO Site

www
    RWrite: 536870918     ROnly: 536870919
    number of sites -> 3
       server afs1.mydomain.com partition /vicepa RW Site
       server afs1.mydomain.com partition /vicepa RO Site
       server afs2.mydomain.com partition /vicepa RO Site

Total entries: 3

'bos status afs1.mydomain.com -long' output for AFS1
----------------------------------------------------
Instance kaserver, (type is simple) currently running normally.
    Process last started at Wed Feb 25 15:59:03 2004 (1 proc starts)
    Command 1 is '/usr/afs/bin/kaserver'

Instance buserver, (type is simple) currently running normally.
    Process last started at Wed Feb 25 15:59:03 2004 (1 proc starts)
    Command 1 is '/usr/afs/bin/buserver'

Instance ptserver, (type is simple) currently running normally.
    Process last started at Wed Feb 25 15:59:03 2004 (1 proc starts)
    Command 1 is '/usr/afs/bin/ptserver'

Instance vlserver, (type is simple) currently running normally.
    Process last started at Wed Feb 25 15:59:03 2004 (1 proc starts)
    Command 1 is '/usr/afs/bin/vlserver'

Instance fs, (type is fs) currently running normally.
    Auxiliary status is: file server running.
    Process last started at Wed Feb 25 15:59:03 2004 (2 proc starts)
    Command 1 is '/usr/afs/bin/fileserver'
    Command 2 is '/usr/afs/bin/volserver'
    Command 3 is '/usr/afs/bin/salvager'

Instance upserver, (type is simple) currently running normally.
    Process last started at Wed Feb 25 15:59:03 2004 (1 proc starts)
    Command 1 is '/usr/afs/bin/upserver -crypt /usr/afs/etc -clear
/usr/afs/bin'


'bos status afs2.mydomain.com' output for AFS2:
-----------------------------------------------
Instance kaserver, (type is simple) currently running normally.
    Process last started at Wed Feb 25 14:11:08 2004 (1 proc starts)
    Command 1 is '/usr/afs/bin/kaserver'

Instance buserver, (type is simple) currently running normally.
    Process last started at Wed Feb 25 14:11:08 2004 (1 proc starts)
    Command 1 is '/usr/afs/bin/buserver'

Instance ptserver, (type is simple) currently running normally.
    Process last started at Wed Feb 25 14:11:08 2004 (1 proc starts)
    Command 1 is '/usr/afs/bin/ptserver'

Instance vlserver, (type is simple) currently running normally.
    Process last started at Wed Feb 25 14:11:08 2004 (1 proc starts)
    Command 1 is '/usr/afs/bin/vlserver'

Instance fs, (type is fs) currently running normally.
    Auxiliary status is: file server running.
    Process last started at Wed Feb 25 14:21:25 2004 (2 proc starts)
    Command 1 is '/usr/afs/bin/fileserver'
    Command 2 is '/usr/afs/bin/volserver'
    Command 3 is '/usr/afs/bin/salvager'

Instance upserver, (type is simple) currently running normally.
    Process last started at Wed Feb 25 14:24:55 2004 (1 proc starts)
    Command 1 is '/usr/afs/bin/upserver -crypt /usr/afs/etc -clear /usr/afs/bin'

'bos listhosts' output from AFS1 and AFS2 are identical:
--------------------------------------------------------
Cell name is mydomain.com
    Host 1 is afs1.mydomain.com
    Host 2 is afs2.mydomain.com

both 'vos syncvldb' and 'vos syncserv' complete with no errors.
-----------------------------------------------------------------
[admin@afs2 /]$ vos syncvldb afs1.mydomain.com -cell mydomain.com
VLDB synchronized with state of server afs1.mydomain.com
[admin@afs2 /]$ vos syncserv afs1.mydomain.com -cell mydomain.com
Server afs1.mydomain.com synchronized with VLDB

'vos listvol -server afs1.mydomain.com'
---------------------------------------
Total number of volumes on server afs1.mydomain.com partition /vicepa: 6
root.afs                          536870912 RW          4 K On-line
root.afs.readonly                 536870913 RO          4 K On-line
root.cell                         536870915 RW          3 K On-line
root.cell.readonly                536870916 RO          3 K On-line
www                               536870918 RW          5 K On-line
www.readonly                      536870919 RO          5 K On-line

Total volumes onLine 6 ; Total volumes offLine 0 ; Total busy 0
Total number of volumes on server afs1.mydomain.com partition /vicepb: 0
Total volumes onLine 0 ; Total volumes offLine 0 ; Total busy 0


'vos listvol -server afs2.mydomain.com'
---------------------------------------
Total number of volumes on server afs2.mydomain.com partition /vicepa: 3
root.afs.readonly                 536870913 RO          4 K On-line
root.cell.readonly                536870916 RO          3 K On-line
www.readonly                      536870919 RO          5 K On-line

Total volumes onLine 3 ; Total volumes offLine 0 ; Total busy 0
Total number of volumes on server afs2.mydomain.com partition /vicepb: 0
Total volumes onLine 0 ; Total volumes offLine 0 ; Total busy 0


Everything seems fine, however if I down AFS1, on a client machine if I try to do anything inside of /afs (ls, cd <dir>, etc.), the clients time-out:

[root@afs1 /]# /etc/init.d/afs stop
Stopping AFS services.....
Stopping AFS bosserver
free(): invalid pointer 0xbf3fc010!
free(): invalid pointer 0xbf3cb010!
[root@afs1 /]#

On The Client:
[root@www2 /]# cd /afs
[root@www2 afs]# ls -al
drwxrwxrwx    2 root     root         2048 Feb 25 14:55 .mydomain.com
drwxrwxrwx    2 root     root         2048 Feb 25 14:55 mydomain.com
[root@www2 afs]# cd mydomain.com/       <--- this should be the replicated RO volume, correct?
[root@www2 mydomain.com]# ls -la
ls: .: Connection timed out
[root@www2 mydomain.com]#

I know that since the secondary AFS server, AFS2, should have a copy of the RO volume, I should still be able to CD into this directory and read files, correct?

I am not sure where to look next.


Thanks in advance,
James Schmidt