[OpenAFS] replica server not "failing over" ?
James Schmidt
james@JamesSchmidt.Com
Wed, 25 Feb 2004 17:33:00 -0600 (CST)
Hi All-
I've racked my brains over this issue and I keep hitting a brick wall. I've read every bit of documentation I can find and I failed to see where I'm going wrong.
I've got my two openafs servers, afs1 and afs2. Afs1 is the primary. I've created RO volume replicas on AFS2, and 'vos listvldb' shows the correct info, however if I offline afs1, all of the clients time out (including AFS2, which is also a client).
Here is the configuration info for both servers (sorry for such a long message but I wanted to dump all of the info I had).
Server Hardware/OS Information:
-------------------------------
Linux Fedora Core 1, Kernel 2.4.22-1.2115.nptl, using openafs-1.2.11-fc1.0.1.i386 RPMs, on generic Pentium III test boxes.
CellServDB (/usr/vice/etc/CellServDB which is symlinked to /usr/afs/etc on both machines). This is also the CellServDB which is on all of the clients.
----------------------------------------------------------------------------
>mydomain.com #Cell name
192.168.2.20 #afs1.mydomain.com
192.168.2.21 #afs2.mydomain.com
ThisCell (also symlinked on both machines):
--------------------------
mydomain.com
'fs listcells' output:
Cell mydomain.com on hosts afs1.mydomain.com afs2.mydomain.com.
"vos listvldb" output from AFS1:
--------------------------------
[admin@afs1 admin]$ vos listvldb
VLDB entries for all servers
root.afs
RWrite: 536870912 ROnly: 536870913
number of sites -> 3
server afs1.mydomain.com partition /vicepa RW Site
server afs1.mydomain.com partition /vicepa RO Site
server afs2.mydomain.com partition /vicepa RO Site
root.cell
RWrite: 536870915 ROnly: 536870916
number of sites -> 3
server afs1.mydomain.com partition /vicepa RW Site
server afs1.mydomain.com partition /vicepa RO Site
server afs2.mydomain.com partition /vicepa RO Site
www
RWrite: 536870918 ROnly: 536870919
number of sites -> 3
server afs1.mydomain.com partition /vicepa RW Site
server afs1.mydomain.com partition /vicepa RO Site
server afs2.mydomain.com partition /vicepa RO Site
Total entries: 3
'bos status afs1.mydomain.com -long' output for AFS1
----------------------------------------------------
Instance kaserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 15:59:03 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/kaserver'
Instance buserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 15:59:03 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/buserver'
Instance ptserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 15:59:03 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/ptserver'
Instance vlserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 15:59:03 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/vlserver'
Instance fs, (type is fs) currently running normally.
Auxiliary status is: file server running.
Process last started at Wed Feb 25 15:59:03 2004 (2 proc starts)
Command 1 is '/usr/afs/bin/fileserver'
Command 2 is '/usr/afs/bin/volserver'
Command 3 is '/usr/afs/bin/salvager'
Instance upserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 15:59:03 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/upserver -crypt /usr/afs/etc -clear
/usr/afs/bin'
'bos status afs2.mydomain.com' output for AFS2:
-----------------------------------------------
Instance kaserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 14:11:08 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/kaserver'
Instance buserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 14:11:08 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/buserver'
Instance ptserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 14:11:08 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/ptserver'
Instance vlserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 14:11:08 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/vlserver'
Instance fs, (type is fs) currently running normally.
Auxiliary status is: file server running.
Process last started at Wed Feb 25 14:21:25 2004 (2 proc starts)
Command 1 is '/usr/afs/bin/fileserver'
Command 2 is '/usr/afs/bin/volserver'
Command 3 is '/usr/afs/bin/salvager'
Instance upserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 14:24:55 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/upserver -crypt /usr/afs/etc -clear /usr/afs/bin'
'bos listhosts' output from AFS1 and AFS2 are identical:
--------------------------------------------------------
Cell name is mydomain.com
Host 1 is afs1.mydomain.com
Host 2 is afs2.mydomain.com
both 'vos syncvldb' and 'vos syncserv' complete with no errors.
-----------------------------------------------------------------
[admin@afs2 /]$ vos syncvldb afs1.mydomain.com -cell mydomain.com
VLDB synchronized with state of server afs1.mydomain.com
[admin@afs2 /]$ vos syncserv afs1.mydomain.com -cell mydomain.com
Server afs1.mydomain.com synchronized with VLDB
'vos listvol -server afs1.mydomain.com'
---------------------------------------
Total number of volumes on server afs1.mydomain.com partition /vicepa: 6
root.afs 536870912 RW 4 K On-line
root.afs.readonly 536870913 RO 4 K On-line
root.cell 536870915 RW 3 K On-line
root.cell.readonly 536870916 RO 3 K On-line
www 536870918 RW 5 K On-line
www.readonly 536870919 RO 5 K On-line
Total volumes onLine 6 ; Total volumes offLine 0 ; Total busy 0
Total number of volumes on server afs1.mydomain.com partition /vicepb: 0
Total volumes onLine 0 ; Total volumes offLine 0 ; Total busy 0
'vos listvol -server afs2.mydomain.com'
---------------------------------------
Total number of volumes on server afs2.mydomain.com partition /vicepa: 3
root.afs.readonly 536870913 RO 4 K On-line
root.cell.readonly 536870916 RO 3 K On-line
www.readonly 536870919 RO 5 K On-line
Total volumes onLine 3 ; Total volumes offLine 0 ; Total busy 0
Total number of volumes on server afs2.mydomain.com partition /vicepb: 0
Total volumes onLine 0 ; Total volumes offLine 0 ; Total busy 0
Everything seems fine, however if I down AFS1, on a client machine if I try to do anything inside of /afs (ls, cd <dir>, etc.), the clients time-out:
[root@afs1 /]# /etc/init.d/afs stop
Stopping AFS services.....
Stopping AFS bosserver
free(): invalid pointer 0xbf3fc010!
free(): invalid pointer 0xbf3cb010!
[root@afs1 /]#
On The Client:
[root@www2 /]# cd /afs
[root@www2 afs]# ls -al
drwxrwxrwx 2 root root 2048 Feb 25 14:55 .mydomain.com
drwxrwxrwx 2 root root 2048 Feb 25 14:55 mydomain.com
[root@www2 afs]# cd mydomain.com/ <--- this should be the replicated RO volume, correct?
[root@www2 mydomain.com]# ls -la
ls: .: Connection timed out
[root@www2 mydomain.com]#
I know that since the secondary AFS server, AFS2, should have a copy of the RO volume, I should still be able to CD into this directory and read files, correct?
I am not sure where to look next.
Thanks in advance,
James Schmidt