[OpenAFS] replica server not "failing over" ?
Sven Oehme
oehmes@de.ibm.com
Thu, 26 Feb 2004 08:32:54 +0100
This is a multipart message in MIME format.
--=_alternative 00295CADC1256E46_=
Content-Type: text/plain; charset="US-ASCII"
Hy James ,
please go to /afs look with fs whereis mydomain.com/ if it reports afs01
and afs02 .
if this is not the case execute fs checkvolumes and fs flushmount
mydomain.com/ and see if fs whereis reports now both sites ,
as long as fs whereis is not reporting both sites , it will not work ...
Sven
-------------------------------------------------------------------------------------------------------------------------
Dept. A141, TG/SSG EMEA AIS Strategy and Architecture
Development Leader Stonehenge
IBM intranet ---> http://w3.ais.mainz.de.ibm.com/stonehenge/
internet ---> http://www-5.ibm.com/services/de/its/filestore.html
Phone (+49)-6131-84-3151
Fax (+49)-6131-84-6708
Mobil (+49)-171-970-6664
E-Mail : oehmes@de.ibm.com
James Schmidt <james@JamesSchmidt.Com>
Sent by: openafs-info-admin@openafs.org
26.02.2004 00:33
To
openafs-info@openafs.org
cc
Subject
[OpenAFS] replica server not "failing over" ?
Hi All-
I've racked my brains over this issue and I keep hitting a brick wall.
I've read every bit of documentation I can find and I failed to see where
I'm going wrong.
I've got my two openafs servers, afs1 and afs2. Afs1 is the primary. I've
created RO volume replicas on AFS2, and 'vos listvldb' shows the correct
info, however if I offline afs1, all of the clients time out (including
AFS2, which is also a client).
Here is the configuration info for both servers (sorry for such a long
message but I wanted to dump all of the info I had).
Server Hardware/OS Information:
-------------------------------
Linux Fedora Core 1, Kernel 2.4.22-1.2115.nptl, using
openafs-1.2.11-fc1.0.1.i386 RPMs, on generic Pentium III test boxes.
CellServDB (/usr/vice/etc/CellServDB which is symlinked to /usr/afs/etc on
both machines). This is also the CellServDB which is on all of the
clients.
----------------------------------------------------------------------------
>mydomain.com #Cell name
192.168.2.20 #afs1.mydomain.com
192.168.2.21 #afs2.mydomain.com
ThisCell (also symlinked on both machines):
--------------------------
mydomain.com
'fs listcells' output:
Cell mydomain.com on hosts afs1.mydomain.com afs2.mydomain.com.
"vos listvldb" output from AFS1:
--------------------------------
[admin@afs1 admin]$ vos listvldb
VLDB entries for all servers
root.afs
RWrite: 536870912 ROnly: 536870913
number of sites -> 3
server afs1.mydomain.com partition /vicepa RW Site
server afs1.mydomain.com partition /vicepa RO Site
server afs2.mydomain.com partition /vicepa RO Site
root.cell
RWrite: 536870915 ROnly: 536870916
number of sites -> 3
server afs1.mydomain.com partition /vicepa RW Site
server afs1.mydomain.com partition /vicepa RO Site
server afs2.mydomain.com partition /vicepa RO Site
www
RWrite: 536870918 ROnly: 536870919
number of sites -> 3
server afs1.mydomain.com partition /vicepa RW Site
server afs1.mydomain.com partition /vicepa RO Site
server afs2.mydomain.com partition /vicepa RO Site
Total entries: 3
'bos status afs1.mydomain.com -long' output for AFS1
----------------------------------------------------
Instance kaserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 15:59:03 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/kaserver'
Instance buserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 15:59:03 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/buserver'
Instance ptserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 15:59:03 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/ptserver'
Instance vlserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 15:59:03 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/vlserver'
Instance fs, (type is fs) currently running normally.
Auxiliary status is: file server running.
Process last started at Wed Feb 25 15:59:03 2004 (2 proc starts)
Command 1 is '/usr/afs/bin/fileserver'
Command 2 is '/usr/afs/bin/volserver'
Command 3 is '/usr/afs/bin/salvager'
Instance upserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 15:59:03 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/upserver -crypt /usr/afs/etc -clear
/usr/afs/bin'
'bos status afs2.mydomain.com' output for AFS2:
-----------------------------------------------
Instance kaserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 14:11:08 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/kaserver'
Instance buserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 14:11:08 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/buserver'
Instance ptserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 14:11:08 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/ptserver'
Instance vlserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 14:11:08 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/vlserver'
Instance fs, (type is fs) currently running normally.
Auxiliary status is: file server running.
Process last started at Wed Feb 25 14:21:25 2004 (2 proc starts)
Command 1 is '/usr/afs/bin/fileserver'
Command 2 is '/usr/afs/bin/volserver'
Command 3 is '/usr/afs/bin/salvager'
Instance upserver, (type is simple) currently running normally.
Process last started at Wed Feb 25 14:24:55 2004 (1 proc starts)
Command 1 is '/usr/afs/bin/upserver -crypt /usr/afs/etc -clear
/usr/afs/bin'
'bos listhosts' output from AFS1 and AFS2 are identical:
--------------------------------------------------------
Cell name is mydomain.com
Host 1 is afs1.mydomain.com
Host 2 is afs2.mydomain.com
both 'vos syncvldb' and 'vos syncserv' complete with no errors.
-----------------------------------------------------------------
[admin@afs2 /]$ vos syncvldb afs1.mydomain.com -cell mydomain.com
VLDB synchronized with state of server afs1.mydomain.com
[admin@afs2 /]$ vos syncserv afs1.mydomain.com -cell mydomain.com
Server afs1.mydomain.com synchronized with VLDB
'vos listvol -server afs1.mydomain.com'
---------------------------------------
Total number of volumes on server afs1.mydomain.com partition /vicepa: 6
root.afs 536870912 RW 4 K On-line
root.afs.readonly 536870913 RO 4 K On-line
root.cell 536870915 RW 3 K On-line
root.cell.readonly 536870916 RO 3 K On-line
www 536870918 RW 5 K On-line
www.readonly 536870919 RO 5 K On-line
Total volumes onLine 6 ; Total volumes offLine 0 ; Total busy 0
Total number of volumes on server afs1.mydomain.com partition /vicepb: 0
Total volumes onLine 0 ; Total volumes offLine 0 ; Total busy 0
'vos listvol -server afs2.mydomain.com'
---------------------------------------
Total number of volumes on server afs2.mydomain.com partition /vicepa: 3
root.afs.readonly 536870913 RO 4 K On-line
root.cell.readonly 536870916 RO 3 K On-line
www.readonly 536870919 RO 5 K On-line
Total volumes onLine 3 ; Total volumes offLine 0 ; Total busy 0
Total number of volumes on server afs2.mydomain.com partition /vicepb: 0
Total volumes onLine 0 ; Total volumes offLine 0 ; Total busy 0
Everything seems fine, however if I down AFS1, on a client machine if I
try to do anything inside of /afs (ls, cd <dir>, etc.), the clients
time-out:
[root@afs1 /]# /etc/init.d/afs stop
Stopping AFS services.....
Stopping AFS bosserver
free(): invalid pointer 0xbf3fc010!
free(): invalid pointer 0xbf3cb010!
[root@afs1 /]#
On The Client:
[root@www2 /]# cd /afs
[root@www2 afs]# ls -al
drwxrwxrwx 2 root root 2048 Feb 25 14:55 .mydomain.com
drwxrwxrwx 2 root root 2048 Feb 25 14:55 mydomain.com
[root@www2 afs]# cd mydomain.com/ <--- this should be the replicated
RO volume, correct?
[root@www2 mydomain.com]# ls -la
ls: .: Connection timed out
[root@www2 mydomain.com]#
I know that since the secondary AFS server, AFS2, should have a copy of
the RO volume, I should still be able to CD into this directory and read
files, correct?
I am not sure where to look next.
Thanks in advance,
James Schmidt
_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info
--=_alternative 00295CADC1256E46_=
Content-Type: text/html; charset="US-ASCII"
<br><font size=2 face="sans-serif">Hy James ,</font>
<br>
<br><font size=2 face="sans-serif">please go to /afs look with fs whereis
</font><font size=2><tt>mydomain.com/</tt></font><font size=2 face="sans-serif">
if it reports afs01 and afs02 .</font>
<br><font size=2 face="sans-serif">if this is not the case execute fs checkvolumes
and fs flushmount </font><font size=2><tt>mydomain.com/</tt></font><font size=2 face="sans-serif">
and see if fs whereis reports now both sites ,</font>
<br><font size=2 face="sans-serif">as long as fs whereis is not reporting
both sites , it will not work ...</font>
<br>
<br><font size=2 face="sans-serif">Sven</font>
<br>
<br><font size=2 face="sans-serif">-------------------------------------------------------------------------------------------------------------------------<br>
Dept. A141, TG/SSG EMEA AIS Strategy and Architecture<br>
Development Leader Stonehenge <br>
IBM intranet ---> http://w3.ais.mainz.de.ibm.com/stonehenge/<br>
internet ---> http://www-5.ibm.com/services/de/its/filestore.html<br>
Phone (+49)-6131-84-3151<br>
Fax (+49)-6131-84-6708<br>
Mobil (+49)-171-970-6664<br>
E-Mail : oehmes@de.ibm.com</font>
<br>
<br>
<br>
<table width=100%>
<tr valign=top>
<td width=40%><font size=1 face="sans-serif"><b>James Schmidt <james@JamesSchmidt.Com></b>
</font>
<br><font size=1 face="sans-serif">Sent by: openafs-info-admin@openafs.org</font>
<p><font size=1 face="sans-serif">26.02.2004 00:33</font>
<td width=59%>
<table width=100%>
<tr>
<td>
<div align=right><font size=1 face="sans-serif">To</font></div>
<td valign=top><font size=1 face="sans-serif">openafs-info@openafs.org</font>
<tr>
<td>
<div align=right><font size=1 face="sans-serif">cc</font></div>
<td valign=top>
<tr>
<td>
<div align=right><font size=1 face="sans-serif">Subject</font></div>
<td valign=top><font size=1 face="sans-serif">[OpenAFS] replica server
not "failing over" ?</font></table>
<br>
<table>
<tr valign=top>
<td>
<td></table>
<br></table>
<br>
<br>
<br><font size=2><tt>Hi All-<br>
<br>
I've racked my brains over this issue and I keep hitting a brick wall.
I've read every bit of documentation I can find and I failed to see
where I'm going wrong.<br>
<br>
I've got my two openafs servers, afs1 and afs2. Afs1 is the primary.
I've created RO volume replicas on AFS2, and 'vos listvldb' shows
the correct info, however if I offline afs1, all of the clients time out
(including AFS2, which is also a client).<br>
<br>
Here is the configuration info for both servers (sorry for such a long
message but I wanted to dump all of the info I had).<br>
<br>
Server Hardware/OS Information:<br>
-------------------------------<br>
Linux Fedora Core 1, Kernel 2.4.22-1.2115.nptl, using openafs-1.2.11-fc1.0.1.i386
RPMs, on generic Pentium III test boxes.<br>
<br>
CellServDB (/usr/vice/etc/CellServDB which is symlinked to /usr/afs/etc
on both machines). This is also the CellServDB which is on all of
the clients.<br>
----------------------------------------------------------------------------<br>
>mydomain.com
#Cell name<br>
192.168.2.20
#afs1.mydomain.com<br>
192.168.2.21
#afs2.mydomain.com<br>
<br>
ThisCell (also symlinked on both machines):<br>
--------------------------<br>
mydomain.com<br>
<br>
'fs listcells' output:<br>
Cell mydomain.com on hosts afs1.mydomain.com afs2.mydomain.com.<br>
<br>
"vos listvldb" output from AFS1:<br>
--------------------------------<br>
[admin@afs1 admin]$ vos listvldb<br>
VLDB entries for all servers<br>
<br>
root.afs<br>
RWrite: 536870912 ROnly: 536870913<br>
number of sites -> 3<br>
server afs1.mydomain.com partition /vicepa RW Site<br>
server afs1.mydomain.com partition /vicepa RO Site<br>
server afs2.mydomain.com partition /vicepa RO Site<br>
<br>
root.cell<br>
RWrite: 536870915 ROnly: 536870916<br>
number of sites -> 3<br>
server afs1.mydomain.com partition /vicepa RW Site<br>
server afs1.mydomain.com partition /vicepa RO Site<br>
server afs2.mydomain.com partition /vicepa RO Site<br>
<br>
www<br>
RWrite: 536870918 ROnly: 536870919<br>
number of sites -> 3<br>
server afs1.mydomain.com partition /vicepa RW Site<br>
server afs1.mydomain.com partition /vicepa RO Site<br>
server afs2.mydomain.com partition /vicepa RO Site<br>
<br>
Total entries: 3<br>
<br>
'bos status afs1.mydomain.com -long' output for AFS1<br>
----------------------------------------------------<br>
Instance kaserver, (type is simple) currently running normally.<br>
Process last started at Wed Feb 25 15:59:03 2004 (1 proc
starts)<br>
Command 1 is '/usr/afs/bin/kaserver'<br>
<br>
Instance buserver, (type is simple) currently running normally.<br>
Process last started at Wed Feb 25 15:59:03 2004 (1 proc
starts)<br>
Command 1 is '/usr/afs/bin/buserver'<br>
<br>
Instance ptserver, (type is simple) currently running normally.<br>
Process last started at Wed Feb 25 15:59:03 2004 (1 proc
starts)<br>
Command 1 is '/usr/afs/bin/ptserver'<br>
<br>
Instance vlserver, (type is simple) currently running normally.<br>
Process last started at Wed Feb 25 15:59:03 2004 (1 proc
starts)<br>
Command 1 is '/usr/afs/bin/vlserver'<br>
<br>
Instance fs, (type is fs) currently running normally.<br>
Auxiliary status is: file server running.<br>
Process last started at Wed Feb 25 15:59:03 2004 (2 proc
starts)<br>
Command 1 is '/usr/afs/bin/fileserver'<br>
Command 2 is '/usr/afs/bin/volserver'<br>
Command 3 is '/usr/afs/bin/salvager'<br>
<br>
Instance upserver, (type is simple) currently running normally.<br>
Process last started at Wed Feb 25 15:59:03 2004 (1 proc
starts)<br>
Command 1 is '/usr/afs/bin/upserver -crypt /usr/afs/etc -clear<br>
/usr/afs/bin'<br>
<br>
<br>
'bos status afs2.mydomain.com' output for AFS2:<br>
-----------------------------------------------<br>
Instance kaserver, (type is simple) currently running normally.<br>
Process last started at Wed Feb 25 14:11:08 2004 (1 proc
starts)<br>
Command 1 is '/usr/afs/bin/kaserver'<br>
<br>
Instance buserver, (type is simple) currently running normally.<br>
Process last started at Wed Feb 25 14:11:08 2004 (1 proc
starts)<br>
Command 1 is '/usr/afs/bin/buserver'<br>
<br>
Instance ptserver, (type is simple) currently running normally.<br>
Process last started at Wed Feb 25 14:11:08 2004 (1 proc
starts)<br>
Command 1 is '/usr/afs/bin/ptserver'<br>
<br>
Instance vlserver, (type is simple) currently running normally.<br>
Process last started at Wed Feb 25 14:11:08 2004 (1 proc
starts)<br>
Command 1 is '/usr/afs/bin/vlserver'<br>
<br>
Instance fs, (type is fs) currently running normally.<br>
Auxiliary status is: file server running.<br>
Process last started at Wed Feb 25 14:21:25 2004 (2 proc
starts)<br>
Command 1 is '/usr/afs/bin/fileserver'<br>
Command 2 is '/usr/afs/bin/volserver'<br>
Command 3 is '/usr/afs/bin/salvager'<br>
<br>
Instance upserver, (type is simple) currently running normally.<br>
Process last started at Wed Feb 25 14:24:55 2004 (1 proc
starts)<br>
Command 1 is '/usr/afs/bin/upserver -crypt /usr/afs/etc -clear
/usr/afs/bin'<br>
<br>
'bos listhosts' output from AFS1 and AFS2 are identical:<br>
--------------------------------------------------------<br>
Cell name is mydomain.com<br>
Host 1 is afs1.mydomain.com<br>
Host 2 is afs2.mydomain.com<br>
<br>
both 'vos syncvldb' and 'vos syncserv' complete with no errors.<br>
-----------------------------------------------------------------<br>
[admin@afs2 /]$ vos syncvldb afs1.mydomain.com -cell mydomain.com<br>
VLDB synchronized with state of server afs1.mydomain.com<br>
[admin@afs2 /]$ vos syncserv afs1.mydomain.com -cell mydomain.com<br>
Server afs1.mydomain.com synchronized with VLDB<br>
<br>
'vos listvol -server afs1.mydomain.com'<br>
---------------------------------------<br>
Total number of volumes on server afs1.mydomain.com partition /vicepa:
6<br>
root.afs
536870912 RW 4
K On-line<br>
root.afs.readonly
536870913 RO 4 K On-line<br>
root.cell
536870915 RW 3 K
On-line<br>
root.cell.readonly 536870916
RO 3 K On-line<br>
www
536870918 RW
5 K On-line<br>
www.readonly
536870919 RO 5 K On-line<br>
<br>
Total volumes onLine 6 ; Total volumes offLine 0 ; Total busy 0<br>
Total number of volumes on server afs1.mydomain.com partition /vicepb:
0<br>
Total volumes onLine 0 ; Total volumes offLine 0 ; Total busy 0<br>
<br>
<br>
'vos listvol -server afs2.mydomain.com'<br>
---------------------------------------<br>
Total number of volumes on server afs2.mydomain.com partition /vicepa:
3<br>
root.afs.readonly
536870913 RO 4 K On-line<br>
root.cell.readonly 536870916
RO 3 K On-line<br>
www.readonly
536870919 RO 5 K On-line<br>
<br>
Total volumes onLine 3 ; Total volumes offLine 0 ; Total busy 0<br>
Total number of volumes on server afs2.mydomain.com partition /vicepb:
0<br>
Total volumes onLine 0 ; Total volumes offLine 0 ; Total busy 0<br>
<br>
<br>
Everything seems fine, however if I down AFS1, on a client machine if I
try to do anything inside of /afs (ls, cd <dir>, etc.), the clients
time-out:<br>
<br>
[root@afs1 /]# /etc/init.d/afs stop<br>
Stopping AFS services.....<br>
Stopping AFS bosserver<br>
free(): invalid pointer 0xbf3fc010!<br>
free(): invalid pointer 0xbf3cb010!<br>
[root@afs1 /]#<br>
<br>
On The Client:<br>
[root@www2 /]# cd /afs<br>
[root@www2 afs]# ls -al<br>
drwxrwxrwx 2 root root
2048 Feb 25 14:55 .mydomain.com<br>
drwxrwxrwx 2 root root
2048 Feb 25 14:55 mydomain.com<br>
[root@www2 afs]# cd mydomain.com/ <--- this should
be the replicated RO volume, correct?<br>
[root@www2 mydomain.com]# ls -la<br>
ls: .: Connection timed out<br>
[root@www2 mydomain.com]#<br>
<br>
I know that since the secondary AFS server, AFS2, should have a copy of
the RO volume, I should still be able to CD into this directory and read
files, correct?<br>
<br>
I am not sure where to look next.<br>
<br>
<br>
Thanks in advance,<br>
James Schmidt<br>
<br>
_______________________________________________<br>
OpenAFS-info mailing list<br>
OpenAFS-info@openafs.org<br>
https://lists.openafs.org/mailman/listinfo/openafs-info<br>
</tt></font>
<br>
--=_alternative 00295CADC1256E46_=--