[OpenAFS] replica server not "failing over" ?

Kim (Dexter) Kimball dhk@ccre.com
Fri, 5 Mar 2004 18:23:35 -0700


As an instructor, one of my favorite topics .... and the AFS Administrators
Guide as quoted below is correct but not particularly illuminating.

The Cache Manager follows a set of "volume traversal rules" as it goes down
the volume tree, which generally are:

1.  If the mount point is -rw (%volname) go to the RW volume (ignore
replicas)
2.  If currently in a RW --> RW volume   (ignore replicas, if any)
3.  If currently in a RO, and next vol in chain is replicated according to
VLDB, then get a RO, assuming the mount point is not -rw


Rules 1 and 2 are what make the dot-path convention work.

The dot-path is a -rw mount point, created with "fs mkm <dir> <vol> -rw".
"fs lsm <dot-path-dir>" will show a % in front of the volume name.  The %
indicates a RW mount point.

EXAMPLE  (note # and % -- # says "follow the traversal rules" and % says
"ignore replicas")

[kim@angel kim]$ fs lsm /afs/lab.ccre.com
'/afs/lab.ccre.com' is a mount point for volume '#root.cell'
[kim@angel kim]$ fs lsm /afs/.lab.ccre.com
'/afs/.lab.ccre.com' is a mount point for volume '%root.cell'
[kim@angel kim]$

The %root.cell puts you in the RW chain of volumes ...

[kim@angel kim]$ cd /afs/lab.ccre.com                   # Regular path, go
to RO
[kim@angel lab.ccre.com]$ fs lq
Volume Name                   Quota      Used %Used   Partition
root.cell.readonly             5000        45    1%          0%


[kim@angel lab.ccre.com]$ cd /afs/.lab.ccre.com         #Dot path, go to RW
[kim@angel .lab.ccre.com]$ fs lq
Volume Name                   Quota      Used %Used   Partition
root.cell                      5000        45    1%          4%
[kim@angel .lab.ccre.com]$


What the explanation in the guide fails to mention is that the rules apply
all the way down the volume "tree"

That is, if you have

    /afs/cell/x/y/z/replicated-volume

anything that puts the CM (Cache Manager) in a RW volume at any node in
/afs/cell/x/y/z will cause the replicas of replicated-volume to be ignored.

Misusing "fs mkm <dir> <volname> -rw" is one way to stumble into a RW
volume -- causing all replicas further down the chain to be ignored.
Failure to replicate any volume in the /afs/cell/x/y/z chain will put you in
a RW, causing all replicas further down the chain to be ignored.
Using volinfo regularly is one way to find out that ROs are being ignored --
symptomatic of a misplaced RW/% mountpoint or an unreplicated volume higher
up the chain.

To state the rules in a different way:
   1. From a RW, I'm going to a RW, I don't care what the VLDB says.
   2. From a RO, if the VLDB says the next volume is replicated, I'm going
to a RO and if none is available I quit.

   3. Volumes mounted with .readonly or .backup volumes take me to a RO or
BU volume, I don't care what 1 ;and 2 say.


IMPORTANCE OF ACCURATE VLDB

Note that the CM depends on the VLDB to determine if a volume is replicated.
The CM does not ask the fileservers/volservers or anyone else about
replication.  In other words, if the VLDB is incomplete or if it has extra
information about a volume, the CM will follow the rules -- but you know
what's on disk and you expect the CM to do something different.

Example:

[kim@angel .lab.ccre.com]$ vos listvl root.cell
#root.cell is reported to be replicated in the vldb

root.cell
    RWrite: 536870915     ROnly: 536870916
    number of sites -> 4
       server magic.lab.ccre.com partition /vicepe RW Site
       server angel.lab.ccre.com partition /vicepc RO Site
       server magic.lab.ccre.com partition /vicepe RO Site
       server satchmo.lab.ccre.com partition /vicepa RO Site


	AS FAR AS THE CM IS CONCERNED, THE ABOVE VLDB ENTRY INDICATES THAT
ROOT.CELL IS REPLICATED
	I'LL REMOVE THE RO SITE INFORMATION WITH VOS REMSITE

[kim@angel .lab.ccre.com]$ vos remsite angel c root.cell
Deleting the replication site for volume 536870915 ...Removed replication
site angel /vicepc for volume root.cell
[kim@angel .lab.ccre.com]$ vos remsite magic e root.cell
Deleting the replication site for volume 536870915 ...Removed replication
site magic /vicepe for volume root.cell
[kim@angel .lab.ccre.com]$ vos remsite satchmo a root.cell
Deleting the replication site for volume 536870915 ...Removed replication
site satchmo /vicepa for volume root.cell
[kim@angel .lab.ccre.com]$ vos listvl root.cell

root.cell
    RWrite: 536870915
    number of sites -> 1
       server magic.lab.ccre.com partition /vicepe RW Site

VOS REMSITE REMOVES THE VLDB INFO, BUT LEAVES THE VOLUMES ON THE VICEP --

[kim@angel .lab.ccre.com]$ vos listvol angel -part c
Total number of volumes on server angel partition /vicepc: 4
MyDocs                            536871172 RW     247630 K On-line
root.afs.readonly                 536870913 RO         16 K On-line
root.cell.readonly                536870916 RO         45 K On-line
***** I'm still here!
sw.macromedia                     536871050 RW     547797 K On-line

Total volumes onLine 4 ; Total volumes offLine 0 ; Total busy 0

TRUST ME ON THE OTHER LOCATIONS -- THE VOLUMES ARE THERE

FS CHECKV CAUSES THE CM TO FORGET WHAT IT'S CACHED ABOUT VOLUME LOCATION

[kim@angel .lab.ccre.com]$ fs checkv
All volumeID/name mappings checked.
[kim@angel .lab.ccre.com]$

HERE'S WHAT I MEAN ABOUT THE VLDB ENTRY BEING IMPORTANT

[kim@angel .lab.ccre.com]$ cd /afs/
[kim@angel afs]$ fs lsm lab.ccre.com
'lab.ccre.com' is a mount point for volume '#root.cell'
[kim@angel afs]$
[kim@angel afs]$ cd lab.ccre.com
[kim@angel lab.ccre.com]$
[kim@angel lab.ccre.com]$ fs lq
Volume Name                   Quota      Used %Used   Partition
root.cell                      5000        45    1%          4%
[kim@angel lab.ccre.com]$

AS FAR AS THE CM CAN TELL FROM THE VLDB ENTRY, ROOT.CELL ISN'T REPLICATED
.... AS THE SYS ADMIN I KNOW BETTER ... BUT THE VLDB IS "LYING" AND THE CM
BASES ITS DECISIONS ON THE VLDB ENTRY FOR A GIVEN VOLUME -- and ignores the
replicas that do exist on the vicep's of 3 fileservers.


IF I CORRECT THE VLDB ENTRY ....
[kim@angel lab.ccre.com]$ vos addsite angel c root.cell
Added replication site angel /vicepc for volume root.cell
[kim@angel lab.ccre.com]$ vos addsite magi c root.cell
[kim@angel lab.ccre.com]$ vos addsite magic e root.cell
Added replication site magic /vicepe for volume root.cell
[kim@angel lab.ccre.com]$ vos addsite satchmo a root.cell
Added replication site satchmo /vicepa for volume root.cell
[kim@angel lab.ccre.com]$ vos listvl root.cell

root.cell
    RWrite: 536870915
    number of sites -> 4
       server magic.lab.ccre.com partition /vicepe RW Site
       server angel.lab.ccre.com partition /vicepc RO Site  -- Not released
       server magic.lab.ccre.com partition /vicepe RO Site  -- Not released
       server satchmo.lab.ccre.com partition /vicepa RO Site  -- Not
released

[kim@angel lab.ccre.com]$ # cm,	forget volume location cache and start over
[kim@angel lab.ccre.com]$ fs checkv
All volumeID/name mappings checked.

# cm does care about the "Not released" tags in the VLDB entry above,
otherwise we'd end up in a RO here ...

[kim@angel lab.ccre.com]$ cd /afs/lab.ccre.com
[kim@angel lab.ccre.com]$ fs lq
Volume Name                   Quota      Used %Used   Partition
root.cell                      5000        45    1%          4%
[kim@angel lab.ccre.com]$

We get rid of the tags with "vos rel"

[kim@angel lab.ccre.com]$ vos release root.cell
Released volume root.cell successfully

Tell the CM to forget what it knows about root.cell (and any other volume)

[kim@angel lab.ccre.com]$ fs checkv
All volumeID/name mappings checked.

And with the VLDB back to its original state we get what we expect ...

[kim@angel lab.ccre.com]$ cd /afs/lab.ccre.com
[kim@angel lab.ccre.com]$ fs lq
Volume Name                   Quota      Used %Used   Partition
root.cell.readonly             5000        45    1%          0%
[kim@angel lab.ccre.com]$


When teaching AFS Administration in Mexico several years ago we got to the
Volume Traversal Rules section of the class.

I got as far as "from a RW go to a RW" and one of the students' eye holes
enlarged and he abruptly left the room.

The student had dutifully created replicas of a given volume at various
fileserver locations throughout Mexico.

He had been trying to figure out why the replicas weren't being accessed.
(He periodically got phone calls when the RW was unavailable.)

The replicated volume was mounted in his home directory ... volume
"user.someone" ... which wasn't replicated.

Remember RW --> RW, replicas be damned?

Exactly.

We did the right thing and made sure no one found out about the faux pas ...
neither of us cherished the idea of changing a path other folks had been
using in scripts/other executables  ... so we replicated his user.someone
volume, created a user.someone2 volume for his own use ...


Yet another way to consider volume traversal:

If the volume name in the mount point has a .readonly extension, go to a RO
and ignore the RW.  (.readonly MPs aren't common)
If the volume name in the  mount point has a .backup extension, go to the BU
volume.  Ignore RWs and ROs.  If the BU doesn't exist, fail.
If the mount point is a RW/% mount point, ignore replicas and go the to RW.
If the RW doesn't exist or is unavailable, fail.

If the mount point is a regular/# mount point AND you're in a RO AND there
are no .backup/.readonly extensions AND the next volume is replicated
according to the VLDB, ignore the RW and go to an RO.  If all ROs are
unavailable, fail -- don't fail over to the RW as it may have been changed
since the RO was last released.

When trying to figure out why a replica isn't being accessed, it's good to
think like the Cache Manager.  And be rigid.  Just because you know the
volumes exist on a vicep somewhere doesn't mean that the Cache Manager will
read your mind.  The VLDB rules.

Kim


=================================
Kim (Dexter) Kimball
dhk@ccre.com


On Thu, 2004-02-26 at 02:12, Tino Schwarze wrote:
> On Wed, Feb 25, 2004 at 05:33:00PM -0600, James Schmidt wrote:
>
> > I've got my two openafs servers, afs1 and afs2.  Afs1 is the primary.
> > I've created RO volume replicas on AFS2, and 'vos listvldb' shows the
> > correct info, however if I offline afs1, all of the clients time out
> > (including AFS2, which is also a client).
>
> > On The Client:
> > [root@www2 /]# cd /afs
> > [root@www2 afs]# ls -al
> > drwxrwxrwx    2 root     root         2048 Feb 25 14:55 .mydomain.com
> > drwxrwxrwx    2 root     root         2048 Feb 25 14:55 mydomain.com
> > [root@www2 afs]# cd mydomain.com/       <--- this should be the
replicated RO volume, correct?
>
> What does "fs lsmount mydomain.com" say?
>
> > I know that since the secondary AFS server, AFS2, should have a copy
> > of the RO volume, I should still be able to CD into this directory and
> > read files, correct?

I had this same problem recently and wondered what the problem was.
Digging through the AFS Administrators Guide I found this statement:

"If you are replicating any volumes, you must replicate the root.afs and
root.cell volumes, preferably at two or three sites each (even if your
cell only has two or three file server machines). The Cache Manager
needs to pass through the directories corresponding to the root.afs and
root.cell volumes as it interprets any pathname. The unavailability of
these volumes makes all other volumes unavailable too, even if the file
server machines storing the other volumes are still functioning."

Following these instructions, I did a vos addsite root.afs and root.cell
to my second server, then vos release root.afs root.cell and fs checkv.

Now when I cd /afs and do fs whereis mydomain.com both servers show up.
John

_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info