[OpenAFS] RWrite and ROnly root.afs/root.cell on same server causing system crash

Daniel Clark dclark@pobox.com
Wed, 31 May 2006 20:21:42 -0400


I'm setting up a new cell, and I think I've run into something that is
either a bug or a piece of documentation that needs to be clarified
(at http://www.openafs.org/pages/doc/QuickStartUnix/auqbg005.htm#HDRWQ80
steps 5-8)

The behavior is that after creating read-only replicas of root.cell
and root.afs on the same server as the read-write replicas, /afs
becomes inaccessible. At this point you can still use vos remove on
the read-only copies and get back to normal, however if you reboot
invoking the rc.afs script causes a system dump / crash / reboot, and
also some server files to be corrupted (for example, local/BosConfig
gets zero-byted). I couldn't figure out how to recover from that (the
utilities all complained about something regarding symbols in /unix
when I tried to use them - I can reproduce that bit of the bug if it's
important, but I assumed it was just because the kernel extentions
were not loaded), so did a clean reinstall.

This is on AIX 5.3. ML4 plus most recent patches as of yesterday, and
OpenAFS 1.4.1 rs_aix53 binaries as distributed from openafs.org.
Hardware is a pSeries 570 DLPAR (virtual machine), 2GB RAM, 1 POWER5
processor, 64 bits.

Below are more details on the problem; a pseudo-workflow of everything
leading up to the problem, and then a demo of the problem. I also have
core/dump files I could provide.

# Make sure /vicepa exists
# Make sure /usr/vice/cache is of type jfs
DIST=/root/afs/afs-1.4.1/rs_aix53
chown -R 0.0 $DIST
umask 022
mkdir /usr/afs
mkdir /usr/vice
mkdir /usr/vice/etc
cd $DIST/root.client/usr/vice/etc
cp -rp dkload /usr/vice/etc
cp -p dkload/rc.afs /etc/rc.afs
vi /etc/rc.afs
chmod 755 /etc/rc.afs
/etc/rc.afs
cd $DIST/root.server/usr/afs
cp -rp * /usr/afs
/usr/afs/bin/bosserver -noauth &
cd /usr/afs/bin
MACH=afsdb1.dclark.us
CELL=dclark.us
./bos setcellname $MACH $CELL -noauth
./bos listhosts $MACH -noauth
./bos create $MACH kaserver simple /usr/afs/bin/kaserver -cell $CELL -noauth
./bos create $MACH buserver simple /usr/afs/bin/buserver -cell $CELL -noauth
./bos create $MACH ptserver simple /usr/afs/bin/ptserver -cell $CELL -noauth
./bos create $MACH vlserver simple /usr/afs/bin/vlserver -cell $CELL -noauth
printf "create afs ; create admin ; setfields admin -flags admin ; quit\n"
./kas -cell $CELL -noauth
./bos adduser $MACH admin -cell $CELL -noauth
./bos addkey $MACH -kvno 0 -cell $CELL -noauth
./bos listkeys $MACH -cell $CELL -noauth
./pts createuser -name admin -cell $CELL -noauth
./pts adduser admin system:administrators -cell $CELL -noauth
./pts membership admin -cell $CELL -noauth
./bos restart $MACH -all -cell $CELL -noauth
./bos create $MACH fs fs /usr/afs/bin/fileserver /usr/afs/bin/volserver /usr/af\
s/bin/salvager -cell $CELL -noauth
./bos status $MACH fs -long -noauth
./vos create $MACH vicepa root.afs -cell $CELL -noauth
./bos create $MACH upserver simple "/usr/afs/bin/upserver -crypt /usr/afs/etc -\
clear /usr/afs/bin" -cell $CELL -noauth

##########################

cd $DIST/root.client/usr/vice/etc
cp -p * /usr/vice/etc
cp -rp C /usr/vice/etc
cd /usr/vice/etc
vi CellServDB
echo "/afs:/usr/vice/cache:50000" > /usr/vice/etc/cacheinfo
mkdir /afs
printf "afs     4     none     none # Needs to be in /etc/vfs"
grep afs /etc/vfs
vi /etc/rc.afs
/usr/afs/bin/bos shutdown $MACH -wait -noauth
ps auxw | grep bosserver
cd /
shutdown -r now

##########################

/etc/rc.afs
/usr/afs/bin/klog admin
/usr/afs/bin/tokens
/usr/afs/bin/bos status $MACH
cd /
/usr/afs/bin/fs checkvolumes
/usr/afs/bin/fs setacl /afs system:anyuser rl
/usr/afs/bin/vos create $MACH vicepa root.cell
/usr/afs/bin/fs mkmount /afs/$CELL root.cell
/usr/afs/bin/fs setacl /afs/$CELL system:anyuser rl
cd /usr/afs/bin
./fs mkmount /afs/.${CELL} root.cell -rw
./vos addsite $MACH vicepa root.afs
./vos addsite $MACH vicepa root.cell
./fs examine /afs
./fs examine /afs/$CELL

###################
# - Fine at this point - #
###################
./vos release root.afs
./vos release root.cell
./fs checkvolumes

bash-3.00# ./fs checkvolumes
All volumeID/name mappings checked.

######################
# - Broken at this point - #
#####################

./fs examine /afs
./fs examine /afs/$CELL

bash-3.00# ./fs examine /afs
fs: File '/afs' doesn't exist
bash-3.00# ./fs examine /afs
fs: File '/afs' doesn't exist

bash-3.00# cd /afs
bash: cd: /afs: A file or directory in the path name does not exist.
bash-3.00# ls -l / | grep afs
ls: 0653-341 The file /afs does not exist.

bash-3.00# vos listvldb
VLDB entries for all servers

root.afs
    RWrite: 536870912     ROnly: 536870913
    number of sites -> 2
       server tiv570test.dclark.us partition /vicepa RW Site
       server tiv570test.dclark.us partition /vicepa RO Site

root.cell
    RWrite: 536870915     ROnly: 536870916
    number of sites -> 2
       server tiv570test.dclark.us partition /vicepa RW Site
       server tiv570test.dclark.us partition /vicepa RO Site

Total entries: 2

bash-3.00# vos remove $MACH vicepa -id 536870913 -cell $CELL
Volume 536870913 on partition /vicepa server tiv570test.dclark.us delete\
d
bash-3.00# vos remove $MACH vicepa -id 536870916 -cell $CELL
Volume 536870916 on partition /vicepa server tiv570test.dclark.us delete\
d

bash-3.00# fs examine /afs
File /afs (536870912.1.1) contained in volume 536870912
Volume status for vid = 536870912 named root.afs
Current disk quota is 5000
Current blocks used are 4
The partition has 426704764 blocks available out of 426770432
bash-3.00# fs examine /afs/$CELL
File /afs/notesdev.ibm.com (536870915.1.1) contained in volume 536870915
Volume status for vid = 536870915 named root.cell
Current disk quota is 5000
Current blocks used are 2
The partition has 426704764 blocks available out of 426770432

Things work again now...

Here is the output of errpt -a:

---------------------------------------------------------------------------
LABEL:          CORE_DUMP
IDENTIFIER:     A63BEB70

Date/Time:       Wed May 31 18:09:54 EDT 2006
Sequence Number: 145
Machine Id:      00CBDEEA4C00
Node Id:         tiv570test
Class:           S
Type:            PERM
Resource Name:   SYSPROC

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

        Recommended Actions
        CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        RERUN THE APPLICATION PROGRAM
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
          11
USER'S PROCESS ID:
                311342
FILE SYSTEM SERIAL NUMBER
           2
INODE NUMBER
       65696
PROCESSOR ID
           0
CORE FILE NAME
/usr/afs/bin/core
PROGRAM NAME
bos
STACK EXECUTION DISABLED
           0
ADDITIONAL INFORMATION
??
??
??
Unable to generate symptom string.
---------------------------------------------------------------------------

---------------------------------------------------------------------------
LABEL:          DUMP_STATS
IDENTIFIER:     67145A39

Date/Time:       Wed May 31 17:46:49 EDT 2006
Sequence Number: 142
Machine Id:      00CBDEEA4C00
Node Id:         tiv570test
Class:           S
Type:            UNKN
Resource Name:   SYSDUMP

Description
SYSTEM DUMP

Probable Causes
UNEXPECTED SYSTEM HALT

User Causes
SYSTEM DUMP REQUESTED BY USER

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
UNEXPECTED SYSTEM HALT

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
DUMP DEVICE
/dev/lg_dumplv
DUMP SIZE
              32018432
TIME
Wed May 31 17:45:15 2006
DUMP TYPE (1 = PRIMARY, 2 = SECONDARY)
           0
DUMP STATUS
           0
ERROR CODE
           0
DUMP INTEGRITY
Compressed dump - Run dmpfmt with -c flag                                 on dum
p after uncompressing.
FILE NAME

PROCESSOR ID
           0
---------------------------------------------------------------------------
LABEL:          MINIDUMP_LOG
IDENTIFIER:     F48137AC

Date/Time:       Wed May 31 17:46:33 EDT 2006
Sequence Number: 141
Machine Id:      00CBDEEA4C00
Node Id:         tiv570test
Class:           O
Type:            UNKN
Resource Name:   minidump

Description
COMPRESSED MINIMAL DUMP

Probable Causes
System dumped. Minimal Dump collected in Non-Volatile Memory.

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
Minidump Data:
4D32 039B 082E 0010 0035 003B 0030 0058 0000 0000 01E8 9000 0000 0000 13EF C1BA
0000 0003 4000 0000 447E 0E95 1737 A43F 0165 6E64 0074 6100 0041 0A31 8C72 3771
CE00 04C6 C400 6563 DDBA 0108 0040 0180 0944 FC38 5803 A0A1 C387 101B 3600 E000
C68C 1630 7060 84D1 8206 0B18 005E 9029 63E7 059B 335F C8D4 6903 878D 1D65 6345
A4C0 8801 2000 002C 44FC 3858 03A0 218E 8610 234A 6CD8 6980 9726 53B2 4CB1 D2A4
0988 8920 232A 0809 8040 4893 130D 0165 6300 4004 0863 66CE 9736 6FEA B8A1 B322
4690 2100 2246 2490 030B 1600 8686 3C69 0205 4080 1104 2402 A088 0504 3C89 2FE8
C901 2492 800E 6AA8 444A 0C20 B3A6 4814 F13C 02D8 E08C 151F 9B40 4306 9D06 69CB
301D E072 D284 7000 0C90 9639 B2F0 53BA 1318 1BA0 0A44 BE90 C782 9856 AE27 814A
F4D6 CAD7 3B91 64CD D60C 8041 2488 804F 25BE 0541 C32D 0052 98EC 7641 22F2 9FDF
7F54 31B0 0B81 4D64 001F 84C2 463C BC00 4DDF BF54 3740 33C1 52A2 4B54 E82A 4B84
244B 6681 012C 4492 4041 4484 5D00 1000 C411 40D0 2042 850C 63DC 8021 9324 073B
050F 265C B8E2 06C4 9632 32A0 C3ED 7A77 6C1C 3579 6082 36E7 35EF 1846 68CB 0412
0294 1DE7 0C65 14F9 2D12 480A CD12 6FE0 0B50 A000 00C5 232B 5E14 2901 D295 9144
8DF2 5051 1941 6A00 03C4 4A44 8000 BC7E A0ED BD47 407C C3FC 504F 6520 6C20 9D48
FCF9 3751 5036 1922 5380 13BA 075F 51C3 0821 4B65 286C 805E 449D 0CE0 4513 150A
48E0 109E D4D7 0104 0DC8 14E2 8825 5E68 9411 24A8 C854 4D14 0E58 9431 2698 2052
0720 2490 DF7F 86CD 6000 9101 1CD2 0B24 8625 C101 0E91 D521 1491 8B29 9652 1C0B
8AF4 6144 0BD8 4413 9511 5949 0F96 5186 24A6 1D32 0510 402A C018 7688 2F00 44F6
0838 71EA B401 2474 F023 661A 0B3E 2091 0060 06BA 9897 321D 0043 3D67 8A95 A84D
4346 1417 00FE 41D0 594D 0888 2682 2042 9C56 A948 1498 C224 A7A4 7001 C010 8D8A
1540 0918 64E9 A508 2888 2044 6493 0102 2B65 3299 F0C0 7D13 E500 0B3A A396 BA15
0BF0 44D6 131F 810D 56D8 4C88 A1C7 9863 12FD F58F 9874 BC05 14A9 C512 26AC 4F76
8180 1CB2 8919 E643 638F F935 AB83 1251 6BE7 B07A 1191 EDB6 110D B08B 20E0 05D0
4332 78D8 04C2 B836 8544 AA5E 7CC9 15C0 1F31 C915 10BB 0101 1046 47FF 49CB 6C44
A961 4088 A907 7C34 A54C 0FFC 35CF 3FFD B439 9393 04FF 1740 6521 2452 012F 1311
280B 2582 1A16 402E 1ACB 44EE 97DD D549 D305 D848 3B56 5967 DDAC 56CA 1215 A00F
1C35 D180 8C61 48A8 818B 610F E0A1 AA44 54D1 0CC7 C401 54D0 8BCC A829 1041 9A23
FB83 6400 C530 A272 311D 0340 039D 33AD 0C8C 7F90 C800 1E07 5F64 2902 088F DA24
8400 7784 9106 1D70 C8F1 C618 2BD0 105D 4D04 7042 4EDD 77E7 BD77 DFBE 01CE 091C
868F 51C6 4272 E4D1 D0D2 13C4 420F 79E6 6D09 800C 3665 4525 3C1F C8E8 CB1E 7070
785F 0C32 A590 031F 6873 2E53 9722 5195 5344 A044 A0C7 7A5A 86AE E316 BEF8 710B
8736 D726 3850 8BF2 2C11 E832 1AE3 8134 A647 64DE 4C47 9453 AF4C C8EF 6E4C 0857
4564 5FF0 4456 1F9F 3122 6C51 1F04 4B53 B9AF CE39 4794
---------------------------------------------------------------------------
LABEL:          DSI_PROC
IDENTIFIER:     9D035E4D

Date/Time:       Wed May 31 17:45:15 EDT 2006
Sequence Number: 140
Machine Id:      00CBDEEA4C00
Node Id:         tiv570test
Class:           S
Type:            PERM
Resource Name:   SYSVMM

Description
DATA STORAGE INTERRUPT, PROCESSOR

Probable Causes
SOFTWARE PROGRAM

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
DATA STORAGE INTERRUPT STATUS REGISTER
0000 0000 0000 0000
SEGMENT REGISTER, SEGREG
0A00 0000 0000 0000
DATA STORAGE INTERRUPT ADDRESS REGISTER
0000 0400 0000 0000
EXVAL
0000 0004 0000 0000
---------------------------------------------------------------------------