[OpenAFS] VLDB problem - Duplicate entries

Todd DeSantis atd@us.ibm.com
Wed, 17 Sep 2008 12:15:58 -0400


--0__=08BBFE54DFC43F648f9e8a93df938690918c08BBFE54DFC43F64
Content-type: multipart/alternative; 
	Boundary="1__=08BBFE54DFC43F648f9e8a93df938690918c08BBFE54DFC43F64"

--1__=08BBFE54DFC43F648f9e8a93df938690918c08BBFE54DFC43F64
Content-type: text/plain; charset=US-ASCII
Content-transfer-encoding: quoted-printable


Hi William :

If your VLDB is still in this state, you can create a
dummy cell on a client machine and then use the

      vos syncvldb <fileserver in real cell > -cell <dummy cell>

to rebuild the VLDB for the main cell.

Once you hit all of the fileservers, you can then place
this VLDB from the dummy cell onto the DB server machines
for your real cell.

The VLDB DataBase does not have any cellname specific
information in it, so creating it on a dummy cell is not
a problem.

As long as you are not making changes to the VLDB while
you are rebuilding the VLDB, this will work.

=3D=3D=3D=3D=3D
=3D=3D=3D=3D=3D

This is what I did on an AFS client machine only.

- Created /usr/afs/[etc,db,local] directories
- Created /usr/afs/etc/CellServDB and ThisCell files

gyro# more CellServDB
>dummycell.com       #Cell name
158.43.11.175        #gyro.dummycell.com

gyro# more ThisCell
dummycell.com

- Added the cell to the client CellServDB file and ran
  fs newcell

  fs newcell dummycell.com gyro.dummycell.com
  ** use your client name for the DB server entry

- Placed the new cell in NoAuth mode
  touch /usr/afs/local/NoAuth

- Started the vlserver on this machine
  /usr/afs/bin/vlserver &

- Got tokens in the main cell that I wanted to duplicate the
  VLDB for

- Ran the "vos syncvldb" command.

  /usr/afs/bin/vos syncvldb pork -cell dummycell.com -noauth -verb

My test cell is dummycell.com and pork is a fileserver in the main
cell.  This command created the new entries in the VLDB for
the dummycell.com cell.

Now running this against your fileservers will probably take some time
because of the number of volumes in your cell.  But as long as no VLDB
updates are going on while your building this new VLDB, it will be
current with your site when it is finished.

 - Then you can get ready to place this VLDB onto your main
   Database server machines.

   You should stop the vlservers on your main DB servers.
   Save a copy of your current, corrupted VLDB
   Copy the new VLDB into the /usr/afs/db dir
   Restart the vlservers

   And the new VLDB should be OK.

   If you get everything in place before you stop the
   vlservers, you should be able to stop the vlservers, copy
   the new VLDB and restart the vlservers before anything
   times out !  So no downtime.

Thanks

Todd DeSantis



                                                                       =
    
             William Setzer                                            =
    
             <William_Setzer@n                                         =
    
             csu.edu>                                                  =
 To 
             Sent by:                  openafs-info@openafs.org        =
    
             openafs-info-admi                                         =
 cc 
             n@openafs.org                                             =
    
                                                                   Subj=
ect 
                                       [OpenAFS] VLDB problem - Duplica=
te  
             09/12/2008 04:50          entries                         =
    
             PM                                                        =
    
                                                                       =
    
                                                                       =
    
             Please respond to                                         =
    
             William_Setzer@nc                                         =
    
                  su.edu                                               =
    
                                                                       =
    
                                                                       =
    




[ If you see this twice, I apologize.  I sent it to an old address
  without noticing, so I hope it got eaten. ]

We've been investigating why our "vos backupsys" processes have been
hanging, and have discovered something disturbing.  Upon dumping out
our VLDB via "vos listvldb > foo" it appears our VLDB has been
corrupted.  We're seeing two entries for a significant percentage
(1/4) of our volumes:

    adm.db
        RWrite: 536899559     Backup: 536899561
        number of sites -> 1
           server A.ncsu.edu partition /vicepa RW Site

    adm.db
        RWrite: 536899559     Backup: 536899561
        number of sites -> 1
           server A.ncsu.edu partition /vicepa RW Site


Right now, it's never more than two instances per volume, and
sometimes they point to the same server, sometimes they point to
different servers.

Our first thought is to do a "vos syncvldb"/"vos syncserv", but we
don't know if this will fix the problem, particularly in the case of
duplicate entries pointing to the same place.  Our second thought is
to do it after zeroing out the VLDB, but the downtime we'd suffer
isn't very appealing. :)   Our third thought is that we might have a
more serious corruption, since we had a problem with our VLDB several
months ago (which we thought we had fixed).

Right now, everything appears to be working "normally", excepting the
"vos backupsys" being very cranky about a large number of non-existent
volumes, but clearly something needs to be done and we're pretty much
out of our depth.

Our current OpenAFS version is 1.2.13, but our upgrade path to 1.4.7
was in progress when interrupted by this problem.  (We were starting
with file servers, so the databases are still at 1.2.13.)

So what do you think would be the safest and/or best course of action
to take?  Thanks in advance for your advice.


William Setzer
Systems & Hosted Services
Office of Information Technology
NC State University
_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info
=

--1__=08BBFE54DFC43F648f9e8a93df938690918c08BBFE54DFC43F64
Content-type: text/html; charset=US-ASCII
Content-Disposition: inline
Content-transfer-encoding: quoted-printable

<html><body>
<p>Hi William :<br>
<br>
If your VLDB is still in this state, you can create a <br>
dummy cell on a client machine and then use the <br>
<br>
	vos syncvldb &lt;fileserver in real cell &gt; -cell &lt;dummy cell&gt;=
<br>
<br>
to rebuild the VLDB for the main cell.<br>
<br>
Once you hit all of the fileservers, you can then place<br>
this VLDB from the dummy cell onto the DB server machines<br>
for your real cell.<br>
<br>
The VLDB DataBase does not have any cellname specific <br>
information in it, so creating it on a dummy cell is not <br>
a problem.<br>
<br>
As long as you are not making changes to the VLDB while<br>
you are rebuilding the VLDB, this will work.<br>
<br>
=3D=3D=3D=3D=3D<br>
=3D=3D=3D=3D=3D<br>
<br>
<font face=3D"Courier New">This is what I did on an AFS client machine =
only.</font><br>
<br>
<font face=3D"Courier New">- Created /usr/afs/[etc,db,local] directorie=
s</font><br>
<font face=3D"Courier New">- Created /usr/afs/etc/CellServDB and ThisCe=
ll files</font><br>
<br>
<font face=3D"Courier New">gyro# more CellServDB</font><br>
<font face=3D"Courier New">&gt;dummycell.com       #Cell name</font><br=
>
<font face=3D"Courier New">158.43.11.175        #gyro.dummycell.com</fo=
nt><br>
<br>
<font face=3D"Courier New">gyro# more ThisCell</font><br>
<font face=3D"Courier New">dummycell.com</font><br>
<br>
<font face=3D"Courier New">- Added the cell to the client CellServDB fi=
le and ran </font><br>
<font face=3D"Courier New">  fs newcell</font><br>
<br>
<font face=3D"Courier New">  fs newcell dummycell.com gyro.dummycell.co=
m</font><br>
<font face=3D"Courier New">  ** use your client name for the DB server =
entry</font><br>
<br>
<font face=3D"Courier New">- Placed the new cell in NoAuth mode</font><=
br>
<font face=3D"Courier New">  touch /usr/afs/local/NoAuth</font><br>
<br>
<font face=3D"Courier New">- Started the vlserver on this machine</font=
><br>
<font face=3D"Courier New">  /usr/afs/bin/vlserver &amp;</font><br>
<br>
<font face=3D"Courier New">- Got tokens in the main cell that I wanted =
to duplicate the </font><br>
<font face=3D"Courier New">  VLDB for</font><br>
<br>
<font face=3D"Courier New">- Ran the &quot;vos syncvldb&quot; command.<=
/font><br>
<br>
<font face=3D"Courier New">  /usr/afs/bin/vos syncvldb pork -cell dummy=
cell.com -noauth -verb</font><br>
<br>
<font face=3D"Courier New">My test cell is dummycell.com and pork is a =
fileserver in the main</font><br>
<font face=3D"Courier New">cell.  This command created the new entries =
in the VLDB for</font><br>
<font face=3D"Courier New">the dummycell.com cell.</font><br>
<br>
<font face=3D"Courier New">Now running this against your fileservers wi=
ll probably take some time</font><br>
<font face=3D"Courier New">because of the number of volumes in your cel=
l.  But as long as no VLDB</font><br>
<font face=3D"Courier New">updates are going on while your building thi=
s new VLDB, it will be</font><br>
<font face=3D"Courier New">current with your site when it is finished.<=
/font><br>
<br>
 - Then you can get ready to place this VLDB onto your main<br>
   Database server machines.<br>
<br>
   You should stop the vlservers on your main DB servers.<br>
   Save a copy of your current, corrupted VLDB<br>
   Copy the new VLDB into the /usr/afs/db dir<br>
   Restart the vlservers<br>
<br>
   And the new VLDB should be OK.<br>
<br>
   If you get everything in place before you stop the <br>
   vlservers, you should be able to stop the vlservers, copy<br>
   the new VLDB and restart the vlservers before anything<br>
   times out !  So no downtime.<br>
<br>
Thanks<br>
<br>
Todd DeSantis<br>
<br>
<img width=3D"16" height=3D"16" src=3D"cid:1__=3D08BBFE54DFC43F648f9e8a=
93df938@us.ibm.com" border=3D"0" alt=3D"Inactive hide details for Willi=
am Setzer &lt;William_Setzer@ncsu.edu&gt;">William Setzer &lt;William_S=
etzer@ncsu.edu&gt;<br>
<br>
<br>

<table width=3D"100%" border=3D"0" cellspacing=3D"0" cellpadding=3D"0">=

<tr valign=3D"top"><td style=3D"background-image:url(cid:2__=3D08BBFE54=
DFC43F648f9e8a93df938@us.ibm.com); background-repeat: no-repeat; " widt=
h=3D"40%">
<ul>
<ul>
<ul>
<ul><b><font size=3D"2">William Setzer &lt;William_Setzer@ncsu.edu&gt;<=
/font></b><font size=3D"2"> </font><br>
<font size=3D"2">Sent by: openafs-info-admin@openafs.org</font>
<p><font size=3D"2">09/12/2008 04:50 PM</font>
<table border=3D"1">
<tr valign=3D"top"><td width=3D"168" bgcolor=3D"#FFFFFF"><div align=3D"=
center"><font size=3D"2">Please respond to<br>
William_Setzer@ncsu.edu</font></div></td></tr>
</table>
</ul>
</ul>
</ul>
</ul>
</td><td width=3D"60%">
<table width=3D"100%" border=3D"0" cellspacing=3D"0" cellpadding=3D"0">=

<tr valign=3D"top"><td width=3D"1%"><img width=3D"58" height=3D"1" src=3D=
"cid:3__=3D08BBFE54DFC43F648f9e8a93df938@us.ibm.com" border=3D"0" alt=3D=
""><br>
<div align=3D"right"><font size=3D"2">To</font></div></td><td width=3D"=
100%"><img width=3D"1" height=3D"1" src=3D"cid:3__=3D08BBFE54DFC43F648f=
9e8a93df938@us.ibm.com" border=3D"0" alt=3D""><br>
<font size=3D"2">openafs-info@openafs.org</font></td></tr>

<tr valign=3D"top"><td width=3D"1%"><img width=3D"58" height=3D"1" src=3D=
"cid:3__=3D08BBFE54DFC43F648f9e8a93df938@us.ibm.com" border=3D"0" alt=3D=
""><br>
<div align=3D"right"><font size=3D"2">cc</font></div></td><td width=3D"=
100%"><img width=3D"1" height=3D"1" src=3D"cid:3__=3D08BBFE54DFC43F648f=
9e8a93df938@us.ibm.com" border=3D"0" alt=3D""><br>
</td></tr>

<tr valign=3D"top"><td width=3D"1%"><img width=3D"58" height=3D"1" src=3D=
"cid:3__=3D08BBFE54DFC43F648f9e8a93df938@us.ibm.com" border=3D"0" alt=3D=
""><br>
<div align=3D"right"><font size=3D"2">Subject</font></div></td><td widt=
h=3D"100%"><img width=3D"1" height=3D"1" src=3D"cid:3__=3D08BBFE54DFC43=
F648f9e8a93df938@us.ibm.com" border=3D"0" alt=3D""><br>
<font size=3D"2">[OpenAFS] VLDB problem - Duplicate entries</font></td>=
</tr>
</table>

<table border=3D"0" cellspacing=3D"0" cellpadding=3D"0">
<tr valign=3D"top"><td width=3D"58"><img width=3D"1" height=3D"1" src=3D=
"cid:3__=3D08BBFE54DFC43F648f9e8a93df938@us.ibm.com" border=3D"0" alt=3D=
""></td><td width=3D"336"><img width=3D"1" height=3D"1" src=3D"cid:3__=3D=
08BBFE54DFC43F648f9e8a93df938@us.ibm.com" border=3D"0" alt=3D""></td></=
tr>
</table>
</td></tr>
</table>
<br>
<tt>[ If you see this twice, I apologize. &nbsp;I sent it to an old add=
ress<br>
 &nbsp;without noticing, so I hope it got eaten. ]<br>
<br>
We've been investigating why our &quot;vos backupsys&quot; processes ha=
ve been<br>
hanging, and have discovered something disturbing. &nbsp;Upon dumping o=
ut<br>
our VLDB via &quot;vos listvldb &gt; foo&quot; it appears our VLDB has =
been<br>
corrupted. &nbsp;We're seeing two entries for a significant percentage<=
br>
(1/4) of our volumes:<br>
<br>
 &nbsp; &nbsp;adm.db <br>
 &nbsp; &nbsp; &nbsp; &nbsp;RWrite: 536899559 &nbsp; &nbsp; Backup: 536=
899561 <br>
 &nbsp; &nbsp; &nbsp; &nbsp;number of sites -&gt; 1<br>
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; server A.ncsu.edu partition /vicepa=
 RW Site <br>
<br>
 &nbsp; &nbsp;adm.db <br>
 &nbsp; &nbsp; &nbsp; &nbsp;RWrite: 536899559 &nbsp; &nbsp; Backup: 536=
899561 <br>
 &nbsp; &nbsp; &nbsp; &nbsp;number of sites -&gt; 1<br>
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; server A.ncsu.edu partition /vicepa=
 RW Site <br>
<br>
<br>
Right now, it's never more than two instances per volume, and<br>
sometimes they point to the same server, sometimes they point to<br>
different servers.<br>
<br>
Our first thought is to do a &quot;vos syncvldb&quot;/&quot;vos syncser=
v&quot;, but we<br>
don't know if this will fix the problem, particularly in the case of<br=
>
duplicate entries pointing to the same place. &nbsp;Our second thought =
is<br>
to do it after zeroing out the VLDB, but the downtime we'd suffer<br>
isn't very appealing. :) &nbsp; Our third thought is that we might have=
 a<br>
more serious corruption, since we had a problem with our VLDB several<b=
r>
months ago (which we thought we had fixed).<br>
<br>
Right now, everything appears to be working &quot;normally&quot;, excep=
ting the<br>
&quot;vos backupsys&quot; being very cranky about a large number of non=
-existent<br>
volumes, but clearly something needs to be done and we're pretty much<b=
r>
out of our depth.<br>
<br>
Our current OpenAFS version is 1.2.13, but our upgrade path to 1.4.7<br=
>
was in progress when interrupted by this problem. &nbsp;(We were starti=
ng<br>
with file servers, so the databases are still at 1.2.13.)<br>
<br>
So what do you think would be the safest and/or best course of action<b=
r>
to take? &nbsp;Thanks in advance for your advice.<br>
<br>
<br>
William Setzer<br>
Systems &amp; Hosted Services<br>
Office of Information Technology<br>
NC State University<br>
_______________________________________________<br>
OpenAFS-info mailing list<br>
OpenAFS-info@openafs.org<br>
</tt><tt><a href=3D"https://lists.openafs.org/mailman/listinfo/openafs-=
info">https://lists.openafs.org/mailman/listinfo/openafs-info</a></tt><=
tt><br>
</tt><br>
</body></html>=


--1__=08BBFE54DFC43F648f9e8a93df938690918c08BBFE54DFC43F64--


--0__=08BBFE54DFC43F648f9e8a93df938690918c08BBFE54DFC43F64
Content-type: image/gif; 
	name="graycol.gif"
Content-Disposition: inline; filename="graycol.gif"
Content-ID: <1__=08BBFE54DFC43F648f9e8a93df938@us.ibm.com>
Content-transfer-encoding: base64

R0lGODlhEAAQAKECAMzMzAAAAP///wAAACH5BAEAAAIALAAAAAAQABAAAAIXlI+py+0PopwxUbpu
ZRfKZ2zgSJbmSRYAIf4fT3B0aW1pemVkIGJ5IFVsZWFkIFNtYXJ0U2F2ZXIhAAA7

--0__=08BBFE54DFC43F648f9e8a93df938690918c08BBFE54DFC43F64
Content-type: image/gif; 
	name="pic03815.gif"
Content-Disposition: inline; filename="pic03815.gif"
Content-ID: <2__=08BBFE54DFC43F648f9e8a93df938@us.ibm.com>
Content-transfer-encoding: base64

R0lGODlhWABDALP/AAAAAK04Qf79/o+Gm7WuwlNObwoJFCsoSMDAwGFsmIuezf///wAAAAAAAAAA
AAAAACH5BAEAAAgALAAAAABYAEMAQAT/EMlJq704682770RiFMRinqggEUNSHIchG0BCfHhOjAuh
EDeUqTASLCbBhQrhG7xis2j0lssNDopE4jfIJhDaggI8YB1sZeZgLVA9YVCpnGagVjV171aRVrYR
RghXcAGFhoUETwYxcXNyADJ3GlcSKGAwLwllVC1vjIUHBWsFilKQdI8GA5IcpApeJQt8L09lmgkH
LZikoU5wjqcyAMMFrJIDPAKvCFletKSev1HBw8KrxtjZ2tvc3d5VyKtCKW3jfz4uMKmq3xu4N0nK
BVoJQmx2LGVOmrqNjjJf2hHAQo/eDwJGTKhQMcgQEEAnEjFS98+RnW3smGkZU6ncCWav/4wYOnAI
TihRL/4FEwbp28BXMMcoscQCVxlepL4IGDSCyJyVQOu0o7CjmLN50OZlqWmyFy5/6yBBuji0AxFR
M00oQAqNIstqI6qKHUsWRAEAvagsmfUEAImyxgbmUpJk3IklNUtJOUAVLoUr1+wqDGTE4zk+T6FG
uQb3SizBCwatiiUgCBN8vrz+zFjVyQ8FWkOlg4NQiZMB5QS8QO3mpOaKnL0Z2EKvNMSILEThKhCg
zMKPVxYJh23qm9KNW7pArPynMqZDiErsTMqI+LRi3QAgkFUbXpuFKhSYZALd0O5RKa2z9EYKBbpb
qxIKsjUPRgD7I2XYV6wyrOw92ykExP8NW4URhknC5dKGE4v4NENQj2jXjmfNgOZDaXb5glRmXQ33
YEWQYNcZFnrYcIQLNzyTFDQNkXIff0ExVlY4srziQk43inZgL4rwxxINMvpFFAz1KOODHiu+4aEw
NEjFl5B3JIKWKF3k6I9bfUGp5ZZcdunll5IA4cuHvQQJ5gcsoCWOOUwgltIwAKRxJgbIkJAQZEq0
2YliZnpZZ4BH3CnYOXldOUOfQoYDqF1LFHbXCrO8xmRsfoXDXJ6ChjCAH3QlhJcT6VWE6FCkfCco
CgrMFsROrIEX3o2whVjWDjoJccN3LdggSGXLCdLEgHr1lyU3O3QxhgohNKXJCWv8JQr/PDdaqd6w
2rj1inLiGeiCJoDspAoQlYE6QWLSECehcWIYxIQES6zhbn1iImTHEQyqJ4eIxJJoUBc+3CbBuwZE
V5cJPPkIjFDdeEabQbd6WgICTxiiz0f5dBKquXF6k4senwEhYGnKEFJeGrxUZy8dB8gmAXI/sPvH
ESfCwVt5hTgYiqQqtdRNHQIU1PJ33ZqmzgE90OwLaoJcnMop1WiMmgkPHQRIrwgFuNV90A3doNKT
mrKIN07AnGcI9BQjhCBN4RfA1qIZnMqorJCogKfGQnxSCDilTVIA0yl5ciTovgLuBDKFUDE9aQcw
9SA+rjSNf9/M1gxrj6VwDTS0IUSElMzBfsj0NFXR2kwsV1A5IF1grLgLL/r1R40BZEnuBWgmQEyb
jqRwSAt6bqMCOFkvKFN2GPPkUzIm/SCF8z8pVzpbjVnMsy0vOr1hw3SaSRUhpY09v0z0J1FnwzPl
fmh+xl4WtR0zGu24I4KbMQm3lnVu2oNWxI9W/lcyzA+mCKF4DBikxb/+UWtOGRiFP8qEwAayIgIA
Ow==

--0__=08BBFE54DFC43F648f9e8a93df938690918c08BBFE54DFC43F64
Content-type: image/gif; 
	name="ecblank.gif"
Content-Disposition: inline; filename="ecblank.gif"
Content-ID: <3__=08BBFE54DFC43F648f9e8a93df938@us.ibm.com>
Content-transfer-encoding: base64

R0lGODlhEAABAIAAAAAAAP///yH5BAEAAAEALAAAAAAQAAEAAAIEjI8ZBQA7

--0__=08BBFE54DFC43F648f9e8a93df938690918c08BBFE54DFC43F64--