[OpenAFS] DB quorum problems on 1.4/1.6 mixed cell
D Brashear
shadow@gmail.com
Tue, 11 Feb 2014 11:48:08 -0500
--089e0129534e90817404f224385d
Content-Type: text/plain; charset=ISO-8859-1
The 1.4/1.6 issue is surely a red herring. You hit the nail when you
mentioned negative transaction IDs. There was a bugfix early in the 1.6
series which handled that; you probably want to just restart all your
dbservers so you can start counting up to rollover again, until you get to
the point of updating them.
On Tue, Feb 11, 2014 at 4:50 AM, Arne Wiebalck <Arne.Wiebalck@cern.ch>wrote:
> Hi,
>
> We've recently added some 1.6.6 servers into our cell which is mainly on
> 1.4.15 (i.e. most of the file servers
> and the DB servers). We now encounter quorum problems with our VLDB
> servers.
>
> The primary symptom is that releases fail with "u: no quorum elected".
>
> VLLog on the sync site shows at that moment:
> -->
> Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.51 is back up: will be
> contacted through 137.138.246.51
> Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.50 is back up: will be
> contacted through 137.138.246.50
> <--
> where 137.138.246.50 and 137.138.246.51 are the non-sync sites.
>
> We can relatively easy trigger this problem by moving volumes between 1.4
> and 1.6 based servers (1.4/1.4
> and 1.6/1.6 transfers seems OK): the move gets stuck and udebug shows
>
> -->
> Host's addresses are: 137.138.128.148
> Host's 137.138.128.148 time is Tue Feb 11 09:38:02 2014
> Local time is Tue Feb 11 09:38:02 2014 (time differential 0 secs)
> Last yes vote for 137.138.128.148 was 5 secs ago (sync site);
> Last vote started 5 secs ago (at Tue Feb 11 09:37:57 2014)
> Local db version is 1392106724.154
> I am sync site until 54 secs from now (at Tue Feb 11 09:38:56 2014) (3
> servers)
> Recovery state f
> I am currently managing write trans 1392106724.-1892301558
> Sync site's db version is 1392106724.154
> 0 locked pages, 0 of them for write
> There are write locks held
> Last time a new db version was labelled was:
> 1158 secs ago (at Tue Feb 11 09:18:44 2014)
>
> Server (137.138.246.51): (db 1392106724.153)
> last vote rcvd 20 secs ago (at Tue Feb 11 09:37:42 2014),
> last beacon sent 20 secs ago (at Tue Feb 11 09:37:42 2014), last vote
> was yes
> dbcurrent=0, up=0 beaconSince=0
>
> Server (137.138.246.50): (db 1392106724.154)
> last vote rcvd 6 secs ago (at Tue Feb 11 09:37:56 2014),
> last beacon sent 5 secs ago (at Tue Feb 11 09:37:57 2014), last vote
> was yes
> dbcurrent=1, up=1 beaconSince=1
> <--
>
> Note that the sync site has gone to Recovery state f and that the time at
> which the last vote was received
> on the other two servers has quite a time gap which gets larger with time.
> Is the negative trans ID ok?
>
> At some point the sync site loses its sync site state:
>
> -->
> Host's addresses are: 137.138.128.148
> Host's 137.138.128.148 time is Tue Feb 11 09:41:01 2014
> Local time is Tue Feb 11 09:41:02 2014 (time differential 1 secs)
> Last yes vote for 137.138.128.148 was 4 secs ago (sync site);
> Last vote started 4 secs ago (at Tue Feb 11 09:40:58 2014)
> Local db version is 1392106724.206
> I am not sync site
> Lowest host 137.138.128.148 was set 4 secs ago
> Sync host 137.138.128.148 was set 4 secs ago
> I am currently managing write trans 1392106724.-1892283756
> Sync site's db version is 1392106724.206
> 0 locked pages, 0 of them for write
> There are write locks held
> Last time a new db version was labelled was:
> 1337 secs ago (at Tue Feb 11 09:18:45 2014)
> <--
>
> so there is no sync site any longer and the vos command gets a no quorum
> error.
>
> As this also happens when we do not move volumes around (like at 3am), but
> other operations such
> as the backup touch the volumes, I would suspect that VLDB operations in
> general can trigger this.
>
> Is this a known issue?
>
> I had understood that it should be OK to run 1.4 and 1.6 file servers in
> parallel and that the DB servers
> could be updated after the file servers, but maybe that is not correct?
>
> Thanks!
> Arne
>
>
> --
> Arne Wiebalck
> CERN IT
>
>
--
D
--089e0129534e90817404f224385d
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">The 1.4/1.6 issue is surely a red herring. You hit the nai=
l when you mentioned negative transaction IDs. There was a bugfix early in =
the 1.6 series which handled that; you probably want to just restart all yo=
ur dbservers so you can start counting up to rollover again, until you get =
to the point of updating them.<br>
</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Tue,=
Feb 11, 2014 at 4:50 AM, Arne Wiebalck <span dir=3D"ltr"><<a href=3D"ma=
ilto:Arne.Wiebalck@cern.ch" target=3D"_blank">Arne.Wiebalck@cern.ch</a>>=
</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word">Hi,<div>=
<br></div><div>We've recently added some 1.6.6 servers into our cell wh=
ich is mainly on 1.4.15 (i.e. most =A0of the file servers</div>
<div>and the DB servers). We now encounter quorum problems with our VLDB se=
rvers.</div><div><br></div><div>The primary symptom is that releases fail w=
ith "u: no quorum elected".</div><div><br></div><div>VLLog on the=
sync site shows at that moment:</div>
<div>--></div><div><div>Tue Feb 11 09:51:47 2014 ubik:server 137.138.246=
.51 is back up: will be contacted through 137.138.246.51</div><div>Tue Feb =
11 09:51:47 2014 ubik:server 137.138.246.50 is back up: will be contacted t=
hrough 137.138.246.50</div>
</div><div><--</div><div>where=A0=A0137.138.246.50 and=A0=A0137.138.246.=
51 are the non-sync sites.</div><div><br></div><div>We can relatively easy =
trigger this problem by moving volumes between 1.4 and 1.6 based servers (1=
.4/1.4</div>
<div>and 1.6/1.6 transfers seems OK): the move gets stuck and udebug shows=
=A0</div><div><br></div><div>--></div><div><div>Host's addresses are=
: 137.138.128.148</div><div>Host's 137.138.128.148 time is Tue Feb 11 0=
9:38:02 2014</div>
<div>Local time is Tue Feb 11 09:38:02 2014 (time differential 0 secs)</div=
><div>Last yes vote for 137.138.128.148 was 5 secs ago (sync site);</div><d=
iv>Last vote started 5 secs ago (at Tue Feb 11 09:37:57 2014)</div><div>
Local db version is 1392106724.154</div><div>I am sync site until 54 secs f=
rom now (at Tue Feb 11 09:38:56 2014) (3 servers)</div><div>Recovery state =
f</div><div>I am currently managing write trans 1392106724.-1892301558</div=
>
<div>Sync site's db version is 1392106724.154</div><div>0 locked pages,=
0 of them for write</div><div>There are write locks held</div><div>Last ti=
me a new db version was labelled was:</div><div>=A0 =A0 =A0 =A0 =A01158 sec=
s ago (at Tue Feb 11 09:18:44 2014)</div>
<div><br></div><div>Server (137.138.246.51): (db 1392106724.153)</div><div>=
=A0 =A0 last vote rcvd 20 secs ago (at Tue Feb 11 09:37:42 2014),</div><div=
>=A0 =A0 last beacon sent 20 secs ago (at Tue Feb 11 09:37:42 2014), last v=
ote was yes</div>
<div>=A0 =A0 dbcurrent=3D0, up=3D0 beaconSince=3D0</div><div><br></div><div=
>Server (137.138.246.50): (db 1392106724.154)</div><div>=A0 =A0 last vote r=
cvd 6 secs ago (at Tue Feb 11 09:37:56 2014),</div><div>=A0 =A0 last beacon=
sent 5 secs ago (at Tue Feb 11 09:37:57 2014), last vote was yes</div>
<div>=A0 =A0 dbcurrent=3D1, up=3D1 beaconSince=3D1</div></div><div><--</=
div><div><br></div><div>Note that the sync site has gone to Recovery state =
f and that the time at which the last vote was received</div><div>on the ot=
her two servers has quite a time gap which gets larger with time. Is the ne=
gative trans ID ok?</div>
<div><br></div><div>At some point the sync site loses its sync site state:<=
/div><div><br></div><div>--></div><div><div>Host's addresses are: 13=
7.138.128.148</div><div>Host's 137.138.128.148 time is Tue Feb 11 09:41=
:01 2014</div>
<div>Local time is Tue Feb 11 09:41:02 2014 (time differential 1 secs)</div=
><div>Last yes vote for 137.138.128.148 was 4 secs ago (sync site);</div><d=
iv>Last vote started 4 secs ago (at Tue Feb 11 09:40:58 2014)</div><div>
Local db version is 1392106724.206</div><div>I am not sync site</div><div>L=
owest host 137.138.128.148 was set 4 secs ago</div><div>Sync host 137.138.1=
28.148 was set 4 secs ago</div><div>I am currently managing write trans 139=
2106724.-1892283756</div>
<div>Sync site's db version is 1392106724.206</div><div>0 locked pages,=
0 of them for write</div><div>There are write locks held</div><div>Last ti=
me a new db version was labelled was:</div><div>=A0 =A0 =A0 =A0 =A01337 sec=
s ago (at Tue Feb 11 09:18:45 2014)</div>
</div><div><--</div><div><br></div><div>so there is no sync site any lon=
ger and the vos command gets a no quorum error.</div><div><br></div><div>As=
this also happens=A0when we do not move volumes around (like at 3am), but =
other operations such</div>
<div>as the backup touch=A0the volumes, I would suspect that VLDB operation=
s in general can trigger this.</div><div><br></div><div>Is this a known iss=
ue?=A0</div><div><br></div><div>I had understood that it should be OK to ru=
n 1.4 and 1.6 file servers in parallel and that the DB servers</div>
<div>could be updated after the file servers, but maybe that is not correct=
?</div><div><br></div><div>Thanks!</div><div>=A0Arne</div><div><br></div><d=
iv><br><div>
<div style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;tex=
t-align:-webkit-auto;font-style:normal;font-weight:normal;line-height:norma=
l;text-transform:none;font-size:medium;white-space:normal;font-family:Helve=
tica;word-wrap:break-word;word-spacing:0px">
<div style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;tex=
t-align:-webkit-auto;font-style:normal;font-weight:normal;line-height:norma=
l;text-transform:none;font-size:medium;white-space:normal;font-family:Helve=
tica;word-wrap:break-word;word-spacing:0px">
--</div><div style=3D"text-indent:0px;letter-spacing:normal;font-variant:no=
rmal;text-align:-webkit-auto;font-style:normal;font-weight:normal;line-heig=
ht:normal;text-transform:none;font-size:medium;white-space:normal;font-fami=
ly:Helvetica;word-wrap:break-word;word-spacing:0px">
Arne Wiebalck<br>CERN IT</div></div>
</div>
<br></div></div></blockquote></div><br><br clear=3D"all"><br>-- <br><div di=
r=3D"ltr">D</div>
</div>
--089e0129534e90817404f224385d--