[OpenAFS] DB quorum problems on 1.4/1.6 mixed cell

Michael Garrison mcgarr@umich.edu
Thu, 13 Feb 2014 15:32:45 -0500


At UMich we ran into the same issue a while ago, and I wound up back
porting the patch to 1.4. Since then, haven't had the negative
transaction ID issue crop up again.

--
Mike Garrison

On Thu, Feb 13, 2014 at 10:29 AM, Arne Wiebalck <Arne.Wiebalck@cern.ch> wrote:
> Just to confirm: I restarted all VL servers in our cell, the transaction IDs
> were reset and
> so far I wasn't able to reproduce the problem. So it seems that it was
> indeed the
> negative transaction ID problem in 1.4 VL servers mentioned earlier in this
> thread.
>
> Cheers,
>  Arne
>
>
> On Feb 11, 2014, at 6:25 PM, Arne Wiebalck <Arne.Wiebalck@cern.ch> wrote:
>
> Thanks Andrew and Derrick!
>
> We've seen the "major synchronisation error" as well when trying to provoke
> that problem.
> This and the fact that we have the very same quorum issue about one year ago
> when
> restarting the VLDB servers made the problem go away for some time seem to
> indicate
> it's indeed the issue you mention. This was when we first added 1.6 servers
> to our cell, btw.
> Apparently, we're pretty lucky ;)
>
> I'll restart our VLDB servers ...
>
> Thanks!
>  Arne
>
>
> On Feb 11, 2014, at 5:48 PM, D Brashear <shadow@gmail.com>
>  wrote:
>
> The 1.4/1.6 issue is surely a red herring. You hit the nail when you
> mentioned negative transaction IDs. There was a bugfix early in the 1.6
> series which handled that; you probably want to just restart all your
> dbservers so you can start counting up to rollover again, until you get to
> the point of updating them.
>
>
> On Tue, Feb 11, 2014 at 4:50 AM, Arne Wiebalck <Arne.Wiebalck@cern.ch>
> wrote:
>>
>> Hi,
>>
>> We've recently added some 1.6.6 servers into our cell which is mainly on
>> 1.4.15 (i.e. most  of the file servers
>> and the DB servers). We now encounter quorum problems with our VLDB
>> servers.
>>
>> The primary symptom is that releases fail with "u: no quorum elected".
>>
>> VLLog on the sync site shows at that moment:
>> -->
>> Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.51 is back up: will be
>> contacted through 137.138.246.51
>> Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.50 is back up: will be
>> contacted through 137.138.246.50
>> <--
>> where  137.138.246.50 and  137.138.246.51 are the non-sync sites.
>>
>> We can relatively easy trigger this problem by moving volumes between 1.4
>> and 1.6 based servers (1.4/1.4
>> and 1.6/1.6 transfers seems OK): the move gets stuck and udebug shows
>>
>> -->
>> Host's addresses are: 137.138.128.148
>> Host's 137.138.128.148 time is Tue Feb 11 09:38:02 2014
>> Local time is Tue Feb 11 09:38:02 2014 (time differential 0 secs)
>> Last yes vote for 137.138.128.148 was 5 secs ago (sync site);
>> Last vote started 5 secs ago (at Tue Feb 11 09:37:57 2014)
>> Local db version is 1392106724.154
>> I am sync site until 54 secs from now (at Tue Feb 11 09:38:56 2014) (3
>> servers)
>> Recovery state f
>> I am currently managing write trans 1392106724.-1892301558
>> Sync site's db version is 1392106724.154
>> 0 locked pages, 0 of them for write
>> There are write locks held
>> Last time a new db version was labelled was:
>>          1158 secs ago (at Tue Feb 11 09:18:44 2014)
>>
>> Server (137.138.246.51): (db 1392106724.153)
>>     last vote rcvd 20 secs ago (at Tue Feb 11 09:37:42 2014),
>>     last beacon sent 20 secs ago (at Tue Feb 11 09:37:42 2014), last vote
>> was yes
>>     dbcurrent=0, up=0 beaconSince=0
>>
>> Server (137.138.246.50): (db 1392106724.154)
>>     last vote rcvd 6 secs ago (at Tue Feb 11 09:37:56 2014),
>>     last beacon sent 5 secs ago (at Tue Feb 11 09:37:57 2014), last vote
>> was yes
>>     dbcurrent=1, up=1 beaconSince=1
>> <--
>>
>> Note that the sync site has gone to Recovery state f and that the time at
>> which the last vote was received
>> on the other two servers has quite a time gap which gets larger with time.
>> Is the negative trans ID ok?
>>
>> At some point the sync site loses its sync site state:
>>
>> -->
>> Host's addresses are: 137.138.128.148
>> Host's 137.138.128.148 time is Tue Feb 11 09:41:01 2014
>> Local time is Tue Feb 11 09:41:02 2014 (time differential 1 secs)
>> Last yes vote for 137.138.128.148 was 4 secs ago (sync site);
>> Last vote started 4 secs ago (at Tue Feb 11 09:40:58 2014)
>> Local db version is 1392106724.206
>> I am not sync site
>> Lowest host 137.138.128.148 was set 4 secs ago
>> Sync host 137.138.128.148 was set 4 secs ago
>> I am currently managing write trans 1392106724.-1892283756
>> Sync site's db version is 1392106724.206
>> 0 locked pages, 0 of them for write
>> There are write locks held
>> Last time a new db version was labelled was:
>>          1337 secs ago (at Tue Feb 11 09:18:45 2014)
>> <--
>>
>> so there is no sync site any longer and the vos command gets a no quorum
>> error.
>>
>> As this also happens when we do not move volumes around (like at 3am), but
>> other operations such
>> as the backup touch the volumes, I would suspect that VLDB operations in
>> general can trigger this.
>>
>> Is this a known issue?
>>
>> I had understood that it should be OK to run 1.4 and 1.6 file servers in
>> parallel and that the DB servers
>> could be updated after the file servers, but maybe that is not correct?
>>
>> Thanks!
>>  Arne
>>
>>
>> --
>> Arne Wiebalck
>> CERN IT
>>
>
>
>
> --
> D
>
>
>