[OpenAFS] Odd ubik (?) synchronization problem

Russ Allbery rra@stanford.edu
Mon, 26 Apr 2004 15:26:08 -0700


Derrick J Brashear <shadow@dementia.org> writes:
> On Mon, 26 Apr 2004, Russ Allbery wrote:

>> The Ubik negotiation is all happening over UDP, though, right?  Maybe
>> we're getting some sort of packet loss.

> New enough tcpdump will show some state on it.

Here's the tcpdump data from the slave around the point at which things
went south.  Note the abort on the disk-releaselocks function with error
code 5381:

afsdb2:/usr/afs/bin# /usr/afsws/bin/translate_et 5381
5381 (u).5 = major synchronization error

15:18:37.772839 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call disk-commit tid 1083017822.-1991030020 (40) (DF)
15:18:37.813234 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call vote-beacon syncsite yes votestart 2004/04/26 15:18:37 dbversion 1083017822.195 tid [|ubik] (56) (DF)
15:18:37.816614 171.64.7.234.7003 > 171.64.7.222.7003:  rx data vldb ubik reply disk-commit (28) (DF)
15:18:37.817026 171.64.7.234.7003 > 171.64.7.222.7003:  rx abort vldb ubik reply vote-beacon vote yes until 2004/04/26 15:18:37 (32) (DF)
15:18:37.817879 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call disk-releaselocks tid 1083017822.-1991030020 (40) (DF)
15:18:37.820501 171.64.7.234.7003 > 171.64.7.222.7003:  rx abort vldb ubik reply disk-releaselocks errcode 5381 (32) (DF)
15:18:45.003439 171.64.7.222.7004 > 171.64.7.234.7004:  rx data kauth call op#133768262 (68) (DF)
15:18:45.003832 171.64.7.234.7004 > 171.64.7.222.7004:  rx abort kauth reply op#133768262 errcode 1083017924 (32) (DF)

Here's the complete trace.

15:18:37.488158 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call disk-begin tid 1083017822.-1991030034 (40) (DF)
15:18:37.488660 171.64.7.234.7003 > 171.64.7.222.7003:  rx data vldb ubik reply disk-begin (28) (DF)
15:18:37.489782 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call disk-lock tid 1083017822.-1991030034 file 0 pos 1 length 1 (56) (DF)
15:18:37.490070 171.64.7.234.7003 > 171.64.7.222.7003:  rx data vldb ubik reply disk-lock (28) (DF)
15:18:37.492468 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call disk-writev (208) (DF)
15:18:37.492963 171.64.7.234.7003 > 171.64.7.222.7003:  rx data vldb ubik reply disk-writev (28) (DF)
15:18:37.599207 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call disk-commit tid 1083017822.-1991030034 (40) (DF)
15:18:37.613299 171.64.7.222.7002 > 171.64.7.234.7002:  rx data pt ubik call vote-beacon syncsite yes votestart 2004/04/26 15:18:37 dbversion 1073787304.9439 tid [|ubik] (56) (DF)
15:18:37.617765 171.64.7.234.7002 > 171.64.7.222.7002:  rx abort pt ubik reply vote-beacon vote yes until 2004/04/26 15:18:36 (32) (DF)
15:18:37.648873 171.64.7.234.7003 > 171.64.7.222.7003:  rx data vldb ubik reply disk-commit (28) (DF)
15:18:37.650227 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call disk-releaselocks tid 1083017822.-1991030034 (40) (DF)
15:18:37.653243 171.64.7.234.7003 > 171.64.7.222.7003:  rx data vldb ubik reply disk-releaselocks (28) (DF)
15:18:37.663363 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call disk-begin tid 1083017822.-1991030020 (40) (DF)
15:18:37.663773 171.64.7.234.7003 > 171.64.7.222.7003:  rx data vldb ubik reply disk-begin (28) (DF)
15:18:37.664895 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call disk-lock tid 1083017822.-1991030020 file 0 pos 1 length 1 (56) (DF)
15:18:37.665326 171.64.7.234.7003 > 171.64.7.222.7003:  rx data vldb ubik reply disk-lock (28) (DF)
15:18:37.667798 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call disk-writev (100) (DF)
15:18:37.668175 171.64.7.234.7003 > 171.64.7.222.7003:  rx data vldb ubik reply disk-writev (28) (DF)
15:18:37.772839 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call disk-commit tid 1083017822.-1991030020 (40) (DF)
15:18:37.813234 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call vote-beacon syncsite yes votestart 2004/04/26 15:18:37 dbversion 1083017822.195 tid [|ubik] (56) (DF)
15:18:37.816614 171.64.7.234.7003 > 171.64.7.222.7003:  rx data vldb ubik reply disk-commit (28) (DF)
15:18:37.817026 171.64.7.234.7003 > 171.64.7.222.7003:  rx abort vldb ubik reply vote-beacon vote yes until 2004/04/26 15:18:37 (32) (DF)
15:18:37.817879 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call disk-releaselocks tid 1083017822.-1991030020 (40) (DF)
15:18:37.820501 171.64.7.234.7003 > 171.64.7.222.7003:  rx abort vldb ubik reply disk-releaselocks errcode 5381 (32) (DF)
15:18:45.003439 171.64.7.222.7004 > 171.64.7.234.7004:  rx data kauth call op#133768262 (68) (DF)
15:18:45.003832 171.64.7.234.7004 > 171.64.7.222.7004:  rx abort kauth reply op#133768262 errcode 1083017924 (32) (DF)
15:18:52.623390 171.64.7.222.7002 > 171.64.7.234.7002:  rx data pt ubik call vote-beacon syncsite yes votestart 2004/04/26 15:18:52 dbversion 1073787304.9439 tid [|ubik] (56) (DF)
15:18:52.623780 171.64.7.234.7002 > 171.64.7.222.7002:  rx abort pt ubik reply vote-beacon vote yes until 2004/04/26 15:18:51 (32) (DF)
15:18:59.525227 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call disk-probe tid 0.0 (32) (DF)
15:18:59.525445 171.64.7.234.7003 > 171.64.7.222.7003:  rx challenge (44) (DF)
15:18:59.525842 171.64.7.222.7003 > 171.64.7.234.7003:  rx response (116) (DF)
15:18:59.526300 171.64.7.234.7003 > 171.64.7.222.7003:  rx data vldb ubik reply disk-probe (28) (DF)
15:18:59.933279 171.64.7.222.7003 > 171.64.7.234.7003:  rx ack (65) (DF)
15:19:00.013460 171.64.7.222.7004 > 171.64.7.234.7004:  rx data kauth call op#-621365002 (68) (DF)
15:19:00.013867 171.64.7.234.7004 > 171.64.7.222.7004:  rx abort kauth reply op#-621365002 errcode 1083017939 (32) (DF)
15:19:07.633608 171.64.7.222.7002 > 171.64.7.234.7002:  rx data pt ubik call vote-beacon syncsite yes votestart 2004/04/26 15:19:07 dbversion 1073787304.9439 tid [|ubik] (56) (DF)
15:19:07.634005 171.64.7.234.7002 > 171.64.7.222.7002:  rx abort pt ubik reply vote-beacon vote yes until 2004/04/26 15:19:05 (32) (DF)
15:19:07.833488 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call vote-beacon syncsite no votestart 2004/04/26 15:19:07 dbversion 1083017822.196 tid [|ubik] (56) (DF)
15:19:07.833715 171.64.7.234.7003 > 171.64.7.222.7003:  rx challenge (44) (DF)
15:19:07.834371 171.64.7.222.7003 > 171.64.7.234.7003:  rx response (116) (DF)
15:19:07.834856 171.64.7.234.7003 > 171.64.7.222.7003:  rx abort vldb ubik reply vote-beacon vote yes until 2004/04/26 15:19:07 (32) (DF)
15:19:11.636130 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call disk-getversion (32) (DF)
15:19:11.636529 171.64.7.234.7003 > 171.64.7.222.7003:  rx data vldb ubik reply disk-getversion dbversion 1083017822.195 (36) (DF)
15:19:12.043377 171.64.7.222.7003 > 171.64.7.234.7003:  rx ack (65) (DF)
15:19:14.522040 171.64.7.222.7003 > 171.64.7.234.7003:  rx data vldb ubik call disk-sendfile file 0 length 8704000 dbversion 1083017822.196 (1440) (DF)
15:19:14.522542 171.64.7.234.7003 > 171.64.7.222.7003:  rx ack (66) (DF)
15:19:14.523346 171.64.7.222.7003 > 171.64.7.234.7003:  rx data (1440) (DF)
15:19:14.523423 171.64.7.222.7003 > 171.64.7.234.7003:  rx data (1440) (DF)
15:19:14.550777 171.64.7.234.7003 > 171.64.7.222.7003:  rx ack (66) (DF)
15:19:14.551246 171.64.7.234.7003 > 171.64.7.222.7003:  rx ack (66) (DF)

at which point the master sent a new copy of the VLDB database to the
slave.

Any idea what I should be looking for in particular at this point?

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>