[OpenAFS] nightly failure since upgrading to 1.6.5

Tracy Di Marco White gendalia@gmail.com
Mon, 10 Feb 2014 12:17:52 -0600


--001a11c16ccc9b25a704f2115bf4
Content-Type: text/plain; charset=ISO-8859-1

Sorry, no need to guess, it was in my monitoring client.
21378  4248 root     4:17PM Sl    43  0.0 0:00.12  0.2   5184  46288
/usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo



On Mon, Feb 10, 2014 at 12:15 PM, Tracy Di Marco White
<gendalia@gmail.com>wrote:

> Somehow, I still have two of them in my scroll back.
> root      4129  0.0  0.2  46288   5124 ?     Sl    7:46AM  0:00.02
> /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo
> root      7155  0.0  1.2  85200  42424 ?     Il    8:06AM  1:27.36
> /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo
>
> I'd assume that means you can guess the third.
>
>
> On Mon, Feb 10, 2014 at 7:00 AM, Peter Grandi <pg@afs.list.sabi.co.uk>wrote:
>
>> > Every night at midnight, we run 'vos backupsys'. For three
>> > nights in a row, on one of the servers I've upgraded to 1.6.5
>> > and dafs, I've been getting the following errors, and it
>> > mostly stops being a fileserver.
>>
>> [ ... ]
>> > Sun Feb  9 00:00:03 2014 SYNC_getCom:  error receiving command
>> > Sun Feb  9 00:00:03 2014 FSYNC_com:  read failed; dropping connection
>> (cnt=493489)
>> > Sun Feb  9 00:00:03 2014 _VLockFd: conflicting lock held on fd 225,
>> offset 538046785 by pid 4129 (locktype=1)
>> > Sun Feb  9 00:00:03 2014 VAttachVolume: another program has vol
>> 538046785 locked
>> > Sun Feb  9 00:00:03 2014 VPreattachVolumeByVp_r: volume 538046785 not
>> in quiescent state (state 2 flags 0x18)
>> [ ... ]
>> > Sun Feb  9 00:00:03 2014 1 Volser: Clone: Recloning volume 538046785 to
>> volume 538046787
>> > Sun Feb  9 00:00:03 2014 SYNC_ask:  length field in response
>> inconsistent on circuit 'FSSYNC'
>> > Sun Feb  9 00:00:03 2014 SYNC_ask: protocol communications failure on
>> circuit 'FSSYNC'; attempting reconnect to server
>> [ ... ]
>>
>> That " _VLockFd: conflicting lock held" and "VAttachVolume:
>> another program has vol NNNN locked" looks vaguely familiar, and
>> in a case that I have seen it was because a DB server was
>> offline, and 'vos' took a very very long time to switch to an
>> online one. But this was with 1.4 and supposedly 1.6 should have
>> a shorter timeout.
>>
>> In another case that vaguely resembles this there was a race
>> between creating a clone and registering it in the VLDB:
>>
>>   http://rt.central.org/rt/Ticket/Display.html?id=131797
>>
>> It would be interesting to know what processes 21378, 4129, 7155
>> were doing and why they held a lock on the RW original.
>> _______________________________________________
>> OpenAFS-info mailing list
>> OpenAFS-info@openafs.org
>> https://lists.openafs.org/mailman/listinfo/openafs-info
>>
>
>

--001a11c16ccc9b25a704f2115bf4
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Sorry, no need to guess, it was in my monitoring client.<d=
iv><div>21378 =A04248 root =A0 =A0 4:17PM Sl =A0 =A043 =A00.0 0:00.12 =A00.=
2 =A0 5184 =A046288 /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojum=
bo=A0</div></div><div>
<br></div></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quot=
e">On Mon, Feb 10, 2014 at 12:15 PM, Tracy Di Marco White <span dir=3D"ltr"=
>&lt;<a href=3D"mailto:gendalia@gmail.com" target=3D"_blank">gendalia@gmail=
.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Somehow, I still have two o=
f them in my scroll back.<div><div>root =A0 =A0 =A04129 =A00.0 =A00.<a href=
=3D"tel:2%20%C2%A046288%20%C2%A0%205124" value=3D"+12462885124" target=3D"_=
blank">2 =A046288 =A0 5124</a> ? =A0 =A0 Sl =A0 =A07:46AM =A00:00.02 /usr/p=
kg/libexec/openafs/davolserver -sleep 5/60 -nojumbo=A0</div>
</div><div><div>
root =A0 =A0 =A07155 =A00.0 =A01.2 =A085200 =A042424 ? =A0 =A0 Il =A0 =A08:=
06AM =A01:27.36 /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo=
=A0</div></div><div><br></div><div>I&#39;d assume that means you can guess =
the third.</div></div><div class=3D"HOEnZb">
<div class=3D"h5"><div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Mon, Feb 10, 2014 at 7:00 AM, Peter G=
randi <span dir=3D"ltr">&lt;<a href=3D"mailto:pg@afs.list.sabi.co.uk" targe=
t=3D"_blank">pg@afs.list.sabi.co.uk</a>&gt;</span> wrote:<br><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;p=
adding-left:1ex">

<div>&gt; Every night at midnight, we run &#39;vos backupsys&#39;. For thre=
e<br>
&gt; nights in a row, on one of the servers I&#39;ve upgraded to 1.6.5<br>
&gt; and dafs, I&#39;ve been getting the following errors, and it<br>
&gt; mostly stops being a fileserver.<br>
<br>
</div>[ ... ]<br>
<div>&gt; Sun Feb =A09 00:00:03 2014 SYNC_getCom: =A0error receiving comman=
d<br>
&gt; Sun Feb =A09 00:00:03 2014 FSYNC_com: =A0read failed; dropping connect=
ion (cnt=3D493489)<br>
&gt; Sun Feb =A09 00:00:03 2014 _VLockFd: conflicting lock held on fd 225, =
offset 538046785 by pid 4129 (locktype=3D1)<br>
&gt; Sun Feb =A09 00:00:03 2014 VAttachVolume: another program has vol 5380=
46785 locked<br>
&gt; Sun Feb =A09 00:00:03 2014 VPreattachVolumeByVp_r: volume 538046785 no=
t in quiescent state (state 2 flags 0x18)<br>
</div>[ ... ]<br>
<div>&gt; Sun Feb =A09 00:00:03 2014 1 Volser: Clone: Recloning volume 5380=
46785 to volume 538046787<br>
&gt; Sun Feb =A09 00:00:03 2014 SYNC_ask: =A0length field in response incon=
sistent on circuit &#39;FSSYNC&#39;<br>
&gt; Sun Feb =A09 00:00:03 2014 SYNC_ask: protocol communications failure o=
n circuit &#39;FSSYNC&#39;; attempting reconnect to server<br>
</div>[ ... ]<br>
<br>
That &quot; _VLockFd: conflicting lock held&quot; and &quot;VAttachVolume:<=
br>
another program has vol NNNN locked&quot; looks vaguely familiar, and<br>
in a case that I have seen it was because a DB server was<br>
offline, and &#39;vos&#39; took a very very long time to switch to an<br>
online one. But this was with 1.4 and supposedly 1.6 should have<br>
a shorter timeout.<br>
<br>
In another case that vaguely resembles this there was a race<br>
between creating a clone and registering it in the VLDB:<br>
<br>
=A0 <a href=3D"http://rt.central.org/rt/Ticket/Display.html?id=3D131797" ta=
rget=3D"_blank">http://rt.central.org/rt/Ticket/Display.html?id=3D131797</a=
><br>
<br>
It would be interesting to know what processes 21378, 4129, 7155<br>
were doing and why they held a lock on the RW original.<br>
_______________________________________________<br>
OpenAFS-info mailing list<br>
<a href=3D"mailto:OpenAFS-info@openafs.org" target=3D"_blank">OpenAFS-info@=
openafs.org</a><br>
<a href=3D"https://lists.openafs.org/mailman/listinfo/openafs-info" target=
=3D"_blank">https://lists.openafs.org/mailman/listinfo/openafs-info</a><br>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a11c16ccc9b25a704f2115bf4--