[OpenAFS] nightly failure since upgrading to 1.6.5

Tracy Di Marco White gendalia@gmail.com
Mon, 10 Feb 2014 12:15:08 -0600


--001a11c16cccdb479904f2115126
Content-Type: text/plain; charset=ISO-8859-1

Somehow, I still have two of them in my scroll back.
root      4129  0.0  0.2  46288   5124 ?     Sl    7:46AM  0:00.02
/usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo
root      7155  0.0  1.2  85200  42424 ?     Il    8:06AM  1:27.36
/usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo

I'd assume that means you can guess the third.


On Mon, Feb 10, 2014 at 7:00 AM, Peter Grandi <pg@afs.list.sabi.co.uk>wrote:

> > Every night at midnight, we run 'vos backupsys'. For three
> > nights in a row, on one of the servers I've upgraded to 1.6.5
> > and dafs, I've been getting the following errors, and it
> > mostly stops being a fileserver.
>
> [ ... ]
> > Sun Feb  9 00:00:03 2014 SYNC_getCom:  error receiving command
> > Sun Feb  9 00:00:03 2014 FSYNC_com:  read failed; dropping connection
> (cnt=493489)
> > Sun Feb  9 00:00:03 2014 _VLockFd: conflicting lock held on fd 225,
> offset 538046785 by pid 4129 (locktype=1)
> > Sun Feb  9 00:00:03 2014 VAttachVolume: another program has vol
> 538046785 locked
> > Sun Feb  9 00:00:03 2014 VPreattachVolumeByVp_r: volume 538046785 not in
> quiescent state (state 2 flags 0x18)
> [ ... ]
> > Sun Feb  9 00:00:03 2014 1 Volser: Clone: Recloning volume 538046785 to
> volume 538046787
> > Sun Feb  9 00:00:03 2014 SYNC_ask:  length field in response
> inconsistent on circuit 'FSSYNC'
> > Sun Feb  9 00:00:03 2014 SYNC_ask: protocol communications failure on
> circuit 'FSSYNC'; attempting reconnect to server
> [ ... ]
>
> That " _VLockFd: conflicting lock held" and "VAttachVolume:
> another program has vol NNNN locked" looks vaguely familiar, and
> in a case that I have seen it was because a DB server was
> offline, and 'vos' took a very very long time to switch to an
> online one. But this was with 1.4 and supposedly 1.6 should have
> a shorter timeout.
>
> In another case that vaguely resembles this there was a race
> between creating a clone and registering it in the VLDB:
>
>   http://rt.central.org/rt/Ticket/Display.html?id=131797
>
> It would be interesting to know what processes 21378, 4129, 7155
> were doing and why they held a lock on the RW original.
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>

--001a11c16cccdb479904f2115126
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Somehow, I still have two of them in my scroll back.<div><=
div>root =A0 =A0 =A04129 =A00.0 =A00.2 =A046288 =A0 5124 ? =A0 =A0 Sl =A0 =
=A07:46AM =A00:00.02 /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -noju=
mbo=A0</div></div><div><div>
root =A0 =A0 =A07155 =A00.0 =A01.2 =A085200 =A042424 ? =A0 =A0 Il =A0 =A08:=
06AM =A01:27.36 /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo=
=A0</div></div><div><br></div><div>I&#39;d assume that means you can guess =
the third.</div></div><div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Mon, Feb 10, 2014 at 7:00 AM, Peter G=
randi <span dir=3D"ltr">&lt;<a href=3D"mailto:pg@afs.list.sabi.co.uk" targe=
t=3D"_blank">pg@afs.list.sabi.co.uk</a>&gt;</span> wrote:<br><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;p=
adding-left:1ex">
<div class=3D"">&gt; Every night at midnight, we run &#39;vos backupsys&#39=
;. For three<br>
&gt; nights in a row, on one of the servers I&#39;ve upgraded to 1.6.5<br>
&gt; and dafs, I&#39;ve been getting the following errors, and it<br>
&gt; mostly stops being a fileserver.<br>
<br>
</div>[ ... ]<br>
<div class=3D"">&gt; Sun Feb =A09 00:00:03 2014 SYNC_getCom: =A0error recei=
ving command<br>
&gt; Sun Feb =A09 00:00:03 2014 FSYNC_com: =A0read failed; dropping connect=
ion (cnt=3D493489)<br>
&gt; Sun Feb =A09 00:00:03 2014 _VLockFd: conflicting lock held on fd 225, =
offset 538046785 by pid 4129 (locktype=3D1)<br>
&gt; Sun Feb =A09 00:00:03 2014 VAttachVolume: another program has vol 5380=
46785 locked<br>
&gt; Sun Feb =A09 00:00:03 2014 VPreattachVolumeByVp_r: volume 538046785 no=
t in quiescent state (state 2 flags 0x18)<br>
</div>[ ... ]<br>
<div class=3D"">&gt; Sun Feb =A09 00:00:03 2014 1 Volser: Clone: Recloning =
volume 538046785 to volume 538046787<br>
&gt; Sun Feb =A09 00:00:03 2014 SYNC_ask: =A0length field in response incon=
sistent on circuit &#39;FSSYNC&#39;<br>
&gt; Sun Feb =A09 00:00:03 2014 SYNC_ask: protocol communications failure o=
n circuit &#39;FSSYNC&#39;; attempting reconnect to server<br>
</div>[ ... ]<br>
<br>
That &quot; _VLockFd: conflicting lock held&quot; and &quot;VAttachVolume:<=
br>
another program has vol NNNN locked&quot; looks vaguely familiar, and<br>
in a case that I have seen it was because a DB server was<br>
offline, and &#39;vos&#39; took a very very long time to switch to an<br>
online one. But this was with 1.4 and supposedly 1.6 should have<br>
a shorter timeout.<br>
<br>
In another case that vaguely resembles this there was a race<br>
between creating a clone and registering it in the VLDB:<br>
<br>
=A0 <a href=3D"http://rt.central.org/rt/Ticket/Display.html?id=3D131797" ta=
rget=3D"_blank">http://rt.central.org/rt/Ticket/Display.html?id=3D131797</a=
><br>
<br>
It would be interesting to know what processes 21378, 4129, 7155<br>
were doing and why they held a lock on the RW original.<br>
_______________________________________________<br>
OpenAFS-info mailing list<br>
<a href=3D"mailto:OpenAFS-info@openafs.org">OpenAFS-info@openafs.org</a><br=
>
<a href=3D"https://lists.openafs.org/mailman/listinfo/openafs-info" target=
=3D"_blank">https://lists.openafs.org/mailman/listinfo/openafs-info</a><br>
</blockquote></div><br></div>

--001a11c16cccdb479904f2115126--