[OpenAFS] nightly failure since upgrading to 1.6.5
Tracy Di Marco White
gendalia@gmail.com
Mon, 10 Feb 2014 12:15:08 -0600
--001a11c16cccdb479904f2115126
Content-Type: text/plain; charset=ISO-8859-1
Somehow, I still have two of them in my scroll back.
root 4129 0.0 0.2 46288 5124 ? Sl 7:46AM 0:00.02
/usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo
root 7155 0.0 1.2 85200 42424 ? Il 8:06AM 1:27.36
/usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo
I'd assume that means you can guess the third.
On Mon, Feb 10, 2014 at 7:00 AM, Peter Grandi <pg@afs.list.sabi.co.uk>wrote:
> > Every night at midnight, we run 'vos backupsys'. For three
> > nights in a row, on one of the servers I've upgraded to 1.6.5
> > and dafs, I've been getting the following errors, and it
> > mostly stops being a fileserver.
>
> [ ... ]
> > Sun Feb 9 00:00:03 2014 SYNC_getCom: error receiving command
> > Sun Feb 9 00:00:03 2014 FSYNC_com: read failed; dropping connection
> (cnt=493489)
> > Sun Feb 9 00:00:03 2014 _VLockFd: conflicting lock held on fd 225,
> offset 538046785 by pid 4129 (locktype=1)
> > Sun Feb 9 00:00:03 2014 VAttachVolume: another program has vol
> 538046785 locked
> > Sun Feb 9 00:00:03 2014 VPreattachVolumeByVp_r: volume 538046785 not in
> quiescent state (state 2 flags 0x18)
> [ ... ]
> > Sun Feb 9 00:00:03 2014 1 Volser: Clone: Recloning volume 538046785 to
> volume 538046787
> > Sun Feb 9 00:00:03 2014 SYNC_ask: length field in response
> inconsistent on circuit 'FSSYNC'
> > Sun Feb 9 00:00:03 2014 SYNC_ask: protocol communications failure on
> circuit 'FSSYNC'; attempting reconnect to server
> [ ... ]
>
> That " _VLockFd: conflicting lock held" and "VAttachVolume:
> another program has vol NNNN locked" looks vaguely familiar, and
> in a case that I have seen it was because a DB server was
> offline, and 'vos' took a very very long time to switch to an
> online one. But this was with 1.4 and supposedly 1.6 should have
> a shorter timeout.
>
> In another case that vaguely resembles this there was a race
> between creating a clone and registering it in the VLDB:
>
> http://rt.central.org/rt/Ticket/Display.html?id=131797
>
> It would be interesting to know what processes 21378, 4129, 7155
> were doing and why they held a lock on the RW original.
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>
--001a11c16cccdb479904f2115126
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">Somehow, I still have two of them in my scroll back.<div><=
div>root =A0 =A0 =A04129 =A00.0 =A00.2 =A046288 =A0 5124 ? =A0 =A0 Sl =A0 =
=A07:46AM =A00:00.02 /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -noju=
mbo=A0</div></div><div><div>
root =A0 =A0 =A07155 =A00.0 =A01.2 =A085200 =A042424 ? =A0 =A0 Il =A0 =A08:=
06AM =A01:27.36 /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo=
=A0</div></div><div><br></div><div>I'd assume that means you can guess =
the third.</div></div><div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Mon, Feb 10, 2014 at 7:00 AM, Peter G=
randi <span dir=3D"ltr"><<a href=3D"mailto:pg@afs.list.sabi.co.uk" targe=
t=3D"_blank">pg@afs.list.sabi.co.uk</a>></span> wrote:<br><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;p=
adding-left:1ex">
<div class=3D"">> Every night at midnight, we run 'vos backupsys'=
;. For three<br>
> nights in a row, on one of the servers I've upgraded to 1.6.5<br>
> and dafs, I've been getting the following errors, and it<br>
> mostly stops being a fileserver.<br>
<br>
</div>[ ... ]<br>
<div class=3D"">> Sun Feb =A09 00:00:03 2014 SYNC_getCom: =A0error recei=
ving command<br>
> Sun Feb =A09 00:00:03 2014 FSYNC_com: =A0read failed; dropping connect=
ion (cnt=3D493489)<br>
> Sun Feb =A09 00:00:03 2014 _VLockFd: conflicting lock held on fd 225, =
offset 538046785 by pid 4129 (locktype=3D1)<br>
> Sun Feb =A09 00:00:03 2014 VAttachVolume: another program has vol 5380=
46785 locked<br>
> Sun Feb =A09 00:00:03 2014 VPreattachVolumeByVp_r: volume 538046785 no=
t in quiescent state (state 2 flags 0x18)<br>
</div>[ ... ]<br>
<div class=3D"">> Sun Feb =A09 00:00:03 2014 1 Volser: Clone: Recloning =
volume 538046785 to volume 538046787<br>
> Sun Feb =A09 00:00:03 2014 SYNC_ask: =A0length field in response incon=
sistent on circuit 'FSSYNC'<br>
> Sun Feb =A09 00:00:03 2014 SYNC_ask: protocol communications failure o=
n circuit 'FSSYNC'; attempting reconnect to server<br>
</div>[ ... ]<br>
<br>
That " _VLockFd: conflicting lock held" and "VAttachVolume:<=
br>
another program has vol NNNN locked" looks vaguely familiar, and<br>
in a case that I have seen it was because a DB server was<br>
offline, and 'vos' took a very very long time to switch to an<br>
online one. But this was with 1.4 and supposedly 1.6 should have<br>
a shorter timeout.<br>
<br>
In another case that vaguely resembles this there was a race<br>
between creating a clone and registering it in the VLDB:<br>
<br>
=A0 <a href=3D"http://rt.central.org/rt/Ticket/Display.html?id=3D131797" ta=
rget=3D"_blank">http://rt.central.org/rt/Ticket/Display.html?id=3D131797</a=
><br>
<br>
It would be interesting to know what processes 21378, 4129, 7155<br>
were doing and why they held a lock on the RW original.<br>
_______________________________________________<br>
OpenAFS-info mailing list<br>
<a href=3D"mailto:OpenAFS-info@openafs.org">OpenAFS-info@openafs.org</a><br=
>
<a href=3D"https://lists.openafs.org/mailman/listinfo/openafs-info" target=
=3D"_blank">https://lists.openafs.org/mailman/listinfo/openafs-info</a><br>
</blockquote></div><br></div>
--001a11c16cccdb479904f2115126--