[OpenAFS] nightly failure since upgrading to 1.6.5
Tracy Di Marco White
gendalia@gmail.com
Mon, 10 Feb 2014 12:17:52 -0600
--001a11c16ccc9b25a704f2115bf4
Content-Type: text/plain; charset=ISO-8859-1
Sorry, no need to guess, it was in my monitoring client.
21378 4248 root 4:17PM Sl 43 0.0 0:00.12 0.2 5184 46288
/usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo
On Mon, Feb 10, 2014 at 12:15 PM, Tracy Di Marco White
<gendalia@gmail.com>wrote:
> Somehow, I still have two of them in my scroll back.
> root 4129 0.0 0.2 46288 5124 ? Sl 7:46AM 0:00.02
> /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo
> root 7155 0.0 1.2 85200 42424 ? Il 8:06AM 1:27.36
> /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo
>
> I'd assume that means you can guess the third.
>
>
> On Mon, Feb 10, 2014 at 7:00 AM, Peter Grandi <pg@afs.list.sabi.co.uk>wrote:
>
>> > Every night at midnight, we run 'vos backupsys'. For three
>> > nights in a row, on one of the servers I've upgraded to 1.6.5
>> > and dafs, I've been getting the following errors, and it
>> > mostly stops being a fileserver.
>>
>> [ ... ]
>> > Sun Feb 9 00:00:03 2014 SYNC_getCom: error receiving command
>> > Sun Feb 9 00:00:03 2014 FSYNC_com: read failed; dropping connection
>> (cnt=493489)
>> > Sun Feb 9 00:00:03 2014 _VLockFd: conflicting lock held on fd 225,
>> offset 538046785 by pid 4129 (locktype=1)
>> > Sun Feb 9 00:00:03 2014 VAttachVolume: another program has vol
>> 538046785 locked
>> > Sun Feb 9 00:00:03 2014 VPreattachVolumeByVp_r: volume 538046785 not
>> in quiescent state (state 2 flags 0x18)
>> [ ... ]
>> > Sun Feb 9 00:00:03 2014 1 Volser: Clone: Recloning volume 538046785 to
>> volume 538046787
>> > Sun Feb 9 00:00:03 2014 SYNC_ask: length field in response
>> inconsistent on circuit 'FSSYNC'
>> > Sun Feb 9 00:00:03 2014 SYNC_ask: protocol communications failure on
>> circuit 'FSSYNC'; attempting reconnect to server
>> [ ... ]
>>
>> That " _VLockFd: conflicting lock held" and "VAttachVolume:
>> another program has vol NNNN locked" looks vaguely familiar, and
>> in a case that I have seen it was because a DB server was
>> offline, and 'vos' took a very very long time to switch to an
>> online one. But this was with 1.4 and supposedly 1.6 should have
>> a shorter timeout.
>>
>> In another case that vaguely resembles this there was a race
>> between creating a clone and registering it in the VLDB:
>>
>> http://rt.central.org/rt/Ticket/Display.html?id=131797
>>
>> It would be interesting to know what processes 21378, 4129, 7155
>> were doing and why they held a lock on the RW original.
>> _______________________________________________
>> OpenAFS-info mailing list
>> OpenAFS-info@openafs.org
>> https://lists.openafs.org/mailman/listinfo/openafs-info
>>
>
>
--001a11c16ccc9b25a704f2115bf4
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">Sorry, no need to guess, it was in my monitoring client.<d=
iv><div>21378 =A04248 root =A0 =A0 4:17PM Sl =A0 =A043 =A00.0 0:00.12 =A00.=
2 =A0 5184 =A046288 /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojum=
bo=A0</div></div><div>
<br></div></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quot=
e">On Mon, Feb 10, 2014 at 12:15 PM, Tracy Di Marco White <span dir=3D"ltr"=
><<a href=3D"mailto:gendalia@gmail.com" target=3D"_blank">gendalia@gmail=
.com</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Somehow, I still have two o=
f them in my scroll back.<div><div>root =A0 =A0 =A04129 =A00.0 =A00.<a href=
=3D"tel:2%20%C2%A046288%20%C2%A0%205124" value=3D"+12462885124" target=3D"_=
blank">2 =A046288 =A0 5124</a> ? =A0 =A0 Sl =A0 =A07:46AM =A00:00.02 /usr/p=
kg/libexec/openafs/davolserver -sleep 5/60 -nojumbo=A0</div>
</div><div><div>
root =A0 =A0 =A07155 =A00.0 =A01.2 =A085200 =A042424 ? =A0 =A0 Il =A0 =A08:=
06AM =A01:27.36 /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo=
=A0</div></div><div><br></div><div>I'd assume that means you can guess =
the third.</div></div><div class=3D"HOEnZb">
<div class=3D"h5"><div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Mon, Feb 10, 2014 at 7:00 AM, Peter G=
randi <span dir=3D"ltr"><<a href=3D"mailto:pg@afs.list.sabi.co.uk" targe=
t=3D"_blank">pg@afs.list.sabi.co.uk</a>></span> wrote:<br><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;p=
adding-left:1ex">
<div>> Every night at midnight, we run 'vos backupsys'. For thre=
e<br>
> nights in a row, on one of the servers I've upgraded to 1.6.5<br>
> and dafs, I've been getting the following errors, and it<br>
> mostly stops being a fileserver.<br>
<br>
</div>[ ... ]<br>
<div>> Sun Feb =A09 00:00:03 2014 SYNC_getCom: =A0error receiving comman=
d<br>
> Sun Feb =A09 00:00:03 2014 FSYNC_com: =A0read failed; dropping connect=
ion (cnt=3D493489)<br>
> Sun Feb =A09 00:00:03 2014 _VLockFd: conflicting lock held on fd 225, =
offset 538046785 by pid 4129 (locktype=3D1)<br>
> Sun Feb =A09 00:00:03 2014 VAttachVolume: another program has vol 5380=
46785 locked<br>
> Sun Feb =A09 00:00:03 2014 VPreattachVolumeByVp_r: volume 538046785 no=
t in quiescent state (state 2 flags 0x18)<br>
</div>[ ... ]<br>
<div>> Sun Feb =A09 00:00:03 2014 1 Volser: Clone: Recloning volume 5380=
46785 to volume 538046787<br>
> Sun Feb =A09 00:00:03 2014 SYNC_ask: =A0length field in response incon=
sistent on circuit 'FSSYNC'<br>
> Sun Feb =A09 00:00:03 2014 SYNC_ask: protocol communications failure o=
n circuit 'FSSYNC'; attempting reconnect to server<br>
</div>[ ... ]<br>
<br>
That " _VLockFd: conflicting lock held" and "VAttachVolume:<=
br>
another program has vol NNNN locked" looks vaguely familiar, and<br>
in a case that I have seen it was because a DB server was<br>
offline, and 'vos' took a very very long time to switch to an<br>
online one. But this was with 1.4 and supposedly 1.6 should have<br>
a shorter timeout.<br>
<br>
In another case that vaguely resembles this there was a race<br>
between creating a clone and registering it in the VLDB:<br>
<br>
=A0 <a href=3D"http://rt.central.org/rt/Ticket/Display.html?id=3D131797" ta=
rget=3D"_blank">http://rt.central.org/rt/Ticket/Display.html?id=3D131797</a=
><br>
<br>
It would be interesting to know what processes 21378, 4129, 7155<br>
were doing and why they held a lock on the RW original.<br>
_______________________________________________<br>
OpenAFS-info mailing list<br>
<a href=3D"mailto:OpenAFS-info@openafs.org" target=3D"_blank">OpenAFS-info@=
openafs.org</a><br>
<a href=3D"https://lists.openafs.org/mailman/listinfo/openafs-info" target=
=3D"_blank">https://lists.openafs.org/mailman/listinfo/openafs-info</a><br>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
--001a11c16ccc9b25a704f2115bf4--