[OpenAFS] Re: nightly failure since upgrading to 1.6.5

Tracy Di Marco White gendalia@gmail.com
Mon, 10 Feb 2014 15:09:25 -0600


--001a11331fb424bd2c04f213c144
Content-Type: text/plain; charset=ISO-8859-1

On Mon, Feb 10, 2014 at 2:23 PM, Andrew Deason <adeason@sinenomine.net>wrote:

> On Mon, 10 Feb 2014 00:27:59 -0600
> Tracy Di Marco White <gendalia@gmail.com> wrote:
>
> > Every night at midnight, we run 'vos backupsys'. For three nights in a
> > row, on one of the servers I've upgraded to 1.6.5 and dafs, I've been
> > getting the following errors, and it mostly stops being a fileserver.
> > Is this fixed in 1.6.6? Anyone else seeing it? This is on NetBSD
> > 6.1.3.
>
> I would guess you are the only one using NetBSD for a "real" fileserver,
> at least for DAFS. The errors you've posted indicate there are some
> problems with the mechanism by which the fileserver and other processes
> use to communicate with each other, so it may be advisable to not trust
> DAFS on NetBSD with "real" data until it's known what's going on, as
> errors like this could possibly lead to corrupted volumes.
>

That's possible, certainly, depending on your definition of 'real'. I know
other people are using DAFS on NetBSD for fileservers. Personally,
I've only been doing it for a year or two.


> Do you know if this seems to happen immediately, or if 'vos backupsys'
> seems to correctly create some backup clones, and then eventually
> triggers this error? I (or someone else) will probably need to reproduce
> this to get a better idea of what's going on, but you can maybe save us
> some time with some more info:


It happens on one server, of four, and it's most of the way through creating
backup volumes on this particular server. It is consistently happening on
one, and only one, server.


> > VolserLog
> > Sat Feb  8 00:02:42 2014 SYNC_ask:  length field in response inconsistent
> > on circuit 'FSSYNC'
> > Sat Feb  8 00:02:42 2014 SYNC_ask: protocol communications failure on
> > circuit 'FSSYNC'; attempting reconnect to server
>
> This message says what one of the problems is, but isn't providing a lot
> of information. If it's convenient for you to apply a patch and rebuild,
> the following patch would give us a little more information in this
> situation (from gerrit 10829):
>
> <
> http://git.openafs.org/?p=openafs.git;a=patch;h=9604a45e94ed23a2941d0a7e11bfd892a0bd0bf7
> >
>


Sure, since I'm restarting just after midnight every night anyway.

On Mon, 10 Feb 2014 12:15:08 -0600
> Tracy Di Marco White <gendalia@gmail.com> wrote:
>
> > root      4129  0.0  0.2 46288 5124 ?     Sl    7:46AM  0:00.02
> > /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo
> > root      7155  0.0  1.2  85200  42424 ?     Il    8:06AM  1:27.36
> > /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo
>
> Do you have any idea why you have multiple davolserver processes running
> at once? Does BosLog maybe say anything about processes dying or
> anything? Could you provide a 'ps' listing of all afs server processes
> on that machine?
>

It's not. Those are three different days, three different restarts.
Restarting
afs is the only way I know of to make the fileserver work again.

-Tracy

--001a11331fb424bd2c04f213c144
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Mon, Feb 10, 2014 at 2:23 PM, Andrew Deason <span dir=3D"ltr">&l=
t;<a href=3D"mailto:adeason@sinenomine.net" target=3D"_blank">adeason@sinen=
omine.net</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">On Mon, 10 Feb 2014 00:27:59 -0600<br>
<div class=3D"">Tracy Di Marco White &lt;<a href=3D"mailto:gendalia@gmail.c=
om">gendalia@gmail.com</a>&gt; wrote:<br>
<br>
</div><div class=3D"">&gt; Every night at midnight, we run &#39;vos backups=
ys&#39;. For three nights in a<br>
&gt; row, on one of the servers I&#39;ve upgraded to 1.6.5 and dafs, I&#39;=
ve been<br>
&gt; getting the following errors, and it mostly stops being a fileserver.<=
br>
&gt; Is this fixed in 1.6.6? Anyone else seeing it? This is on NetBSD<br>
&gt; 6.1.3.<br>
<br>
</div>I would guess you are the only one using NetBSD for a &quot;real&quot=
; fileserver,<br>
at least for DAFS. The errors you&#39;ve posted indicate there are some<br>
problems with the mechanism by which the fileserver and other processes<br>
use to communicate with each other, so it may be advisable to not trust<br>
DAFS on NetBSD with &quot;real&quot; data until it&#39;s known what&#39;s g=
oing on, as<br>
errors like this could possibly lead to corrupted volumes.<br></blockquote>=
<div><br></div><div>That&#39;s possible, certainly, depending on your defin=
ition of &#39;real&#39;. I know</div><div>other people are using DAFS on Ne=
tBSD for fileservers. Personally,</div>
<div>I&#39;ve only been doing it for a year or two.=A0</div><div>=A0</div><=
blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex">
Do you know if this seems to happen immediately, or if &#39;vos backupsys&#=
39;<br>
seems to correctly create some backup clones, and then eventually<br>
triggers this error? I (or someone else) will probably need to reproduce<br=
>
this to get a better idea of what&#39;s going on, but you can maybe save us=
<br>
some time with some more info:</blockquote><div><br></div><div>It happens o=
n one server, of four, and it&#39;s most of the way through creating</div><=
div>backup volumes on this particular server. It is consistently happening =
on</div>
<div>one, and only one, server.</div><div>=A0</div><blockquote class=3D"gma=
il_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-lef=
t:1ex"><div class=3D"">
&gt; VolserLog<br>
&gt; Sat Feb =A08 00:02:42 2014 SYNC_ask: =A0length field in response incon=
sistent<br>
&gt; on circuit &#39;FSSYNC&#39;<br>
&gt; Sat Feb =A08 00:02:42 2014 SYNC_ask: protocol communications failure o=
n<br>
&gt; circuit &#39;FSSYNC&#39;; attempting reconnect to server<br>
<br>
</div>This message says what one of the problems is, but isn&#39;t providin=
g a lot<br>
of information. If it&#39;s convenient for you to apply a patch and rebuild=
,<br>
the following patch would give us a little more information in this<br>
situation (from gerrit 10829):<br>
<br>
&lt;<a href=3D"http://git.openafs.org/?p=3Dopenafs.git;a=3Dpatch;h=3D9604a4=
5e94ed23a2941d0a7e11bfd892a0bd0bf7" target=3D"_blank">http://git.openafs.or=
g/?p=3Dopenafs.git;a=3Dpatch;h=3D9604a45e94ed23a2941d0a7e11bfd892a0bd0bf7</=
a>&gt;<br></blockquote>
<div><br></div><div>=A0</div><div>Sure, since I&#39;m restarting just after=
 midnight every night anyway.</div><div><br></div><blockquote class=3D"gmai=
l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left=
:1ex">

On Mon, 10 Feb 2014 12:15:08 -0600<br>
<div class=3D"">Tracy Di Marco White &lt;<a href=3D"mailto:gendalia@gmail.c=
om">gendalia@gmail.com</a>&gt; wrote:<br>
<br>
</div><div class=3D"">&gt; root =A0 =A0 =A04129 =A00.0 =A00.<a href=3D"tel:=
2%20%2046288%20%20%205124" value=3D"+12462885124">2  46288   5124</a> ? =A0=
 =A0 Sl =A0 =A07:46AM =A00:00.02<br>
&gt; /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo<br>
&gt; root =A0 =A0 =A07155 =A00.0 =A01.2 =A085200 =A042424 ? =A0 =A0 Il =A0 =
=A08:06AM =A01:27.36<br>
&gt; /usr/pkg/libexec/openafs/davolserver -sleep 5/60 -nojumbo<br>
<br>
</div>Do you have any idea why you have multiple davolserver processes runn=
ing<br>
at once? Does BosLog maybe say anything about processes dying or<br>
anything? Could you provide a &#39;ps&#39; listing of all afs server proces=
ses<br>
on that machine?<br></blockquote><div><br></div><div>It&#39;s not. Those ar=
e three different days, three different restarts. Restarting</div><div>afs =
is the only way I know of to make the fileserver work again.</div><div>
<br></div><div>-Tracy=A0</div></div></div></div>

--001a11331fb424bd2c04f213c144--