EXTERNAL: [OpenAFS] Preliminary findings on today's brokenness

Chaskiel Grundman cgrundman@gmail.com
Thu, 14 Jan 2021 10:37:56 -0500


--000000000000d75e8405b8de0a46
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I guess I should elaborate a little
The "RX Epoch" is a value chosen by each copy of the RX network stack and
is used, in part, to disambiguate different instances of RX running on the
same port.
In openafs, the RX stack exists inside the RX-using process, not the
networking bits in the kernel, so each independent program has a completely
independent RX stack.

Each time you run vos, a new RX stack is spun up with a new epoch.
The cache manager (afsd) uses an epoch chosen when it was started (i.e.
during boot)
The fileserver, ptserver, vlserver each have their own RX stack, with an
epoch chosen when they were last restarted.

On Thu, Jan 14, 2021 at 10:26 AM Ben Carter <bhc@pitt.edu> wrote:

>
> So we are running 1.6 code and we definitely have a problem.  However
> for us, a sync site is being elected, but doing a vos examine from a
> client seems to hang.  Actual access to files in AFS seems to be working
> fine but we've not restarted any file server processes.
>
> Ben
>
> On 1/14/21 10:21 AM, Chaskiel Grundman wrote:
> > None of these things is confirmed yet, but based on some analysis and
> > testing carnegie mellon has done today:
> >
> > - The problem is in RX (the transport layer), not any of the applicatio=
ns
> > - It likely affects 1.8.0 and newer, but not 1.6
> > -It seems to be triggered by the RX epoch being after the unix time
> > 0x60000000  aka 1610612736, aka Thu Jan 14 08:25:36 UTC 2021
> >
> >
> > So any cache manager and server that has been running since before that
> > time will continue to work until they are restarted. Sites may wish to
> > try and avoid having critical systems reboot or restart until a fix or
> > workaround for this issue is identified.
> >
> > If anyone has a system running something 1.8.0 or newer where the comma=
nd
> > vos status afs-01.andrew.cmu.edu
> > <
> https://nam12.safelinks.protection.outlook.com/?url=3Dhttp%3A%2F%2Fafs-01=
.andrew.cmu.edu%2F&data=3D04%7C01%7Cbhc%40pitt.edu%7C41b163d418f34672980208=
d8b8a01ee8%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637462345143664355%=
7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWw=
iLCJXVCI6Mn0%3D%7C2000&sdata=3DyrFiXzq9V9tiqqASL4EDgRrSChdNPbgkOsWeY3SFjvY%=
3D&reserved=3D0>
>
> > -noauth
> >
> > succeeds, I'd appreciate knowing about it, as it will change this
> analysis.
>
>
> --
> Ben Carter
> System Engineer/Operations
> University of Pittsburgh Information Technology
> Office: 412-624-6470
> bhc@pitt.edu
>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>

--000000000000d75e8405b8de0a46
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I guess I should elaborate a little<div>The &quot;RX Epoch=
&quot; is a value chosen by each copy of the RX network stack and is used, =
in part, to disambiguate different instances of RX running on the same port=
.</div><div>In openafs, the RX stack exists inside the RX-using process, no=
t the networking bits in the kernel, so each independent=C2=A0program has a=
 completely independent RX stack.</div><div><br></div><div>Each time you ru=
n vos, a new RX stack is spun up with a new epoch.=C2=A0</div><div>The cach=
e manager (afsd) uses an epoch chosen when it was started (i.e. during boot=
)</div><div>The fileserver, ptserver, vlserver each have their own RX stack=
, with an epoch chosen when they were last restarted.</div></div><br><div c=
lass=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Thu, Jan 14, =
2021 at 10:26 AM Ben Carter &lt;<a href=3D"mailto:bhc@pitt.edu">bhc@pitt.ed=
u</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin=
:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"=
><br>
So we are running 1.6 code and we definitely have a problem.=C2=A0 However =
<br>
for us, a sync site is being elected, but doing a vos examine from a <br>
client seems to hang.=C2=A0 Actual access to files in AFS seems to be worki=
ng <br>
fine but we&#39;ve not restarted any file server processes.<br>
<br>
Ben<br>
<br>
On 1/14/21 10:21 AM, Chaskiel Grundman wrote:<br>
&gt; None of these things is confirmed yet, but based on some analysis and =
<br>
&gt; testing carnegie mellon has done today:<br>
&gt; <br>
&gt; - The problem is in RX (the transport layer), not any of the applicati=
ons<br>
&gt; - It likely affects 1.8.0 and newer, but not 1.6<br>
&gt; -It seems to be triggered=C2=A0by the RX epoch being after the unix ti=
me <br>
&gt; 0x60000000=C2=A0 aka 1610612736, aka Thu Jan 14 08:25:36 UTC 2021<br>
&gt; <br>
&gt; <br>
&gt; So any cache manager=C2=A0and server that has been running since befor=
e that <br>
&gt; time will continue to work until they are restarted. Sites may wish to=
 <br>
&gt; try and avoid having critical systems reboot or restart until a fix or=
 <br>
&gt; workaround for this issue is identified.<br>
&gt; <br>
&gt; If anyone has a system running something 1.8.0 or newer where the comm=
and<br>
&gt; vos status <a href=3D"http://afs-01.andrew.cmu.edu" rel=3D"noreferrer"=
 target=3D"_blank">afs-01.andrew.cmu.edu</a> <br>
&gt; &lt;<a href=3D"https://nam12.safelinks.protection.outlook.com/?url=3Dh=
ttp%3A%2F%2Fafs-01.andrew.cmu.edu%2F&amp;data=3D04%7C01%7Cbhc%40pitt.edu%7C=
41b163d418f34672980208d8b8a01ee8%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0=
%7C637462345143664355%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV=
2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=3DyrFiXzq9V9tiqqASL4=
EDgRrSChdNPbgkOsWeY3SFjvY%3D&amp;reserved=3D0" rel=3D"noreferrer" target=3D=
"_blank">https://nam12.safelinks.protection.outlook.com/?url=3Dhttp%3A%2F%2=
Fafs-01.andrew.cmu.edu%2F&amp;data=3D04%7C01%7Cbhc%40pitt.edu%7C41b163d418f=
34672980208d8b8a01ee8%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C63746234=
5143664355%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJB=
TiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=3DyrFiXzq9V9tiqqASL4EDgRrSChdNP=
bgkOsWeY3SFjvY%3D&amp;reserved=3D0</a>&gt; <br>
&gt; -noauth<br>
&gt; <br>
&gt; succeeds, I&#39;d appreciate knowing about it, as it will change this =
analysis.<br>
<br>
<br>
-- <br>
Ben Carter<br>
System Engineer/Operations<br>
University of Pittsburgh Information Technology<br>
Office: 412-624-6470<br>
<a href=3D"mailto:bhc@pitt.edu" target=3D"_blank">bhc@pitt.edu</a><br>
<br>
_______________________________________________<br>
OpenAFS-info mailing list<br>
<a href=3D"mailto:OpenAFS-info@openafs.org" target=3D"_blank">OpenAFS-info@=
openafs.org</a><br>
<a href=3D"https://lists.openafs.org/mailman/listinfo/openafs-info" rel=3D"=
noreferrer" target=3D"_blank">https://lists.openafs.org/mailman/listinfo/op=
enafs-info</a><br>
</blockquote></div>

--000000000000d75e8405b8de0a46--