[OpenAFS-devel] linux45: smoke test failed

Joe Gorse jhgorse@gmail.com
Mon, 18 Jul 2016 14:18:34 -0400


--94eb2c147d143ba6060537ed004f
Content-Type: text/plain; charset=UTF-8

*Summary:*

getcwd() is failing, sometimes. When it fails, d_unlinked(pwd.dentry)
<http://lxr.free-electrons.com/source/fs/dcache.c#L3250> is likely not
returning success for unknown reasons; hence the ENOENT return from
getcwd().

*Status:*

I ran an strace, filtering on open(), for a 'git log' command which seemed
to reliably repeat it after the 3 minute script from Stephan.

strace on git log w/ and w/out cwd error

with cwd error: http://hastebin.com/dezowowoto.coffee
w/o cwd error: http://hastebin.com/nayojoqomo.coffee


The only substantive differences are line 45 (an extra mprotect() in the
without cwd error case) and then the meat at line ~91-95 at the close().

96 getcwd("/vagrant/systap", 4096) = 16 | getcwd(0x7da600, 4096) = -1
ENOENT (No such file or directory)


The left-hand side above was done on a non-AFS fs, so the path is used as
the identifier. The right hand side is the failed AFS case.

When that same getcwd() call succeeds with AFS, the arguments are exactly
the same, but the return value is 0x16 and git reports that the directory
is "Not a git repository".

*Next steps:*
1. Get SystemTap to recognize the OpenAFS kernel module and userland debug
symbols.
2. Start probing the AFS lookups for this directory and the incorrect
unhashing of it. Particularly in osi_TryEvictVCache

Any hints on SystemTap with custom built OpenAFS are solicited and welcome.

Cheers,
Joe


On Sat, Jul 16, 2016 at 11:39 PM, Benjamin Kaduk <kaduk@mit.edu> wrote:

> On Fri, 15 Jul 2016, Joe Gorse wrote:
>
> > Stephan,
> >
> > Picking this back up, I am having difficulties repeating it consistently.
> > Debian 8.4, kernel 4.4.15, OpenAFS master
> > f14d263a73f0be75e4de92f62e836fb2e55680dd. I see the gerrit for reverting
> on
> > master is not in yet, so that's not it. Tried increasing the frequency of
> > afs_ShakeLooseVCaches.
> >
> > A smaller git repo (e.g. openafs-robotest) never seems to trip the CWD
> bug.
>
> Presumably because it does not create enough vcaches to go over a limit
> somewhere.
>
> > Test method:
> > [vagrant@openafs-debian-dev:/afs/.robotest/test] $ mkdir g; cd g; git
> clone
> > git://gerrit.openafs.org/openafs.git;sleep 180;git log
> > Cloning into 'openafs'...
> > remote: Counting objects: 192945, done.
> > remote: Compressing objects: 100% (46882/46882), done.
> > remote: Total 192945 (delta 159381), reused 177218 (delta 145040)
> > Receiving objects: 100% (192945/192945), 71.80 MiB | 7.31 MiB/s, done.
> > Resolving deltas: 100% (159381/159381), done.
> > Checking connectivity... done.
> > Checking out files: 100% (5563/5563), done.
> > fatal: Unable to read current working directory: No such file or
> directory
> > [vagrant@openafs-debian-dev:/afs/.robotest/test/g] 3m31s 128 $ date
> > Fri Jul 15 21:39:16 UTC 2016
> > [vagrant@openafs-debian-dev:/afs/.robotest/test/g] $
> >
> > I don't see anything obvious in the log files, though I am not sure what
> I
> > would be looking for.
>
> I do not expect us to be logging anything useful in the current code; you
> will probably have to add some logging.
>
> -Ben
>



-- 
Joe Gorse

C: 440-552-0730
LI: Joe Gorse <http://www.linkedin.com/pub/joe-gorse/7/12/397>

--94eb2c147d143ba6060537ed004f
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><b>Summary:</b></div><div><br></div><div>getcwd() is =
failing, sometimes. When it fails, <a href=3D"http://lxr.free-electrons.com=
/source/fs/dcache.c#L3250" target=3D"_blank">d_unlinked(pwd.dentry)</a> is =
likely not returning success for unknown reasons; hence the ENOENT return f=
rom getcwd().</div>







<div><b><br></b></div><div><b>Status:</b></div><div><br></div>I ran an stra=
ce, filtering on open(), for a &#39;git log&#39; command which seemed to re=
liably repeat it after the 3 minute script from Stephan.<div><p style=3D"ma=
rgin:16px 0px 0px;padding:0px;color:rgb(0,0,0);font-family:Arial,FreeSans,H=
elvetica,sans-serif;font-size:13px;line-height:17.0001px;background-color:r=
gb(240,240,240)">strace on git log w/ and w/out cwd error</p><blockquote st=
yle=3D"border-left-width:2px;border-left-style:solid;border-left-color:rgb(=
50,108,166);margin:16px 0px 0px;padding:0px 0px 0px 16px;color:rgb(51,51,51=
);font-family:Arial,FreeSans,Helvetica,sans-serif;font-size:13px;line-heigh=
t:17.0001px;background-color:rgb(240,240,240)"><p style=3D"margin:0px;paddi=
ng:0px">with cwd error:=C2=A0<a href=3D"http://hastebin.com/dezowowoto.coff=
ee" style=3D"color:rgb(50,108,166);text-decoration:none" target=3D"_blank">=
http://hastebin.com/dezowowoto.coffee</a><br>w/o cwd error:=C2=A0<a href=3D=
"http://hastebin.com/nayojoqomo.coffee" style=3D"color:rgb(50,108,166);text=
-decoration:none" target=3D"_blank">http://hastebin.com/nayojoqomo.coffee</=
a></p></blockquote><div><br></div>The only substantive differences are line=
 45 (an extra mprotect() in the without cwd error case) and then the meat a=
t line ~91-95 at the close().<blockquote style=3D"border-left-width:2px;bor=
der-left-style:solid;border-left-color:rgb(50,108,166);margin:16px 0px 0px;=
padding:0px 0px 0px 16px;color:rgb(51,51,51);font-family:Arial,FreeSans,Hel=
vetica,sans-serif;font-size:13px;line-height:17.0001px;background-color:rgb=
(240,240,240)"><p style=3D"margin:0px;padding:0px">96	getcwd(&quot;/vagrant=
/systap&quot;, 4096) =3D 16 |	getcwd(0x7da600, 4096) =3D -1 ENOENT (No such=
 file or directory)</p></blockquote></div><div><br></div><div>The left-hand=
 side above was done on a non-AFS fs, so the path is used as the identifier=
. The right hand side is the failed AFS case.</div><div><br></div><div>When=
 that same getcwd() call succeeds with AFS, the arguments are exactly the s=
ame, but the return value is 0x16 and git reports that the directory is &qu=
ot;Not a git repository&quot;.</div><div><br></div><div><b>Next steps:</b><=
/div><div>1. Get SystemTap to recognize the OpenAFS kernel module and userl=
and debug symbols.</div><div>2. Start probing the AFS lookups for this dire=
ctory and the incorrect unhashing of it. Particularly in=C2=A0<span style=
=3D"color:rgb(0,0,0);font-family:Arial;font-size:14.6667px;white-space:pre-=
wrap;background-color:transparent">osi_TryEvictVCache</span></div><div><br>=
</div><div>Any hints on SystemTap with custom built OpenAFS are solicited a=
nd welcome.</div><div><br></div><div>Cheers,</div><div>Joe</div><div><br></=
div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Sat,=
 Jul 16, 2016 at 11:39 PM, Benjamin Kaduk <span dir=3D"ltr">&lt;<a href=3D"=
mailto:kaduk@mit.edu" target=3D"_blank">kaduk@mit.edu</a>&gt;</span> wrote:=
<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-lef=
t:1px #ccc solid;padding-left:1ex"><span class=3D"">On Fri, 15 Jul 2016, Jo=
e Gorse wrote:<br>
<br>
&gt; Stephan,<br>
&gt;<br>
&gt; Picking this back up, I am having difficulties repeating it consistent=
ly.<br>
&gt; Debian 8.4, kernel 4.4.15, OpenAFS master<br>
&gt; f14d263a73f0be75e4de92f62e836fb2e55680dd. I see the gerrit for reverti=
ng on<br>
&gt; master is not in yet, so that&#39;s not it. Tried increasing the frequ=
ency of<br>
&gt; afs_ShakeLooseVCaches.<br>
&gt;<br>
&gt; A smaller git repo (e.g. openafs-robotest) never seems to trip the CWD=
 bug.<br>
<br>
</span>Presumably because it does not create enough vcaches to go over a li=
mit<br>
somewhere.<br>
<span class=3D""><br>
&gt; Test method:<br>
&gt; [vagrant@openafs-debian-dev:/afs/.robotest/test] $ mkdir g; cd g; git =
clone<br>
&gt; git://<a href=3D"http://gerrit.openafs.org/openafs.git;sleep" rel=3D"n=
oreferrer" target=3D"_blank">gerrit.openafs.org/openafs.git;sleep</a> 180;g=
it log<br>
&gt; Cloning into &#39;openafs&#39;...<br>
&gt; remote: Counting objects: 192945, done.<br>
&gt; remote: Compressing objects: 100% (46882/46882), done.<br>
&gt; remote: Total 192945 (delta 159381), reused 177218 (delta 145040)<br>
&gt; Receiving objects: 100% (192945/192945), 71.80 MiB | 7.31 MiB/s, done.=
<br>
&gt; Resolving deltas: 100% (159381/159381), done.<br>
&gt; Checking connectivity... done.<br>
&gt; Checking out files: 100% (5563/5563), done.<br>
&gt; fatal: Unable to read current working directory: No such file or direc=
tory<br>
&gt; [vagrant@openafs-debian-dev:/afs/.robotest/test/g] 3m31s 128 $ date<br=
>
&gt; Fri Jul 15 21:39:16 UTC 2016<br>
&gt; [vagrant@openafs-debian-dev:/afs/.robotest/test/g] $<br>
&gt;<br>
&gt; I don&#39;t see anything obvious in the log files, though I am not sur=
e what I<br>
&gt; would be looking for.<br>
<br>
</span>I do not expect us to be logging anything useful in the current code=
; you<br>
will probably have to add some logging.<br>
<br>
-Ben<br>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div class=
=3D"gmail_signature" data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><d=
iv>Joe Gorse<div><font face=3D"Calibri" size=3D"1"><br></font></div><div><f=
ont face=3D"Calibri" size=3D"1"><div style=3D"font-family:arial;font-size:s=
mall"><span style=3D"font-family:arial;background-color:rgb(255,255,255)"><=
font size=3D"1">C:=C2=A0</font></span><span style=3D"background-color:rgb(2=
55,255,255);font-size:x-small"><a value=3D"+14402344531"><font color=3D"#00=
00cc" face=3D"Helvetica">440-552-0730</font></a></span></div><div style=3D"=
font-family:arial;font-size:small"><span style=3D"font-family:arial;backgro=
und-color:rgb(255,255,255)"><font size=3D"1">LI: <a href=3D"http://www.link=
edin.com/pub/joe-gorse/7/12/397" target=3D"_blank">Joe Gorse</a></font></sp=
an></div></font></div></div></div></div>
</div>

--94eb2c147d143ba6060537ed004f--