[OpenAFS] Solaris 10 deadlock issue

Aaron Knister aaronk@umbc.edu
Tue, 14 Jun 2011 19:11:33 -0400


--001485e76ede56dd1504a5b42911
Content-Type: text/plain; charset=ISO-8859-1

The box in question is an x86 VM running in VMware ESXi on a host with dual
Opteron CPUs. I have also reproduced it on a pre-nehalem and a nehalem Intel
system. All are running the latest patches to Solaris 10 u8.

On Tue, Jun 14, 2011 at 6:32 PM, Patricia O'Reilly <oreilly@qualcomm.com>wrote:

> Is this an x86 Solaris 10 box running on Nehalem?
>
> Aaron Knister wrote:
> > Good afternoon!
> >
> > I'm writing to report a deadlock issue I'm seeing on Solaris 10.
> >
> > What I've observed is that when a file larger than the configured size
> > of the cache is copied out of AFS the cache manager deadlocks and all
> > access to /afs on the affected system hangs until the system is
> > rebooted. The issue occurs with a memory cache as well as a disk cache.
> >
> > The issue can be mitigated if the cache size is raised to the value of
> > roughly half of the physical memory in the given system. The issue
> > appeared somewhere between Solaris 10 "u8" and "u9."
> >
> > I've reproduced the problem using OpenAFS 1.4.14.1, 1.5.78 and 1.6.0pre6
> > and a Solaris 10 "u8" system with all of the latest patches applied.
> >
> > I've put together a tar file containing:
> >
> > - An fstrace dump starting a few seconds before I initiated the copy
> > - A stack trace of the hung cp command
> > - The output of cmdebug -long -server localhost run after AFS hangs
> >
> > The individual files as well as a tar file of them can be found here:
> > http://userpages.umbc.edu/~aaronk/afs/solaris10-deadlock-issue.
> >
> > Any help would be greatly appreciated.
> >
> > Best,
> > Aaron
> >
> > --
> > Aaron Knister
> > Systems Administrator
> > Division of Information Technology
> > University of Maryland, Baltimore County
> > aaronk@umbc.edu <mailto:aaronk@umbc.edu>
>



-- 
Aaron Knister
Systems Administrator
Division of Information Technology
University of Maryland, Baltimore County
aaronk@umbc.edu

--001485e76ede56dd1504a5b42911
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

The box in question is an x86 VM running in VMware ESXi on a host with dual=
 Opteron CPUs. I have also reproduced it on a pre-nehalem and a nehalem Int=
el system. All are running the latest patches to Solaris 10 u8.<br><br><div=
 class=3D"gmail_quote">

On Tue, Jun 14, 2011 at 6:32 PM, Patricia O&#39;Reilly <span dir=3D"ltr">&l=
t;<a href=3D"mailto:oreilly@qualcomm.com">oreilly@qualcomm.com</a>&gt;</spa=
n> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex;">

Is this an x86 Solaris 10 box running on Nehalem?<br>
<div><div></div><div class=3D"h5"><br>
Aaron Knister wrote:<br>
&gt; Good afternoon!<br>
&gt;<br>
&gt; I&#39;m writing to report a deadlock issue I&#39;m seeing on Solaris 1=
0.<br>
&gt;<br>
&gt; What I&#39;ve observed is that when a file larger than the configured =
size<br>
&gt; of the cache is copied out of AFS the cache manager deadlocks and all<=
br>
&gt; access to /afs on the affected system hangs until the system is<br>
&gt; rebooted. The issue occurs with a memory cache as well as a disk cache=
.<br>
&gt;<br>
&gt; The issue can be mitigated if the cache size is raised to the value of=
<br>
&gt; roughly half of the physical memory in the given system. The issue<br>
&gt; appeared somewhere between Solaris 10 &quot;u8&quot; and &quot;u9.&quo=
t;<br>
&gt;<br>
&gt; I&#39;ve reproduced the problem using OpenAFS 1.4.14.1, 1.5.78 and 1.6=
.0pre6<br>
&gt; and a Solaris 10 &quot;u8&quot; system with all of the latest patches =
applied.<br>
&gt;<br>
&gt; I&#39;ve put together a tar file containing:<br>
&gt;<br>
&gt; - An fstrace dump starting a few seconds before I initiated the copy<b=
r>
&gt; - A stack trace of the hung cp command<br>
&gt; - The output of cmdebug -long -server localhost run after AFS hangs<br=
>
&gt;<br>
&gt; The individual files as well as a tar file of them can be found here:<=
br>
&gt; <a href=3D"http://userpages.umbc.edu/~aaronk/afs/solaris10-deadlock-is=
sue" target=3D"_blank">http://userpages.umbc.edu/~aaronk/afs/solaris10-dead=
lock-issue</a>.<br>
&gt;<br>
&gt; Any help would be greatly appreciated.<br>
&gt;<br>
&gt; Best,<br>
&gt; Aaron<br>
&gt;<br>
&gt; --<br>
&gt; Aaron Knister<br>
&gt; Systems Administrator<br>
&gt; Division of Information Technology<br>
&gt; University of Maryland, Baltimore County<br>
</div></div>&gt; <a href=3D"mailto:aaronk@umbc.edu">aaronk@umbc.edu</a> &lt=
;mailto:<a href=3D"mailto:aaronk@umbc.edu">aaronk@umbc.edu</a>&gt;<br>
</blockquote></div><br><br clear=3D"all"><br>-- <br>Aaron Knister<br>System=
s Administrator<br>Division of Information Technology<br>University of Mary=
land, Baltimore County<br><a href=3D"mailto:aaronk@umbc.edu" target=3D"_bla=
nk">aaronk@umbc.edu</a><br>



--001485e76ede56dd1504a5b42911--