[OpenAFS] Solaris 10 deadlock issue
Aaron Knister
aaronk@umbc.edu
Tue, 14 Jun 2011 19:11:33 -0400
--001485e76ede56dd1504a5b42911
Content-Type: text/plain; charset=ISO-8859-1
The box in question is an x86 VM running in VMware ESXi on a host with dual
Opteron CPUs. I have also reproduced it on a pre-nehalem and a nehalem Intel
system. All are running the latest patches to Solaris 10 u8.
On Tue, Jun 14, 2011 at 6:32 PM, Patricia O'Reilly <oreilly@qualcomm.com>wrote:
> Is this an x86 Solaris 10 box running on Nehalem?
>
> Aaron Knister wrote:
> > Good afternoon!
> >
> > I'm writing to report a deadlock issue I'm seeing on Solaris 10.
> >
> > What I've observed is that when a file larger than the configured size
> > of the cache is copied out of AFS the cache manager deadlocks and all
> > access to /afs on the affected system hangs until the system is
> > rebooted. The issue occurs with a memory cache as well as a disk cache.
> >
> > The issue can be mitigated if the cache size is raised to the value of
> > roughly half of the physical memory in the given system. The issue
> > appeared somewhere between Solaris 10 "u8" and "u9."
> >
> > I've reproduced the problem using OpenAFS 1.4.14.1, 1.5.78 and 1.6.0pre6
> > and a Solaris 10 "u8" system with all of the latest patches applied.
> >
> > I've put together a tar file containing:
> >
> > - An fstrace dump starting a few seconds before I initiated the copy
> > - A stack trace of the hung cp command
> > - The output of cmdebug -long -server localhost run after AFS hangs
> >
> > The individual files as well as a tar file of them can be found here:
> > http://userpages.umbc.edu/~aaronk/afs/solaris10-deadlock-issue.
> >
> > Any help would be greatly appreciated.
> >
> > Best,
> > Aaron
> >
> > --
> > Aaron Knister
> > Systems Administrator
> > Division of Information Technology
> > University of Maryland, Baltimore County
> > aaronk@umbc.edu <mailto:aaronk@umbc.edu>
>
--
Aaron Knister
Systems Administrator
Division of Information Technology
University of Maryland, Baltimore County
aaronk@umbc.edu
--001485e76ede56dd1504a5b42911
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
The box in question is an x86 VM running in VMware ESXi on a host with dual=
Opteron CPUs. I have also reproduced it on a pre-nehalem and a nehalem Int=
el system. All are running the latest patches to Solaris 10 u8.<br><br><div=
class=3D"gmail_quote">
On Tue, Jun 14, 2011 at 6:32 PM, Patricia O'Reilly <span dir=3D"ltr">&l=
t;<a href=3D"mailto:oreilly@qualcomm.com">oreilly@qualcomm.com</a>></spa=
n> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex;">
Is this an x86 Solaris 10 box running on Nehalem?<br>
<div><div></div><div class=3D"h5"><br>
Aaron Knister wrote:<br>
> Good afternoon!<br>
><br>
> I'm writing to report a deadlock issue I'm seeing on Solaris 1=
0.<br>
><br>
> What I've observed is that when a file larger than the configured =
size<br>
> of the cache is copied out of AFS the cache manager deadlocks and all<=
br>
> access to /afs on the affected system hangs until the system is<br>
> rebooted. The issue occurs with a memory cache as well as a disk cache=
.<br>
><br>
> The issue can be mitigated if the cache size is raised to the value of=
<br>
> roughly half of the physical memory in the given system. The issue<br>
> appeared somewhere between Solaris 10 "u8" and "u9.&quo=
t;<br>
><br>
> I've reproduced the problem using OpenAFS 1.4.14.1, 1.5.78 and 1.6=
.0pre6<br>
> and a Solaris 10 "u8" system with all of the latest patches =
applied.<br>
><br>
> I've put together a tar file containing:<br>
><br>
> - An fstrace dump starting a few seconds before I initiated the copy<b=
r>
> - A stack trace of the hung cp command<br>
> - The output of cmdebug -long -server localhost run after AFS hangs<br=
>
><br>
> The individual files as well as a tar file of them can be found here:<=
br>
> <a href=3D"http://userpages.umbc.edu/~aaronk/afs/solaris10-deadlock-is=
sue" target=3D"_blank">http://userpages.umbc.edu/~aaronk/afs/solaris10-dead=
lock-issue</a>.<br>
><br>
> Any help would be greatly appreciated.<br>
><br>
> Best,<br>
> Aaron<br>
><br>
> --<br>
> Aaron Knister<br>
> Systems Administrator<br>
> Division of Information Technology<br>
> University of Maryland, Baltimore County<br>
</div></div>> <a href=3D"mailto:aaronk@umbc.edu">aaronk@umbc.edu</a> <=
;mailto:<a href=3D"mailto:aaronk@umbc.edu">aaronk@umbc.edu</a>><br>
</blockquote></div><br><br clear=3D"all"><br>-- <br>Aaron Knister<br>System=
s Administrator<br>Division of Information Technology<br>University of Mary=
land, Baltimore County<br><a href=3D"mailto:aaronk@umbc.edu" target=3D"_bla=
nk">aaronk@umbc.edu</a><br>
--001485e76ede56dd1504a5b42911--