[OpenAFS] Solaris 10 deadlock issue

Aaron Knister aaronk@umbc.edu
Tue, 14 Jun 2011 17:56:44 -0400


--90e6ba4fc6e8c7b43204a5b31dfb
Content-Type: text/plain; charset=ISO-8859-1

Good afternoon!

I'm writing to report a deadlock issue I'm seeing on Solaris 10.

What I've observed is that when a file larger than the configured size of
the cache is copied out of AFS the cache manager deadlocks and all access to
/afs on the affected system hangs until the system is rebooted. The issue
occurs with a memory cache as well as a disk cache.

The issue can be mitigated if the cache size is raised to the value of
roughly half of the physical memory in the given system. The issue appeared
somewhere between Solaris 10 "u8" and "u9."

I've reproduced the problem using OpenAFS 1.4.14.1, 1.5.78 and 1.6.0pre6 and
a Solaris 10 "u8" system with all of the latest patches applied.

I've put together a tar file containing:

- An fstrace dump starting a few seconds before I initiated the copy
- A stack trace of the hung cp command
- The output of cmdebug -long -server localhost run after AFS hangs

The individual files as well as a tar file of them can be found here:
http://userpages.umbc.edu/~aaronk/afs/solaris10-deadlock-issue.

Any help would be greatly appreciated.

Best,
Aaron

-- 
Aaron Knister
Systems Administrator
Division of Information Technology
University of Maryland, Baltimore County
aaronk@umbc.edu

--90e6ba4fc6e8c7b43204a5b31dfb
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Good afternoon!<div><br></div><div>I&#39;m writing to report a deadlock iss=
ue I&#39;m seeing on Solaris 10.</div><div><br></div><div>What I&#39;ve obs=
erved is that when a file larger than the configured size of the cache is c=
opied out of AFS the cache manager deadlocks and all access to /afs on the =
affected system hangs until the system is rebooted. The issue occurs with a=
 memory cache as well as a disk cache.<br>

<div><br></div><div>The issue can be mitigated if the cache size is raised =
to the value of roughly half of the physical memory in the given system. Th=
e issue appeared somewhere between=A0Solaris=A010 &quot;u8&quot; and &quot;=
u9.&quot;</div>

<div><br></div><div>I&#39;ve reproduced the problem using OpenAFS 1.4.14.1,=
 1.5.78 and 1.6.0pre6 and a Solaris 10 &quot;u8&quot; system with all of th=
e latest patches applied.</div><div><br></div><div>I&#39;ve put together a =
tar file containing:</div>

<div><br></div><div>- An fstrace dump starting a few seconds before I initi=
ated the copy</div><div>- A stack trace of the hung cp command</div><div>- =
The output of cmdebug -long -server localhost run after AFS hangs</div>

<div><br></div><div>The individual files as well as a tar file of them can =
be found here: <a href=3D"http://userpages.umbc.edu/~aaronk/afs/solaris10-d=
eadlock-issue">http://userpages.umbc.edu/~aaronk/afs/solaris10-deadlock-iss=
ue</a>.</div>

<div><br></div><div>Any help would be greatly appreciated.</div><div><br></=
div><div>Best,</div><div>Aaron</div><div><br>-- <br>Aaron Knister<br>System=
s Administrator<br>Division of Information Technology<br>University of Mary=
land, Baltimore County<br>

<a href=3D"mailto:aaronk@umbc.edu" target=3D"_blank">aaronk@umbc.edu</a><br=
>
</div></div>

--90e6ba4fc6e8c7b43204a5b31dfb--