[OpenAFS] Solaris 10 deadlock issue

Patricia O'Reilly oreilly@qualcomm.com
Tue, 14 Jun 2011 15:32:51 -0700


Is this an x86 Solaris 10 box running on Nehalem?

Aaron Knister wrote:
> Good afternoon!
> 
> I'm writing to report a deadlock issue I'm seeing on Solaris 10.
> 
> What I've observed is that when a file larger than the configured size
> of the cache is copied out of AFS the cache manager deadlocks and all
> access to /afs on the affected system hangs until the system is
> rebooted. The issue occurs with a memory cache as well as a disk cache.
> 
> The issue can be mitigated if the cache size is raised to the value of
> roughly half of the physical memory in the given system. The issue
> appeared somewhere between Solaris 10 "u8" and "u9."
> 
> I've reproduced the problem using OpenAFS 1.4.14.1, 1.5.78 and 1.6.0pre6
> and a Solaris 10 "u8" system with all of the latest patches applied.
> 
> I've put together a tar file containing:
> 
> - An fstrace dump starting a few seconds before I initiated the copy
> - A stack trace of the hung cp command
> - The output of cmdebug -long -server localhost run after AFS hangs
> 
> The individual files as well as a tar file of them can be found here:
> http://userpages.umbc.edu/~aaronk/afs/solaris10-deadlock-issue.
> 
> Any help would be greatly appreciated.
> 
> Best,
> Aaron
> 
> -- 
> Aaron Knister
> Systems Administrator
> Division of Information Technology
> University of Maryland, Baltimore County
> aaronk@umbc.edu <mailto:aaronk@umbc.edu>