[OpenAFS] Help with wedged Solaris box

Kevin Hildebrand kevin@umd.edu
Thu, 29 Nov 2012 09:08:18 -0500 (EST)


Hi, we've got a Solaris 10 box running OpenAFS-1.6.1 that is periodically 
becoming non-responsive and requiring a hard restart.

>From what I can see from looking at crash dumps, it appears that the 
threads that are hanging are in the process of doing file access 
(stat/lookup) in AFS.

For example:
> 2a1043be951::stack
mutex_vector_enter+0x428(190d458, 2, 707dcf50, fffb14c0ca47a08a, 2a100297c81, 0)
afs_root+0x3c(6003f54ce40, 2a1043bf548, 1, 0, 6002542e840, 0)
fsop_root+0x10(6003f54ce40, 2a1043bf548, 6002153acc8, 2420, 0, 7afc1490)
traverse+0x7c(2a1043bf678, 2a1043bf548, 0, 0, 6002542e840, 6002153acc8)
lookuppnvp+0x3d0(2a1043bf940, 0, 6002542e840, 2a1043bf678, 2a1043bf680, 60021529a40)
lookuppnat+0x120(60021529a40, 0, 1, 0, 2a1043bfad8, 0)
lookupnameat+0x5c(0, 0, 1, 0, 2a1043bfad8, 0)
cstatat_getvp+0x198(ffd19400, 100ae7708, 1, 1, 2a1043bfad8, 0)
cstatat+0x40(ffffffffffd19553, 100ae7708, 1000, 100405a50, 0, 10)
syscall_trap+0xac(100ae7708, 100405a50, 100b52fe8, 16, 100b7bffc, 4)
>

For this particular crash dump, I have hundreds of threads that are stuck 
in this location.

I'd appreciate any suggestions on how to debug this further.

Thanks,
Kevin

--
Kevin Hildebrand
University of Maryland, College Park
Division of IT