[OpenAFS] deadlock in OpenAFS 1.4.11 (Solaris 5.10)

John Tang Boyland boyland@cs.uwm.edu
Fri, 09 Apr 2010 11:08:47 -0500


We get an occasional deadlock happening on Solaris 5.10 using
OpenAFS 1.4.11.  After the problem starts, any attempt to use AFS
on the machine freezes:  For example:

% truss -f touch /afs/not-here
15694:  execve("/usr/bin/touch", 0x08047E20, 0x08047E2C)  argc = 2
15694:  resolvepath("/usr/lib/ld.so.1", "/lib/ld.so.1", 1023) = 12
15694:  resolvepath("/usr/bin/touch", "/usr/bin/touch", 1023) = 14
15694:  sysconfig(_CONFIG_PAGESIZE)                     = 4096
15694:  xstat(2, "/usr/bin/touch", 0x08047BF8)          = 0
15694:  open("/var/ld/ld.config", O_RDONLY)             = 3
15694:  fxstat(2, 3, 0x08047B38)                        = 0
15694:  mmap(0x00000000, 128104, PROT_READ, MAP_SHARED, 3, 0) = 0xFEFA1000
15694:  close(3)                                        = 0
15694:  mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFEF90000
15694:  xstat(2, "/lib/libc.so.1", 0x08047440)          = 0
15694:  resolvepath("/lib/libc.so.1", "/lib/libc.so.1", 1023) = 14
15694:  open("/lib/libc.so.1", O_RDONLY)                = 3
15694:  mmap(0x00010000, 32768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_ALIGN, 3, 0) = 0xFEF80000
15694:  mmap(0x00010000, 880640, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFEEA0000
15694:  mmap(0xFEEA0000, 775469, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xFEEA0000
15694:  mmap(0xFEF6E000, 26855, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 778240) = 0xFEF6E000
15694:  mmap(0xFEF75000, 5016, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANON, -1, 0) = 0xFEF75000
15694:  munmap(0xFEF5E000, 65536)                       = 0
15694:  memcntl(0xFEEA0000, 123376, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
15694:  close(3)                                        = 0
15694:  munmap(0xFEF80000, 32768)                       = 0
15694:  mmap(0x00010000, 24576, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFEF80000
15694:  getcontext(0x080479B0)
15694:  getrlimit(RLIMIT_STACK, 0x080479A8)             = 0
15694:  getpid()                                        = 15694 [15692]
15694:  lwp_private(0, 1, 0xFEF82000)                   = 0x000001C3
15694:  setustack(0xFEF82060)
15694:  sysi86(SI86FPSTART, 0xFEF75A58, 0x0000133F, 0x00001F80) = 0x00000001
15694:  brk(0x08062758)                                 = 0
15694:  brk(0x08064758)                                 = 0
15694:  xstat(2, "/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3", 0x08046D08) = 015694:  resolvepath("/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3", "/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3", 1023) = 44
15694:  open("/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3", O_RDONLY) = 3
15694:  mmap(0x00010000, 32768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_ALIGN, 3, 0) = 0xFEE90000
15694:  mmap(0x00010000, 2297856, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFEC00000
15694:  mmap(0xFEC00000, 2225278, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xFEC00000
15694:  mmap(0xFEE2F000, 4234, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 2224128) = 0xFEE2F000
15694:  munmap(0xFEE20000, 61440)                       = 0
15694:  memcntl(0xFEC00000, 7188, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
15694:  close(3)                                        = 0
15694:  xstat(2, "/usr/lib/locale/en_US.UTF-8/methods_en_US.UTF-8.so.3", 0x08046C60) = 0
15694:  resolvepath("/usr/lib/locale/en_US.UTF-8/methods_en_US.UTF-8.so.3", "/usr/lib/locale/common/methods_unicode.so.3", 1023) = 43
15694:  open("/usr/lib/locale/en_US.UTF-8/methods_en_US.UTF-8.so.3", O_RDONLY) = 3
15694:  mmap(0xFEE90000, 32768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFEE90000
15694:  mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFEE80000
15694:  mmap(0x00010000, 122880, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFEE60000
15694:  mmap(0xFEE60000, 55437, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xFEE60000
15694:  mmap(0xFEE7D000, 2524, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 53248) = 0xFEE7D000
15694:  munmap(0xFEE6E000, 61440)                       = 0
15694:  memcntl(0xFEE60000, 2532, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
15694:  close(3)                                        = 0
15694:  xstat(2, "/usr/lib/locale/en_US.UTF-8/libc.so.1", 0x08046C60) Err#2 ENOENT
15694:  munmap(0xFEE90000, 32768)                       = 0
15694:  sysconfig(_CONFIG_PAGESIZE)                     = 4096

FREEZE

On a machine which has not had the problem (yet...) the output continues
...
24608:  sysconfig(_CONFIG_PAGESIZE)                     = 4096
24608:  stat64("/afs/not-here", 0x08047C90)             Err#2 ENOENT
24608:  creat64("/afs/not-here", 0666)                  Err#30 EROFS
24608:  open("/usr/lib/locale/en_US.UTF-8/LC_MESSAGES/SUNW_OST_OSCMD.mo", O_RDONLY) Err#2 ENOENT
24608:  fstat64(2, 0x08046EA0)                          = 0
24608:  write(2, " t o u c h", 5)                       = 5
24608:  write(2, " :  ", 2)                             = 2
24608:  write(2, " / a f s / n o t - h e r".., 13)      = 13
24608:  write(2, "   c a n n o t   c r e a".., 15)      = 15
24608:  _exit(1)

In other words, the stat64 call accesses AFS and (on the machine
with the problem), the thread gets stuck in the AFS tarbaby.

I suspected it was due to logging, so I changed the configuration to
mount a dedicated partition for /usr/vice/cache, and rebooted.  The '
machine was fine for a month or two, but problem has re-occurred.

The machine is used frequently (it's our main computer server for
undergraduate classes) but "fortunately" AFS is not very popular here
so most courses don't use it (partly because of nasty things like
this happening now and then) and so the machine is still being used
for non-AFS courses.  Hence I hadn't tried to install a newer version
of OpenAFS.  If this is a known bug with OpenAFS, I will indeed 
ask them to take the machine offline long enough to fix this.
(Political capital and all; I hope people understand.)

I haven't tried to reproduce this bug (and wouldn't want to
on the computer server!): it only seems to happen on these
main compute servers -- never on my little research machines... :-(

Any help would be appreciated.

John