[OpenAFS] deadlock in OpenAFS 1.4.11 (Solaris 5.10)

Douglas E. Engert deengert@anl.gov
Fri, 09 Apr 2010 11:33:48 -0500


What happens if you have exported LANG=C into your environment.
Based on you trace the freeze comes before AFS does anything. Or
is just the truss has not written all its output.

John Tang Boyland wrote:
> We get an occasional deadlock happening on Solaris 5.10 using
> OpenAFS 1.4.11.  After the problem starts, any attempt to use AFS
> on the machine freezes:  For example:
> 
> % truss -f touch /afs/not-here
> 15694:  execve("/usr/bin/touch", 0x08047E20, 0x08047E2C)  argc = 2
> 15694:  resolvepath("/usr/lib/ld.so.1", "/lib/ld.so.1", 1023) = 12
> 15694:  resolvepath("/usr/bin/touch", "/usr/bin/touch", 1023) = 14
> 15694:  sysconfig(_CONFIG_PAGESIZE)                     = 4096
> 15694:  xstat(2, "/usr/bin/touch", 0x08047BF8)          = 0
> 15694:  open("/var/ld/ld.config", O_RDONLY)             = 3
> 15694:  fxstat(2, 3, 0x08047B38)                        = 0
> 15694:  mmap(0x00000000, 128104, PROT_READ, MAP_SHARED, 3, 0) = 0xFEFA1000
> 15694:  close(3)                                        = 0
> 15694:  mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFEF90000
> 15694:  xstat(2, "/lib/libc.so.1", 0x08047440)          = 0
> 15694:  resolvepath("/lib/libc.so.1", "/lib/libc.so.1", 1023) = 14
> 15694:  open("/lib/libc.so.1", O_RDONLY)                = 3
> 15694:  mmap(0x00010000, 32768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_ALIGN, 3, 0) = 0xFEF80000
> 15694:  mmap(0x00010000, 880640, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFEEA0000
> 15694:  mmap(0xFEEA0000, 775469, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xFEEA0000
> 15694:  mmap(0xFEF6E000, 26855, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 778240) = 0xFEF6E000
> 15694:  mmap(0xFEF75000, 5016, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANON, -1, 0) = 0xFEF75000
> 15694:  munmap(0xFEF5E000, 65536)                       = 0
> 15694:  memcntl(0xFEEA0000, 123376, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
> 15694:  close(3)                                        = 0
> 15694:  munmap(0xFEF80000, 32768)                       = 0
> 15694:  mmap(0x00010000, 24576, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFEF80000
> 15694:  getcontext(0x080479B0)
> 15694:  getrlimit(RLIMIT_STACK, 0x080479A8)             = 0
> 15694:  getpid()                                        = 15694 [15692]
> 15694:  lwp_private(0, 1, 0xFEF82000)                   = 0x000001C3
> 15694:  setustack(0xFEF82060)
> 15694:  sysi86(SI86FPSTART, 0xFEF75A58, 0x0000133F, 0x00001F80) = 0x00000001
> 15694:  brk(0x08062758)                                 = 0
> 15694:  brk(0x08064758)                                 = 0
> 15694:  xstat(2, "/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3", 0x08046D08) = 015694:  resolvepath("/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3", "/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3", 1023) = 44
> 15694:  open("/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3", O_RDONLY) = 3
> 15694:  mmap(0x00010000, 32768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_ALIGN, 3, 0) = 0xFEE90000
> 15694:  mmap(0x00010000, 2297856, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFEC00000
> 15694:  mmap(0xFEC00000, 2225278, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xFEC00000
> 15694:  mmap(0xFEE2F000, 4234, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 2224128) = 0xFEE2F000
> 15694:  munmap(0xFEE20000, 61440)                       = 0
> 15694:  memcntl(0xFEC00000, 7188, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
> 15694:  close(3)                                        = 0
> 15694:  xstat(2, "/usr/lib/locale/en_US.UTF-8/methods_en_US.UTF-8.so.3", 0x08046C60) = 0
> 15694:  resolvepath("/usr/lib/locale/en_US.UTF-8/methods_en_US.UTF-8.so.3", "/usr/lib/locale/common/methods_unicode.so.3", 1023) = 43
> 15694:  open("/usr/lib/locale/en_US.UTF-8/methods_en_US.UTF-8.so.3", O_RDONLY) = 3
> 15694:  mmap(0xFEE90000, 32768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFEE90000
> 15694:  mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFEE80000
> 15694:  mmap(0x00010000, 122880, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFEE60000
> 15694:  mmap(0xFEE60000, 55437, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xFEE60000
> 15694:  mmap(0xFEE7D000, 2524, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 53248) = 0xFEE7D000
> 15694:  munmap(0xFEE6E000, 61440)                       = 0
> 15694:  memcntl(0xFEE60000, 2532, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
> 15694:  close(3)                                        = 0
> 15694:  xstat(2, "/usr/lib/locale/en_US.UTF-8/libc.so.1", 0x08046C60) Err#2 ENOENT
> 15694:  munmap(0xFEE90000, 32768)                       = 0
> 15694:  sysconfig(_CONFIG_PAGESIZE)                     = 4096
> 
> FREEZE
> 
> On a machine which has not had the problem (yet...) the output continues
> ...
> 24608:  sysconfig(_CONFIG_PAGESIZE)                     = 4096
> 24608:  stat64("/afs/not-here", 0x08047C90)             Err#2 ENOENT
> 24608:  creat64("/afs/not-here", 0666)                  Err#30 EROFS
> 24608:  open("/usr/lib/locale/en_US.UTF-8/LC_MESSAGES/SUNW_OST_OSCMD.mo", O_RDONLY) Err#2 ENOENT
> 24608:  fstat64(2, 0x08046EA0)                          = 0
> 24608:  write(2, " t o u c h", 5)                       = 5
> 24608:  write(2, " :  ", 2)                             = 2
> 24608:  write(2, " / a f s / n o t - h e r".., 13)      = 13
> 24608:  write(2, "   c a n n o t   c r e a".., 15)      = 15
> 24608:  _exit(1)
> 
> In other words, the stat64 call accesses AFS and (on the machine
> with the problem), the thread gets stuck in the AFS tarbaby.
> 
> I suspected it was due to logging, so I changed the configuration to
> mount a dedicated partition for /usr/vice/cache, and rebooted.  The '
> machine was fine for a month or two, but problem has re-occurred.
> 
> The machine is used frequently (it's our main computer server for
> undergraduate classes) but "fortunately" AFS is not very popular here
> so most courses don't use it (partly because of nasty things like
> this happening now and then) and so the machine is still being used
> for non-AFS courses.  Hence I hadn't tried to install a newer version
> of OpenAFS.  If this is a known bug with OpenAFS, I will indeed 
> ask them to take the machine offline long enough to fix this.
> (Political capital and all; I hope people understand.)
> 
> I haven't tried to reproduce this bug (and wouldn't want to
> on the computer server!): it only seems to happen on these
> main compute servers -- never on my little research machines... :-(
> 
> Any help would be appreciated.
> 
> John
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
> 
> 

-- 

  Douglas E. Engert  <DEEngert@anl.gov>
  Argonne National Laboratory
  9700 South Cass Avenue
  Argonne, Illinois  60439
  (630) 252-5444