[OpenAFS] deadlock in OpenAFS 1.4.11 (Solaris 5.10)

Derrick Brashear shadow@gmail.com
Fri, 9 Apr 2010 12:26:26 -0400


cmdebug or it didn't happen.

On Fri, Apr 9, 2010 at 12:08 PM, John Tang Boyland
<boyland@pabst.cs.uwm.edu> wrote:
> We get an occasional deadlock happening on Solaris 5.10 using
> OpenAFS 1.4.11. =A0After the problem starts, any attempt to use AFS
> on the machine freezes: =A0For example:
>
> % truss -f touch /afs/not-here
> 15694: =A0execve("/usr/bin/touch", 0x08047E20, 0x08047E2C) =A0argc =3D 2
> 15694: =A0resolvepath("/usr/lib/ld.so.1", "/lib/ld.so.1", 1023) =3D 12
> 15694: =A0resolvepath("/usr/bin/touch", "/usr/bin/touch", 1023) =3D 14
> 15694: =A0sysconfig(_CONFIG_PAGESIZE) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =3D 4096
> 15694: =A0xstat(2, "/usr/bin/touch", 0x08047BF8) =A0 =A0 =A0 =A0 =A0=3D 0
> 15694: =A0open("/var/ld/ld.config", O_RDONLY) =A0 =A0 =A0 =A0 =A0 =A0 =3D=
 3
> 15694: =A0fxstat(2, 3, 0x08047B38) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0=3D 0
> 15694: =A0mmap(0x00000000, 128104, PROT_READ, MAP_SHARED, 3, 0) =3D 0xFEF=
A1000
> 15694: =A0close(3) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0=3D 0
> 15694: =A0mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIV=
ATE|MAP_ANON, -1, 0) =3D 0xFEF90000
> 15694: =A0xstat(2, "/lib/libc.so.1", 0x08047440) =A0 =A0 =A0 =A0 =A0=3D 0
> 15694: =A0resolvepath("/lib/libc.so.1", "/lib/libc.so.1", 1023) =3D 14
> 15694: =A0open("/lib/libc.so.1", O_RDONLY) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0=3D 3
> 15694: =A0mmap(0x00010000, 32768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_AL=
IGN, 3, 0) =3D 0xFEF80000
> 15694: =A0mmap(0x00010000, 880640, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|M=
AP_ANON|MAP_ALIGN, -1, 0) =3D 0xFEEA0000
> 15694: =A0mmap(0xFEEA0000, 775469, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_F=
IXED|MAP_TEXT, 3, 0) =3D 0xFEEA0000
> 15694: =A0mmap(0xFEF6E000, 26855, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_F=
IXED|MAP_INITDATA, 3, 778240) =3D 0xFEF6E000
> 15694: =A0mmap(0xFEF75000, 5016, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FI=
XED|MAP_ANON, -1, 0) =3D 0xFEF75000
> 15694: =A0munmap(0xFEF5E000, 65536) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =3D 0
> 15694: =A0memcntl(0xFEEA0000, 123376, MC_ADVISE, MADV_WILLNEED, 0, 0) =3D=
 0
> 15694: =A0close(3) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0=3D 0
> 15694: =A0munmap(0xFEF80000, 32768) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =3D 0
> 15694: =A0mmap(0x00010000, 24576, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRI=
VATE|MAP_ANON|MAP_ALIGN, -1, 0) =3D 0xFEF80000
> 15694: =A0getcontext(0x080479B0)
> 15694: =A0getrlimit(RLIMIT_STACK, 0x080479A8) =A0 =A0 =A0 =A0 =A0 =A0 =3D=
 0
> 15694: =A0getpid() =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0=3D 15694 [15692]
> 15694: =A0lwp_private(0, 1, 0xFEF82000) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =3D 0x000001C3
> 15694: =A0setustack(0xFEF82060)
> 15694: =A0sysi86(SI86FPSTART, 0xFEF75A58, 0x0000133F, 0x00001F80) =3D 0x0=
0000001
> 15694: =A0brk(0x08062758) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =3D 0
> 15694: =A0brk(0x08064758) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =3D 0
> 15694: =A0xstat(2, "/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3", 0x0804=
6D08) =3D 015694: =A0resolvepath("/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.s=
o.3", "/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3", 1023) =3D 44
> 15694: =A0open("/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3", O_RDONLY) =
=3D 3
> 15694: =A0mmap(0x00010000, 32768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_AL=
IGN, 3, 0) =3D 0xFEE90000
> 15694: =A0mmap(0x00010000, 2297856, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|=
MAP_ANON|MAP_ALIGN, -1, 0) =3D 0xFEC00000
> 15694: =A0mmap(0xFEC00000, 2225278, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_=
FIXED|MAP_TEXT, 3, 0) =3D 0xFEC00000
> 15694: =A0mmap(0xFEE2F000, 4234, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIV=
ATE|MAP_FIXED|MAP_INITDATA, 3, 2224128) =3D 0xFEE2F000
> 15694: =A0munmap(0xFEE20000, 61440) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =3D 0
> 15694: =A0memcntl(0xFEC00000, 7188, MC_ADVISE, MADV_WILLNEED, 0, 0) =3D 0
> 15694: =A0close(3) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0=3D 0
> 15694: =A0xstat(2, "/usr/lib/locale/en_US.UTF-8/methods_en_US.UTF-8.so.3"=
, 0x08046C60) =3D 0
> 15694: =A0resolvepath("/usr/lib/locale/en_US.UTF-8/methods_en_US.UTF-8.so=
.3", "/usr/lib/locale/common/methods_unicode.so.3", 1023) =3D 43
> 15694: =A0open("/usr/lib/locale/en_US.UTF-8/methods_en_US.UTF-8.so.3", O_=
RDONLY) =3D 3
> 15694: =A0mmap(0xFEE90000, 32768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FI=
XED, 3, 0) =3D 0xFEE90000
> 15694: =A0mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIV=
ATE|MAP_ANON, -1, 0) =3D 0xFEE80000
> 15694: =A0mmap(0x00010000, 122880, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|M=
AP_ANON|MAP_ALIGN, -1, 0) =3D 0xFEE60000
> 15694: =A0mmap(0xFEE60000, 55437, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FI=
XED|MAP_TEXT, 3, 0) =3D 0xFEE60000
> 15694: =A0mmap(0xFEE7D000, 2524, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIV=
ATE|MAP_FIXED|MAP_INITDATA, 3, 53248) =3D 0xFEE7D000
> 15694: =A0munmap(0xFEE6E000, 61440) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =3D 0
> 15694: =A0memcntl(0xFEE60000, 2532, MC_ADVISE, MADV_WILLNEED, 0, 0) =3D 0
> 15694: =A0close(3) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0=3D 0
> 15694: =A0xstat(2, "/usr/lib/locale/en_US.UTF-8/libc.so.1", 0x08046C60) E=
rr#2 ENOENT
> 15694: =A0munmap(0xFEE90000, 32768) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =3D 0
> 15694: =A0sysconfig(_CONFIG_PAGESIZE) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =3D 4096
>
> FREEZE
>
> On a machine which has not had the problem (yet...) the output continues
> ...
> 24608: =A0sysconfig(_CONFIG_PAGESIZE) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =3D 4096
> 24608: =A0stat64("/afs/not-here", 0x08047C90) =A0 =A0 =A0 =A0 =A0 =A0 Err=
#2 ENOENT
> 24608: =A0creat64("/afs/not-here", 0666) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0Err#30 EROFS
> 24608: =A0open("/usr/lib/locale/en_US.UTF-8/LC_MESSAGES/SUNW_OST_OSCMD.mo=
", O_RDONLY) Err#2 ENOENT
> 24608: =A0fstat64(2, 0x08046EA0) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0=3D 0
> 24608: =A0write(2, " t o u c h", 5) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =3D 5
> 24608: =A0write(2, " : =A0", 2) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =3D 2
> 24608: =A0write(2, " / a f s / n o t - h e r".., 13) =A0 =A0 =A0=3D 13
> 24608: =A0write(2, " =A0 c a n n o t =A0 c r e a".., 15) =A0 =A0 =A0=3D 1=
5
> 24608: =A0_exit(1)
>
> In other words, the stat64 call accesses AFS and (on the machine
> with the problem), the thread gets stuck in the AFS tarbaby.
>
> I suspected it was due to logging, so I changed the configuration to
> mount a dedicated partition for /usr/vice/cache, and rebooted. =A0The '
> machine was fine for a month or two, but problem has re-occurred.
>
> The machine is used frequently (it's our main computer server for
> undergraduate classes) but "fortunately" AFS is not very popular here
> so most courses don't use it (partly because of nasty things like
> this happening now and then) and so the machine is still being used
> for non-AFS courses. =A0Hence I hadn't tried to install a newer version
> of OpenAFS. =A0If this is a known bug with OpenAFS, I will indeed
> ask them to take the machine offline long enough to fix this.
> (Political capital and all; I hope people understand.)
>
> I haven't tried to reproduce this bug (and wouldn't want to
> on the computer server!): it only seems to happen on these
> main compute servers -- never on my little research machines... :-(
>
> Any help would be appreciated.
>
> John
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>



--=20
Derrick