[OpenAFS-devel] JAFS and UNLOCK_GLOBAL_MUTEX

Peter Somogyi psomogyi@gamax.hu
Tue, 20 Sep 2005 18:09:09 +0200


Hi,

I'm invoking some OpenAFS API methods via JAFS from multiple threads in JAVA.
The JAVA VM and JAFS use the same /lib/tls/libpthread.so.0 library.
At random times it crashes with assertion fail in libadmin, at random places where UNLOCK_GLOBAL_MUTEX macro is called.
Possible assertion places:
: Assertion failed! file ../auth/cellconfig.c, line 968.
: Assertion failed! file afs_clientAdmin.c, line 1968.
: Assertion failed! file afs_utilAdmin.c, line 460

I've put a printf trace into src/util/pthread_glock.c, pthread_recursive_mutex_unlock into the else block, and I get in each cases:
pthread_mutex_unlock else: mut->locked: 1, mut->times_inside: 2, mut->owner: 1133333424, pthread_self: 1133067184
(so that the owner and the calling pthread looks different, and the function sets rc=-1 which causes assert)

The error occures rarely, but if occures then it comes out within 10min-2hour.

I've written a C OpenAFS API thread test, but this error didn't come out (so far...).
(Note: I've read that JAVA VM suspend threads very often, perhaps that's why I'm getting it _only_ from my java test.)

I have 2 suspicions:

(1) phtread_glock.c, h is buggy: the members of struct pthread_recursive_mutex_t are not "volatile" - I'm still testing this case... ...
(it seems to me a real error in theory, but first I have to test a lot whether setting them "volatile" solves my problem...)

(2) the JavaVM thread handling is not 100% compatible(?) with the usage of OpenAFS pthreads (it's strange because they _seem_ to work together well mostly)
And I've read if JavaVM and C part use the same pthread implementation, they should work together.

I was able to reproduce the error on 2 platforms:
- SLES9, 2.6.5-7.193-smp, i686, Classic VM (build 1.4.2, J2RE 1.4.2 IBM build cxia321420-20040626 (JIT enabled: jitc)), libpthread: NPTL, 2 cpu, glibc-2.3.3-98.47
- SuSE8, 2.6.11.5, i686, Java HotSpot(TM) Client VM (build 1.4.2-b28, mixed mode), libpthread: LinuxThreads, 1 cpu, glibc-2.3.2-88
(openafs-1.3.87)

If anybody can strenghten/reject one of my suspicions above, or experienced the same, please tell me.
Thank you in advance.

Note 1: I've opened a ticket about this in RT: #21526
Note 2: using the native recursive mutex implementation of pthread _seems_ to solve this problem (after 1-2 tests), and seems to be a faster/well tested implementation. Why not use it?
Note 3: an _example_ of a stack trace _part_ - just to help understanding the problem:
...
3HPNATIVESTACK         Native Stack of "Thread-2" PID 7065
NULL                   -------------------------
3HPSTACKLINE            FFFFE410
3HPSTACKLINE            abort at 40073CE9 in libc.so.6
3HPSTACKLINE            ?? at 433E121E in libjafsadm.so
3HPSTACKLINE            util_AdminServerAddressGetFromName at 433B1F62 in libjafsadm.so
3HPSTACKLINE            bos_ServerOpen at 433A3D70 in libjafsadm.so
3HPSTACKLINE            Java_org_openafs_jafs_Server_getBosServerHandle at 4338340E in libjafsadm.so
3HPSTACKLINE            431FA9C8
3HPSTACKLINE            mmipExecuteJava at 402F3D03 in libjvm.so
3HPSTACKLINE            438D2B58
...
(I have more javacore files having something like the above part)

-- 
Peter Somogyi
Software Developer, Gamax Ltd.
1114 Budapest, Bartok B. u 15/d
Tel.: +36-1-381-0544
e-mail: psomogyi@gamax.hu