[OpenAFS] Re: Linux server/client hangs and crashes

Systems Administration sysadmin@contrailservices.com
Wed, 11 Aug 2004 10:39:36 -0600

>> I have tried using the lwp version of fileserver, wrapped the 
>> pthreads version with LD_ASSUME_KERNEL=2.4.1, and done tcpdump, 
>> cmdebug, fstrace, etc. etc. ad nauseam.
> You actually copied the binary out of src/lwp/fileserver, and didn't 
> just use what was as you said "the only fileserver"?

yes - I explicitly compiled the entire source and then grabbed both 
copies of the fileserver binary - the src/viced/fileserver and 
src/tviced/fileserver, placed both in the /usr/afs/bin/ directory as 
filserver.lwp and fileserver.pthreads - added a shell script wrapper 
that sets LD_ASSUME and calls the pthreads variant.

Repeated the same test case on both flavors, same symptoms.  Mass 
write/copy/delete operations will hang - some single file 
write/read/delete operations can proceed but will sporadically hangup.  
Client shows an active cache entry with cmdebug, server reports itself 
to be up and processes appear normal.  Other machines can access the 
cell in question and read/write to the same volume.

>> When I tried to use fstrace on the fileserver - bam kernel panics 
>> right and left - I'm trying to setup a serial console to capture 
>> these now.
> fstrace traces the client. the problem is in the fileserver. you're 
> barking up the wrong tree.

I know - I was attempting to run the test from the fileserver machine 
back to itself as a client.  One of the problems with the Mac OS X 
install of openafs is that it is missing certain tools - fstrace is one 
of them.  Since the symptoms appeared on linux as well I chose to try 
and debug from a linux box to a linux box, thats when my to-date rock 
solid fileserver (running for over 1 year without a crash) went belly 
up as I tried to run the AFS tests.  Three exact duplicate OOPS panics, 
and fun watching my raid-5 rebuild all day.

At this point I have no idea how to trace down the source of the 
problem - could it help to downgrade GCC and GLIBC to a known good 
pthreads library?

Is there any know issue with 2.4.x kernels - or grsecurity patch?