[OpenAFS-devel] Linux deadlocks (possibly fixed in IBM-AFS)

Broughton, Travis V tvb@intel.com
Wed, 3 Jul 2002 07:12:49 -0700


We've been running into some bugs in 1.2.5 that are causing deadlocks and
hangs on the Linux client.  Unlike most AFS deadlocks I've seen, the system
load average goes to zero rather than steadily increasing.  We believe this 
behavior to have been fixed in the most recent IBM-AFS release, namely by 
the following deltas:

	srikanth-IY31752-afs3.6-race-condition-in-afs-buffer-cache 1.2

	Fix race condition in function afs_newslot(). This function is used
	to recycle buffers based on the buffer reference count and the
	buffer age. This function used to check the buffer reference count 
	without locking it. The result was that buffers that were in use 
	would also be recycled.

and

	srikanth-12885-afs3.6-race.condition.in.linux.event.handling 1.5

	Fix another race condition in the event handling code. This race is
	because the operation of dropping GLOCK and going to sleep is not
	atomic. This gives an opportunity for another thread to grab GLOCK
	and call wake_up before the first thread actually goes to sleep. The
	result is a lost wake up.

	The new code does not rely on sleep_on(). Instead it manually adds
	the thread to the wait queues and changes to process state to 
	"sleeping" before it drops GLOCK. It then drops GLOCK and invokes
the
	scheduler. Doing it this way allows us to control the the process 
	state more precisely and avoids this race condition.

Has anyone else run into these issues in OpenAFS?  Have fixes analogous to
the above been incorporated into OpenAFS?  I can provide kdumps and other 
debug info if that would help to narrow down the source of problem.

Thanks,
-tvb