[OpenAFS-devel] Linux deadlocks (possibly fixed in IBM-AFS)

Derrick J Brashear shadow@dementia.org
Thu, 4 Jul 2002 01:43:05 -0400 (EDT)


On Wed, 3 Jul 2002, Broughton, Travis V wrote:

> 
> We've been running into some bugs in 1.2.5 that are causing deadlocks and
> hangs on the Linux client.  Unlike most AFS deadlocks I've seen, the system
> load average goes to zero rather than steadily increasing.  We believe this 
> behavior to have been fixed in the most recent IBM-AFS release, namely by 
> the following deltas:
> 
> 	srikanth-IY31752-afs3.6-race-condition-in-afs-buffer-cache 1.2
> 
> 	Fix race condition in function afs_newslot(). This function is used
> 	to recycle buffers based on the buffer reference count and the
> 	buffer age. This function used to check the buffer reference count 
> 	without locking it. The result was that buffers that were in use 
> 	would also be recycled.

I believe this was actually a fileserver fix that came from me

> and
> 
> 	srikanth-12885-afs3.6-race.condition.in.linux.event.handling 1.5
> 
> 	Fix another race condition in the event handling code. This race is
> 	because the operation of dropping GLOCK and going to sleep is not
> 	atomic. This gives an opportunity for another thread to grab GLOCK
> 	and call wake_up before the first thread actually goes to sleep. The
> 	result is a lost wake up.
> 
> 	The new code does not rely on sleep_on(). Instead it manually adds
> 	the thread to the wait queues and changes to process state to 
> 	"sleeping" before it drops GLOCK. It then drops GLOCK and invokes
> the
> 	scheduler. Doing it this way allows us to control the the process 
> 	state more precisely and avoids this race condition.

And this is probably from the latest go-around of patches which will be in
OpenAFS 1.2.6

> Has anyone else run into these issues in OpenAFS?  Have fixes analogous to
> the above been incorporated into OpenAFS?  I can provide kdumps and other 
> debug info if that would help to narrow down the source of problem.