[OpenAFS-devel] Linux deadlocks (possibly fixed in IBM-AFS)
Broughton, Travis V
tvb@intel.com
Wed, 3 Jul 2002 07:12:49 -0700
We've been running into some bugs in 1.2.5 that are causing deadlocks and
hangs on the Linux client. Unlike most AFS deadlocks I've seen, the system
load average goes to zero rather than steadily increasing. We believe this
behavior to have been fixed in the most recent IBM-AFS release, namely by
the following deltas:
srikanth-IY31752-afs3.6-race-condition-in-afs-buffer-cache 1.2
Fix race condition in function afs_newslot(). This function is used
to recycle buffers based on the buffer reference count and the
buffer age. This function used to check the buffer reference count
without locking it. The result was that buffers that were in use
would also be recycled.
and
srikanth-12885-afs3.6-race.condition.in.linux.event.handling 1.5
Fix another race condition in the event handling code. This race is
because the operation of dropping GLOCK and going to sleep is not
atomic. This gives an opportunity for another thread to grab GLOCK
and call wake_up before the first thread actually goes to sleep. The
result is a lost wake up.
The new code does not rely on sleep_on(). Instead it manually adds
the thread to the wait queues and changes to process state to
"sleeping" before it drops GLOCK. It then drops GLOCK and invokes
the
scheduler. Doing it this way allows us to control the the process
state more precisely and avoids this race condition.
Has anyone else run into these issues in OpenAFS? Have fixes analogous to
the above been incorporated into OpenAFS? I can provide kdumps and other
debug info if that would help to narrow down the source of problem.
Thanks,
-tvb