[OpenAFS-port-freebsd] Client deadlock?

Benjamin Kaduk kaduk@MIT.EDU
Thu, 31 Mar 2011 13:06:29 -0400 (EDT)


Was going to send this yesterday, but fell asleep before I got everything 
written up...

On Wed, 30 Mar 2011, Garrett Wollman wrote:

> Running Ben's package of 1.6.0pre4, I have found bonnie++ to be a
> sure-fire way of deadlocking (?) the client.  The following client
> processes are running:
>
>    0  1121     1   0  44  0  5832  1768 afsslp Ds    ??    0:00.01 /usr/local/sbin/afsd -stat 2800 -daemons 6 -volumes 128 -dynroot -fakestat-all -afsdb -memcache
>    0  1117     1   0  46  0  5832  1556 sbwait IL     0    5:39.20 /usr/local/sbin/afsd -stat 2800 -daemons 6 -volumes 128 -dynroot -fakestat-all -afsdb -memcache
>    0  1118     1   0  76  0  5832  1556 afs_rx DL     0    0:00.00 /usr/local/sbin/afsd -stat 2800 -daemons 6 -volumes 128 -dynroot -fakestat-all -afsdb -memcache
>    0  1119     1   0  44  0  5832  1556 afswai DL     0    0:00.11 /usr/local/sbin/afsd -stat 2800 -daemons 6 -volumes 128 -dynroot -fakestat-all -afsdb -memcache
>    0  1122     1   0  44  0  5832  1556 afscon DL     0    0:00.73 /usr/local/sbin/afsd -stat 2800 -daemons 6 -volumes 128 -dynroot -fakestat-all -afsdb -memcache
>    0  1123     1   0  76  0  5832  1556 afscon DL     0    0:00.04 /usr/local/sbin/afsd -stat 2800 -daemons 6 -volumes 128 -dynroot -fakestat-all -afsdb -memcache
>    0  1124     1   0  48  0  5832  1556 afsslp DL     0    7:07.72 /usr/local/sbin/afsd -stat 2800 -daemons 6 -volumes 128 -dynroot -fakestat-all -afsdb -memcache
>    0  1125     1   0  67  0  5832  1556 afsslp DL     0    7:00.29 /usr/local/sbin/afsd -stat 2800 -daemons 6 -volumes 128 -dynroot -fakestat-all -afsdb -memcache
>    0  1126     1   0  46  0  5832  1556 afsslp DL     0    6:06.30 /usr/local/sbin/afsd -stat 2800 -daemons 6 -volumes 128 -dynroot -fakestat-all -afsdb -memcache
>    0  1127     1   0  49  0  5832  1556 afsslp DL     0    7:09.72 /usr/local/sbin/afsd -stat 2800 -daemons 6 -volumes 128 -dynroot -fakestat-all -afsdb -memcache
>    0  1128     1   0  45  0  5832  1556 afsslp DL     0    6:28.13 /usr/local/sbin/afsd -stat 2800 -daemons 6 -volumes 128 -dynroot -fakestat-all -afsdb -memcache
>    0  1129     1   0  46  0  5832  1556 afsslp DL     0    6:12.35 /usr/local/sbin/afsd -stat 2800 -daemons 6 -volumes 128 -dynroot -fakestat-all -afsdb -memcache
>    0  1130     1   0  76  0  5832  1556 afswai DL     0    0:02.75 /usr/local/sbin/afsd -stat 2800 -daemons 6 -volumes 128 -dynroot -fakestat-all -afsdb -memcache
>
> ...but none of them ever seem to get scheduled.  (The bonnie++ process
> is also stuck in afsslp.)  I don't have a debugging kernel on this
> machine (it's actually only mine for a day to do some performance
> testing) so I can't easily get a backtrace.  rxdebug on the server
> reports no active connections.  The client *is* working enough to
> respond to rxdebug, and reports:
>

I reproduced this on my machine, and it looks to not actually be a 
deadlock -- rather, the cache is full and nothing is causing it to be sent 
out on the wire.
In particular, bonnie++ is waiting on afs_WaitForCacheDrain in 
ObtainDCacheForWriting, while the CacheTruncateDaemon is in its 100ms wait 
to free up the glock for other threads.  There was some talk on Jabber 
this morning, and Derrick wants me to check that the sleep is actually 
bounded at 100ms and not infinite due to a bug.
The sleepqueue implementation is annoying to get information about from a 
core dump, so I'll probably be throwing printfs in libafs.ko to see what's 
going on in the CacheTruncateDaemon.

-Ben Kaduk