[OpenAFS-devel] Strange hangs in openafs 1.4.1 linux 2.6.17.7

Tue, 12 Sep 2006 06:30:46 -0400

On 9/12/06, Jerry Lundstr=F6m <jerry.lundstrom@it.su.se> wrote:
> Jerry Lundstr=F6m wrote:
> > Jeffrey Altman wrote:
> >> The 1.4.1 server has a bug that results in significant delays in the
> >> response to clients if there were outstanding callback breaks that cou=
ld
> >> not previously be delivered to the client and the client's IP address =
or
> >> the port number has changed.  This is the bug to which I was referring
> >> which was fixed prior to 1.4.2-beta-1.
> >
> > This doesnt not explain the afs_cv_wait hangs where I clearly see the
> > response from the client in the tcpdump running on the client. Neither
> > the ip address or the port was changed in that 0.1sec of the
> > fetch-status request and response.
>
> Sorry this meaning got all messed up.
>
> When I run tcpdump on the client I see the fetch-status request being
> sent to the server and I see the servers response to the client but the
> process that sent the request has hanged in afs_cv_wait so the server
> sents a couple more responses and after a few seconds a ping (or atleast
> I think its a ping) but the process is still stuck in afs_cv_wait and I
> can't strace the process or attach gdb on it.
>
> This has happend both with memcache and filecache on the ramdisk but
> never with the filecache running on a real drive.
>

The i/o timing characteristics of memcache/ramdisk are very different
than a physical disk.  It's quite possible that these timing
differences are allowing you to hit a race condition in the cache
manager.  Is there any way you can provide us with a kernel stack
backtrace for the process(es) which gets stuck in afs_cv_wait?
Alternatively, could you provide us with an fstrace dump leading up to
the deadlock?

Thanks,

--=20
Tom Keiser
tkeiser@gmail.com