[OpenAFS-devel] appears that cvs trunk has a deadlock problem of some sort for linux24

Neulinger, Nathan nneul@umr.edu
Thu, 4 Apr 2002 11:41:59 -0600


Yes, it's hanging in a dlock. My tracing indicates:

about to TFDC
about to DLOCK in tfdc

and that's where it hangs.

Based on the comments above TryFlush, the code of TF doesn't make sense.
It says it maintains the vcache lock exclusively and that it is called
with that lock held. Well, the first thing it does is try to lock it
again. Additionally, shortly after it exits, the caller will also try to
DUNLOCK(). Looks to me like the code needs to just DUNLOCK()..DLOCK() in
the loop in TF, and get rid of the outer DLOCK/DUNLOCK. I'm trying
something along those lines right now.=20

-- Nathan

------------------------------------------------------------
Nathan Neulinger                       EMail:  nneul@umr.edu
University of Missouri - Rolla         Phone: (573) 341-4841
Computing Services                       Fax: (573) 341-4216


> -----Original Message-----
> From: Neulinger, Nathan=20
> Sent: Thursday, April 04, 2002 11:30 AM
> To: openafs-devel@openafs.org
> Subject: RE: [OpenAFS-devel] appears that cvs trunk has a=20
> deadlock problem of some sort for linux24
>=20
>=20
> Yep. Definately hanging in the TryFlush... I temporarily commented out
> the call to it and problem went away.=20
>=20
> I'm willing to bet one of those DLOCK()'s is spinning.=20
>=20
> Something else that looks odd to me:
>=20
> In TryFlush:
>=20
>        if (!DCOUNT(dentry) && !dentry->d_inode) {
>             DGET(dentry);
>             AFS_GUNLOCK();
>             DUNLOCK();
>=20
> but in newVCache:
>=20
>                         if (DCOUNT(dentry)) {
>                             afs_TryFlushDcacheChildren(dentry);
>                         }
>=20
>                         if (!DCOUNT(dentry)) {
>                             AFS_GUNLOCK();
>                             DGET(dentry);
>                             DUNLOCK();
>                             d_drop(dentry);
>                             dput(dentry);
>                             AFS_GLOCK();
>                             goto restart;
>=20
> Notice how in NVC it does the dget after the gunlock? Why=20
> isn't TryFlush
> doing it the same way?
>=20
> I'd bet though that it's one of the DLOCK() calls, I'm trying to trace
> it out now...=20
>=20
> -- Nathan
>=20
> ------------------------------------------------------------
> Nathan Neulinger                       EMail:  nneul@umr.edu
> University of Missouri - Rolla         Phone: (573) 341-4841
> Computing Services                       Fax: (573) 341-4216
>=20
>=20
> > -----Original Message-----
> > From: Neulinger, Nathan=20
> > Sent: Thursday, April 04, 2002 10:34 AM
> > To: openafs-devel@openafs.org
> > Subject: RE: [OpenAFS-devel] appears that cvs trunk has a=20
> > deadlock problem of some sort for linux24
> >=20
> >=20
> > Initial glance, it looks like the TryFlush... routine is=20
> > called inside a
> > DLOCK(), and itself does a DLOCK(). Not sure if that is=20
> > kosher or not.=20
> >=20
> > -- Nathan
> >=20
> > ------------------------------------------------------------
> > Nathan Neulinger                       EMail:  nneul@umr.edu
> > University of Missouri - Rolla         Phone: (573) 341-4841
> > Computing Services                       Fax: (573) 341-4216
> >=20
> >=20
> > > -----Original Message-----
> > > From: Neulinger, Nathan=20
> > > Sent: Thursday, April 04, 2002 10:15 AM
> > > To: openafs-devel@openafs.org
> > > Subject: [OpenAFS-devel] appears that cvs trunk has a=20
> > > deadlock problem of some sort for linux24
> > >=20
> > >=20
> > > It's been introduced since 2002/03/26. Only changes I see=20
> > in the trunk
> > > since then are the fake stat code, and the flush-dcache stuff.=20
> > >=20
> > > Basically, running my "crash afsd" script (yes, it's useful=20
> > > enough that
> > > I keep a script around for some testing), runs through for a=20
> > > while, and
> > > then machine completely locks up. No panic msg, nothing. Only=20
> > > thing that
> > > responds is A-SysRQ-SUB.=20
> > >=20
> > > The script that I test with just does:
> > >=20
> > > find /umr/s/openafs/ -follow -type f -print | xargs -P 8 -n 30 wc
> > >=20
> > > Have not tried pulling out any of the dcache or fakestat=20
> > > changes to see
> > > if reverting them helps. Will try and get more info soon=20
> > > unless someone
> > > else spots problem first. (I assumed it was something with the
> > > prototypes branch, but verified against build from the trunk.)
> > >=20
> > > -- Nathan
> > >=20
> > > ------------------------------------------------------------
> > > Nathan Neulinger                       EMail:  nneul@umr.edu
> > > University of Missouri - Rolla         Phone: (573) 341-4841
> > > Computing Services                       Fax: (573) 341-4216
> > > _______________________________________________
> > > OpenAFS-devel mailing list
> > > OpenAFS-devel@openafs.org
> > > https://lists.openafs.org/mailman/listinfo/openafs-devel
> > >=20
> > _______________________________________________
> > OpenAFS-devel mailing list
> > OpenAFS-devel@openafs.org
> > https://lists.openafs.org/mailman/listinfo/openafs-devel
> >=20
> _______________________________________________
> OpenAFS-devel mailing list
> OpenAFS-devel@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-devel
>=20