[OpenAFS] Re: deadlock in OpenAFS 1.4.11 (Solaris 5.10)
Mon, 12 Apr 2010 10:14:40 -0400
you might as well reboot it. i suspect (and wondered before) if the
real issue was not deadlock but that the machine simply went into a
loop, and with a cache that small it's likely it did. not the best
behavior, of course but not the most urgent thing to pursue at the
On Mon, Apr 12, 2010 at 10:07 AM, John Tang Boyland
> Andrew Deason <email@example.com> writes
> ] John Tang Boyland <firstname.lastname@example.org> wrote:
> ] > [mdb]
> ] Thanks for those. I'm not sure myself what's going on, but perhaps some
> ] discussion will help...
> ] You appear to be running out of cache files, though, by the way. If you
> ] increase the size of your cache (or maybe even just the number of
> ] files), it may make this less likely to occur.
> OK. =A0I'll do that the next time we reboot. =A0The cacheinfo is
> rather small (25000K).
> (In fact, I guess that's why other people haven't noticed the problem.
> Running with a 25MB disk cache is pretty ridiculous.)
> ] > BTW:
> ] > process 17679 is the one writing the LONG file that seemed to
> ] > initiate the deadlock. =A0I notice it is inside "FetchWholeEnchilada"=
> ] It appears to have unlinked the file while it was open; does that sound
> ] correct?
> Possibly: process 17679 is listed as "make test".
> I'm guessing the user was noticing
> things were going slow and control-C'ed the make process, and "make"
> decided to delete the output file.
> But I don't know for sure.
> ] > =A0 fffffe8003244cb0 FetchWholeEnchilada+0xf4()
> ] > =A0 fffffe8003244d80 afs_remove+0x7eb()
> ] Can someone explain this, by the way? If I'm reading this correctly, we
> ] fetch/cache the entire file contents of a file if it's unlinked from
> ] under a process... Why?
> ] > =A0 fffffe8002fda5d0 swtch+0x110()
> ] > =A0 fffffe8002fda5f0 cv_wait+0x68()
> ] > =A0 fffffe8002fda640 afs_osi_Sleep+0x99()
> ] > =A0 fffffe8002fda6c0 Afs_Lock_Obtain+0x1cb()
> ] > =A0 fffffe8002fda780 afs_putpage+0x14a()
> ] > =A0 fffffe8002fda7f0 osi_VM_GetDownD+0xe8()
> ] > =A0 fffffe8002fda9c0 afs_GetDownD+0x7ed()
> ] > =A0 fffffe8002fdab90 afs_GetDCache+0x713()
> ] So, all of these are waiting to free up a dcache entry. I'm not in this
> ] code very much, but here's a guess... someone tell me if this makes any
> ] sense.
> ] What looks like may be possible is that some process locks vcache V1,
> ] and tries to get a dcache entry for it; it tries to create a new dcache
> ] entry and tries to free up a dcache entry (D1) because we're out. D1 ha=
> ] mapped pages (or whatever IFAnyPages means), and we need to invalidate
> ] the pages, so we need to lock D1's vcache. If D1's vcache is the same a=
> ] vcache V1, we have deadlock. This makes sense to me to see while
> ] FetchWholeEnchilada is running, since fetching the later chunks may be
> ] trying to free up the earlier chunks fetched in the same file...
> ] If that is plausible, I think potential solutions include dropping the
> ] V1 lock before GetDownD (I assume this isn't possible, or a lot of
> ] things assume this doesn't happen and is a lot of work to make right,
> ] etc)... or, passing the avc into GetDownD, and have GetDownD skip
> ] dcaches that need page invalidation that have the same vcache as the on=
> ] passed in. That way we sleep and retry (although still while holding th=
> ] V1 lock...)
> ] --
> ] Andrew Deason
> ] email@example.com
> BTW: Is there any more useful information I could get from the machine
> or can we reboot it? =A0Please reply by email to firstname.lastname@example.org.
> OpenAFS-info mailing list