[OpenAFS] Re: deadlock in OpenAFS 1.4.11 (Solaris 5.10)

Andrew Deason adeason@sinenomine.net
Sun, 11 Apr 2010 23:14:18 -0500


On Sat, 10 Apr 2010 12:54:30 -0500
John Tang Boyland <boyland@pabst.cs.uwm.edu> wrote:

> [mdb]

Thanks for those. I'm not sure myself what's going on, but perhaps some
discussion will help...

You appear to be running out of cache files, though, by the way. If you
increase the size of your cache (or maybe even just the number of
files), it may make this less likely to occur.

> BTW:
> process 17679 is the one writing the LONG file that seemed to 
> initiate the deadlock.  I notice it is inside "FetchWholeEnchilada".

It appears to have unlinked the file while it was open; does that sound
correct?

>   fffffe8003244cb0 FetchWholeEnchilada+0xf4()
>   fffffe8003244d80 afs_remove+0x7eb()

Can someone explain this, by the way? If I'm reading this correctly, we
fetch/cache the entire file contents of a file if it's unlinked from
under a process... Why?

>   fffffe8002fda5d0 swtch+0x110()
>   fffffe8002fda5f0 cv_wait+0x68()
>   fffffe8002fda640 afs_osi_Sleep+0x99()
>   fffffe8002fda6c0 Afs_Lock_Obtain+0x1cb()
>   fffffe8002fda780 afs_putpage+0x14a()
>   fffffe8002fda7f0 osi_VM_GetDownD+0xe8()
>   fffffe8002fda9c0 afs_GetDownD+0x7ed()
>   fffffe8002fdab90 afs_GetDCache+0x713()

So, all of these are waiting to free up a dcache entry. I'm not in this
code very much, but here's a guess... someone tell me if this makes any
sense.

What looks like may be possible is that some process locks vcache V1,
and tries to get a dcache entry for it; it tries to create a new dcache
entry and tries to free up a dcache entry (D1) because we're out. D1 has
mapped pages (or whatever IFAnyPages means), and we need to invalidate
the pages, so we need to lock D1's vcache. If D1's vcache is the same as
vcache V1, we have deadlock. This makes sense to me to see while
FetchWholeEnchilada is running, since fetching the later chunks may be
trying to free up the earlier chunks fetched in the same file...

If that is plausible, I think potential solutions include dropping the
V1 lock before GetDownD (I assume this isn't possible, or a lot of
things assume this doesn't happen and is a lot of work to make right,
etc)... or, passing the avc into GetDownD, and have GetDownD skip
dcaches that need page invalidation that have the same vcache as the one
passed in. That way we sleep and retry (although still while holding the
V1 lock...)

-- 
Andrew Deason
adeason@sinenomine.net