[OpenAFS] Re: deadlock in OpenAFS 1.4.11 (Solaris 5.10)

John Tang Boyland boyland@cs.uwm.edu
Mon, 12 Apr 2010 09:07:58 -0500


Andrew Deason <adeason@sinenomine.net> writes
] John Tang Boyland <boyland@pabst.cs.uwm.edu> wrote:
] 
] > [mdb]
] 
] Thanks for those. I'm not sure myself what's going on, but perhaps some
] discussion will help...
] 
] You appear to be running out of cache files, though, by the way. If you
] increase the size of your cache (or maybe even just the number of
] files), it may make this less likely to occur.

OK.  I'll do that the next time we reboot.  The cacheinfo is
rather small (25000K).

(In fact, I guess that's why other people haven't noticed the problem.
Running with a 25MB disk cache is pretty ridiculous.)

] > BTW:
] > process 17679 is the one writing the LONG file that seemed to 
] > initiate the deadlock.  I notice it is inside "FetchWholeEnchilada".
] 
] It appears to have unlinked the file while it was open; does that sound
] correct?

Possibly: process 17679 is listed as "make test".
I'm guessing the user was noticing
things were going slow and control-C'ed the make process, and "make"
decided to delete the output file.
But I don't know for sure.

] >   fffffe8003244cb0 FetchWholeEnchilada+0xf4()
] >   fffffe8003244d80 afs_remove+0x7eb()
] 
] Can someone explain this, by the way? If I'm reading this correctly, we
] fetch/cache the entire file contents of a file if it's unlinked from
] under a process... Why?
] 
] >   fffffe8002fda5d0 swtch+0x110()
] >   fffffe8002fda5f0 cv_wait+0x68()
] >   fffffe8002fda640 afs_osi_Sleep+0x99()
] >   fffffe8002fda6c0 Afs_Lock_Obtain+0x1cb()
] >   fffffe8002fda780 afs_putpage+0x14a()
] >   fffffe8002fda7f0 osi_VM_GetDownD+0xe8()
] >   fffffe8002fda9c0 afs_GetDownD+0x7ed()
] >   fffffe8002fdab90 afs_GetDCache+0x713()
] 
] So, all of these are waiting to free up a dcache entry. I'm not in this
] code very much, but here's a guess... someone tell me if this makes any
] sense.
] 
] What looks like may be possible is that some process locks vcache V1,
] and tries to get a dcache entry for it; it tries to create a new dcache
] entry and tries to free up a dcache entry (D1) because we're out. D1 has
] mapped pages (or whatever IFAnyPages means), and we need to invalidate
] the pages, so we need to lock D1's vcache. If D1's vcache is the same as
] vcache V1, we have deadlock. This makes sense to me to see while
] FetchWholeEnchilada is running, since fetching the later chunks may be
] trying to free up the earlier chunks fetched in the same file...
] 
] If that is plausible, I think potential solutions include dropping the
] V1 lock before GetDownD (I assume this isn't possible, or a lot of
] things assume this doesn't happen and is a lot of work to make right,
] etc)... or, passing the avc into GetDownD, and have GetDownD skip
] dcaches that need page invalidation that have the same vcache as the one
] passed in. That way we sleep and retry (although still while holding the
] V1 lock...)
] 
] -- 
] Andrew Deason
] adeason@sinenomine.net

BTW: Is there any more useful information I could get from the machine
or can we reboot it?  Please reply by email to boyland@cs.uwm.edu.