[OpenAFS] Re: deadlock in OpenAFS 1.4.11 (Solaris 5.10)

Mon, 12 Apr 2010 10:14:40 -0400

you might as well reboot it. i suspect (and wondered before) if the
real issue was not deadlock but that the machine simply went into a
loop, and with a cache that small it's likely it did. not the best
behavior, of course but not the most urgent thing to pursue at the
moment.

On Mon, Apr 12, 2010 at 10:07 AM, John Tang Boyland
<boyland@pabst.cs.uwm.edu> wrote:
> Andrew Deason <adeason@sinenomine.net> writes
> ] John Tang Boyland <boyland@pabst.cs.uwm.edu> wrote:
> ]
> ] > [mdb]
> ]
> ] Thanks for those. I'm not sure myself what's going on, but perhaps some
> ] discussion will help...
> ]
> ] You appear to be running out of cache files, though, by the way. If you
> ] increase the size of your cache (or maybe even just the number of
> ] files), it may make this less likely to occur.
>
> OK. =A0I'll do that the next time we reboot. =A0The cacheinfo is
> rather small (25000K).
>
> (In fact, I guess that's why other people haven't noticed the problem.
> Running with a 25MB disk cache is pretty ridiculous.)
>
> ] > BTW:
> ] > process 17679 is the one writing the LONG file that seemed to
> ] > initiate the deadlock. =A0I notice it is inside "FetchWholeEnchilada"=
.
> ]
> ] It appears to have unlinked the file while it was open; does that sound
> ] correct?
>
> Possibly: process 17679 is listed as "make test".
> I'm guessing the user was noticing
> things were going slow and control-C'ed the make process, and "make"
> decided to delete the output file.
> But I don't know for sure.
>
> ] > =A0 fffffe8003244cb0 FetchWholeEnchilada+0xf4()
> ] > =A0 fffffe8003244d80 afs_remove+0x7eb()
> ]
> ] Can someone explain this, by the way? If I'm reading this correctly, we
> ] fetch/cache the entire file contents of a file if it's unlinked from
> ] under a process... Why?
> ]
> ] > =A0 fffffe8002fda5d0 swtch+0x110()
> ] > =A0 fffffe8002fda5f0 cv_wait+0x68()
> ] > =A0 fffffe8002fda640 afs_osi_Sleep+0x99()
> ] > =A0 fffffe8002fda6c0 Afs_Lock_Obtain+0x1cb()
> ] > =A0 fffffe8002fda780 afs_putpage+0x14a()
> ] > =A0 fffffe8002fda7f0 osi_VM_GetDownD+0xe8()
> ] > =A0 fffffe8002fda9c0 afs_GetDownD+0x7ed()
> ] > =A0 fffffe8002fdab90 afs_GetDCache+0x713()
> ]
> ] So, all of these are waiting to free up a dcache entry. I'm not in this
> ] code very much, but here's a guess... someone tell me if this makes any
> ] sense.
> ]
> ] What looks like may be possible is that some process locks vcache V1,
> ] and tries to get a dcache entry for it; it tries to create a new dcache
> ] entry and tries to free up a dcache entry (D1) because we're out. D1 ha=
s
> ] mapped pages (or whatever IFAnyPages means), and we need to invalidate
> ] the pages, so we need to lock D1's vcache. If D1's vcache is the same a=
s
> ] vcache V1, we have deadlock. This makes sense to me to see while
> ] FetchWholeEnchilada is running, since fetching the later chunks may be
> ] trying to free up the earlier chunks fetched in the same file...
> ]
> ] If that is plausible, I think potential solutions include dropping the
> ] V1 lock before GetDownD (I assume this isn't possible, or a lot of
> ] things assume this doesn't happen and is a lot of work to make right,
> ] etc)... or, passing the avc into GetDownD, and have GetDownD skip
> ] dcaches that need page invalidation that have the same vcache as the on=
e
> ] passed in. That way we sleep and retry (although still while holding th=
e
> ] V1 lock...)
> ]
> ] --
> ] Andrew Deason
> ] adeason@sinenomine.net
>
> BTW: Is there any more useful information I could get from the machine
> or can we reboot it? =A0Please reply by email to boyland@cs.uwm.edu.
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>

--=20
Derrick