[OpenAFS] Re: advice on troubleshooting blocked cache manager on MacOS?

Derrick Brashear shadow@gmail.com
Thu, 4 Feb 2010 15:08:10 -0500


Revisiting this, since you said it's still happening:

On Wed, Jan 27, 2010 at 12:26 PM, Derrick Brashear <shadow@gmail.com> wrote=
:
> On Wed, Jan 27, 2010 at 12:10 PM, Adam Megacz <adam@megacz.com> wrote:
>>
>> Derrick Brashear <shadow@gmail.com> writes:
>>>> I might be able to try that, but it will take a few days.
>>>
>>> if true, you should see output in cmdebug now
>>
>> Okay, I just caught it red-handed. =A0Can anybody help with reading the
>> tea leaves here?
>>
>> =A0megacz@quine:~$cmdebug localhost
>> =A0Lock afs_xvcache status: (none_waiting, write_locked(pid:11013 at:335=
))
>
> =A0 =A0 =A0 writelocked =3D (0 =3D=3D NBObtainWriteLock(&afs_xvcache, 335=
));
>
> in afs_vop_reclaim
>
> xvreclaim not held, which means we're presumably in afs_FlushVCache.
>
>> =A0Lock afs_xserver status: (none_waiting, 1 read_locks(pid:0))
>
> somewhere has afs_xserver read locked. for obvious reasons we can't
> track these. no one's blocked on it.
>
>> =A0Lock afs_xvcb status: (writer_waiting, write_locked(pid:0 at:273), 1 =
waiters)
>
> =A0 =A0 =A0 =A0ObtainWriteLock(&afs_xvcb, 273);
>
> is in afs_FlushVCBs (called with lockit true). assuming you're not
> running disconnected and actively trying to disconnect, this is the
> system daemon which does this (afs_Daemon). that also explains
> "pid:0". We don't know who's waiting, but only this, QueueVCB and
> RemoveVCB actually *get* afs_xvcb.
>
> So, let's be clever. FlushVCache? Calls QueueVCB. So we can assume
> it's blocking.
>
> So then the question is why FlushVCBs is blocking you. well, you said
> you had multihomed fileservers.
>
> RXAFS_GiveUpCallBacks is called here. you didn't perchance grab
> rxdebug output for the client at this point? (no is fine, this is
> probably the answer)
>
> so, presumably (and now from memory, i'm not looking at the code) you
> block for like a minute while it times out a fileserver, then it fails
> over to another address, afs_Analyze returns shouldretry=3D1, you look,
> afs_ConnByHost probably gets the other address, and the loop proceeds
> and wins.

Ok, so, can you gather rxdebug (hungclient) 7001
and perhaps a couple minutes of
tcpdump -s 1500 -n -w /tmp/packets host (hungclient) and port 7001

at this point?
(specify an ethernet interface with -i if it's not the default that's
your upstream)