[OpenAFS] Re: Tuning the -daemons.

Andrew Deason adeason@sinenomine.net
Tue, 19 Apr 2011 09:56:45 -0500

On Tue, 19 Apr 2011 09:28:17 +0200
Jan Johansson <janj@it.su.se> wrote:

> Simon Wilkinson <sxw@inf.ed.ac.uk> wrote:
> > Reviewing your original post, it has occurred to me that your
> > problem could be a symptom of an issue a number of sites are seeing
> > with callback breaks. Essentially, it is possible for the thread in
> > client that handles incoming network traffic to hang whilst handling
> > a callback break. If this happens, it appears to the fileserver like
> > the client is no longer handling data, and you will see the errors
> > that you have been seeing.

Assuming I am correctly thinking of the same issue... to clarify:
technically it's the thread handling incoming RPCs, not the thread
handling all Rx traffic. But the result is xvcache staying write-locked
for a long time, so the whole CM pretty much grinds to a halt.

> So let me see if understand this problem correctly.
> 1. The IMAP server wants to update the status of a mail.
> 2. The AFS server starts breaking CallBacks.
> 3. The IMAP server is waiting for for the AFS server to say OK on
> the mail update.
> 4. Some other AFS server wants to break CallBacks on the IMAP
> server and since the IMAP server is busy waiting for the first
> AFS server it can't respond.
> 5. Things go bananas.

It's not quite this simple. What I have seen when this issue occurs is
that the client is attempting to give back callbacks to a fileserver,
and the fileserver is trying to InitCallBackState3 to the client at the
same time (or anything else that locks the host structure). On all
releases of 1.4, these two operations require the same lock in the
client, so the requests hang until they timeout.

There may be other ways for it to occur, but that's the general scenario
I recall.

> > We believe that this behaviour is fixed in 1.6.0pre4. If you still
> > have your test environment, it would be very interesting to know
> > whether you still see these problems.
> Is this 1.6.0pre4 for server or client?

Client. It's fixed on the 1.4 branch, too, but not in any released

Andrew Deason