[OpenAFS] Re: ProbeUuid for host failed

Andrew Deason adeason@sinenomine.net
Tue, 3 Apr 2012 22:42:22 -0500


On Tue, 3 Apr 2012 19:04:03 -0700
Ken Elkabany <Ken@Elkabany.com> wrote:

> 1.6.0pre1 which was packaged with Ubuntu 11.10. Should we make it a
> priority to upgrade?

Yes, there are many known problems with that.

> > Yes; that should not be possible unless the client is within a
> > certain narrow range of versions. The client could be tied up trying
> > to clear up that queue of GUCB messages, which is why everything
> > would appear to freeze for a short time, and you get that ProbeUuid
> > failure.
>
> What are GUCB messages? Why would they pile up, and in which
> circumstances?

I wasn't really explaining because the technical details aren't really
important as far as avoiding the issue goes. The solution is: "upgrade"
(or "downgrade"; generally "don't use that version"). I don't have any
workarounds nor a lot of knowledge about what triggers it; the code
making this situation possible is not in any actual release and so I
didn't expect to see anyone run into issues with it.

But since you _asked_....

I was referring to a GiveUpCallBacks message, which the client sends to
the fileserver for a file if it's not caching it anymore, to let the
fileserver know that the client does not need to be contacted if the
file changes. Normally these are queued and a bunch are sent at once; if
we queue too many we send what we have at the time of the queueing. In
older clients this was done while holding certain heavy-duty locks, and
in certain situations can incur a kind of deadlock between the
fileserver and clients. One way around that which temporarily existed in
the code was to allocate new structures dynamically when we ran out of
the pre-set amount, and to flush the queue periodically like we always
did. Since the amount of such structures dynamically allocated is
unbounded, there were concerns that this approach could cause
situations... well, situations like you're describing. So before 1.6.0
this was changed to allow certain locks to be dropped when we flushed
the queue when it became full, similar to the original behavior.

> I traced the ProbeUuid failure to the OpenAFS fileservers using the
> incorrect IP for certain clients. The clients each have one interface,
> but are accessible via 2 IP addresses (one external/internet/WAN, one
> internal/local). The fileservers would use their external IP address,
> which the firewall would block.

I'm not sure I understand; which IP are the fileservers supposed to use?
The internal one or the external one? You can configure a client to only
advertise certain addresses with NetInfo and NetRestrict:

<http://docs.openafs.org/Reference/5/NetInfo.html>
<http://docs.openafs.org/Reference/5/NetRestrict.html>

-- 
Andrew Deason
adeason@sinenomine.net