[OpenAFS] BreakDelayedCallbacks FAILED still an issue

John W. Sopko Jr. sopko@cs.unc.edu
Tue, 18 Apr 2006 13:58:56 -0400


Thanks for the info. The client is not using a NAT, so
the problem is probably the old Windows client. We have 500+
clients and we sometimes miss some, especially users that
configure their own systems and use AFS from their ISP
on their home systems, wirless, labs etc. That is why it
will be nice to have this fixed on the server side.

Jeffrey Altman wrote:
> The combination of problems that you have experienced should
> be solved in 1.4.1.  One of the issues that you were seeing is
> that the client was contacting the server on a port other than
> 7001 and the server was attempting to break callbacks on port
> 7001.  Since the NAT doesn't have a port mapping from 7001 to
> the client, the callbacks could not be broken.  Every time the
> client would contact the server, the server would believe that
> it had callbacks for the client that must be broken and would
> block the incoming RPC until the callbacks could be broken.
> 
> The 1.3.77 client also has a serious bug that would cause it
> to generate rapid fire requests using a new RX Connection for
> each RPC.  If you have 1.3.77 still deployed, try your best to
> upgrade them.
> 
> The 1.4.1 file server (to be announced real soon now) goes to
> great lengths to track clients by both address and port number
> and to deal with clients behind NATs so that each time the NAT
> allocates a new port number to the client the relevant host
> entry will be updated to track it.  This should provide a very
> good NAT experience for end users that have AFS clients that
> support UUIDs.  All of the OpenAFS clients for UNIX/Linux support
> UUIDs and Windows clients 1.3.80 and later do.
> 
> Jeffrey Altman
> 
> 
> John W. Sopko Jr. wrote:
>> We have 3 OpenAFS 1.4.0 files ervers running on Redhat linux
>> enterprixe 3 with the latest patches. This morning when I
>> came in the servers were very slow and not responding to
>> client requests, they were basically hung. This in turn
>> pretty much takes down all our web servers file services
>> for home dirs etc.
>>
>> I tracked this down to a "bad" afs windows client, the client
>> was running an old 1.3.77 version of the client or may have
>> a mis configured firewall. I halted the "bad" client
>> and this fixed our server problems. I turned up
>> debugging on the file server (kill -TSTP) and got the below
>> messages I used to track this down. I searched the afs-info
>> archives and this problem was discussed in 2002 and was
>> supposed to get fixed. Is this
>> fixed in a version newer then 1.4.0? That is, not allowing
>> clients to bring down the server with bad callbacks. Thanks
>> for your input.
>>
>> Tue Apr 18 10:20:19 2006 CB: RCallBackConnectBack failed for
>> 152.2.128.182:7001
>> Tue Apr 18 10:22:27 2006 [12] CB: Call back connect back failed (in
>> break delayed) for 152.2.128.182:7001
>> Tue Apr 18 10:22:27 2006 [12] BreakDelayedCallbacks FAILED for host
>> 152.2.128.182 which IS UP.  Possible network or routing failure.
>> Tue Apr 18 10:22:27 2006 [12] MultiProbe failed to find new address for
>> host 152.2.128.182:7001
>> Tue Apr 18 10:24:34 2006 [7] CB: WhoAreYou failed for
>> 152.2.128.182:7001, error -03
>> Tue Apr 18 10:26:42 2006 [7] CB: Call back connect back failed (in break
>> delayed) for 152.2.128.182:7001
>> Tue Apr 18 10:26:42 2006 [7] BreakDelayedCallbacks FAILED for host
>> 152.2.128.182 which IS UP.  Possible network or routing failure.
>>
>> Here is the old post about this:
>>
>> --------------------------------------------
>> From fbo2@gmx.net  Tue Aug 27 12:13:13 2002
>> Date: Tue, 27 Aug 2002 18:12:59 +0200
>> From: FBO <fbo2@gmx.net>
>> To: OpenAFS-info@openafs.org
>>
>>               432936,1      22%
>> X-BeenThere: openafs-info@openafs.org
>> X-Mailman-Version: 2.0.4
>> Precedence: bulk
>> List-Help: <mailto:openafs-info-request@openafs.org?subject=help>
>> List-Post: <mailto:openafs-info@openafs.org>
>> List-Subscribe: <https://lists.openafs.org/mailman/listinfo/openafs-info>,
>>         <mailto:openafs-info-request@openafs.org?subject=subscribe>
>> List-Id: OpenAFS Info/Discussion <openafs-info.openafs.org>
>> List-Unsubscribe:
>> <https://lists.openafs.org/mailman/listinfo/openafs-info>,
>>         <mailto:openafs-info-request@openafs.org?subject=unsubscribe>
>> List-Archive: <https://lists.openafs.org/pipermail/openafs-info/>
>>
>> Hello,
>>
>> We (Solaris 8, Transarc 3.6 2.32 servers, 3.6 2.26 db servers) had an
>> issue where a client with a certain firewall (Zone Alarm and or Black
>> Ice) configuration (allowing AFS traffic out but no AFS traffic in, or
>> more precisely, it didn't allow any _uninitiated_ inbound AFS traffic
>> e.g. a fileserver callback) caused the fileserver (a couple actually) to
>> come to a crawl (reads/writes taking 10minutes or more to complete) and
>> become virtually unusable.  Had to end up blocking this firewall'ed
>> client machine to get fileservers back to normal.  During "outage"
>> FileLog would repeat following message sequence every minute:
>>
>> Wed Jul 10 16:22:55 2002 BreakDelayedCallbacks FAILED for host 894f2528
>> which IS UP.  Possible network or routing failure.
>> Wed Jul 10 16:22:55 2002 MultiProbe failed to find new address for
>> host894f2528.7001
>> Wed Jul 10 16:23:51 2002 CB: Call back connect back failed (in break
>> delayed) for 894f2528.7001
>>
>> We have not been able to duplicate the problem but we've experienced it
>> 2 to 3 times within about 3 months.
>>
>> Below is the explanation I got from Transarc. They've informed us that a
>> fix is en route.  Has anybody ever experienced this in openafs (or
>> anywhere)?
>>
>>
>>
>>
>>

-- 
John W. Sopko Jr.               University of North Carolina
email: sopko AT cs.unc.edu      Computer Science Dept., CB 3175
Phone: 919-962-1844             Sitterson Hall; Room 044
Fax:   919-962-1799             Chapel Hill, NC 27599-3175