[OpenAFS] BreakDelayedCallbacks FAILED still an issue

John W. Sopko Jr. sopko@cs.unc.edu
Tue, 18 Apr 2006 12:43:25 -0400


We have 3 OpenAFS 1.4.0 files ervers running on Redhat linux
enterprixe 3 with the latest patches. This morning when I
came in the servers were very slow and not responding to
client requests, they were basically hung. This in turn
pretty much takes down all our web servers file services
for home dirs etc.

I tracked this down to a "bad" afs windows client, the client
was running an old 1.3.77 version of the client or may have
a mis configured firewall. I halted the "bad" client
and this fixed our server problems. I turned up
debugging on the file server (kill -TSTP) and got the below
messages I used to track this down. I searched the afs-info
archives and this problem was discussed in 2002 and was
supposed to get fixed. Is this
fixed in a version newer then 1.4.0? That is, not allowing
clients to bring down the server with bad callbacks. Thanks
for your input.

Tue Apr 18 10:20:19 2006 CB: RCallBackConnectBack failed for 152.2.128.182:7001
Tue Apr 18 10:22:27 2006 [12] CB: Call back connect back failed (in break 
delayed) for 152.2.128.182:7001
Tue Apr 18 10:22:27 2006 [12] BreakDelayedCallbacks FAILED for host 
152.2.128.182 which IS UP.  Possible network or routing failure.
Tue Apr 18 10:22:27 2006 [12] MultiProbe failed to find new address for host 
152.2.128.182:7001
Tue Apr 18 10:24:34 2006 [7] CB: WhoAreYou failed for 152.2.128.182:7001, 
error -03
Tue Apr 18 10:26:42 2006 [7] CB: Call back connect back failed (in break 
delayed) for 152.2.128.182:7001
Tue Apr 18 10:26:42 2006 [7] BreakDelayedCallbacks FAILED for host 
152.2.128.182 which IS UP.  Possible network or routing failure.

Here is the old post about this:

--------------------------------------------
 From fbo2@gmx.net  Tue Aug 27 12:13:13 2002
Date: Tue, 27 Aug 2002 18:12:59 +0200
From: FBO <fbo2@gmx.net>
To: OpenAFS-info@openafs.org
 
               432936,1      22%
X-BeenThere: openafs-info@openafs.org
X-Mailman-Version: 2.0.4
Precedence: bulk
List-Help: <mailto:openafs-info-request@openafs.org?subject=help>
List-Post: <mailto:openafs-info@openafs.org>
List-Subscribe: <https://lists.openafs.org/mailman/listinfo/openafs-info>,
         <mailto:openafs-info-request@openafs.org?subject=subscribe>
List-Id: OpenAFS Info/Discussion <openafs-info.openafs.org>
List-Unsubscribe: <https://lists.openafs.org/mailman/listinfo/openafs-info>,
         <mailto:openafs-info-request@openafs.org?subject=unsubscribe>
List-Archive: <https://lists.openafs.org/pipermail/openafs-info/>

Hello,

We (Solaris 8, Transarc 3.6 2.32 servers, 3.6 2.26 db servers) had an
issue where a client with a certain firewall (Zone Alarm and or Black
Ice) configuration (allowing AFS traffic out but no AFS traffic in, or
more precisely, it didn't allow any _uninitiated_ inbound AFS traffic
e.g. a fileserver callback) caused the fileserver (a couple actually) to
come to a crawl (reads/writes taking 10minutes or more to complete) and
become virtually unusable.  Had to end up blocking this firewall'ed
client machine to get fileservers back to normal.  During "outage"
FileLog would repeat following message sequence every minute:

Wed Jul 10 16:22:55 2002 BreakDelayedCallbacks FAILED for host 894f2528
which IS UP.  Possible network or routing failure.
Wed Jul 10 16:22:55 2002 MultiProbe failed to find new address for
host894f2528.7001
Wed Jul 10 16:23:51 2002 CB: Call back connect back failed (in break
delayed) for 894f2528.7001

We have not been able to duplicate the problem but we've experienced it
2 to 3 times within about 3 months.

Below is the explanation I got from Transarc. They've informed us that a
fix is en route.  Has anybody ever experienced this in openafs (or
anywhere)?





-- 
John W. Sopko Jr.               University of North Carolina
email: sopko AT cs.unc.edu      Computer Science Dept., CB 3175
Phone: 919-962-1844             Sitterson Hall; Room 044
Fax:   919-962-1799             Chapel Hill, NC 27599-3175