[OpenAFS] Re: Chronic blocked connections on fileserver

Fri, 28 Sep 2007 10:48:45 -0500

On Mon, Sep 24, 2007 at 09:35:37AM -0400, Jim Rees wrote:
> Will Maier wrote:
> > I should add that we're running OpenAFS 1.4.1 on Scientific
> > Linux 4.4 on 2.6.9-42.0.3.ELsmp (x86_64).
> 
> If I'm not mistaken it's a known bug, or rather set of bugs. You
> want to update to at least OpenAFS 1.4.2, probably later. You may
> also have some misbehaving clients, or clients behind misbehaving
> firewalls.

Thanks for the advice. On Tuesday morning, I upgraded OpenAFS to
1.4.4 (using the Scientific Linux RPMs) on Tuesday morning. The
upgrade window happened to fall during another bad spike in blocked
connections, which disappeared immediately after the bos restart I
ran on the server. For the next 48 hours, we had only one short
period with ~16 blocked connections, a significant improvement over
our previous condition (extended periods of ~220 blocked
connections, several times a day).

After 48 or so hours, the extended periods of badness returned.
These periods are still isolated to this one host and otherwise
resemble what we saw last week.

The FileLog shows lots and lots of SAFS_FetchStatus calls during the
bad periods, but I'm not sure if that's causal or symptomatic. When
the bad periods resolve themselves (as they always do), data
actually starts being transferred.

On a whim, I thought I'd take a look at the hosts hitting us during
a representative bad period. Since we had one this morning around
09:30 CDT, I did the following (the FileLog was on max-verbose):

    $ sed -e '/09:3.*SAFS_FetchStatus,  Fid/!d; s/^.*Fid = \([0-9]\+\)\..*, Host \(.*\):.*,.*$/\2/' FileLog |\
         sort | uniq -c | sort -rn | head | cat -n
         1     2514 128.30.29.49
         2     1326 128.104.3.108
         3     1188 128.105.150.212
         4      815 128.104.3.109
         5      758 144.92.180.12
         6      670 144.92.180.145
         7      615 128.105.150.227
         8      538 144.92.180.188
         9      470 128.104.28.27
        10      381 144.92.180.187

Lines 2..10 aren't suprising; they're all on our subnet and should
be accessing lots of data. Line 1 is odd, though: why should a host
at MIT be responsible for ~50 FetchStatus calls per second? Could
this sort of behavior over a WAN link (they're MIT, we're Wisconsin)
cause blocked connections to pile up?

I don't expect that this is responsible for our problems, but it
sure struck me as interesting nonetheless. If it's not, do any of
our post-upgrade observations suggest another course of action?

Thanks for the help!
-- 

o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------*