[OpenAFS] Request for testing: NATs and 1.6.6pre*

Andrew Deason adeason@sinenomine.net
Thu, 19 Dec 2013 18:28:58 -0600


Hi all,

1.6.6pre1 and 1.6.6pre2 contain an extra feature in the OpenAFS
fileserver that could possibly help with communicating with clients
behind NATs (Network Address Translation). It's not completely certain
how much this feature helps, though, so it will be removed from the
1.6.6 release unless we get some more information about it.

If you are running a fileserver that you believe may have some trouble
talking to clients behind NATs, testing this feature would be very
helpful. This is most relevant for any site that may have fileservers
that are talking to NAT'ed clients, where the clients are old enough to
not have the client-side NAT improvements (pre-1.6); this is most common
at sites that have users accessing AFS from home that don't know much
about AFS. You can test this new feature by just running a fileserver
with 1.6.6pre* and see if anything improves; there is no additional
configuration or anything to do. 

But how do you know if this is a problem for you at all? Usually the
most user-visible symptom is that access to AFS hangs while a client is
tryign to write to AFS, but a lot of different things can cause that.

To know if that is being caused _specifically_ because of problems
reaching clients behind NATs, you can check the fileserver's FileLog. In
there, if you see a lot of log messages talking about errors trying to
contact specific IPs and port numbers, you may be suffering from this.

In particular, it's somewhat likely to be related to NATs if you see a
lot of such error messages logged referring to non-7001 ports. And it's
especially likely if you see a lot of connection errors for non-7001
ports that are obviously incrementing over time. (For example, you see
an error for port 8005, then 8006, then 8007, etc, all from the same
IP.)

It can also help to know if the IPs you see logged in FileLog are behind
NATs in the first place. If you have no way of knowing that, you can
sort-of detect what hosts may be behind NATs by sending the fileserver
the SIGXCPU signal, and looking at the resulting
/usr/afs/local/hosts.dump file. If you see an entry for a host with a
public IP like "ip:203.0.113.40", and later on in that entry you see a
list of IPs that include private IPs, like "[ 203.0.113.40:7001
192.168.1.5:7001]", that host may be behind a NAT.

"Detecting" a client behind a NAT in this way is far from perfect, but
it's just another things to check. Common private IP ranges are of
course 192.168/16, 172.16/20, and 10/8. A client can obviously be behind
a NAT without an IP in any of those ranges, but those are commonly used
by consumer-grade home routers and stuff like that.


Anyway, if you ever look into why an OpenAFS fileserver appears to be
slow/hanging, and the above information suggests that client NATs are an
issue, it would be very helpful if you tried looking into some posible
fixes. If you cannot deploy 1.6.6pre* on a server experiencing this
issue, we can also provide patches specifically for this issue based on
a previous stable version, if that's more feasible. There are also
additional possible patches in this area that are not in 1.6.6pre*, if
you want to try other approaches.

Or even if you can't actually deploy any testing code, I'd still like to
hear from you if you think you are experiencing issues in this area.
More information is always appreciated. Remember that if we don't hear
anything, this will be pulled out.


For developers: obviously I'm skipping over the details of what any of
this actually does. The 'extra feature' is gerrit 9420, which will be
reverted via gerrit 10135. See also: gerrits 10144-10147.

-- 
Andrew Deason
adeason@sinenomine.net