[OpenAFS] Performance issues
Wed, 17 Aug 2011 23:03:44 +0200
Some six months after launching my first AFS project, I've learned
some things but would also like to make some improvements.
My site has three locations, each with one physical server machine
running Debian squeeze, as well as two Internet connections. To avoid
Kerberos identity problems among the three servers, I've set up one
virtual host per server machine, each with a single interface and a
public IP address. At first these virtual machines hosted both our
integrated Kerberos and OpenLDAP servers, as well as our OpenAFS
database and file servers.
In this configuration, all AFS traffic, even local traffic, was forced
to traverse a firewall and a NAT. In tests it all seemed okay, but in
practice performance could be quite poor, often getting worse after
several hours of use. At first we thought the cause was due to the UDP
connections timing out, but the problems persisted even after we had
compensated and no more AFS packets were being dropped.
Last weekend I decided to make a major change to our AFS server
architecture. I created a new AFS file server on the base OS of each
of the three physical servers, moved all of the AFS volumes there and
removed the file server processes from the virtual hosts. With NetInfo
and NetRestrict, I set each of the new file servers to register only
one public IP address in the VLDB. Now local AFS file server traffic
is no longer forced to traverse the firewall/NAT. This was a great
improvement, so that people with local user volumes are at last
experiencing performance that can be described as normal.
However, if anything this has only made the contrast between local and
remote performance even more acute. After all, nothing has changed for
the remote users: not only do their connections have less bandwidth (6
Mbps) and higher latency (25-30 ms), but their traffic must also
traverse two firewalls and a local NAT.
What is currently likely the main reason for this difference in performance?
If it's mostly due to the NAT and the fact that the firewalls must be
traversed and the UPD connections kept track of, then I could set up
some tunnels to avoid most of that. But, if it's just the higher
latency and limited bandwidth, then I suppose there's not much to do