[OpenAFS] Performance issues

Jaap Winius jwinius@umrk.nl
Wed, 17 Aug 2011 23:03:44 +0200

Hi folks,

Some six months after launching my first AFS project, I've learned  
some things but would also like to make some improvements.

My site has three locations, each with one physical server machine  
running Debian squeeze, as well as two Internet connections. To avoid  
Kerberos identity problems among the three servers, I've set up one  
virtual host per server machine, each with a single interface and a  
public IP address. At first these virtual machines hosted both our  
integrated Kerberos and OpenLDAP servers, as well as our OpenAFS  
database and file servers.

In this configuration, all AFS traffic, even local traffic, was forced  
to traverse a firewall and a NAT. In tests it all seemed okay, but in  
practice performance could be quite poor, often getting worse after  
several hours of use. At first we thought the cause was due to the UDP  
connections timing out, but the problems persisted even after we had  
compensated and no more AFS packets were being dropped.

Last weekend I decided to make a major change to our AFS server  
architecture. I created a new AFS file server on the base OS of each  
of the three physical servers, moved all of the AFS volumes there and  
removed the file server processes from the virtual hosts. With NetInfo  
and NetRestrict, I set each of the new file servers to register only  
one public IP address in the VLDB. Now local AFS file server traffic  
is no longer forced to traverse the firewall/NAT. This was a great  
improvement, so that people with local user volumes are at last  
experiencing performance that can be described as normal.

However, if anything this has only made the contrast between local and  
remote performance even more acute. After all, nothing has changed for  
the remote users: not only do their connections have less bandwidth (6  
Mbps) and higher latency (25-30 ms), but their traffic must also  
traverse two firewalls and a local NAT.

What is currently likely the main reason for this difference in performance?

If it's mostly due to the NAT and the fact that the firewalls must be  
traversed and the UPD connections kept track of, then I could set up  
some tunnels to avoid most of that. But, if it's just the higher  
latency and limited bandwidth, then I suppose there's not much to do  
about it.