[OpenAFS] performance and udp buffers

Dan Van Der Ster daniel.vanderster@cern.ch
Tue, 9 Oct 2012 09:24:22 +0000


Dear AFS Gurus,
At CERN we have been suffering from occasional problems where the time to a=
ccess any volume/partition on a busy AFS server increased suddenly to ~infi=
nity (or at least 20 seconds). During these incidents we consistently notic=
e:
 - at least one user is hammering the fileserver (from 10s or 100s of batch=
 jobs)
 - time to write 64kB to any volume on the affected server went from ~10ms =
as usual to >10 or 20 seconds
 - network throughput is "flat" for the duration of the incident, but well =
below the historical peak throughput -- sometimes at ~50MBps or up to ~150M=
Bps (server has a 10Gbps network card)
 - CPU usage is also flat at ~120% (corresponding to 1 processor + a bit)
 - iostat shows little or no disk activity
 - there is not a shortage of threads (more than 100 idle threads).

We are indeed able to reproduce the issue in a synthetic stress test enviro=
nment (using both v1.4.14+CERN patches and also 1.6.1a vanilla code). With =
the 1.6.1a vanilla fileserver, we can hit this access time wall by creating=
 >=3D30 clients which simultaneously cp a 10GB file from AFS into /dev/null=
.

Recently, while trying to reproduce the issue with rxperf, we found that th=
e issue is basically due to overrunning the UDP socket buffer, i.e. huge nu=
mbers of dropped UDP packets (we see >10% packet errors in /proc/net/snmp).=
 By increasing the buffer size we can effectively mitigate this problem.=20

We currently run fileservers with udpsize=3D2MB, and at that size we have a=
 30 client limit in our test environment. With a buffer size=3D8MB (increas=
ed kernel max with sysctl and fileserver option), we don't see any dropped =
UDP packets during our client-reading stress test, but still get some dropp=
ed packets if all clients write to the server. With a 16MB buffer we don't =
see any dropped packets at all in reading or writing.

In practise with this very large UDP buffer, we can decrease the access tim=
e from ~infinity to less than 1second on a very heavily loaded server (e.g.=
 250 clients writing 1GB each)

We plan to roll these very large UDP buffer sizes into production, but want=
ed to check here first if we have missed something. Does anyone foresee pro=
blems with using a 16MB UDP buffer?

By the way, we have also compared the access latency of 1.4.14 and 1.6.1a i=
n our rxperf tests. In general we find that 1.6.1a provides 2-3x speedup (e=
.g. hammered 1.6.1a has a 64kB write latency of ~300ms vs ~1s for 1.4.14). =
So this confirms a significant performance improvement in 1.6.

Best Regards,
Dan van der Ster, CERN IT-DSS
on behalf of the CERN AFS Team=