[OpenAFS] Questions regarding `afsd` caching arguments (`-dcache` and `-files`)

Ciprian Dorin Craciun ciprian.craciun@gmail.com
Sat, 9 Mar 2019 01:49:18 +0200


On Fri, Mar 8, 2019 at 11:39 PM Ciprian Dorin Craciun
<ciprian.craciun@gmail.com> wrote:
> On Fri, Mar 8, 2019 at 11:11 PM Jeffrey Altman <jaltman@auristor.com> wrote:
> > The performance issues could be anywhere and everywhere between the
> > application being used for testing and the disk backing the vice partition.


OK, so first of all I want to thank Jeffrey for the support via IRC,
as we've solved the issue.


Basically it boils down to:

* lower the number of threads from the `fileserver` to a proper value
based on available CPU's / cores;  (in my case `-p 4` or `-p 8`;)

* properly configure jumbo frames on the network cards `ip link set
dev eth0 mtu 9000`;  (this configuration has to be made in the
"proper" place else it will be lost after restart;)
* (after changing MTU restart both server and clients;)

* disable encryption `fs setcrypt -crypt off`;  (in the end based on
what I understood it's not too powerful, and given that I'll use it
mostly on LAN it's not an issue;  moreover for WAN I don't need to
saturate GigaBit network;)
* (after changing re-authenticate, i.e. `unlog && klog`);


In order to check the correct configuration one has to:

* `cmdebug -server 192.168.0.2 -addrs` (on the client) to see if the
MTU is correctly picked up;  (else restart the cache manager;)

* `rxdebug -server 192.168.0.1 -peer -long` (on the server) to see if
the `ifMTU / natMTU / maxMTU` for the client connection have proper
values;  (in my case they were `8524 / 7108 / 7108`;)

* use `top -H` and check if the kernel thread `afs_rxlistener` (on the
client) and the many of the `fileserver` threads (on the server) are
not maxed-out (i.e. > ~90%);  if so, that is the bottleneck (after
encryption is disabled and jumbo frames are enabled);


A note about the benchmark:  in order to saturate the link I've tested
only with the large files (i.e. ~20 MiB each), else I'll end up
"trashing" the disk, and thus that would become the bottleneck.



BTW, I've taken the liberty to copy-paste the log from the IRC channel
(I've keep only the relevant lines, and also grouped reordered some of
them), because they are very insightful into OpenAFS performance
tuning.

So once more thank's Jeffrey for the help,
Ciprian.



~~~~
23:43 < auristor> first question, when you are writing to the
fileserver, does "top -H" show a fileserver thread at or near 100%
cpu?
23:45 < auristor> -H will break them out by process thread instead
providing one value for the fileserver as a whole

23:46 < auristor> I ask because one thread is the RX listener thread
and that thread is the data pump.  If that thread reaches 100% then
you are out of capacity to receive and transmit packets


00:00 < auristor> Since you have a single client and 8 processor
threads on the fileserver, I would recommend lowering the -p
configuration value to reduce lock contention.

23:55 < auristor> there are two major bottlenecks in the OpenAFS.
First, the rx listener thread which does all of the work associated
with packet allocation, population, transmission, restransmission, and
freeing on the sender and packet allocation, population, application
queuing, acknowledging, and freeing on the receiver.

23:56 < auristor> In OpenAFS this process is not as efficient as it
could be and its architecture limits it to using a single processor
thread which means that its ability to scale correlates to the
processor clock speed


23:58 < auristor> Second, there are many global locks in play.  On the
fileserver, there is one global lock for each fileserver subsystem
required to process an RPC.  For directories there are 8 global locks
that must be acquired and 7 for non-directories.
23:59 < auristor> These global locks in the fileserver result in
serialization of calls received in parallel.

00:00 < ciprian_craciun> (Even if they are for different directories / files?)
00:00 < ciprian_craciun> (I.e. is there some sort of actual "global
lock" that basically serializes all requests from all clients?)

00:01 < auristor> The global locks I mentioned do serialize the
startup and shutdown of calls even when the calls touch different
objects.


00:02 < auristor> Note that an afs family fileserver is really an
object store.  unlike a nfs or cifs fileserver, an afs fileserver does
not perform path evaluation.   path evaluation to object id is
performed by the cache managers.

00:04 < auristor> The Linux cache manager also has a single global
lock that protects all other locks and data structures.  This lock is
dropped frequently to permit parallel processing but it does severely
limit the amount of a parallel execution


00:09 < ciprian_craciun> Trying now with `-p 4` seems to yield ~35
MiB/s of `cat` throughput.

00:11 < auristor> that would imply that the fileserver is not
releasing worker threads from the call channel fast enough to permit
the thread to be available for the next incoming call from the client.

00:12 < auristor> are you tests using authentication?
00:14 < auristor> So the fcrypt encrypt and decrypt is probably the culprit
00:15 < auristor> fcrypt is weaker than des and very inefficient.
00:16 < auristor> The delays the encryption introduces in the sender
can lead to network stalls
00:24 < auristor> "aklog -force" is equivalent tot aht

00:24 < ciprian_craciun> And yes, now it's seems I reach ~100 MiB/s.
00:25 < ciprian_craciun> Now RX listener is < 50%.


00:25 < auristor> and what is the afs_rxlistener thread utilization
and the fileserver threads usage?
00:26 < auristor> I suspect you have now moved the bottleneck from the
client's rx listener thread to the fileserver

00:27 < ciprian_craciun> The fileserver threads are < ~15%


00:29 < auristor> no jumbo grams
00:32 < auristor> cmdebug client -addr
00:45 < auristor> establish a call from the client to the fileserver
and what does "rxdebug <fileserver> 7000 -peer -long" report for the
client?
00:49 < auristor> remove the -rxmaxmtu from the fileserver config
~~~~