[OpenAFS-devel] volserver tuning
Todd_DeSantis@transarc.com
Todd_DeSantis@transarc.com
Wed, 30 Oct 2002 09:02:56 -0500 (EST)
Hi Nathan - Hi Russ (and others):
>> Recently I started more closely monitoring our volservers for
>> responsiveness, especially during mass volume move and dump
>> operations.
>> I've noticed that during periods where volume actions are taking
>> place, the volserver periodically hangs and doesn't seem to
>> respond. Sometimes this occurs with only a few moves taking place.
> Yup, we've been having the same problem; it causes a ton of problems
> for volume releases not infrequently.
I believe that both OpenAFS and IBM AFS have looked into this over the
past few months. This bottleneck lies in the communication between
the volserver and fileserver via the fssync calls. For certain
calls/trnasactions, the volserver must contact the fileserver to do
some actions. These actions are mainly
- have the fileserver break callbacks to clients that have been
using this volume.
We have noticed that with the increase in PCs and laptops that
travel between offices and home that the BreakCallback calls
can fail/timeout because they are no longer on the network.
This has caused the link between the fileserver and volserver
to linger and cause problems with this call and with other
transaction on this volserver.
Derrick made changes to the OpenAFS fssync code to allow the
fileserver to return control back to the volserver while the
callbacks are being broken. I also think Rainer Toebbicke of CERN
also made some changes in this area too. This can allow that
initial volserver transaction to continue.
However, I think that the fileserver only has 1 thread dedicated
to listenting to requests from the volserver, so while the
fileserver is still handling the BreakCallbacks request, other
requests from the volserver are being blocked. It is possible
that the CERN code addresses this and allows more fileserver
threads to listen for volserver requests.
The Transarc AFS code has also addressed some of these areas
of contention. We allow the fileserver to return control to the
volserver while it is breaking callbacks and we have also
increased the number of threads that are available for the
volserver requests. We have several sites running with these
versions now and I have not heard of any problems.
As a warning, I have suggested that sites be wary of scripts
that will try to release a series of volumes one after the other.
Since the "vos release" no longer has to wait for the fileserver
to BreakCallbacks. the vos command finishes sooner and this could
cause the next releases to hit the bottleneck at the fssync
interface and fail.
So having multiple "vos move" jobs running at the same time on this
volserver/fileserver can also run into this problem.
You can check the FileLog to see if you are seeing messages
complaining about breaking volume callbacks to see if this is
possibly the problem you are running up against.
In my early days of supporting AFS, I always tried to tell customers
to watch the number of simultaneous vos transactions that they send to
the fileserver. These transactions are expensive at the IO level and
the more transactions running at the same time can not only hurt
volserver performance, but also fileserver performance. Since those
days, we have changed the way the fileserver places volumes on the
vice partitions so finding the next free inode to use is much faster
and no longer as big performance bottleneck.
But it is still recommended that we do not try to overload the
volserver even though it does have the ability to run with 16
threads. From past expereiences with customers, once 3 or 4 vos
transactinos were active, performance did start to suffer.
I'm probably mentioning things that you are already aware of, but I
did want to throw this out there.
Thanks
Todd DeSantis
AFS Support