[OpenAFS-devel] volserver tuning

Todd_DeSantis@transarc.com Todd_DeSantis@transarc.com
Wed, 30 Oct 2002 09:02:56 -0500 (EST)


Hi Nathan - Hi Russ (and others):

>> Recently I started more closely monitoring our volservers for
>> responsiveness, especially during mass volume move and dump
>> operations.

>> I've noticed that during periods where volume actions are taking
>> place, the volserver periodically hangs and doesn't seem to
>> respond. Sometimes this occurs with only a few moves taking place.

> Yup, we've been having the same problem; it causes a ton of problems
> for volume releases not infrequently.

I believe that both OpenAFS and IBM AFS have looked into this over the
past few months.  This bottleneck lies in the communication between
the volserver and fileserver via the fssync calls.  For certain
calls/trnasactions, the volserver must contact the fileserver to do
some actions.  These actions are mainly
  - have the fileserver break callbacks to clients that have been
    using this volume.

    We have noticed that with the increase in PCs and laptops that 
    travel between offices and home that the BreakCallback calls
    can fail/timeout because they are no longer on the network.
    This has caused the link between the fileserver and volserver
    to linger and cause problems with this call and with other 
    transaction on this volserver.

    Derrick made changes to the OpenAFS fssync code to allow the
    fileserver to return control back to the volserver while the
    callbacks are being broken.  I also think Rainer Toebbicke of CERN
    also made some changes in this area too.  This can allow that
    initial volserver transaction to continue.

    However, I think that the fileserver only has 1 thread dedicated 
    to listenting to requests from the volserver, so while the
    fileserver is still handling the BreakCallbacks request, other
    requests from the volserver are being blocked.  It is possible
    that the CERN code addresses this and allows more fileserver
    threads to listen for volserver requests.

    The Transarc AFS code has also addressed some of these areas
    of contention.  We allow the fileserver to return control to the
    volserver while it is breaking callbacks and we have also
    increased the number of threads that are available for the
    volserver requests.  We have several sites running with these
    versions now and I have not heard of any problems.

    As a warning, I have suggested that sites be wary of scripts 
    that will try to release a series of volumes one after the other.
    Since the "vos release" no longer has to wait for the fileserver
    to BreakCallbacks. the vos command finishes sooner and this could
    cause the next releases to hit the bottleneck at the fssync
    interface and fail.

    So having multiple "vos move" jobs running at the same time on this
    volserver/fileserver can also run into this problem.

    You can check the FileLog to see if you are seeing messages
    complaining about breaking volume callbacks to see if this is
    possibly the problem you are running up against.

In my early days of supporting AFS, I always tried to tell customers
to watch the number of simultaneous vos transactions that they send to
the fileserver.  These transactions are expensive at the IO level and
the more transactions running at the same time can not only hurt
volserver performance, but also fileserver performance.  Since those
days, we have changed the way the fileserver places volumes on the
vice partitions so finding the next free inode to use is much faster
and no longer as big performance bottleneck.

But it is still recommended that we do not try to overload the
volserver even though it does have the ability to run with 16
threads.  From past expereiences with customers, once 3 or 4 vos
transactinos were active, performance did start to suffer.

I'm probably mentioning things that you are already aware of, but I
did want to throw this out there.

Thanks

Todd DeSantis
AFS Support