[OpenAFS] Re: DB servers "quorum" and OpenAFS tools

Peter Grandi pg@afs.list.sabi.co.UK
Thu, 23 Jan 2014 21:55:15 +0000

>>> For example in an ideal world putting more or less DB servers
>>> in the client 'CellServDB' should not matter, as long as one
>>> that belongs to the cell is up; again if the logic were for
>>> all types of client: "scan quickly the list of potential DB
>>> servers, find one that is up and belongs to the cell and
>>> reckons is part of the quorum, and if necessary get from it
>>> the address of the sync site".

> The problem is that you the client to scan "quickly" to find a
> server that is up, but because networks are not perfectly
> reliable and drop packets all the time, it cannot know that a
> server is not up until that server has failed to respond to
> multiple retransmissions of the request.

That has nothing to do with how quickly the probes are sent...

> Those retransmissions cannot be sent "quickly"; in fact, they
> _must_ be sent with exponentially-increasing backoff times.

That has nothing to do with how "quickly" they can be sent...  The
duration of the intervals betwen the probes is a different matter
from what should be the ratio of intervals.

> Otherwise, when your network becomes congested, the
> retransmission of dropped packets will act as a runaway positive
> feedback loop, making the congestion worse and saturating the
> network.

I am sorry I have not been clear about the topic: I was not
meaning to discussing flow control is back-to-back streaming
connections, my concern was about the frequency of *probing*
servers for accessibility.

Discovering the availability of DB servers is not the same thing
as streaming data from/to a fileserver, both in nature and as to
amount of traffic involved. In TCP congestion control for example
one could be talking about streams of 100,000x 8192B packets per
second. DB database discovery=20

But even if I had meant to discuss back-to-back streaming packet
congestion control, the absolute numbers are still vastly
different. In the case of *probing* for the liveness of a *single*
DB server I have observed the 'vos' command send packets with
these intervals:

  =C2=ABThe wait times after the 8 attempts are: 3.6s, 6.8s, 13.2s,
  21.4s, 4.6s, 25.4s, 26.2s, 3.8s.=C2=BB

with randomish variations around that, That's around 5 packets per
minute with intervals between 3,600ms and 26,200ms. Again, to a
single DB server, not say roundrobin to all DB servers in

With TCP congestion control back of (the 'RTO' parameter) for
200ms (two hundreds milliseconds). With another rather different
distributed filesystem, Lustre, I observed some issue with that
very long backoff time, with high throughput (600-800MB/s)
back-to-back packet streams, and there is a significant amount of
research that on fast low latency links 200ms RTO seems way

For example in a paper that is already 5 years old:


    =C2=ABUnder severe packet loss, TCP can experience a timeout that
    lasts a minimum of 200ms, determined by the TCP minimum
    retransmission timeout (RTOmin ).

    While the default values operating systems use today may
    suffice for the wide-area, datacenters and SANs have round
    trip times that are orders of magnitude below the RTOmin
    defaults (Table 1).

    Scenario=09RTT  =09OS=09TCP RTOmin

    Table 1: Typical round-trip-times and minimum
    TCP retransmission bounds.=C2=BB


    How low must the RTO be to retain high throughput under TCP
    incast collapse conditions, and to how many servers does this
    solution scale? We explore this question using real-world
    measurements and ns-2 simulations [26], finding that to be
    maximally effective, the timers must operate on a granularity
    close to the RTT of the network=E2=80=94hundreds of microseconds or=


    =C2=ABFigure 3: Experiments on a real cluster validate the
    simulation result that reducing the RTOmin to microseconds
    improves goodput.=C2=BB

    =C2=ABAggressively lowering both the RTO and RTOmin shows practical=

    benefits for datacenters. In this section, we investigate if
    reducing the RTOmin value to microseconds and using finer
    granularity timers is safe for wide area transfers.

    We find that the impact of spurious timeouts on long, bulk
    data flows is very low =E2=80=93 within the margins of error =E2=80=
    allowing RTO to go into the microseconds without impairing
    wide-area performance.=C2=BB