[OpenAFS] DB servers "quorum" and OpenAFS tools

Neil Davies semanticphilosopher@gmail.com
Fri, 24 Jan 2014 15:46:16 +0000


To solve this you can't just use the round trip in its raw form, you =
to understand it terms of how the "delay and loss" accrued.

Its a bit too long (and potentially off-topic) for this list, but =
briefly the way we
perform this sort of analysis (in my day job) is to view it as quality
attenuation (short hand =E2=88=86Q) - this can be split into a set of =
bases that
both permit evaluation of the structural component of the delay/loss=20
(geography, path length and their serialisation rates) and the variable =
(which is basically due to contention).

This, in turn can be used to estimate setting of a timeout - if the =
has ONLY introduced delay any additional sending will just increase it.=20=

If the congestion has introduced loss (or loss as occurred for another =
then it is worth sending it. Using the =E2=88=86Q approach it is =
possible to assess
both the false-postive (resending a packet when it was just delayed)
and performance (waiting "too long" to retransmit given a packet has =
been lost)
hit for a protocol.

If you combine this approach with the collective knowledge of other =
to/from that service (which can happen in - at least in principle and
to some extent in - Rx, but not really in TCP) then it suggests
that you could create a more optimum solution - one that recovers from =
reasonably quickly without creating too much additional load.


If interested, there is a brief introduction to these ideas in =

On 23 Jan 2014, at 21:55, Peter Grandi <pg@afs.list.sabi.co.UK> wrote:

>>>> For example in an ideal world putting more or less DB servers
>>>> in the client 'CellServDB' should not matter, as long as one
>>>> that belongs to the cell is up; again if the logic were for
>>>> all types of client: "scan quickly the list of potential DB
>>>> servers, find one that is up and belongs to the cell and
>>>> reckons is part of the quorum, and if necessary get from it
>>>> the address of the sync site".
>> The problem is that you the client to scan "quickly" to find a
>> server that is up, but because networks are not perfectly
>> reliable and drop packets all the time, it cannot know that a
>> server is not up until that server has failed to respond to
>> multiple retransmissions of the request.
> That has nothing to do with how quickly the probes are sent...
>> Those retransmissions cannot be sent "quickly"; in fact, they
>> _must_ be sent with exponentially-increasing backoff times.
> That has nothing to do with how "quickly" they can be sent...  The
> duration of the intervals betwen the probes is a different matter
> from what should be the ratio of intervals.
>> Otherwise, when your network becomes congested, the
>> retransmission of dropped packets will act as a runaway positive
>> feedback loop, making the congestion worse and saturating the
>> network.
> I am sorry I have not been clear about the topic: I was not
> meaning to discussing flow control is back-to-back streaming
> connections, my concern was about the frequency of *probing*
> servers for accessibility.
> Discovering the availability of DB servers is not the same thing
> as streaming data from/to a fileserver, both in nature and as to
> amount of traffic involved. In TCP congestion control for example
> one could be talking about streams of 100,000x 8192B packets per
> second. DB database discovery=20
> But even if I had meant to discuss back-to-back streaming packet
> congestion control, the absolute numbers are still vastly
> different. In the case of *probing* for the liveness of a *single*
> DB server I have observed the 'vos' command send packets with
> these intervals:
>  =C2=ABThe wait times after the 8 attempts are: 3.6s, 6.8s, 13.2s,
>  21.4s, 4.6s, 25.4s, 26.2s, 3.8s.=C2=BB
> with randomish variations around that, That's around 5 packets per
> minute with intervals between 3,600ms and 26,200ms. Again, to a
> single DB server, not say roundrobin to all DB servers in
> 'CellServDB'.
> With TCP congestion control back of (the 'RTO' parameter) for
> 200ms (two hundreds milliseconds). With another rather different
> distributed filesystem, Lustre, I observed some issue with that
> very long backoff time, with high throughput (600-800MB/s)
> back-to-back packet streams, and there is a significant amount of
> research that on fast low latency links 200ms RTO seems way
> excessive.
> For example in a paper that is already 5 years old:
>  http://www.cs.cmu.edu/~dga/papers/incast-sigcomm2009.pdf
>    =C2=ABUnder severe packet loss, TCP can experience a timeout that
>    lasts a minimum of 200ms, determined by the TCP minimum
>    retransmission timeout (RTOmin ).
>    While the default values operating systems use today may
>    suffice for the wide-area, datacenters and SANs have round
>    trip times that are orders of magnitude below the RTOmin
>    defaults (Table 1).
>    Scenario	RTT  	OS	TCP RTOmin
>    WAN		100ms	Linux	200ms
>    Datacenter	<1ms	BSD	200ms
>    SAN		<0.1ms	Solaris	400ms
>    Table 1: Typical round-trip-times and minimum
>    TCP retransmission bounds.=C2=BB
>    How low must the RTO be to retain high throughput under TCP
>    incast collapse conditions, and to how many servers does this
>    solution scale? We explore this question using real-world
>    measurements and ns-2 simulations [26], finding that to be
>    maximally effective, the timers must operate on a granularity
>    close to the RTT of the network=E2=80=94hundreds of microseconds or
>    less.=C2=BB
>    =C2=ABFigure 3: Experiments on a real cluster validate the
>    simulation result that reducing the RTOmin to microseconds
>    improves goodput.=C2=BB
>    =C2=ABAggressively lowering both the RTO and RTOmin shows practical
>    benefits for datacenters. In this section, we investigate if
>    reducing the RTOmin value to microseconds and using finer
>    granularity timers is safe for wide area transfers.
>    We find that the impact of spurious timeouts on long, bulk
>    data flows is very low =E2=80=93 within the margins of error =E2=80=93=

>    allowing RTO to go into the microseconds without impairing
>    wide-area performance.=C2=BB
