Cluster: decrease ping/pong traffic by trusting other nodes reports.

Cluster of bigger sizes tend to have a lot of traffic in the cluster bus just for failure detection: a node will try to get a ping reply from another node no longer than when the half the node timeout would elapsed, in order to avoid a false positive. However this means that if we have N nodes and the node timeout is set to, for instance M seconds, we'll have to ping N nodes every M/2 seconds. This N*M/2 pings will receive the same number of pongs, so a total of N*M packets per node. However given that we have a total of N nodes doing this, the total number of messages will be N*N*M. In a 100 nodes cluster with a timeout of 60 seconds, this translates to a total of 100*100*30 packets per second, summing all the packets exchanged by all the nodes. This is, as you can guess, a lot... So this patch changes the implementation in a very simple way in order to trust the reports of other nodes: if a node A reports a node B as alive at least up to a given time, we update our view accordingly. The problem with this approach is that it could result into a subset of nodes being able to reach a given node X, and preventing others from detecting that is actually not reachable from the majority of nodes. So the above algorithm is refined by trusting other nodes only if we do not have currently a ping pending for the node X, and if there are no failure reports for that node. Since each node, anyway, pings 10 other nodes every second (one node every 100 milliseconds), anyway eventually even trusting the other nodes reports, we will detect if a given node is down from our POV. Now to understand the number of packets that the cluster would exchange for failure detection with the patch, we can start considering the random PINGs that the cluster sent anyway as base line: Each node sends 10 packets per second, so the total traffic if no additioal packets would be sent, including PONG packets, would be: Total messages per second = N*10*2 However by trusting other nodes gossip sections will not AWALYS prevent pinging nodes for the "half timeout reached" rule all the times. The math involved in computing the actual rate as N and M change is quite complex and depends also on another parameter, which is the number of entries in the gossip section of PING and PONG packets. However it is possible to compare what happens in cluster of different sizes experimentally. After applying this patch a very important reduction in the number of packets exchanged is trivial to observe, without apparent impacts on the failure detection performances. Actual numbers with different cluster sizes should be published in the Reids Cluster documentation in the future. Related to #3929.
2025-01-22 16:18:28 -05:00 · 2017-04-14 10:14:17 +02:00 · 2017-04-14 10:14:17 +02:00 · 8f7bf2841a
commit 8f7bf2841a
parent c5d6f577f0
1 changed files with 13 additions and 0 deletions
--- a/src/cluster.c
+++ b/src/cluster.c
@ -1354,6 +1354,19 @@ void clusterProcessGossipSection(clusterMsg *hdr, clusterLink *link) {
                }
            }
            /* If from our POV the node is up (no failure flags are set),
             * we have no pending ping for the node, nor we have failure
             * reports for this node, update the last pong time with the
             * one we see from the other nodes. */
            if (!(flags & (CLUSTER_NODE_FAIL|CLUSTER_NODE_PFAIL)) &&
                node->ping_sent == 0 &&
                clusterNodeFailureReportsCount(node) == 0)
            {
                uint32_t pongtime = ntohl(g->pong_received);
                if (pongtime > node->pong_received)
                    node->pong_received = pongtime;
            }
            /* If we already know this node, but it is not reachable, and
             * we see a different address in the gossip section of a node that
             * can talk with this other node, update the address, disconnect