When a slave requests masters vote for a manual failover, the
REQUEST_AUTH message is flagged in a special way in order to force the
masters to give the authorization even if the master is not marked as
failing.
The check was placed in a way that conflicted with the continue
statements used by the node hearth beat code later that needs to skip
the current node sometimes. Moved at the start of the function so that's
always executed.
This feature allows slaves to migrate to orphaned masters (masters
without working slaves), as long as a set of conditions are met,
including the fact that the migrating slave needs to be in a
master-slaves ring with at least another slave working.
When we schedule a failover, broadcast a PONG to the slaves.
The other slaves that plan to get elected will do the same too, this way
it is likely that every slave will have a good picture of its own rank.
Note that this is N*N messages where N is the number of slaves for the
failing master, however usually even large clusters have many master
nodes but a limited number of replicas per node, so this is harmless.
Note that when we compute the initial delay, there are probably still
more up to date information to receive from slaves with new offsets, so
the delay is recomputed when new data is available.
Return the number of slaves for the same master having a better
replication offset of the current slave, that is, the slave "rank" used
to pick a delay before the request for election.
Accessing to the 'myself' node, the node representing the currently
running instance, is handy without the need to type
server.cluster->myself every time.
The two fields are used in order to remember the latest known
replication offset and the time we received it from other slave nodes.
This will be used by slaves in order to start the election procedure
with a delay that is proportional to the rank of the slave among the
other slaves for this master, when sorted for replication offset.
Usually this allows the slave with the most updated offset to win the
election and replace the failing master in the cluster.
One of the simple heuristics used by Redis Cluster in order to avoid
losing data in the typical failure modes created by the asynchronous
replication with the slaves (a master is unable, when accepting a
write, to immediately tell if it should be really accepted or refused
because of a configuration change), is to wait some time before to
rejoin the cluster after being partitioned away from the majority of
instances.
A similar condition happens when a master is restarted. It does not know
if it was already failed over, nor if all the clients have already an
updated configuration about the cluster map, so it is possible that
clients will try to write to stale masters that were restarted.
In a similar way this commit changes masters behavior so they wait
2000 milliseconds before accepting writes after a reboot. There is
nothing special about 2 seconds if not to be a value supposedly larger
a few orders of magnitude compared to the cluster bus communication
latencies.
The code was doing checks for slaves that should be done only when the
instance is currently a master. Switching a slave from a master to
another one should just work.
CLUSTER FORGET is not useful if we can't remove a node from all the
nodes of our cluster because of the Gossip protocol that keeps adding
a given node to nodes where we already tried to remove it.
So now CLUSTER FORGET implements a nodes blacklist that is set and
checked by the Gossip section processing function. This way before a
node is re-added at least 60 seconds must elapse since the FORGET
execution.
This means that redis-trib has some time to remove a node from a whole
cluster. It is possible that in the future it will be uesful to raise
the 60 sec figure to something bigger.
The rejoin delay usually is the node timeout. However if the node
timeout is too small, we set it to 500 milliseconds, that is a value
chosen to be greater than most setups RTT / instances latency figures
so that likely communication with other nodes happen before rejoining.
Usually we update the cluster state (to understand if we should accept
queries or reply with an error) only when there is a change in the state
of the nodes. However for the "delayed rejoin" feature to work, that is,
for a master to wait some time before accepting queries again after it
rejoins the majority, we need to periodically update the last time when
the node was partitioned away from the majority.
With this commit if the cluster is down we update the state ten times
per second.
Even without the user messing manually with the file, it is still
possible to have blank lines (just a single "\n" per line) because of
how the nodes.conf update/write process works.
The way the file was generated was unsafe and leaded to nodes.conf file
corruption (zero length file) on server stop/crash during the creation
of the file.
The previous file update method was as simple as open with O_TRUNC
followed by the write call. While the write call was a single one with
the full payload, ensuring no half-written files for POSIX semantics,
stopping the server just after the open call resulted into a zero-length
file (all the nodes information lost!).
A client can enter a special cluster read-only mode using the READONLY
command: if the client read from a slave instance after this command,
for slots that are actually served by the instance's master, the queries
will be processed without redirection, allowing clients to read from
slaves (but without any kind fo read-after-write guarantee).
The READWRITE command can be used in order to exit the readonly state.
When the configured node timeout is very small, the data validity time
(maximum data age for a slave to try a failover) is too little (ten
times the configured node timeout) when the replication link with the
master is mostly idle. In this case we'll receive some data from the
master only every server.repl_ping_slave_period to refresh the last
interaction with the master.
This commit adds to the max data validity time the slave ping period to
avoid this problem of slaves sensing too old data without a good reason.
However this max data validity time is likely a setting that should be
configurable by the Redis Cluster user in a way completely independent
from the node timeout.
This commit makes it simple to start an handshake with a specific node
address, and uses this in order to detect a node IP change and start a
new handshake in order to fix the IP if possible.
As specified in the Redis Cluster specification, when a node can reach
the majority again after a period in which it was partitioend away with
the minorty of masters, wait some time before accepting queries, to
provide a reasonable amount of time for other nodes to upgrade its
configuration.
This lowers the probabilities of both a client and a master with not
updated configuration to rejoin the cluster at the same time, with a
stale master accepting writes.
The value was otherwise undefined, so next time the node was promoted
again from slave to master, adding a slave to the list of slaves
would likely crash the server or result into undefined behavior.
Now there is a function that handles the update of the local slot
configuration every time we have some new info about a node and its set
of served slots and configEpoch.
Moreoever the UPDATE packets are now processed when received (it was a
work in progress in the previous commit).
The commit also introduces detection of nodes publishing not updated
configuration. More work in progress to send an UPDATE packet to inform
of the config change.
All the internal state of cluster involving time is now using mstime_t
and mstime() in order to use milliseconds resolution.
Also the clusterCron() function is called with a 10 hz frequency instead
of 1 hz.
The cluster node_timeout must be also configured in milliseconds by the
user in redis.conf.
When a slave requests our vote, the configEpoch he claims for its master
and the set of served slots must be greater or equal to the configEpoch
of the nodes serving these slots in the current configuraiton of the
master granting its vote.
In other terms, masters don't vote for slaves having a stale
configuration for the slots they want to serve.
The new API is able to remember operations to perform before returning
to the event loop, such as checking if there is the failover quorum for
a slave, save and fsync the configuraiton file, and so forth.
Because this operations are performed before returning on the event
loop we are sure that messages that are sent in the same event loop run
will be delivered *after* the configuration is already saved, that is a
requirement sometimes. For instance we want to publish a new epoch only
when it is already stored in nodes.conf in order to avoid returning back
in the logical clock when a node is restarted.
This new API provides a big performance advantage compared to saving and
possibly fsyncing the configuration file multiple times in the same
event loop run, especially in the case of big clusters with tens or
hundreds of nodes.
The new algorithm does not check replies time as checking for the
currentEpoch in the reply ensures that the reply is about the current
election process.
The old algorithm used a PROMOTED flag and explicitly checks about
slave->master convertions. Wit the new cluster meta-data propagation
algorithm we just look at the configEpoch to check if we need to
reconfigure slots, then:
1) If a node is a master but it reaches zero served slots becuase of
reconfiguration.
2) If a node is a slave but the master reaches zero served slots because
of a reconfiguration.
We switch as a replica of the new slots owner.
We need to:
1) Increment the configEpoch.
2) Save it to disk and fsync the file.
3) Broadcast the PONG with the new configuration.
If other nodes will receive the updated configuration we need to be sure
to restart with this new config in the event of a crash.
First change: now there is no need to be a master in order to detect a
failure, however the majority of masters signaling PFAIL or FAIL is needed.
This change is important because it allows slaves rejoining the cluster
after a partition to sense the FAIL condition so that eventually all the
nodes agree on failures.
The time is sent in requests, and copied back in reply packets.
This way the receiver can compare the time field in a reply with its
local clock and check the age of the request associated with this reply.
This is an easy way to discard delayed replies. Note that only a clock
is used here, that is the one of the node sending the packet. The
receiver only copies the field back into the reply, so no
synchronization is needed between clocks of different hosts.
Handshake nodes should turn into normal nodes or be freed in a
reasonable amount of time, otherwise they'll keep accumulating if the
address they are associated with is not reachable for some reason.