The Redis test uses a server-clients model in order to parallelize the
execution of different tests. However in recent versions of osx not
setting the channel to a binary encoding caused issues even if AFAIK no
binary data is really sent via this channel. However now the channels
are deliberately set to a binary encoding and this solves the issue.
The exact issue was the test not terminating and giving the impression
of running forever, since test clients or servers were unable to
exchange the messages to continue.
In high RPS environments, the default listen backlog is not sufficient, so
giving users the power to configure it is the right approach, especially
since it requires only minor modifications to the code.
The check was placed in a way that conflicted with the continue
statements used by the node hearth beat code later that needs to skip
the current node sometimes. Moved at the start of the function so that's
always executed.
This feature allows slaves to migrate to orphaned masters (masters
without working slaves), as long as a set of conditions are met,
including the fact that the migrating slave needs to be in a
master-slaves ring with at least another slave working.
When we schedule a failover, broadcast a PONG to the slaves.
The other slaves that plan to get elected will do the same too, this way
it is likely that every slave will have a good picture of its own rank.
Note that this is N*N messages where N is the number of slaves for the
failing master, however usually even large clusters have many master
nodes but a limited number of replicas per node, so this is harmless.
Note that when we compute the initial delay, there are probably still
more up to date information to receive from slaves with new offsets, so
the delay is recomputed when new data is available.
Return the number of slaves for the same master having a better
replication offset of the current slave, that is, the slave "rank" used
to pick a delay before the request for election.
Accessing to the 'myself' node, the node representing the currently
running instance, is handy without the need to type
server.cluster->myself every time.
The two fields are used in order to remember the latest known
replication offset and the time we received it from other slave nodes.
This will be used by slaves in order to start the election procedure
with a delay that is proportional to the rank of the slave among the
other slaves for this master, when sorted for replication offset.
Usually this allows the slave with the most updated offset to win the
election and replace the failing master in the cluster.
Incremental flushing in rio.c is only used to avoid huge kernel buffers
synched to slow disks creating big latency spikes, so this fix has no
durability implications, however it is certainly more correct to make
sure that the FILE buffers are flushed to the kernel before calling
fsync on the file descriptor.
Thanks to Li Shao Kai for reporting this issue in the Redis mailing
list.
One of the simple heuristics used by Redis Cluster in order to avoid
losing data in the typical failure modes created by the asynchronous
replication with the slaves (a master is unable, when accepting a
write, to immediately tell if it should be really accepted or refused
because of a configuration change), is to wait some time before to
rejoin the cluster after being partitioned away from the majority of
instances.
A similar condition happens when a master is restarted. It does not know
if it was already failed over, nor if all the clients have already an
updated configuration about the cluster map, so it is possible that
clients will try to write to stale masters that were restarted.
In a similar way this commit changes masters behavior so they wait
2000 milliseconds before accepting writes after a reboot. There is
nothing special about 2 seconds if not to be a value supposedly larger
a few orders of magnitude compared to the cluster bus communication
latencies.
The code was doing checks for slaves that should be done only when the
instance is currently a master. Switching a slave from a master to
another one should just work.
When an instance is potentially set to replicate with another master, it
is conceptually disconnected forever, since we have no old copy of the
dataset for this master in memory.