When the configured node timeout is very small, the data validity time
(maximum data age for a slave to try a failover) is too little (ten
times the configured node timeout) when the replication link with the
master is mostly idle. In this case we'll receive some data from the
master only every server.repl_ping_slave_period to refresh the last
interaction with the master.
This commit adds to the max data validity time the slave ping period to
avoid this problem of slaves sensing too old data without a good reason.
However this max data validity time is likely a setting that should be
configurable by the Redis Cluster user in a way completely independent
from the node timeout.
This commit makes it simple to start an handshake with a specific node
address, and uses this in order to detect a node IP change and start a
new handshake in order to fix the IP if possible.
As specified in the Redis Cluster specification, when a node can reach
the majority again after a period in which it was partitioend away with
the minorty of masters, wait some time before accepting queries, to
provide a reasonable amount of time for other nodes to upgrade its
configuration.
This lowers the probabilities of both a client and a master with not
updated configuration to rejoin the cluster at the same time, with a
stale master accepting writes.
The value was otherwise undefined, so next time the node was promoted
again from slave to master, adding a slave to the list of slaves
would likely crash the server or result into undefined behavior.
Now there is a function that handles the update of the local slot
configuration every time we have some new info about a node and its set
of served slots and configEpoch.
Moreoever the UPDATE packets are now processed when received (it was a
work in progress in the previous commit).
The commit also introduces detection of nodes publishing not updated
configuration. More work in progress to send an UPDATE packet to inform
of the config change.
All the internal state of cluster involving time is now using mstime_t
and mstime() in order to use milliseconds resolution.
Also the clusterCron() function is called with a 10 hz frequency instead
of 1 hz.
The cluster node_timeout must be also configured in milliseconds by the
user in redis.conf.
When a slave requests our vote, the configEpoch he claims for its master
and the set of served slots must be greater or equal to the configEpoch
of the nodes serving these slots in the current configuraiton of the
master granting its vote.
In other terms, masters don't vote for slaves having a stale
configuration for the slots they want to serve.
The new API is able to remember operations to perform before returning
to the event loop, such as checking if there is the failover quorum for
a slave, save and fsync the configuraiton file, and so forth.
Because this operations are performed before returning on the event
loop we are sure that messages that are sent in the same event loop run
will be delivered *after* the configuration is already saved, that is a
requirement sometimes. For instance we want to publish a new epoch only
when it is already stored in nodes.conf in order to avoid returning back
in the logical clock when a node is restarted.
This new API provides a big performance advantage compared to saving and
possibly fsyncing the configuration file multiple times in the same
event loop run, especially in the case of big clusters with tens or
hundreds of nodes.
The new algorithm does not check replies time as checking for the
currentEpoch in the reply ensures that the reply is about the current
election process.
The old algorithm used a PROMOTED flag and explicitly checks about
slave->master convertions. Wit the new cluster meta-data propagation
algorithm we just look at the configEpoch to check if we need to
reconfigure slots, then:
1) If a node is a master but it reaches zero served slots becuase of
reconfiguration.
2) If a node is a slave but the master reaches zero served slots because
of a reconfiguration.
We switch as a replica of the new slots owner.
We need to:
1) Increment the configEpoch.
2) Save it to disk and fsync the file.
3) Broadcast the PONG with the new configuration.
If other nodes will receive the updated configuration we need to be sure
to restart with this new config in the event of a crash.
First change: now there is no need to be a master in order to detect a
failure, however the majority of masters signaling PFAIL or FAIL is needed.
This change is important because it allows slaves rejoining the cluster
after a partition to sense the FAIL condition so that eventually all the
nodes agree on failures.
The time is sent in requests, and copied back in reply packets.
This way the receiver can compare the time field in a reply with its
local clock and check the age of the request associated with this reply.
This is an easy way to discard delayed replies. Note that only a clock
is used here, that is the one of the node sending the packet. The
receiver only copies the field back into the reply, so no
synchronization is needed between clocks of different hosts.
Handshake nodes should turn into normal nodes or be freed in a
reasonable amount of time, otherwise they'll keep accumulating if the
address they are associated with is not reachable for some reason.
This feature was implemented in the initial days of the Redis Cluster
implementaiton but is not a good idea at all.
1) It depends on clocks to be synchronized, that is already very bad.
2) Moreover it adds a bug where the pong time is updated via gossip so
no new PING is ever sent by the current node, with the effect of no PONG
received, no update of tables, no clearing of PFAIL flag.
In general to trust other nodes about the reachability of other nodes is
a broken distributed programming model.
Actaully the string is modified in-place and a reallocation is never
needed, so there is no need to return the new sds string pointer as
return value of the function, that is now just "void".
Previously two string encodings were used for string objects:
1) REDIS_ENCODING_RAW: a string object with obj->ptr pointing to an sds
stirng.
2) REDIS_ENCODING_INT: a string object where the obj->ptr void pointer
is casted to a long.
This commit introduces a experimental new encoding called
REDIS_ENCODING_EMBSTR that implements an object represented by an sds
string that is not modifiable but allocated in the same memory chunk as
the robj structure itself.
The chunk looks like the following:
+--------------+-----------+------------+--------+----+
| robj data... | robj->ptr | sds header | string | \0 |
+--------------+-----+-----+------------+--------+----+
| ^
+-----------------------+
The robj->ptr points to the contiguous sds string data, so the object
can be manipulated with the same functions used to manipulate plan
string objects, however we need just on malloc and one free in order to
allocate or release this kind of objects. Moreover it has better cache
locality.
This new allocation strategy should benefit both the memory usage and
the performances. A performance gain between 60 and 70% was observed
during micro-benchmarks, however there is more work to do to evaluate
the performance impact and the memory usage behavior.
Any places which I feel might want to be updated to work differently
with IPv6 have been marked with a comment starting "IPV6:".
Currently the only comments address places where an IP address is
combined with a port using the standard : separated form. These may want
to be changed when printing IPv6 addresses to wrap the address in []
such as
[2001:db8::c0:ffee]:6379
instead of
2001:db8::c0:ffee:6379
as the latter format is a technically valid IPv6 address and it is hard
to distinguish the IPv6 address component from the port unless you know
the port is supposed to be there.
In two places buffers have been created with a size of 128 bytes which
could be reduced to INET6_ADDRSTRLEN to still hold a full IP address.
These places have been marked as they are presently big enough to handle
the needs of storing a printable IPv6 address.
Changes the sockaddr_in to a sockaddr_storage. Attempts to convert the
IP address into an AF_INET or AF_INET6 before returning an "Invalid IP
address" error. Handles converting the sockaddr from either AF_INET or
AF_INET6 back into a string for storage in the clusterNode ip field.
Change the sockaddr_in to sockaddr_storage which is capable of storing
both AF_INET and AF_INET6 sockets. Uses the sockaddr_storage ss_family
to correctly return the printable IP address and port.
Function makes the assumption that the buffer is of at least
REDIS_CLUSTER_IPLEN bytes in size.
Add REDIS_CLUSTER_IPLEN macro to define the size of the clusterNode ip
character array. Additionally use this macro in inet_ntop(3) calls where
the size of the array was being defined manually.
The REDIS_CLUSTER_IPLEN is defined as INET_ADDRSTRLEN which defines the
correct size of a buffer to store an IPv4 address in. The
INET_ADDRSTRLEN macro itself is defined in the <netinet/in.h> header
file and should be portable across the majority of systems.
Using sizeof with an array will only return expected results if the
array is created in the scope of the function where sizeof is used. This
commit changes the inet_ntop calls so that they use the fixed buffer
value as defined in redis.h which is 16.
When the PONG delay is half the cluster node timeout, the link gets
disconnected (and later automatically reconnected) in order to ensure
that it's not just a dead connection issue.
However this operation is only performed if the link is old enough, in
order to avoid to disconnect the same link again and again (and among
the other problems, never receive the PONG because of that).
Note: when the link is reconnected, the 'ping_sent' field is not updated
even if a new ping is sent using the new connection, so we can still
reliably detect a node ping timeout.
We used to copy this value into the server.cluster structure, however this
was not necessary.
The reason why we don't directly use server.cluster->node_timeout is
that things that can be configured via redis.conf need to be directly
available in the server structure as server.cluster is allocated later
only if needed in order to reduce the memory footprint of non-cluster
instances.
In commit d728ec6 it was introduced the concept of sending a ping to
every node not receiving a ping since node_timeout/2 seconds.
However the code was located in a place that was not executed because of
a previous conditional causing the loop to re-iterate.
This caused false positives in nodes availability detection.
The current code is still not perfect as a node may be detected to be in
PFAIL state even if it does not reply for just node_timeout/2 seconds
that is not correct. There is a plan to improve this code ASAP.
When a master turns into a slave after a failover event, make sure to
clear the assigned slots before setting up the replication, as a slave
should never claim slots in an explicit way, but just take over the
master slots when replacing its master.
A slave node set this flag for itself when, after receiving authorization
from the majority of nodes, it turns itself into a master.
At the same time now this flag is tested by nodes receiving a PING
message before reconfiguring after a failover event. This makes the
system more robust: even if currently there is no way to manually turn
a slave into a master it is possible that we'll have such a feature in
the future, or that simply because of misconfiguration a node joins the
cluster as master while others believe it's a slave. This alone is now
no longer enough to trigger reconfiguration as other nodes will check
for the PROMOTED flag.
The PROMOTED flag is cleared every time the node is turned back into a
replica of some other node.
Sender flags were not propagated for the sender, but only for nodes in
the gossip section. This is odd and in the next commits we'll need to
get updated flags for the sender node, so this commit adds a new field
in the cluster messages header.
The message header is the same size as we reused some free space that
was marked as 'unused' because of alignment concerns.
So when the failing master node is back in touch with the cluster,
instead of remaining unused it is converted into a replica of the
new master, ready to perform the fail over if the new master node
will fail at some point.
Note that as a side effect clients with stale configuration are now
not an issue as well, as the node converted into a slave will not
accept queries but will redirect clients accordingly.
The code handling a master that turns into a slave or the contrary was
improved in order to avoid repeating the same operations. Also
the readability and conceptual simplicity was improved.
Redis Cluster can cope with a minority of nodes not informed about the
failure of a master in time for some reason (netsplit or node not
functioning properly, blocked, ...) however to wait a few seconds before
to start the failover will make most "normal" failovers simpler as the
FAIL message will propagate before the slave election happens.
If we have a master in FAIL state that's reachable again, and apparently
no one is going to serve its slots, clear the FAIL flag and let the
cluster continue with its operations again.
This is the unix time at which we set the FAIL flag for the node.
It is only valid if FAIL is set.
The idea is to use it in order to make the cluster more robust, for
instance in order to revert a FAIL state if it is long-standing but
still slots are assigned to this node, that is, no one is going to fix
these slots apparently.
Usually we try to send just 1 ping every second, however when we detect
we are going to have unreliable failure detection because we can't ping
some node in time, send an additional ping.
This should only happen with very large clusters or when the the node
timeout is set to a very low value.
As stated in the comment this is usually due to a resharding in progress
so the client should be still redirected to the old node that will
handle the redirection elsewhere.
Before a relatively slow popcount() operation was needed every time we
needed to get the number of slots served by a given cluster node.
Now we just need to check an integer that is taken in sync with the
bitmap.
This is not very important as anyway when the function counting the
number of reports is called the cleanup is performed. However with this
change if only part of the nodes that reported the failure will report
the node is back ok, we'll cleanup the older entries ASAP. In complex
split net split scenarios, and when we are dealing with clusters having
nodes in the order of ~ 1000, this can save some CPU.
Not sure why I set a limit to 1 million keys, there is no reason for
this artificial limit, and anyway this is s a stupid limit because it is
already high enough to create latency issues. So let's the users shoot
on their feet because maybe they just actually know what they are doing.
A §Redis Cluster node used to mark a node as failing when itself
detected a failure for that node, and a single acknowledge was received
about the possible failure state.
The new API will be used in order to possible to require that N other
nodes have a PFAIL or FAIL state for a given node for a node to set it
as failing.
Now that we cache connections, a retry attempt makes sure that the
operation don't fail just because there is an existing connection error
on the socket, like the other end closing the connection.
Unfortunately this condition is not detectable using
getsockopt(SO_ERROR), so the only option left is to retry.
We don't retry on timeouts.
By caching TCP connections used by MIGRATE to chat with other Redis
instances a 5x performance improvement was measured with
redis-benchmark against small keys.
This can dramatically speedup cluster resharding and other processes
where an high load of MIGRATE commands are used.