Commit Graph

305 Commits

Author SHA1 Message Date
antirez
1a0cea33a0 Cluster: initialize senderConfigEpoch and senderCurrentEpoch for warnings suppression. 2013-11-05 12:01:07 +01:00
antirez
0c9f60a628 Cluster: there is a lower limit for the handshake timeout. 2013-10-11 10:34:32 +02:00
antirez
1447d28c0f Cluster: data_age conversion to milliseconds fixed. 2013-10-09 16:36:06 +02:00
antirez
573c2fea91 Cluster: clusterCron() freq is now 10h. Still ping 1 node every sec.
After the change in clusterCron() frequency of call, we still want to
ping just one random node every second.
2013-10-09 16:29:17 +02:00
antirez
ba42428633 Cluster: time switched from seconds to milliseconds.
All the internal state of cluster involving time is now using mstime_t
and mstime() in order to use milliseconds resolution.

Also the clusterCron() function is called with a 10 hz frequency instead
of 1 hz.

The cluster node_timeout must be also configured in milliseconds by the
user in redis.conf.
2013-10-09 16:19:26 +02:00
antirez
929b6a4480 Cluster: cluster stuff moved from redis.h to cluster.h. 2013-10-09 15:38:05 +02:00
antirez
ae2763f564 Cluster: masters don't vote for a slave with stale config.
When a slave requests our vote, the configEpoch he claims for its master
and the set of served slots must be greater or equal to the configEpoch
of the nodes serving these slots in the current configuraiton of the
master granting its vote.

In other terms, masters don't vote for slaves having a stale
configuration for the slots they want to serve.
2013-10-08 12:45:35 +02:00
antirez
f7d6ad4366 Cluster: fix slave data age computation when master is still connected. 2013-10-07 16:07:13 +02:00
antirez
2c3301b9f5 Cluster: log message improved when FAIL is cleared from a slave node. 2013-10-07 15:44:58 +02:00
antirez
72f38cd70f Cluster: slave nodes advertise master slots bitmap and configEpoch. 2013-10-07 11:31:12 +02:00
antirez
7afc0dd59a Cluster: new clusterDoBeforeSleep() API.
The new API is able to remember operations to perform before returning
to the event loop, such as checking if there is the failover quorum for
a slave, save and fsync the configuraiton file, and so forth.

Because this operations are performed before returning on the event
loop we are sure that messages that are sent in the same event loop run
will be delivered *after* the configuration is already saved, that is a
requirement sometimes. For instance we want to publish a new epoch only
when it is already stored in nodes.conf in order to avoid returning back
in the logical clock when a node is restarted.

This new API provides a big performance advantage compared to saving and
possibly fsyncing the configuration file multiple times in the same
event loop run, especially in the case of big clusters with tens or
hundreds of nodes.
2013-10-03 09:58:06 +02:00
antirez
211dcbe339 Cluster: update cluster config when slave changes master. 2013-10-02 12:27:12 +02:00
antirez
6c4d904baf Cluster: bus messages stats in CLUSTER info. 2013-10-02 10:10:08 +02:00
antirez
abe81781ae Cluster: FAIL messages from unknown senders are handled better.
Previously the event was not logged but instead the node reported an
unknown packet type received.
2013-10-02 09:42:45 +02:00
antirez
7970ebd80a Cluster: senderCurrentEpoch == node currentEpoch was too strict.
We can accept a vote as long as its epoch is >= the epoch at which we
started the voting process. There is no need for it to be exactly the
same.
2013-10-01 17:21:28 +02:00
antirez
f1bfd8233b Cluster: fix typo in clusterProcessPacket() comment. 2013-10-01 15:40:20 +02:00
antirez
1dedf9aa36 Cluster: time field removed from cluster messages header.
The new algorithm does not check replies time as checking for the
currentEpoch in the reply ensures that the reply is about the current
election process.
2013-09-30 16:19:44 +02:00
antirez
2d0844ee37 Cluster: log message shortened. 2013-09-30 11:51:58 +02:00
antirez
4dc247eb31 Cluster: detect cluster reconfiguration when master slots drop to 0.
The old algorithm used a PROMOTED flag and explicitly checks about
slave->master convertions. Wit the new cluster meta-data propagation
algorithm we just look at the configEpoch to check if we need to
reconfigure slots, then:

1) If a node is a master but it reaches zero served slots becuase of
reconfiguration.
2) If a node is a slave but the master reaches zero served slots because
of a reconfiguration.

We switch as a replica of the new slots owner.
2013-09-30 11:45:26 +02:00
antirez
62b1591439 Cluster: re-order failover operations to make it safer.
We need to:

1) Increment the configEpoch.
2) Save it to disk and fsync the file.
3) Broadcast the PONG with the new configuration.

If other nodes will receive the updated configuration we need to be sure
to restart with this new config in the event of a crash.
2013-09-30 10:16:48 +02:00
antirez
b187517719 Cluster: when upading the configEpoch for a node, save config on disk ASAP. 2013-09-30 10:16:25 +02:00
antirez
03ca903983 Cluster: fsync data when saving the cluster config. 2013-09-30 10:13:07 +02:00
antirez
026e63392e Cluster: update the node configEpoch when newer is detected. 2013-09-27 09:55:41 +02:00
antirez
7c4b8f29e7 Cluster: react faster when a slave wins an election. 2013-09-26 16:54:43 +02:00
antirez
42fa46e49a Cluster: removed an old source of delay to start the slave failover. 2013-09-26 13:28:19 +02:00
antirez
a445aa30a0 Cluster: master node now uses new protocol to vote. 2013-09-26 13:00:41 +02:00
antirez
fb9b76fe14 Cluster: slave node now uses the new protocol to get elected. 2013-09-26 11:13:17 +02:00
antirez
32b5410af9 Cluster: add currentEpoch to CLUSTER INFO. 2013-09-25 12:38:36 +02:00
antirez
6ec795d2cf Cluster: update our currentEpoch when a greater one is seen. 2013-09-25 12:36:29 +02:00
antirez
d426ada891 Cluster: broadcast currentEpoch and configEpoch in packets header. 2013-09-25 11:53:35 +02:00
antirez
12483b0061 Cluster: configEpoch added in cluster nodes description. 2013-09-25 11:47:13 +02:00
antirez
3c9bb8751a Cluster: PFAIL -> FAIL transition allowed for slaves.
First change: now there is no need to be a master in order to detect a
failure, however the majority of masters signaling PFAIL or FAIL is needed.

This change is important because it allows slaves rejoining the cluster
after a partition to sense the FAIL condition so that eventually all the
nodes agree on failures.
2013-09-20 11:26:44 +02:00
antirez
925ea9f858 Cluster: added time field in cluster bus messages.
The time is sent in requests, and copied back in reply packets.
This way the receiver can compare the time field in a reply with its
local clock and check the age of the request associated with this reply.

This is an easy way to discard delayed replies. Note that only a clock
is used here, that is the one of the node sending the packet. The
receiver only copies the field back into the reply, so no
synchronization is needed between clocks of different hosts.
2013-09-20 09:22:21 +02:00
antirez
d0e327413b Cluster: don't add an handshake node for the same ip:port pair multiple times. 2013-09-04 15:52:16 +02:00
antirez
72587e6cc5 Cluster: free HANDSHAKE nodes after node_timeout.
Handshake nodes should turn into normal nodes or be freed in a
reasonable amount of time, otherwise they'll keep accumulating if the
address they are associated with is not reachable for some reason.
2013-09-04 12:41:21 +02:00
antirez
8eff339ca4 Cluster: CLUSTER SAVECONFIG command added. 2013-09-04 10:33:00 +02:00
antirez
528201ad6c Cluster: don't save HANDSHAKE nodes in nodes.conf. 2013-09-04 10:25:26 +02:00
antirez
e5d5da6f7c Cluster: always use safe iteartors to iterate server.cluster->nodes. 2013-09-04 10:07:50 +02:00
antirez
354a5de270 Cluster: clusterReadHandler() reworked to be more correct and simpler to follow. 2013-09-03 11:43:52 +02:00
antirez
1036b4b21b Cluster: use non-blocking I/O for the cluster bus. 2013-09-03 11:43:52 +02:00
antirez
f6efb6cdec Cluster: fixed a bug in clusterSendPublish() due to inverted statements.
The code used to copy the header *after* the 'hdr' pointer was already
switched to the new buffer. Of course we need to do the reverse.
2013-09-03 11:43:43 +02:00
antirez
303dde3757 Don't update node pong time via gossip.
This feature was implemented in the initial days of the Redis Cluster
implementaiton but is not a good idea at all.

1) It depends on clocks to be synchronized, that is already very bad.
2) Moreover it adds a bug where the pong time is updated via gossip so
no new PING is ever sent by the current node, with the effect of no PONG
received, no update of tables, no clearing of PFAIL flag.

In general to trust other nodes about the reachability of other nodes is
a broken distributed programming model.
2013-08-26 16:16:25 +02:00
antirez
6ae37b0e1d Cluster: set event handler in cluster bus listening socket.
The commit using listenToPort() introduced this bug by no longer
creating the event handler to handle incoming messages from the cluster
bus.
2013-08-22 14:53:53 +02:00
antirez
81a6a9639a Use listenToPort() in cluster.c as well. 2013-08-22 14:05:07 +02:00
antirez
042776aff7 Cluster: fix CLUSTER MEET ip address validation.
This was broken by the IPv6 support patches.
2013-08-22 11:54:28 +02:00
antirez
9cf30132cc Cluster: process MEET packets as PING packets.
Somewhat a previous commit broken this so CLUSTER MEET was no longer
working.
2013-08-22 11:53:28 +02:00
antirez
b804afcf01 Use a safe dict.c iterator in clusterCron(). 2013-08-21 15:51:15 +02:00
antirez
6ea8e0949c sdsrange() does not need to return a value.
Actaully the string is modified in-place and a reallocation is never
needed, so there is no need to return the new sds string pointer as
return value of the function, that is now just "void".
2013-07-24 11:21:39 +02:00
antirez
894eba07c8 Introduction of a new string encoding: EMBSTR
Previously two string encodings were used for string objects:

1) REDIS_ENCODING_RAW: a string object with obj->ptr pointing to an sds
stirng.

2) REDIS_ENCODING_INT: a string object where the obj->ptr void pointer
is casted to a long.

This commit introduces a experimental new encoding called
REDIS_ENCODING_EMBSTR that implements an object represented by an sds
string that is not modifiable but allocated in the same memory chunk as
the robj structure itself.

The chunk looks like the following:

+--------------+-----------+------------+--------+----+
| robj data... | robj->ptr | sds header | string | \0 |
+--------------+-----+-----+------------+--------+----+
                     |                       ^
                     +-----------------------+

The robj->ptr points to the contiguous sds string data, so the object
can be manipulated with the same functions used to manipulate plan
string objects, however we need just on malloc and one free in order to
allocate or release this kind of objects. Moreover it has better cache
locality.

This new allocation strategy should benefit both the memory usage and
the performances. A performance gain between 60 and 70% was observed
during micro-benchmarks, however there is more work to do to evaluate
the performance impact and the memory usage behavior.
2013-07-22 10:31:38 +02:00
antirez
631d656a94 All IP string repr buffers are now REDIS_IP_STR_LEN bytes. 2013-07-09 11:32:52 +02:00
Geoff Garside
ca78446c55 Mark places that might want changing for IPv6.
Any places which I feel might want to be updated to work differently
with IPv6 have been marked with a comment starting "IPV6:".

Currently the only comments address places where an IP address is
combined with a port using the standard : separated form. These may want
to be changed when printing IPv6 addresses to wrap the address in []
such as

	[2001:db8::c0:ffee]:6379

instead of

	2001:db8::c0:ffee:6379

as the latter format is a technically valid IPv6 address and it is hard
to distinguish the IPv6 address component from the port unless you know
the port is supposed to be there.
2013-07-08 15:58:14 +02:00
Geoff Garside
f7d9a92d4e Mark ip string buffers which could be reduced.
In two places buffers have been created with a size of 128 bytes which
could be reduced to INET6_ADDRSTRLEN to still hold a full IP address.
These places have been marked as they are presently big enough to handle
the needs of storing a printable IPv6 address.
2013-07-08 15:57:23 +02:00
Geoff Garside
e6bf4c2676 Update clusterCommand to handle AF_INET6 addresses
Changes the sockaddr_in to a sockaddr_storage. Attempts to convert the
IP address into an AF_INET or AF_INET6 before returning an "Invalid IP
address" error. Handles converting the sockaddr from either AF_INET or
AF_INET6 back into a string for storage in the clusterNode ip field.
2013-07-08 15:57:23 +02:00
Geoff Garside
5be83eecac Update node2IpString to handle AF_INET6 addresses.
Change the sockaddr_in to sockaddr_storage which is capable of storing
both AF_INET and AF_INET6 sockets. Uses the sockaddr_storage ss_family
to correctly return the printable IP address and port.

Function makes the assumption that the buffer is of at least
REDIS_CLUSTER_IPLEN bytes in size.
2013-07-08 15:57:23 +02:00
Geoff Garside
b39e827d22 Add missing includes for getpeername.
getpeername(2) requires <sys/socket.h> which on some systems also
requires <sys/types.h>. Include both to avoid compilation warnings.
2013-07-08 15:55:39 +02:00
Geoff Garside
9cfa02fe73 Add macro to define clusterNode.ip buffer size.
Add REDIS_CLUSTER_IPLEN macro to define the size of the clusterNode ip
character array. Additionally use this macro in inet_ntop(3) calls where
the size of the array was being defined manually.

The REDIS_CLUSTER_IPLEN is defined as INET_ADDRSTRLEN which defines the
correct size of a buffer to store an IPv4 address in. The
INET_ADDRSTRLEN macro itself is defined in the <netinet/in.h> header
file and should be portable across the majority of systems.
2013-07-08 15:55:39 +02:00
Geoff Garside
6e894f02cf Fix cluster.c inet_ntop use of sizeof(n->ip).
Using sizeof with an array will only return expected results if the
array is created in the scope of the function where sizeof is used. This
commit changes the inet_ntop calls so that they use the fixed buffer
value as defined in redis.h which is 16.
2013-07-08 15:51:37 +02:00
Geoff Garside
693b640510 Use inet_pton(3) in clusterCommand.
Replace inet_aton(3) call with the more future proof inet_pton(3)
function which is capable of handling additional address families.
2013-07-08 15:51:37 +02:00
Geoff Garside
a6ea707cec Use inet_ntop(3) in nodeIp2String & clusterCommand
Replace inet_ntoa(3) calls with the more future proof inet_ntop(3)
function which is capable of handling additional address families.
2013-07-08 15:51:37 +02:00
Geoff Garside
f5494a427e Update anetTcpAccept & anetPeerToString calls.
Add the additional ip buffer length argument to function calls of
anetTcpAccept and anetPeerToString in network.c and cluster.c
2013-07-08 15:51:37 +02:00
antirez
98eecb70eb Binding multiple IPs done properly with multiple sockets. 2013-07-05 11:47:20 +02:00
antirez
2160effc78 Revert "Cluster: use new anet.c listening socket creation API."
This reverts commit 016ac38a21.
2013-07-05 11:08:44 +02:00
antirez
016ac38a21 Cluster: use new anet.c listening socket creation API. 2013-07-04 18:49:49 +02:00
antirez
dfc98dccf4 Cluster: detect nodes address change. 2013-06-12 10:50:07 -07:00
antirez
d427373f01 clusterProcessPacket() comments improved for correctness. 2013-06-11 21:34:34 +02:00
antirez
5c9f6d4f55 Cluster: link reconnection on delayed PONG reply.
When the PONG delay is half the cluster node timeout, the link gets
disconnected (and later automatically reconnected) in order to ensure
that it's not just a dead connection issue.

However this operation is only performed if the link is old enough, in
order to avoid to disconnect the same link again and again (and among
the other problems, never receive the PONG because of that).

Note: when the link is reconnected, the 'ping_sent' field is not updated
even if a new ping is sent using the new connection, so we can still
reliably detect a node ping timeout.
2013-05-03 15:43:03 +02:00
antirez
1315b9f246 Cluster: restore PING sent time on reconnections. 2013-05-03 15:42:59 +02:00
antirez
ae71731019 Cluster: PING/PONG handling redesigned. 2013-05-03 15:42:38 +02:00
antirez
a120560f70 Cluster: process config from PING packets as we do for PONG.
Also clusterBroadcastPing() was renamed into clusterBroadcastPong()
that's what the function is actually doing.
2013-05-03 15:41:34 +02:00
antirez
8a51c067ad Cluster: createClusterLink() comment fixed for grammar. 2013-05-03 15:41:29 +02:00
xiaost7
ecdbaf4695 Cluster: fix clusterNode.name print format on debug message.
It was %40s instead of %.40s, and since the string is not null
terminated it caused random garbage to be displayed, and possibly a
crash.
2013-04-19 09:53:43 +02:00
antirez
b84570dece Cluster: reconfigure additonal slaves on failover. 2013-04-09 12:13:26 +02:00
antirez
68cf249f81 Cluster: use server.cluster_node_timeout directly.
We used to copy this value into the server.cluster structure, however this
was not necessary.

The reason why we don't directly use server.cluster->node_timeout is
that things that can be configured via redis.conf need to be directly
available in the server structure as server.cluster is allocated later
only if needed in order to reduce the memory footprint of non-cluster
instances.
2013-04-09 11:24:18 +02:00
antirez
ef4f25ff6e Cluster: configdigest field no longer used. Removed. 2013-04-09 11:07:25 +02:00
antirez
f09b2508f4 Cluster: properly send ping to nodes not pinged foro too much time.
In commit d728ec6 it was introduced the concept of sending a ping to
every node not receiving a ping since node_timeout/2 seconds.
However the code was located in a place that was not executed because of
a previous conditional causing the loop to re-iterate.

This caused false positives in nodes availability detection.

The current code is still not perfect as a node may be detected to be in
PFAIL state even if it does not reply for just node_timeout/2 seconds
that is not correct. There is a plan to improve this code ASAP.
2013-04-08 19:40:20 +02:00
antirez
05fa4f4034 Cluster: node timeout is now configurable. 2013-04-04 12:29:10 +02:00
antirez
00bab23c41 Cluster: turn hardcoded node timeout multiplicators into defines.
Most Redis Cluster time limits are expressed in terms of the configured
node timeout. Turn them into defines.
2013-04-04 12:04:11 +02:00
antirez
c39e34d007 Cluster: when slave changes master, remove it from the old master. 2013-03-25 15:01:25 +01:00
antirez
34c1871e9f Cluster: set node role on successful handshake. 2013-03-25 13:03:01 +01:00
antirez
a8b09faf3d Cluster: comment no longer in sync with code removed. 2013-03-21 10:47:10 +01:00
antirez
8c1bc8e865 Cluster: clear the PROMOTED slave directly into clusterSetMaster().
This way we make sure every time a master is turned into a replica
the flag will be cleared.
2013-03-20 11:51:44 +01:00
antirez
e006407fd0 Cluster: master node must clear its hash slots when turning into a slave.
When a master turns into a slave after a failover event, make sure to
clear the assigned slots before setting up the replication, as a slave
should never claim slots in an explicit way, but just take over the
master slots when replacing its master.
2013-03-20 11:32:35 +01:00
antirez
506f9a42b0 Cluster: new flag PROMOTED introduced.
A slave node set this flag for itself when, after receiving authorization
from the majority of nodes, it turns itself into a master.

At the same time now this flag is tested by nodes receiving a PING
message before reconfiguring after a failover event. This makes the
system more robust: even if currently there is no way to manually turn
a slave into a master it is possible that we'll have such a feature in
the future, or that simply because of misconfiguration a node joins the
cluster as master while others believe it's a slave. This alone is now
no longer enough to trigger reconfiguration as other nodes will check
for the PROMOTED flag.

The PROMOTED flag is cleared every time the node is turned back into a
replica of some other node.
2013-03-20 10:48:42 +01:00
antirez
026b9483db Cluster: add sender flags in cluster bus messages header.
Sender flags were not propagated for the sender, but only for nodes in
the gossip section. This is odd and in the next commits we'll need to
get updated flags for the sender node, so this commit adds a new field
in the cluster messages header.

The message header is the same size as we reused some free space that
was marked as 'unused' because of alignment concerns.
2013-03-20 10:32:00 +01:00
antirez
d15b027d91 Cluster: turn old master into a replica of node that failed over.
So when the failing master node is back in touch with the cluster,
instead of remaining unused it is converted into a replica of the
new master, ready to perform the fail over if the new master node
will fail at some point.

Note that as a side effect clients with stale configuration are now
not an issue as well, as the node converted into a slave will not
accept queries but will redirect clients accordingly.
2013-03-20 00:30:47 +01:00
antirez
4d62623015 Cluster: node replication role change handle improved.
The code handling a master that turns into a slave or the contrary was
improved in order to avoid repeating the same operations. Also
the readability and conceptual simplicity was improved.
2013-03-19 16:01:30 +01:00
antirez
88221f88c0 Cluster: new command CLUSTER FLUSHSLOTS.
It's just a simpler way to CLUSTER DELSLOTS with all the slots as
arguments, in order to obtain a node without assigned slots for
reconfiguration.
2013-03-19 09:58:05 +01:00
antirez
e28e61e839 Cluster: when failing over claim master slots. 2013-03-15 16:53:41 +01:00
antirez
dd091661d4 Cluster: log when a slave asks for failover authorization. 2013-03-15 16:44:08 +01:00
antirez
1375b0611b Cluster: slaves start failover with a small delay.
Redis Cluster can cope with a minority of nodes not informed about the
failure of a master in time for some reason (netsplit or node not
functioning properly, blocked, ...) however to wait a few seconds before
to start the failover will make most "normal" failovers simpler as the
FAIL message will propagate before the slave election happens.
2013-03-15 16:39:49 +01:00
antirez
d512a09c20 Cluster: a bit more serious node role change handling. 2013-03-15 16:35:16 +01:00
antirez
004fbef847 Cluster: remove node from master slaves when it turns into a master.
Also, a few nearby comments improved.
2013-03-15 16:16:19 +01:00
antirez
44c92f5aeb Cluster: slave failover implemented. 2013-03-15 16:11:34 +01:00
antirez
1d8f302e0d Cluster: election -> promotion in two comments. 2013-03-15 15:44:49 +01:00
antirez
bf82195467 Cluster: added function to broadcast pings.
See the function top-comment for info why this is useful sometimes.
2013-03-15 15:43:58 +01:00
antirez
892e98548a Cluster: don't broadcast messages to HANDSHAKE nodes.
Also don't check for NOADDR as we check that node->link is not NULL
that's enough.
2013-03-15 15:36:36 +01:00
antirez
76a3954f4a Cluster: fix clusterHandleSlaveFailover() conditional: quorum is enough. 2013-03-15 13:20:34 +01:00
antirez
90e99a2082 Cluster: two lame bugs fixed in FAILOVER AUTH messages generation. 2013-03-14 21:27:12 +01:00
antirez
aeacaa57e6 Cluster: code to process messages moved in the right if-else chain. 2013-03-14 21:21:58 +01:00
antirez
35f05c66b6 Cluster: handle FAILOVER_AUTH_ACK messages.
That's trivial as we just need to increment the count of masters that
received with an ACK.
2013-03-14 16:43:13 +01:00