Commit Graph

417 Commits

Author SHA1 Message Date
antirez
67bb2c46b2 Add casting to match printf format.
adjustOpenFilesLimit() and clusterUpdateSlotsWithConfig() that were
assuming uint64_t is the same as unsigned long long, which is true
probably for all the systems out there that we target, but still GCC
emitted a warning since technically they are two different types.
2014-04-07 08:58:06 +02:00
antirez
8f52173b2c Cluster: last_vote_epoch -> lastVoteEpoch.
Use cammel case for epochs that are persisted on disk.
2014-03-27 15:01:24 +01:00
antirez
7fb14b73ba Cluster: save/restore vars that must persist after recovery.
This fixes issue #1479.
2014-03-27 14:56:29 +01:00
antirez
6dd2dbbd36 Cluster: handshake "already known" error logged to VERBOSE.
This is not really an error but something that always happens for
example when creating a new cluster, or if the sysadmin rejoins manually
a node that is already known.

Since useless logs don't help, moved to VERBOSE level.
2014-03-26 16:35:38 +01:00
antirez
3cf6f1f54f Cluster: clusterHandleConfigEpochCollision() fixed.
New config epochs must always be obtained incrementing the currentEpoch,
that is itself guaranteed to be >= the max configEpoch currently known
to the node.
2014-03-26 12:31:28 +01:00
antirez
80d4c52cdf Cluster: better logging for clusterUpdateSlotsConfigWith(). 2014-03-26 12:09:38 +01:00
antirez
eb746ec408 Cluster: CLUSTER SETSLOT implementation comment updated.
Update the comment since the implementation details changed.
2014-03-25 17:50:46 +01:00
antirez
6c527a89a0 Cluster: configEpoch collisions resolution.
The slave election in Redis Cluster guarantees that slaves promoted to
masters always end with unique config epochs, however failures during
manual reshardings, software bugs and operational errors may in theory
cause two nodes to have the same configEpoch.

This commit introduces a mechanism to eventually always end with different
configEpochs if a collision ever happens.

As a (wanted) side effect, this also ensures that after a new cluster
is created, all nodes will end with a different configEpoch automatically.
2014-03-25 17:19:58 +01:00
antirez
c1041c570f Cluster: stay within 80 cols. 2014-03-25 16:07:14 +01:00
antirez
82b53c650c struct dictEntry -> dictEntry. 2014-03-20 16:20:37 +01:00
antirez
e26f4486b0 Cluster: update node configEpoch on UPDATE messages.
The UPDATE message contains the configEpoch of the node configuration
advertised in the packet. Update it if needed.
2014-03-11 11:53:09 +01:00
antirez
a2ff90919f Cluster: set slot error if we receive an update for a busy slot.
By manually modifying nodes configurations in random ways, it is possible
to create the following scenario:

A is serving keys for slot 10
B is manually configured to serve keys for slot 10

A receives an update from B (or another node) where it is informed that
the slot 10 is now claimed by B with a greater configuration epoch,
however A still has keys from slot 10.

With this commit A will put the slot in error setting it in IMPORTING
state, so that redis-trib can detect the issue.
2014-03-11 11:49:47 +01:00
antirez
1ed0ad77f0 Cluster: clarified a comment in clusterUpdateSlotsConfigWith(). 2014-03-11 11:32:40 +01:00
antirez
8287945ff8 Cluster: flush importing/migrating state when master is turned into slave. 2014-03-11 11:22:06 +01:00
antirez
2e8e0ad44e Cluster: clusterCloseAllSlots() added. 2014-03-11 11:16:18 +01:00
antirez
787b297046 Cluster: getKeysFromCommand() API cleaned up.
This API originated from the "diskstore" experiment, not for Redis
Cluster itself, so there were legacy/useless things trying to
differentiate between keys that are going to be overwritten and keys
that need to be fetched from disk (preloaded).

All useless with Cluster, so removed with the result of code
simplification.
2014-03-10 13:18:41 +01:00
antirez
c1a7d3e61f Cluster: abort on port too high error.
It also fixes multi-line comment style to be consistent with the rest of
the code base.

Related to #1555.
2014-03-10 10:41:27 +01:00
Salvatore Sanfilippo
442b06db54 Merge pull request #1555 from mattsta/cluster-port-error-out
Cluster port error out
2014-03-10 10:37:50 +01:00
antirez
ed8c55237b Cluster: be explicit about passing NULL as bind addr for connect.
The code was already correct but it was using that bindaddr[0] is set to
NULL as a side effect of current implementation if no bind address is
configured. This is not guarnteed to hold true in the future.
2014-03-10 10:33:53 +01:00
antirez
3e8a92ef8d Cluster: log error when anetTcpNonBlockBindConnect() fails. 2014-03-10 10:32:28 +01:00
Salvatore Sanfilippo
3b0edb80ec Merge pull request #1567 from mattsta/fix-cluster-join
Bind source address for cluster communication
2014-03-10 10:28:32 +01:00
antirez
0f1f25784f Cluster: better timeout and retry time for failover.
When node-timeout is too small, in the order of a few milliseconds,
there is no way the voting process can terminate during that time, so we
set a lower limit for the failover timeout of two seconds.

The retry time is set to two times the failover timeout time, so it is
at least 4 seconds.
2014-03-10 09:57:52 +01:00
antirez
6984692060 Cluster: fix conditional generating TRYAGAIN error. 2014-03-07 16:18:00 +01:00
antirez
36676c2318 Redis Cluster: support for multi-key operations. 2014-03-07 13:19:09 +01:00
Matt Stancliff
385c25f70f Remove redundant IP length definition
REDIS_CLUSTER_IPLEN had the same value as
REDIS_IP_STR_LEN.  They were both #define'd
to the same INET6_ADDRSTRLEN.
2014-03-06 17:55:43 +01:00
Matt Stancliff
d2040ab9b1 Remove some redundant code
Function nodeIp2String in cluster.c is exactly
anetPeerToString with a pre-extracted fd.
2014-03-06 17:55:39 +01:00
Matt Stancliff
59cf0b1902 Fix return value check for anetTcpAccept
anetTcpAccept returns ANET_ERR, not AE_ERR.

This isn't a physical error since both ANET_ERR
and AE_ERR are -1, but better to be consistent.
2014-03-06 17:55:31 +01:00
Matt Stancliff
e5b1e7be64 Bind source address for cluster communication
The first address specified as a bind parameter
(server.bindaddr[0]) gets used as the source IP
for cluster communication.

If no bind address is specified by the user, the
behavior is unchanged.

This patch allows multiple Redis Cluster instances
to communicate when running on the same interface
of the same host.
2014-03-04 17:36:45 -05:00
antirez
8dea2029a4 Fix configEpoch assignment when a cluster slot gets "closed".
This is still code to rework in order to use agreement to obtain a new
configEpoch when a slot is migrated, however this commit handles the
special case that happens when the nodes are just started and everybody
has a configEpoch of 0. In this special condition to have the maximum
configEpoch is not enough as the special epoch 0 is not unique (all the
others are).

This does not fixes the intrinsic race condition of a failover happening
while we are resharding, that will be addressed later.
2014-03-03 11:12:11 +01:00
Matt Stancliff
ce68caea37 Cluster: error out quicker if port is unusable
The default cluster control port is 10,000 ports higher than
the base Redis port.  If Redis is started on a too-high port,
Cluster can't start and everything will exit later anyway.
2014-02-19 17:30:07 -05:00
antirez
db6d628c3e Cluster: clusterDelNode(): remove node from master's slaves. 2014-02-11 10:34:25 +01:00
antirez
5e0e03be41 Cluster: UPDATE messages are the norm and verbose.
Logging them at WARNING level was of little utility and of sure disturb.
2014-02-11 10:18:24 +01:00
antirez
4a64286c36 Cluster: configEpoch assignment in SETNODE improved.
Avoid to trash a configEpoch for every slot migrated if this node has
already the max configEpoch across the cluster.

Still work to do in this area but this avoids both ending with a very
high configEpoch without any reason and to flood the system with fsyncs.
2014-02-11 10:09:17 +01:00
antirez
72f7abf6a2 Cluster: clusterSetStartupEpoch() made more generally useful.
The actual goal of the function was to get the max configEpoch found in
the cluster, so make it general by removing the assignment of the max
epoch to currentEpoch that is useful only at startup.
2014-02-11 10:00:14 +01:00
antirez
44f7afe28a Cluster: always increment the configEpoch in SETNODE after import.
Removed a stale conditional preventing the configEpoch from incrementing
after the import in certain conditions. Since the master got a new slot
it should always claim a new configuration.
2014-02-11 09:50:37 +01:00
antirez
a1349728ea Cluster: on resharding upgrade version of receiving node.
The node receiving the hash slot needs to have a version that wins over
the other versions in order to force the ownership of the slot.

However the current code is far from perfect since a failover can happen
during the manual resharding. The fix is a work in progress but the
bottom line is that the new version must either be voted as usually,
set by redis-trib manually after it makes sure can't be used by other
nodes, or reserved configEpochs could be used for manual operations (for
example odd versions could be never used by slaves and are always used
by CLUSTER SETSLOT NODE).
2014-02-11 00:36:05 +01:00
antirez
6dc26795aa Cluster: fsync at every SETSLOT command puts too pressure on disks.
During slots migration redis-trib can send a number of SETSLOT commands.
Fsyncing every time is a bit too much in production as verified
empirically.

To make sure configs are fsynced on all nodes after a resharding
redis-trib may send something like CLUSTER CONFSYNC.

In this case fsyncs were not providing too much value since anyway
processes can crash in the middle of the resharding of an hash slot, and
redis-trib should be able to recover from this condition anyway.
2014-02-10 23:54:08 +01:00
antirez
218358bbbd Cluster: conditions to clear "migrating" on slot for SETSLOT ... NODE changed.
If the slot is manually assigned to another node, clear the migrating
status regardless of the fact it was previously assigned to us or not,
as long as we no longer have keys for this slot.

This avoid a race during slots migration that may leave the slot in
migrating status in the source node, since it received an update message
from the destination node that is already claiming the slot.

This way we are sure that redis-trib at the end of the slot migration is
always able to close the slot correctly.
2014-02-10 23:51:47 +01:00
antirez
bf670e0745 Cluster: don't update slave's master if we don't know it.
There is no way we can update the slave's node->slaveof pointer if we
don't know the master (no node with such an ID in our tables).
2014-02-10 18:33:34 +01:00
antirez
a3755ae9ee Cluster: ignore slot config changes if we are importing it. 2014-02-10 18:04:43 +01:00
antirez
6fc53e16ad Cluster: update configEpoch after manually messing with slots. 2014-02-10 18:01:58 +01:00
antirez
1a73c992a3 Cluster: fixed inverted arguments in logging function call. 2014-02-10 17:21:10 +01:00
antirez
32563b4a5f Cluster: clear the FAIL status for masters without slots.
Masters without slots don't participate to the cluster but just do
redirections, no need to take them in FAIL state if they are back
reachable.
2014-02-10 17:18:27 +01:00
antirez
5b2082ead3 Cluster: replica migration should only work for masters serving slots. 2014-02-10 17:08:37 +01:00
antirez
f885fa8bac Cluster: clusterReadHandler() fixed to work with new message header. 2014-02-10 16:27:37 +01:00
antirez
7bf7b7350c Cluster: signature changed to "RCmb" (Redis Cluster message bus).
Sounds better after all.
2014-02-10 15:55:21 +01:00
antirez
dced9c0619 Cluster: discard bus messages with version != 0. 2014-02-10 15:54:22 +01:00
antirez
007e1c7cb2 Cluster: added signature + version in bus packets. 2014-02-10 15:53:09 +01:00
antirez
142281dc79 Cluster: keys slot computation now supports hash tags.
Currently this is marginally useful, only to make sure two keys are in
the same hash slot when the cluster is stable (no rehashing in
progress).

In the future it is possible that support will be added to run
mutli-keys operations with keys in the same hash slot.
2014-02-07 17:39:01 +01:00
antirez
04fe000bf8 Cluster: fixed MF condition in clusterHandleSlaveFailover().
For manual failover we need a manual failover in progress, and that
mf_can_start is true (master offset received and matched).
2014-02-05 16:01:56 +01:00
antirez
c6f02fd67a Cluster: CLUSTER FAILOVER replies with OK and logs the event. 2014-02-05 15:52:38 +01:00
antirez
c72449af30 Cluster: check that a MF is in progress in manualFailoverCheckTimeout().
Otherwise it is always detected as a manual failover timed out.
2014-02-05 15:45:24 +01:00
antirez
b7402bcad5 Cluster: force AUTH ACK on manual failover.
When a slave requests masters vote for a manual failover, the
REQUEST_AUTH message is flagged in a special way in order to force the
masters to give the authorization even if the master is not marked as
failing.
2014-02-05 13:10:03 +01:00
antirez
4cf0cd5719 Cluster: manual failover initial implementation. 2014-02-05 13:01:24 +01:00
antirez
a7d30681c9 Cluster: configurable replicas migration barrier.
It is possible to configure the min number of additional working slaves
a master should be left with, for a slave to migrate to an orphaned
master.
2014-01-31 11:26:36 +01:00
antirez
6c9359add1 Cluster: perform orphaned masters check before continue statements.
The check was placed in a way that conflicted with the continue
statements used by the node hearth beat code later that needs to skip
the current node sometimes. Moved at the start of the function so that's
always executed.
2014-01-30 18:23:31 +01:00
antirez
c2507b0ff6 Cluster: replica migration implementation.
This feature allows slaves to migrate to orphaned masters (masters
without working slaves), as long as a set of conditions are met,
including the fact that the migrating slave needs to be in a
master-slaves ring with at least another slave working.
2014-01-30 18:05:11 +01:00
antirez
5b4020fb42 Cluster: swap two code blocks to have a more obvious flow. 2014-01-30 16:34:23 +01:00
antirez
4beaaff8ea Cluster: remove not needed return statement breaking failover. 2014-01-29 17:28:46 +01:00
antirez
3582054982 Cluster: broadcast pong to other slaves in the same ring.
When we schedule a failover, broadcast a PONG to the slaves.
The other slaves that plan to get elected will do the same too, this way
it is likely that every slave will have a good picture of its own rank.

Note that this is N*N messages where N is the number of slaves for the
failing master, however usually even large clusters have many master
nodes but a limited number of replicas per node, so this is harmless.
2014-01-29 17:19:55 +01:00
antirez
e2b59621a8 Cluster: log offset when announcing the failover election delay. 2014-01-29 17:16:10 +01:00
antirez
940531e9b7 Cluster: added progressive election delay according to slave rank.
Note that when we compute the initial delay, there are probably still
more up to date information to receive from slaves with new offsets, so
the delay is recomputed when new data is available.
2014-01-29 16:53:45 +01:00
antirez
6f54032080 Cluster: function clusterGetSlaveRank() added.
Return the number of slaves for the same master having a better
replication offset of the current slave, that is, the slave "rank" used
to pick a delay before the request for election.
2014-01-29 16:39:04 +01:00
antirez
40cd38f0c4 Cluster: update node replication offset from bus packets headers. 2014-01-29 16:01:00 +01:00
antirez
9d4ded7ec6 Cluster: refactoring: new macros to check node flags. 2014-01-29 12:17:16 +01:00
antirez
099bd336db Cluster: use myself instead of server->cluster.myself. 2014-01-29 11:38:14 +01:00
antirez
e36bd8b43e Cluster: added a global myself pointer in cluster.c.
Accessing to the 'myself' node, the node representing the currently
running instance, is handy without the need to type
server.cluster->myself every time.
2014-01-29 11:22:22 +01:00
antirez
f1e09d8c41 Cluster: clusterBroadcastPong() improved with target selection.
Now we can broadcast a pong to all the instances or just the local
slaves (that is useful for replication offset propagation).
2014-01-29 11:08:52 +01:00
antirez
befcf6259e Cluster: broadcast master/slave replication offset in bus header. 2014-01-28 16:51:50 +01:00
antirez
0b1b25c51c Cluster: introduced repl_offset fields in clusterNode.
The two fields are used in order to remember the latest known
replication offset and the time we received it from other slave nodes.

This will be used by slaves in order to start the election procedure
with a delay that is proportional to the rank of the slave among the
other slaves for this master, when sorted for replication offset.

Usually this allows the slave with the most updated offset to win the
election and replace the failing master in the cluster.
2014-01-28 16:28:07 +01:00
antirez
0f9422d575 Cluster: update slaves lists in clusterSetMaster(). 2014-01-22 18:46:53 +01:00
antirez
5383ab0bc6 Cluster: CLUSTER SLAVES subcommand added. 2014-01-22 18:38:42 +01:00
antirez
603e480fd5 Cluster: clusterGenNodesDescription() refactored into two functions. 2014-01-22 18:36:12 +01:00
antirez
80e80668f4 Cluster: master nodes wait before rejoining the cluster after reboot.
One of the simple heuristics used by Redis Cluster in order to avoid
losing data in the typical failure modes created by the asynchronous
replication with the slaves (a master is unable, when accepting a
write, to immediately tell if it should be really accepted or refused
because of a configuration change), is to wait some time before to
rejoin the cluster after being partitioned away from the majority of
instances.

A similar condition happens when a master is restarted. It does not know
if it was already failed over, nor if all the clients have already an
updated configuration about the cluster map, so it is possible that
clients will try to write to stale masters that were restarted.

In a similar way this commit changes masters behavior so they wait
2000 milliseconds before accepting writes after a reboot. There is
nothing special about 2 seconds if not to be a value supposedly larger
a few orders of magnitude compared to the cluster bus communication
latencies.
2014-01-20 11:52:52 +01:00
antirez
e6970e204f Cluster: debug printf statemets removed.
These were committed for error after being inserted in order to fix an
issue.
2014-01-20 11:19:04 +01:00
antirez
ac3850cabd Cluster: allow CLUSTER REPLICATE to switch master.
The code was doing checks for slaves that should be done only when the
instance is currently a master. Switching a slave from a master to
another one should just work.
2014-01-17 18:22:35 +01:00
antirez
3d455393a6 Cluster: don't let a node forget its own master.
redis-trib should make sure to reconfigure slaves of a node to remove
from the cluster to replicate with other nodes before sending CLUSTER
FORGET.
2014-01-16 17:49:35 +01:00
antirez
0c373207fa Cluster: don't forget yourself with CLUSTER FORGET. 2014-01-16 09:46:23 +01:00
antirez
3e948970fe Cluster: use the node blacklist in CLUSTER FORGET.
CLUSTER FORGET is not useful if we can't remove a node from all the
nodes of our cluster because of the Gossip protocol that keeps adding
a given node to nodes where we already tried to remove it.

So now CLUSTER FORGET implements a nodes blacklist that is set and
checked by the Gossip section processing function. This way before a
node is re-added at least 60 seconds must elapse since the FORGET
execution.

This means that redis-trib has some time to remove a node from a whole
cluster. It is possible that in the future it will be uesful to raise
the 60 sec figure to something bigger.
2014-01-15 16:50:45 +01:00
antirez
ccf268fa17 Cluster: fix clusterBlacklistAddNode() by setting right expire time.
The hash table value should be set to now + 60 seconds otherwise it
expires immediately.
2014-01-15 16:49:31 +01:00
antirez
4e1861155f Cluster: clusterBlacklistAddNode() key lookup fixed.
We can't lookup by node->name that's not an SDS string but a plain C
array in the node structure.
2014-01-15 16:45:07 +01:00
antirez
b51be7b34f Cluster: clusterBlacklistExists() requires blacklist cleanup before lookup. 2014-01-15 16:06:54 +01:00
antirez
a81340abaf Cluster: set a minimum rejoin delay if node_timeout is too small.
The rejoin delay usually is the node timeout. However if the node
timeout is too small, we set it to 500 milliseconds, that is a value
chosen to be greater than most setups RTT / instances latency figures
so that likely communication with other nodes happen before rejoining.
2014-01-15 12:34:33 +01:00
antirez
a687cbc19c Cluster: periodically call clusterUpdateState() when cluster is down.
Usually we update the cluster state (to understand if we should accept
queries or reply with an error) only when there is a change in the state
of the nodes. However for the "delayed rejoin" feature to work, that is,
for a master to wait some time before accepting queries again after it
rejoins the majority, we need to periodically update the last time when
the node was partitioned away from the majority.

With this commit if the cluster is down we update the state ten times
per second.
2014-01-15 12:26:12 +01:00
antirez
25ddefdea3 Cluster: range checking in getSlotOrReply() fixed.
See issue #1426 on Github.
2014-01-15 11:33:46 +01:00
antirez
fb659cd334 Cluster: ignore empty lines in nodes.conf.
Even without the user messing manually with the file, it is still
possible to have blank lines (just a single "\n" per line) because of
how the nodes.conf update/write process works.
2014-01-15 11:23:41 +01:00
antirez
6c63df3031 Cluster: atomic update of nodes.conf file.
The way the file was generated was unsafe and leaded to nodes.conf file
corruption (zero length file) on server stop/crash during the creation
of the file.

The previous file update method was as simple as open with O_TRUNC
followed by the write call. While the write call was a single one with
the full payload, ensuring no half-written files for POSIX semantics,
stopping the server just after the open call resulted into a zero-length
file (all the nodes information lost!).
2014-01-15 10:31:20 +01:00
antirez
28273394cb Cluster: support to read from slave nodes.
A client can enter a special cluster read-only mode using the READONLY
command: if the client read from a slave instance after this command,
for slots that are actually served by the instance's master, the queries
will be processed without redirection, allowing clients to read from
slaves (but without any kind fo read-after-write guarantee).

The READWRITE command can be used in order to exit the readonly state.
2014-01-14 16:33:16 +01:00
antirez
58c8a071a5 Fix RESTORE ttl handling in 32 bit archs.
long was used instead of long long in order to handle a 64 bit
resolution millisecond timestamp.

This fixes issue #1483.
2014-01-09 11:09:23 +01:00
antirez
f510549044 Cluster: clusterProcessPacket() was not 80 cols friendly.
The function actually needs to be split into sub-functions at some
point in the future.
2013-12-25 17:57:36 +01:00
antirez
66ec1412fe Redis Cluster: add repl_ping_slave_period to slave data validity time.
When the configured node timeout is very small, the data validity time
(maximum data age for a slave to try a failover) is too little (ten
times the configured node timeout) when the replication link with the
master is mostly idle. In this case we'll receive some data from the
master only every server.repl_ping_slave_period to refresh the last
interaction with the master.

This commit adds to the max data validity time the slave ping period to
avoid this problem of slaves sensing too old data without a good reason.
However this max data validity time is likely a setting that should be
configurable by the Redis Cluster user in a way completely independent
from the node timeout.
2013-12-22 10:05:16 +01:00
antirez
658aff9d29 Redis Cluster: move node failure reports logging from VERBOSE to NOTICE level. 2013-12-21 00:04:53 +01:00
antirez
5a404c87c1 Redis Cluster: remove no longer relevant comment. 2013-12-20 14:40:11 +01:00
antirez
fda4cba912 Redis Cluster: reconfigure replication when master changes address. 2013-12-20 12:47:22 +01:00
antirez
d7374032c0 Redis Cluster: handshake code refactoring + Gossip IP switch detection.
This commit makes it simple to start an handshake with a specific node
address, and uses this in order to detect a node IP change and start a
new handshake in order to fix the IP if possible.
2013-12-20 12:38:03 +01:00
antirez
a2c938c834 Redis Cluster: delay state change when in the majority again.
As specified in the Redis Cluster specification, when a node can reach
the majority again after a period in which it was partitioend away with
the minorty of masters, wait some time before accepting queries, to
provide a reasonable amount of time for other nodes to upgrade its
configuration.

This lowers the probabilities of both a client and a master with not
updated configuration to rejoin the cluster at the same time, with a
stale master accepting writes.
2013-12-20 09:56:18 +01:00
antirez
7a666ac419 Cluster: set n->slaves to NULL in clusterNodeResetSlaves().
The value was otherwise undefined, so next time the node was promoted
again from slave to master, adding a slave to the list of slaves
would likely crash the server or result into undefined behavior.
2013-12-17 14:50:24 +01:00
antirez
fda91dbde3 Cluster: check link is valid before sending UPDATE. 2013-12-17 12:28:37 +01:00
antirez
f57bb36ce7 Cluster: initialize todo_before_sleep flags to 0. 2013-12-17 12:22:02 +01:00
antirez
c70c0c6db7 Cluster: use proper type mstime_t for ping delay var. 2013-12-17 10:27:36 +01:00