Commit Graph

3601 Commits

Author SHA1 Message Date
antirez
ba42428633 Cluster: time switched from seconds to milliseconds.
All the internal state of cluster involving time is now using mstime_t
and mstime() in order to use milliseconds resolution.

Also the clusterCron() function is called with a 10 hz frequency instead
of 1 hz.

The cluster node_timeout must be also configured in milliseconds by the
user in redis.conf.
2013-10-09 16:19:26 +02:00
antirez
929b6a4480 Cluster: cluster stuff moved from redis.h to cluster.h. 2013-10-09 15:38:05 +02:00
antirez
6fa9b1a420 Merge branch 'bettercluster' into unstable 2013-10-08 13:04:33 +02:00
antirez
ae2763f564 Cluster: masters don't vote for a slave with stale config.
When a slave requests our vote, the configEpoch he claims for its master
and the set of served slots must be greater or equal to the configEpoch
of the nodes serving these slots in the current configuraiton of the
master granting its vote.

In other terms, masters don't vote for slaves having a stale
configuration for the slots they want to serve.
2013-10-08 12:45:35 +02:00
antirez
f7d6ad4366 Cluster: fix slave data age computation when master is still connected. 2013-10-07 16:07:13 +02:00
antirez
2c3301b9f5 Cluster: log message improved when FAIL is cleared from a slave node. 2013-10-07 15:44:58 +02:00
antirez
72f38cd70f Cluster: slave nodes advertise master slots bitmap and configEpoch. 2013-10-07 11:31:12 +02:00
antirez
0150c70b2b Replication: install the write handler when reusing a cached master.
Sometimes when we resurrect a cached master after a successful partial
resynchronization attempt, there is pending data in the output buffers
of the client structure representing the master (likely REPLCONF ACK
commands).

If we don't reinstall the write handler, it will never be installed
again by addReply*() family functions as they'll assume that if there is
already data pending, the write handler is already installed.

This bug caused some slaves after a successful partial sync to never
send REPLCONF ACK, and continuously being detected as timing out by the
master, with a disconnection / reconnection loop.
2013-10-04 16:14:54 +02:00
antirez
1461422ce6 Replication: install the write handler when reusing a cached master.
Sometimes when we resurrect a cached master after a successful partial
resynchronization attempt, there is pending data in the output buffers
of the client structure representing the master (likely REPLCONF ACK
commands).

If we don't reinstall the write handler, it will never be installed
again by addReply*() family functions as they'll assume that if there is
already data pending, the write handler is already installed.

This bug caused some slaves after a successful partial sync to never
send REPLCONF ACK, and continuously being detected as timing out by the
master, with a disconnection / reconnection loop.
2013-10-04 16:12:25 +02:00
antirez
6d8c2a4848 Replication: fix master timeout.
Since we started sending REPLCONF ACK from slaves to masters, the
lastinteraction field of the client structure is always refreshed as
soon as there is room in the socket output buffer, so masters in timeout
are detected with too much delay (the socket buffer takes a lot of time
to be filled by small REPLCONF ACK <number> entries).

This commit only counts data received as interactions with a master,
solving the issue.
2013-10-04 13:01:45 +02:00
antirez
b41570f719 Replication: fix master timeout.
Since we started sending REPLCONF ACK from slaves to masters, the
lastinteraction field of the client structure is always refreshed as
soon as there is room in the socket output buffer, so masters in timeout
are detected with too much delay (the socket buffer takes a lot of time
to be filled by small REPLCONF ACK <number> entries).

This commit only counts data received as interactions with a master,
solving the issue.
2013-10-04 12:59:24 +02:00
antirez
d62ae1ec05 PSYNC: safer handling of PSYNC requests.
There was a bug that over-esteemed the amount of backlog available,
however this could only happen when a slave was asking for an offset
that was in the "future" compared to the master replication backlog.

Now this case is handled well and logged as an incident in the master
log file.
2013-10-04 12:27:30 +02:00
antirez
4f9a69399b Add REWRITE to CONFIG subcommands help message. 2013-10-04 12:27:26 +02:00
antirez
37e06bd952 PSYNC: safer handling of PSYNC requests.
There was a bug that over-esteemed the amount of backlog available,
however this could only happen when a slave was asking for an offset
that was in the "future" compared to the master replication backlog.

Now this case is handled well and logged as an incident in the master
log file.
2013-10-04 12:25:09 +02:00
antirez
7afc0dd59a Cluster: new clusterDoBeforeSleep() API.
The new API is able to remember operations to perform before returning
to the event loop, such as checking if there is the failover quorum for
a slave, save and fsync the configuraiton file, and so forth.

Because this operations are performed before returning on the event
loop we are sure that messages that are sent in the same event loop run
will be delivered *after* the configuration is already saved, that is a
requirement sometimes. For instance we want to publish a new epoch only
when it is already stored in nodes.conf in order to avoid returning back
in the logical clock when a node is restarted.

This new API provides a big performance advantage compared to saving and
possibly fsyncing the configuration file multiple times in the same
event loop run, especially in the case of big clusters with tens or
hundreds of nodes.
2013-10-03 09:58:06 +02:00
antirez
211dcbe339 Cluster: update cluster config when slave changes master. 2013-10-02 12:27:12 +02:00
antirez
6c4d904baf Cluster: bus messages stats in CLUSTER info. 2013-10-02 10:10:08 +02:00
antirez
abe81781ae Cluster: FAIL messages from unknown senders are handled better.
Previously the event was not logged but instead the node reported an
unknown packet type received.
2013-10-02 09:42:45 +02:00
antirez
7970ebd80a Cluster: senderCurrentEpoch == node currentEpoch was too strict.
We can accept a vote as long as its epoch is >= the epoch at which we
started the voting process. There is no need for it to be exactly the
same.
2013-10-01 17:21:28 +02:00
antirez
f1bfd8233b Cluster: fix typo in clusterProcessPacket() comment. 2013-10-01 15:40:20 +02:00
antirez
1dedf9aa36 Cluster: time field removed from cluster messages header.
The new algorithm does not check replies time as checking for the
currentEpoch in the reply ensures that the reply is about the current
election process.
2013-09-30 16:19:44 +02:00
antirez
2b93a19537 Add REWRITE to CONFIG subcommands help message. 2013-09-30 11:53:18 +02:00
antirez
2d0844ee37 Cluster: log message shortened. 2013-09-30 11:51:58 +02:00
antirez
707ff0f714 Make clear that runids are not cluster node IDs. 2013-09-30 11:48:09 +02:00
antirez
4dc247eb31 Cluster: detect cluster reconfiguration when master slots drop to 0.
The old algorithm used a PROMOTED flag and explicitly checks about
slave->master convertions. Wit the new cluster meta-data propagation
algorithm we just look at the configEpoch to check if we need to
reconfigure slots, then:

1) If a node is a master but it reaches zero served slots becuase of
reconfiguration.
2) If a node is a slave but the master reaches zero served slots because
of a reconfiguration.

We switch as a replica of the new slots owner.
2013-09-30 11:45:26 +02:00
antirez
62b1591439 Cluster: re-order failover operations to make it safer.
We need to:

1) Increment the configEpoch.
2) Save it to disk and fsync the file.
3) Broadcast the PONG with the new configuration.

If other nodes will receive the updated configuration we need to be sure
to restart with this new config in the event of a crash.
2013-09-30 10:16:48 +02:00
antirez
b187517719 Cluster: when upading the configEpoch for a node, save config on disk ASAP. 2013-09-30 10:16:25 +02:00
antirez
03ca903983 Cluster: fsync data when saving the cluster config. 2013-09-30 10:13:07 +02:00
antirez
026e63392e Cluster: update the node configEpoch when newer is detected. 2013-09-27 09:55:41 +02:00
antirez
7c4b8f29e7 Cluster: react faster when a slave wins an election. 2013-09-26 16:54:43 +02:00
antirez
42fa46e49a Cluster: removed an old source of delay to start the slave failover. 2013-09-26 13:28:19 +02:00
antirez
a445aa30a0 Cluster: master node now uses new protocol to vote. 2013-09-26 13:00:41 +02:00
antirez
fb9b76fe14 Cluster: slave node now uses the new protocol to get elected. 2013-09-26 11:13:17 +02:00
Michel Martens
347ab78e90 Document the redis-cli --csv option. 2013-09-26 10:12:46 +02:00
antirez
656c3ffe4a Cluster: fix redis-trib node config fingerprinting for new nodes format. 2013-09-25 12:58:06 +02:00
antirez
341ed1d1a8 Cluster: fix redis-trib for added configEpoch field in CLUSTER NODES. 2013-09-25 12:44:56 +02:00
antirez
32b5410af9 Cluster: add currentEpoch to CLUSTER INFO. 2013-09-25 12:38:36 +02:00
antirez
6ec795d2cf Cluster: update our currentEpoch when a greater one is seen. 2013-09-25 12:36:29 +02:00
antirez
d426ada891 Cluster: broadcast currentEpoch and configEpoch in packets header. 2013-09-25 11:53:35 +02:00
antirez
12483b0061 Cluster: configEpoch added in cluster nodes description. 2013-09-25 11:47:13 +02:00
antirez
da257afe57 htonu64() and ntohu64 added to endianconv.h. 2013-09-25 09:26:36 +02:00
antirez
3c9bb8751a Cluster: PFAIL -> FAIL transition allowed for slaves.
First change: now there is no need to be a master in order to detect a
failure, however the majority of masters signaling PFAIL or FAIL is needed.

This change is important because it allows slaves rejoining the cluster
after a partition to sense the FAIL condition so that eventually all the
nodes agree on failures.
2013-09-20 11:26:44 +02:00
antirez
925ea9f858 Cluster: added time field in cluster bus messages.
The time is sent in requests, and copied back in reply packets.
This way the receiver can compare the time field in a reply with its
local clock and check the age of the request associated with this reply.

This is an easy way to discard delayed replies. Note that only a clock
is used here, that is the one of the node sending the packet. The
receiver only copies the field back into the reply, so no
synchronization is needed between clocks of different hosts.
2013-09-20 09:22:21 +02:00
antirez
7bec743e66 Allow AUTH / PING when disconnected from slave and serve-stale-data is no. 2013-09-17 09:46:06 +02:00
antirez
d0e327413b Cluster: don't add an handshake node for the same ip:port pair multiple times. 2013-09-04 15:52:16 +02:00
antirez
72587e6cc5 Cluster: free HANDSHAKE nodes after node_timeout.
Handshake nodes should turn into normal nodes or be freed in a
reasonable amount of time, otherwise they'll keep accumulating if the
address they are associated with is not reachable for some reason.
2013-09-04 12:41:21 +02:00
antirez
2debce325b redis-cli: fix big keys search when the key no longer exist.
The code freed a reply object that was never created, resulting in a
segfault every time randomkey returned a key that was deleted before we
queried it for size.
2013-09-04 10:35:53 +02:00
antirez
8eff339ca4 Cluster: CLUSTER SAVECONFIG command added. 2013-09-04 10:33:00 +02:00
antirez
528201ad6c Cluster: don't save HANDSHAKE nodes in nodes.conf. 2013-09-04 10:25:26 +02:00
antirez
e5d5da6f7c Cluster: always use safe iteartors to iterate server.cluster->nodes. 2013-09-04 10:07:50 +02:00