Commit Graph

2261 Commits

Author SHA1 Message Date
antirez
abd6308d27 Set server.repl_down_since to 0 when changing master.
When an instance is potentially set to replicate with another master, it
is conceptually disconnected forever, since we have no old copy of the
dataset for this master in memory.
2014-01-17 18:20:31 +01:00
antirez
36c24bcca0 Cluster: redis-trib shows number of replicas of masters. 2014-01-17 17:56:45 +01:00
antirez
27ed9da383 Cluster: redis-trib help output format modified. 2014-01-17 12:32:49 +01:00
antirez
a68c9ba97e Cluster: redis-trib shows what a slave replicates + fixes.
Also the :replicates info field in the node object is now correctly
populated. This also fixes the :replicas field computation.
2014-01-17 12:06:18 +01:00
antirez
b451176734 Cluster: redis-trib addnode is now able to add replicas. 2014-01-17 11:48:42 +01:00
antirez
30d9c1dc32 Cluster: fix redis-trib help subcommand. 2014-01-17 10:29:40 +01:00
antirez
17d0c3e85a Cluster: redis-trib delnode implementation. 2014-01-16 18:22:03 +01:00
antirez
3d455393a6 Cluster: don't let a node forget its own master.
redis-trib should make sure to reconfigure slaves of a node to remove
from the cluster to replicate with other nodes before sending CLUSTER
FORGET.
2014-01-16 17:49:35 +01:00
antirez
9531c84807 Cluster: redis-trib help output improved.
Show options if any. Clarify that for some command any node address is
ok.
2014-01-16 16:23:33 +01:00
antirez
0c373207fa Cluster: don't forget yourself with CLUSTER FORGET. 2014-01-16 09:46:23 +01:00
antirez
3e948970fe Cluster: use the node blacklist in CLUSTER FORGET.
CLUSTER FORGET is not useful if we can't remove a node from all the
nodes of our cluster because of the Gossip protocol that keeps adding
a given node to nodes where we already tried to remove it.

So now CLUSTER FORGET implements a nodes blacklist that is set and
checked by the Gossip section processing function. This way before a
node is re-added at least 60 seconds must elapse since the FORGET
execution.

This means that redis-trib has some time to remove a node from a whole
cluster. It is possible that in the future it will be uesful to raise
the 60 sec figure to something bigger.
2014-01-15 16:50:45 +01:00
antirez
ccf268fa17 Cluster: fix clusterBlacklistAddNode() by setting right expire time.
The hash table value should be set to now + 60 seconds otherwise it
expires immediately.
2014-01-15 16:49:31 +01:00
antirez
4e1861155f Cluster: clusterBlacklistAddNode() key lookup fixed.
We can't lookup by node->name that's not an SDS string but a plain C
array in the node structure.
2014-01-15 16:45:07 +01:00
antirez
b51be7b34f Cluster: clusterBlacklistExists() requires blacklist cleanup before lookup. 2014-01-15 16:06:54 +01:00
antirez
a81340abaf Cluster: set a minimum rejoin delay if node_timeout is too small.
The rejoin delay usually is the node timeout. However if the node
timeout is too small, we set it to 500 milliseconds, that is a value
chosen to be greater than most setups RTT / instances latency figures
so that likely communication with other nodes happen before rejoining.
2014-01-15 12:34:33 +01:00
antirez
a687cbc19c Cluster: periodically call clusterUpdateState() when cluster is down.
Usually we update the cluster state (to understand if we should accept
queries or reply with an error) only when there is a change in the state
of the nodes. However for the "delayed rejoin" feature to work, that is,
for a master to wait some time before accepting queries again after it
rejoins the majority, we need to periodically update the last time when
the node was partitioned away from the majority.

With this commit if the cluster is down we update the state ten times
per second.
2014-01-15 12:26:12 +01:00
antirez
25ddefdea3 Cluster: range checking in getSlotOrReply() fixed.
See issue #1426 on Github.
2014-01-15 11:33:46 +01:00
antirez
fb659cd334 Cluster: ignore empty lines in nodes.conf.
Even without the user messing manually with the file, it is still
possible to have blank lines (just a single "\n" per line) because of
how the nodes.conf update/write process works.
2014-01-15 11:23:41 +01:00
antirez
6c63df3031 Cluster: atomic update of nodes.conf file.
The way the file was generated was unsafe and leaded to nodes.conf file
corruption (zero length file) on server stop/crash during the creation
of the file.

The previous file update method was as simple as open with O_TRUNC
followed by the write call. While the write call was a single one with
the full payload, ensuring no half-written files for POSIX semantics,
stopping the server just after the open call resulted into a zero-length
file (all the nodes information lost!).
2014-01-15 10:31:20 +01:00
antirez
28273394cb Cluster: support to read from slave nodes.
A client can enter a special cluster read-only mode using the READONLY
command: if the client read from a slave instance after this command,
for slots that are actually served by the instance's master, the queries
will be processed without redirection, allowing clients to read from
slaves (but without any kind fo read-after-write guarantee).

The READWRITE command can be used in order to exit the readonly state.
2014-01-14 16:33:16 +01:00
antirez
aacbba2607 Fix typo in aofRewriteBufferAppend() comment. 2014-01-14 15:37:49 +01:00
antirez
5189485625 Set REDIS_AOF_REWRITE_MIN_SIZE to 64mb.
64mb is the default value in redis.conf. For some reason instead the
hard-coded default was 1mb that is too small.
2014-01-14 11:27:28 +01:00
antirez
d5763dceaf SENTINEL SET master quorum implemented. 2014-01-14 09:23:26 +01:00
antirez
fe86f890b0 SENTINEL SET: error on bad option name + flush config on error. 2014-01-13 11:55:57 +01:00
antirez
f822516e43 SENTINEL SET implemented.
The new command allows to change master-specific configurations
at runtime. All the settable parameters can be retrivied via the
SENTINEL MASTER command, so there is no equivalent "GET" command.
2014-01-13 11:53:29 +01:00
antirez
3cdcaff069 Sentinel: fix wrong arity error message. 2014-01-13 11:05:13 +01:00
antirez
964f6b17e9 Sentinel: SENTINEL REMOVE command added.
The command totally removes a monitored master.
2014-01-10 15:39:36 +01:00
antirez
cf2835519e Sentinel: releaseSentinelRedisInstance() top comment fixed.
The claim about unlinking the instance from the connected hash tables
was the opposite of the reality. Also the current actual behavior is
safer in most cases, so it is better to manually unlink when needed.
2014-01-10 15:33:42 +01:00
antirez
9d0f46c6f5 Sentinel: flush config on disk when new master is added. 2014-01-10 15:22:06 +01:00
antirez
d4f296bc1d anetResolveIP() prototype added to anet.h. 2014-01-10 15:18:41 +01:00
antirez
39f9f449b0 Sentinel: SENTINEL MONITOR command implemented.
It allows to add new masters to monitor at runtime.
2014-01-10 15:18:24 +01:00
antirez
774f0bd45e anetResolveIP() added to anet.c.
The new function is used when we want to normalize an IP address without
performing a DNS lookup if the string to resolve is not a valid IP.

This is useful every time only IPs are valid inputs or when we want to
skip DNS resolution that is slow during runtime operations if we are
required to block.
2014-01-10 15:02:39 +01:00
antirez
c42e4bd0b6 Sentinel: added SENTINEL MASTER <name> command.
With SENTINEL MASTERS it was already possible to list all the configured
masters, but not a specific one.
2014-01-10 14:41:52 +01:00
antirez
2bb9cd464e Add all the configurable fields to addReplySentinelRedisInstance().
Note: the auth password with the master is voluntarily not exposed.
2014-01-10 14:31:41 +01:00
antirez
5a7d04ee7b Trip comment to 80 cols in SentinelCommand(). 2014-01-10 14:13:04 +01:00
antirez
58c8a071a5 Fix RESTORE ttl handling in 32 bit archs.
long was used instead of long long in order to handle a 64 bit
resolution millisecond timestamp.

This fixes issue #1483.
2014-01-09 11:09:23 +01:00
antirez
e1ab2991c3 Fix keyspace events flags-to-string conversion.
Fixes issue #1491 on Github.
2014-01-08 17:18:34 +01:00
antirez
90a81b4ebb Don't send REPLCONF ACK to old masters.
Masters not understanding REPLCONF ACK will reply with errors to our
requests causing a number of possible issues.

This commit detects a global replication offest set to -1 at the end of
the replication, and marks the client representing the master with the
REDIS_PRE_PSYNC flag.

Note that this flag was called REDIS_PRE_PSYNC_SLAVE but now it is just
REDIS_PRE_PSYNC as it is used for both slaves and masters starting with
this commit.

This commit fixes issue #1488.
2014-01-08 14:28:16 +01:00
antirez
3f92e05637 Clarify a comment in slaveTryPartialResynchronization(). 2014-01-08 14:28:13 +01:00
antirez
fdf50e1e3d Log disconnection with slave only when ip:port is available. 2013-12-25 18:41:53 +01:00
antirez
2041882286 anetPeerToString / SockName: port can be NULL on errors too. 2013-12-25 18:41:49 +01:00
antirez
a2a900356e anetTcpGenericConnect() bug introduced in 9d19977 fixed.
Durign a refactoring I mispelled _port for port.
This is one of the reasons I never used _varname myself.
2013-12-25 18:41:45 +01:00
antirez
cb23d510f4 Remove useless goto from anetTcpGenericConnect(). 2013-12-25 18:41:41 +01:00
antirez
491f681088 anetTcpGenericConnect() code improved + 1 bug fix.
Now the socket is closed if anetNonBlock() fails, and in general the
code structure makes it harder to introduce this kind of bugs in the
future.

Reference: pull request #1059.
2013-12-25 18:15:28 +01:00
antirez
f510549044 Cluster: clusterProcessPacket() was not 80 cols friendly.
The function actually needs to be split into sub-functions at some
point in the future.
2013-12-25 17:57:36 +01:00
antirez
e789384255 Fix CONFIG REWRITE handling of unknown options.
There were two problems with the implementation.

1) "save" was not correctly processed when no save point was configured,
   as reported in issue #1416.
2) The way the code checked if an option existed in the "processed"
   dictionary was wrong, as we add the element with as a key associated
   with a NULL value, so dictFetchValue() can't be used to check for
   existance, but dictFind() must be used, that returns NULL only if the
   entry does not exist at all.
2013-12-23 12:50:27 +01:00
antirez
7e9433cee1 Configuring port to 0 disables IP socket as specified.
This was no longer the case with 2.8 becuase of a bug introduced with
the IPv6 support. Now it is fixed.

This fixes issue #1287 and #1477.
2013-12-23 11:31:35 +01:00
antirez
94e8c9e77e Make new masters inherit replication offsets.
Currently replication offsets could be used into a limited way in order
to understand, out of a set of slaves, what is the one with the most
updated data. For example this comparison is possible of N slaves
were replicating all with the same master.

However the replication offset was not transferred from master to slaves
(that are later promoted as masters) in any way, so for instance if
there were three instances A, B, C, with A master and B and C
replication from A, the following could happen:

C disconnects from A.
B is turned into master.
A is switched to master of B.
B receives some write.

In this context there was no way to compare the offset of A and C,
because B would use its own local master replication offset as
replication offset to initialize the replication with A.

With this commit what happens is that when B is turned into master it
inherits the replication offset from A, making A and C comparable.
In the above case assuming no inconsistencies are created during the
disconnection and failover process, A will show to have a replication
offset greater than C.

Note that this does not mean offsets are always comparable to understand
what is, in a set of instances, since in more complex examples the
replica with the higher replication offset could be partitioned away
when picking the instance to elect as new master. However this in
general improves the ability of a system to try to pick a good replica
to promote to master.
2013-12-22 11:43:25 +01:00
antirez
ba5eb44d14 Slave disconnection is an event worth logging. 2013-12-22 10:15:35 +01:00
antirez
66ec1412fe Redis Cluster: add repl_ping_slave_period to slave data validity time.
When the configured node timeout is very small, the data validity time
(maximum data age for a slave to try a failover) is too little (ten
times the configured node timeout) when the replication link with the
master is mostly idle. In this case we'll receive some data from the
master only every server.repl_ping_slave_period to refresh the last
interaction with the master.

This commit adds to the max data validity time the slave ping period to
avoid this problem of slaves sensing too old data without a good reason.
However this max data validity time is likely a setting that should be
configurable by the Redis Cluster user in a way completely independent
from the node timeout.
2013-12-22 10:05:16 +01:00
antirez
b2dedd9da8 Log when a slave lose the connection with its master. 2013-12-21 00:23:37 +01:00
antirez
658aff9d29 Redis Cluster: move node failure reports logging from VERBOSE to NOTICE level. 2013-12-21 00:04:53 +01:00
antirez
5a404c87c1 Redis Cluster: remove no longer relevant comment. 2013-12-20 14:40:11 +01:00
antirez
fda4cba912 Redis Cluster: reconfigure replication when master changes address. 2013-12-20 12:47:22 +01:00
antirez
d7374032c0 Redis Cluster: handshake code refactoring + Gossip IP switch detection.
This commit makes it simple to start an handshake with a specific node
address, and uses this in order to detect a node IP change and start a
new handshake in order to fix the IP if possible.
2013-12-20 12:38:03 +01:00
antirez
a2c938c834 Redis Cluster: delay state change when in the majority again.
As specified in the Redis Cluster specification, when a node can reach
the majority again after a period in which it was partitioend away with
the minorty of masters, wait some time before accepting queries, to
provide a reasonable amount of time for other nodes to upgrade its
configuration.

This lowers the probabilities of both a client and a master with not
updated configuration to rejoin the cluster at the same time, with a
stale master accepting writes.
2013-12-20 09:56:18 +01:00
antirez
b3632319a4 CONFIG REWRITE: no special handling or include and rename-command.
CONFIG REWRITE is now wiser and does not touch what it does not
understand inside redis.conf.
2013-12-19 15:57:11 +01:00
Yubao Liu
7da423f79f CONFIG REWRITE: don't throw some options on config rewrite
Those options will be thrown without this patch:
  include, rename-command, min-slaves-to-write, min-slaves-max-lag,
appendfilename.
2013-12-19 15:56:48 +01:00
antirez
3b9cf3ed3a CONFIG REWRITE: old development comments removed. 2013-12-19 15:30:06 +01:00
antirez
b221e13dac CONFIG REWRITE: don't wipe unknown options.
With this commit options not explicitly rewritten by CONFIG REWRITE are
not touched at all. These include new options that may not have support
for REWRITE, and other special cases like rename-command and include.
2013-12-19 15:25:45 +01:00
antirez
7a666ac419 Cluster: set n->slaves to NULL in clusterNodeResetSlaves().
The value was otherwise undefined, so next time the node was promoted
again from slave to master, adding a slave to the list of slaves
would likely crash the server or result into undefined behavior.
2013-12-17 14:50:24 +01:00
antirez
fda91dbde3 Cluster: check link is valid before sending UPDATE. 2013-12-17 12:28:37 +01:00
antirez
f57bb36ce7 Cluster: initialize todo_before_sleep flags to 0. 2013-12-17 12:22:02 +01:00
antirez
c70c0c6db7 Cluster: use proper type mstime_t for ping delay var. 2013-12-17 10:27:36 +01:00
antirez
7c1cbdceb2 Cluster: use an hardcoded 60 sec timeout in redis-trib connections.
Later this should be configurable from the command line but at least now
we use something more appropriate for our use case compared to the
redis-rb default timeout.
2013-12-17 10:00:33 +01:00
antirez
47815d38e0 Fixed clearNodeFailureIfNeeded() time type to mstime_t.
This prevented 32bit cluster instances from clearing the FAIL flag when
needed.
2013-12-17 09:45:52 +01:00
antirez
e88e6a6334 Cluster: use long long for timestamps in clusterGenNodesDescription().
Ping sent and pong received fields need to be casted to long long to be
printed correctly into 32 bit systems.
2013-12-17 09:38:11 +01:00
antirez
2dfc5e35a9 Makefile.dep updated. 2013-12-13 13:10:05 +01:00
antirez
c00453da1d SDIFF iterator misuse fixed in diff algorithm #1.
The bug could be easily triggered by:

    SADD foo a b c 1 2 3 4 5 6
    SDIFF foo foo

When the key was the same in two sets, an unsafe iterator was used to
check existence of elements in the same set we were iterating.
Usually this would just result into a wrong output, however with the
dict.c API misuse protection we have in place, the result was actually
an assertion failed that was triggered by the CI test, while creating
random datasets for the "MASTER and SLAVE consistency" test.
2013-12-13 11:34:21 +01:00
antirez
5320148883 Sentinel: dead code removed. 2013-12-13 11:01:13 +01:00
antirez
452dea30f6 Makefile: remove odd syntax not compatible with some make versions.
See issue #1448.
2013-12-12 15:19:39 +01:00
Salvatore Sanfilippo
a99c751d6c Merge pull request #1460 from codeeply/simplify2
comment mistake fixed
2013-12-12 02:23:44 -08:00
codeeply
0f06f8df07 comment mistake fixed 2013-12-12 16:33:29 +08:00
antirez
a5ec247f13 Replication: publish the slave_repl_offset when disconnected from master.
When a slave was disconnected from its master the replication offset was
reported as -1. Now it is reported as the replication offset of the
previous master, so that failover can be performed using this value in
order to try to select a slave with more processed data from a set of
slaves of the old master.
2013-12-11 15:23:15 +01:00
Salvatore Sanfilippo
0a89d9a0b1 Merge pull request #1451 from yossigo/unbalanced-quotes-fix
Return proper error on requests with an unbalanced number of quotes.
2013-12-11 03:06:18 -08:00
Yossi Gottlieb
88a5cede88 Fix wrong repldboff type which causes dropped replication in rare cases. 2013-12-11 11:38:02 +01:00
antirez
11120689c4 Slaves heartbeats during sync improved.
The previous fix for false positive timeout detected by master was not
complete. There is another blocking stage while loading data for the
first synchronization with the master, that is, flushing away the
current data from the DB memory.

This commit uses the newly introduced dict.c callback in order to make
some incremental work (to send "\n" heartbeats to the master) while
flushing the old data from memory.

It is hard to write a regression test for this issue unfortunately. More
support for debugging in the Redis core would be needed in terms of
functionalities to simulate a slow DB loading / deletion.
2013-12-10 18:47:31 +01:00
antirez
2eb781b35b dict.c: added optional callback to dictEmpty().
Redis hash table implementation has many non-blocking features like
incremental rehashing, however while deleting a large hash table there
was no way to have a callback called to do some incremental work.

This commit adds this support, as an optiona callback argument to
dictEmpty() that is currently called at a fixed interval (one time every
65k deletions).
2013-12-10 18:46:24 +01:00
antirez
2c4ab8a534 Log empty DB + Loading data into two separated messages. 2013-12-10 18:43:25 +01:00
antirez
7c531eb5ad Don't send more than 1 newline/sec while loading RDB. 2013-12-10 18:43:19 +01:00
antirez
27db38d069 Slaves heartbeat while loading RDB files.
Starting with Redis 2.8 masters are able to detect timed out slaves,
while before 2.8 only slaves were able to detect a timed out master.

Now that timeout detection is bi-directional the following problem
happens as described "in the field" by issue #1449:

1) Master and slave setup with big dataset.
2) Slave performs the first synchronization, or a full sync
   after a failed partial resync.
3) Master sends the RDB payload to the slave.
4) Slave loads this payload.
5) Master detects the slave as timed out since does not receive back the
   REPLCONF ACK acknowledges.

Here the problem is that the master has no way to know how much the
slave will take to load the RDB file in memory. The obvious solution is
to use a greater replication timeout setting, but this is a shame since
for the 0.1% of operation time we are forced to use a timeout that is
not what is suited for 99.9% of operation time.

This commit tries to fix this problem with a solution that is a bit of
an hack, but that modifies little of the replication internals, in order
to be back ported to 2.8 safely.

During the RDB loading time, we send the master newlines to avoid
being sensed as timed out. This is the same that the master already does
while saving the RDB file to still signal its presence to the slave.

The single newline is used because:

1) It can't desync the protocol, as it is only transmitted all or
nothing.
2) It can be safely sent while we don't have a client structure for the
master or in similar situations just with write(2).
2013-12-09 20:26:00 +01:00
antirez
eaf1bfb88b Handle inline requested terminated with just \n. 2013-12-09 13:28:39 +01:00
Yossi Gottlieb
6e70c01148 Return proper error on requests with an unbalanced number of quotes. 2013-12-08 12:58:12 +02:00
antirez
c590549e40 Sentinel: fix reported role info sampling.
The way the role change was recoded was not sane and too much
convoluted, causing the role information to be not always updated.

This commit fixes issue #1445.
2013-12-06 12:46:56 +01:00
antirez
2b414a4b5f Sentinel: fix reported role fields when master is reset.
When there is a master address switch, the reported role must be set to
master so that we have a chance to re-sample the INFO output to check if
the new address is reporting the right role.

Otherwise if the role was wrong, it will be sensed as wrong even after
the address switch, and for enough time according to the role change
time, for Sentinel consider the master SDOWN.

This fixes isue #1446, that describes the effects of this bug in
practice.
2013-12-06 11:37:46 +01:00
antirez
11e81a1e9a Fixed grammar: before H the article is a, not an. 2013-12-05 16:35:32 +01:00
antirez
58713c6b13 Fix clients timeout handling.
During the refactoring of blocking operations, commit
82b672f633, a bug was introduced where
a milliseconds time is compared to a seconds time, so all the clients
always appear to timeout if timeout is set to non-zero value.

Thanks to Jonathan Leibiusky for finding the bug and helping verifying
the cause and fix.
2013-12-05 14:55:07 +01:00
antirez
c5618e7fdd WAIT command: synchronous replication for Redis. 2013-12-04 16:20:03 +01:00
antirez
c2f305545a blocked.c API commented. 2013-12-03 18:03:15 +01:00
antirez
82b672f633 BLPOP blocking code refactored to be generic & reusable. 2013-12-03 17:43:53 +01:00
antirez
2e027c48e5 Removed old comments and dead code from freeClient(). 2013-12-03 13:54:06 +01:00
antirez
e4025ea926 Grammar fix in freeClient(). 2013-12-03 13:40:41 +01:00
antirez
f80cf7363a Sentinel: don't write HZ when flushing config.
See issue #1419.
2013-12-02 15:56:10 +01:00
antirez
dffebbc904 Sentinel: better time desynchronization.
Sentinels are now desynchronized in a better way changing the time
handler frequency between 10 and 20 HZ. This way on average a
desynchronization of 25 milliesconds is produced that should be larger
enough compared to network latency, avoiding most split-brain condition
during the vote.

Now that the clocks are desynchronized, to have larger random delays when
performing operations can be easily achieved in the following way.
Take as example the function that starts the failover, that is
called with a frequency between 10 and 20 HZ and will start the
failover every time there are the conditions. By just adding as an
additional condition something like rand()%4 == 0, we can amplify the
desynchronization between Sentinel instances easily.

See issue #1419.
2013-12-02 12:29:42 +01:00
antirez
6fa42b7507 Cluster: nodes re-addition blacklist API. 2013-12-02 11:12:23 +01:00
antirez
8f18345ef0 Cluster: basic data structures for nodes black list. 2013-11-29 17:37:06 +01:00
antirez
3db825fde4 Cluster: some code about clusterHandleSlaveFailover() marginally improved.
80 cols friendly, some minor change to the code to make it simpler.
2013-11-29 16:17:05 +01:00
antirez
55f90b11c9 Stop writes on MISCONF only if instance is a master.
From the point of view of the slave not accepting writes from the master
can only create a bigger consistency issue.
2013-11-28 16:29:26 +01:00
antirez
60817bb262 Reply to PING with error when there is a MISCONF state. 2013-11-28 16:17:10 +01:00
antirez
0addf8aff1 Sentinel: log vote received from other Sentinels. 2013-11-28 15:23:46 +01:00