Commit Graph

561 Commits

Author SHA1 Message Date
shenlongxing
35ca670060 Fix cluster-announce-ip memory leak 2018-07-31 16:01:44 +08:00
antirez
0bdeb861f4 Example the magic +1 in migrateCommand().
Related to #5154.
2018-07-24 17:31:43 +02:00
antirez
53d46fa712 Make changes of PR #5154 hopefully simpler. 2018-07-24 17:27:43 +02:00
WuYunlong
4017a11144 Do not migrate already expired keys. 2018-07-21 10:00:32 +08:00
Jack Drogon
93238575f7 Fix typo 2018-07-03 18:19:46 +02:00
antirez
2edcafb35d addReplySubSyntaxError() renamed to addReplySubcommandSyntaxError(). 2018-07-02 18:49:34 +02:00
Salvatore Sanfilippo
bc6a004588
Merge pull request #4998 from itamarhaber/module_command_help
Module command help
2018-07-02 18:46:56 +02:00
Guy Benoish
dfcc20f4fd Fix compiler warning in restoreCommand 2018-06-24 16:53:01 +07:00
Guy Benoish
b5197f1fc9 Enhance RESTORE with RDBv9 new features
RESTORE now supports:
1. Setting LRU/LFU
2. Absolute-time TTL

Other related changes:
1. RDB loading will not override LRU bits when RDB file
   does not contain the LRU opcode.
2. RDB loading will not set LRU/LFU bits if the server's
   maxmemory-policy does not match.
2018-06-20 15:11:08 +07:00
antirez
e94b2053c6 Modify clusterRedirectClient() to handle ZPOP and XREAD. 2018-06-19 15:53:32 +02:00
Itamar Haber
fefde6e3e4 Capitalizes subcommands & orders lexicographically 2018-06-09 21:03:52 +03:00
Itamar Haber
c199280edb Globally applies addReplySubSyntaxError 2018-06-07 18:39:36 +03:00
antirez
a7dbe37d53 Typo: entires -> entries in several places. 2018-06-07 14:36:56 +02:00
shenlongxing
c85ae56edc Fix write() errno error 2018-06-06 13:06:42 +02:00
antirez
a97df1a6e1 Modules Cluster API: make node IDs pointers constant. 2018-03-30 13:16:07 +02:00
antirez
0701cad3de Modules Cluster API: message bus implementation. 2018-03-29 15:13:31 +02:00
Salvatore Sanfilippo
44f2cfa631
Merge pull request #4722 from charsyam/feature/refactoring-call-aeDeleteFileEvent-twice-in-freeClusterLink
Refactoring to call aeDeleteFileEvent twice as once
2018-03-22 16:23:40 +01:00
antirez
432bf4770e Cluster: ability to prevent slaves from failing over their masters.
This commit, in some parts derived from PR #3041 which is no longer
possible to merge (because the user deleted the original branch),
implements the ability of slaves to have a special configuration
preventing that they try to start a failover when the master is failing.

There are multiple reasons for wanting this, and the feautre was
requested in issue #3021 time ago.

The differences between this patch and the original PR are the
following:

1. The flag is saved/loaded on the nodes configuration.
2. The 'myself' node is now flag-aware, the flag is updated as needed
   when the configuration is changed via CONFIG SET.
3. The flag name uses NOFAILOVER instead of NO_FAILOVER to be consistent
   with existing NOADDR.
4. The redis.conf documentation was rewritten.

Thanks to @deep011 for the original patch.
2018-03-14 14:01:38 +01:00
charsyam
da7f5700cf refactoring-call-aeDeleteFileEvent-twice-in-freeClusterLink 2018-03-01 22:30:39 +09:00
antirez
533d0e0375 Cluster: improve crash-recovery safety after failover auth vote.
Add AE_BARRIER to the writable event loop so that slaves requesting
votes can't be served before we re-enter the event loop in the next
iteration, so clusterBeforeSleep() will fsync to disk in time.
Also add the call to explicitly fsync, given that we modified the last
vote epoch variable.
2018-02-27 13:06:42 +01:00
antirez
727dd43614 Fix migrateCommand() access of not initialized byte. 2018-01-18 12:41:05 +01:00
antirez
3ce1c28d47 Rewrite MIGRATE AUTH option.
See PR #2507. This is a reimplementation of the fix that contained
different problems.
2018-01-09 18:48:26 +01:00
antirez
de276b6a43 Cluster: allow read-only EVAL/EVALSHA in slaves.
Fix #3665.
2017-12-13 13:36:01 +01:00
antirez
522760fac7 Change indentation and other minor details of PR #4489.
The main change introduced by this commit is pretending that help
arrays are more text than code, thus indenting them at level 0. This
improves readability, and is an old practice when defining arrays of
C strings describing text.

Additionally a few useless return statements are removed, and the HELP
subcommand capitalized when printed to the user.
2017-12-06 12:05:14 +01:00
Itamar Haber
8b51121998 Merge remote-tracking branch 'upstream/unstable' into help_subcommands 2017-12-05 18:14:59 +02:00
Itamar Haber
bd5af03dbd Adds help to CLUSTER command 2017-12-03 19:05:10 +02:00
Salvatore Sanfilippo
3508b9c440
Merge pull request #4170 from TehWebby/patch-2
Fix typo
2017-11-28 18:40:43 +01:00
antirez
ffcf7d5ab1 Fix buffer overflows occurring reading redis.conf.
There was not enough sanity checking in the code loading the slots of
Redis Cluster from the nodes.conf file, this resulted into the
attacker's ability to write data at random addresses in the process
memory, by manipulating the index of the array. The bug seems
exploitable using the following techique: the config file may be altered so
that one of the nodes gets, as node ID (which is the first field inside the
structure) some data that is actually executable: then by writing this
address in selected places, this node ID part can be executed after a
jump. So it is mostly just a matter of effort in order to exploit the
bug. In practice however the issue is not very critical because the
bug requires an unprivileged user to be able to modify the Redis cluster
nodes configuration, and at the same time this should result in some
gain. However Redis normally is unprivileged as well. Yet much better to
have this fixed indeed.

Fix #4278.
2017-10-31 09:41:22 +01:00
Shaun Webb
2e6f285009 Fix typo 2017-07-27 09:37:37 +09:00
Salvatore Sanfilippo
d9565379da Merge pull request #4128 from leonchen83/unstable
fix mismatch argument and return wrong value of clusterDelNodeSlots
2017-07-24 14:18:28 +02:00
antirez
a3778f3b0f Make representClusterNodeFlags() more robust.
This function failed when an internal-only flag was set as an only flag
in a node: the string was trimmed expecting a final comma before
exiting the function, causing a crash. See issue #4142.
Moreover generation of flags representation only needed at DEBUG log
level was always performed: a waste of CPU time. This is fixed as well
by this commit.
2017-07-20 15:17:35 +02:00
Leon Chen
9e7a8c0207 fix return wrong value of clusterDelNodeSlots 2017-07-20 17:24:38 +08:00
Leon Chen
2cdf4cc656 fix mismatch argument 2017-07-18 02:28:24 -05:00
antirez
e1b8b4b6da CLUSTER GETKEYSINSLOT: avoid overallocating.
Close #3911.
2017-07-11 15:49:09 +02:00
Suraj Narkhede
f85f36f50d Fix following issues in blocking commands:
1. brpop last key index, thus checking all keys for slots.
2. Memory leak in clusterRedirectBlockedClientIfNeeded.
3. Remove while loop in clusterRedirectBlockedClientIfNeeded.
2017-06-23 00:30:21 -07:00
Suraj Narkhede
d303bca587 Fix brpop command table entry and redirect blocked clients. 2017-06-22 23:52:00 -07:00
Antonio Mallia
2d1d57eb47 Removed duplicate 'sys/socket.h' include 2017-06-04 15:26:53 +01:00
antirez
271733f4f8 Cluster: discard pong times in the future.
However we allow for 500 milliseconds of tolerance, in order to
avoid often discarding semantically valid info (the node is up)
because of natural few milliseconds desync among servers even when
NTP is used.

Note that anyway we should ping the node from time to time regardless and
discover if it's actually down from our point of view, since no update
is accepted while we have an active ping on the node.

Related to #3929.
2017-04-15 10:12:08 +02:00
antirez
02777bb252 Cluster: always add PFAIL nodes at end of gossip section.
To rely on the fact that nodes in PFAIL state will be shared around by
randomly adding them in the gossip section is a weak assumption,
especially after changes related to sending less ping/pong packets.

We want to always include gossip entries for all the nodes that are in
PFAIL state, so that the PFAIL -> FAIL state promotion can happen much
faster and reliably.

Related to #3929.
2017-04-14 13:39:49 +02:00
antirez
8c829d9e43 Cluster: fix gossip section ping/pong times encoding.
The gossip section times are 32 bit, so cannot store the milliseconds
time but just the seconds approximation, which is good enough for our
uses. At the same time however, when comparing the gossip section times
of other nodes with our node's view, we need to convert back to
milliseconds.

Related to #3929. Without this change the patch to reduce the traffic in
the bus message does not work.
2017-04-14 11:01:22 +02:00
antirez
6878a3fedd Cluster: add clean-logs command to create-cluster script. 2017-04-14 10:52:00 +02:00
antirez
8f7bf2841a Cluster: decrease ping/pong traffic by trusting other nodes reports.
Cluster of bigger sizes tend to have a lot of traffic in the cluster bus
just for failure detection: a node will try to get a ping reply from
another node no longer than when the half the node timeout would elapsed,
in order to avoid a false positive.

However this means that if we have N nodes and the node timeout is set
to, for instance M seconds, we'll have to ping N nodes every M/2
seconds. This N*M/2 pings will receive the same number of pongs, so
a total of N*M packets per node. However given that we have a total of N
nodes doing this, the total number of messages will be N*N*M.

In a 100 nodes cluster with a timeout of 60 seconds, this translates
to a total of 100*100*30 packets per second, summing all the packets
exchanged by all the nodes.

This is, as you can guess, a lot... So this patch changes the
implementation in a very simple way in order to trust the reports of
other nodes: if a node A reports a node B as alive at least up to
a given time, we update our view accordingly.

The problem with this approach is that it could result into a subset of
nodes being able to reach a given node X, and preventing others from
detecting that is actually not reachable from the majority of nodes.
So the above algorithm is refined by trusting other nodes only if we do
not have currently a ping pending for the node X, and if there are no
failure reports for that node.

Since each node, anyway, pings 10 other nodes every second (one node
every 100 milliseconds), anyway eventually even trusting the other nodes
reports, we will detect if a given node is down from our POV.

Now to understand the number of packets that the cluster would exchange
for failure detection with the patch, we can start considering the
random PINGs that the cluster sent anyway as base line:
Each node sends 10 packets per second, so the total traffic if no
additioal packets would be sent, including PONG packets, would be:

    Total messages per second = N*10*2

However by trusting other nodes gossip sections will not AWALYS prevent
pinging nodes for the "half timeout reached" rule all the times. The
math involved in computing the actual rate as N and M change is quite
complex and depends also on another parameter, which is the number of
entries in the gossip section of PING and PONG packets. However it is
possible to compare what happens in cluster of different sizes
experimentally. After applying this patch a very important reduction in
the number of packets exchanged is trivial to observe, without apparent
impacts on the failure detection performances.

Actual numbers with different cluster sizes should be published in the
Reids Cluster documentation in the future.

Related to #3929.
2017-04-14 10:43:53 +02:00
antirez
c5d6f577f0 Cluster: collect more specific bus messages stats.
First step in order to change Cluster in order to use less messages.
Related to issue #3929.
2017-04-13 19:22:35 +02:00
antirez
1409c545da Cluster: hash slots tracking using a radix tree. 2017-03-27 16:37:22 +02:00
antirez
f917e0da4c Fix MIGRATE closing of cached socket on error.
After investigating issue #3796, it was discovered that MIGRATE
could call migrateCloseSocket() after the original MIGRATE c->argv
was already rewritten as a DEL operation. As a result the host/port
passed to migrateCloseSocket() could be anything, often a NULL pointer
that gets deferenced crashing the server.

Now the socket is closed at an earlier time when there is a socket
error in a later stage where no retry will be performed, before we
rewrite the argument vector. Moreover a check was added so that later,
in the socket_err label, there is no further attempt at closing the
socket if the argument was rewritten.

This fix should resolve the bug reported in #3796.
2017-02-09 09:58:38 +01:00
Salvatore Sanfilippo
6cf1a325d6 Merge pull request #3643 from andyli028/unstable
Modify MIN->MAX
2016-12-19 08:19:10 +01:00
antirez
b53e73e159 MIGRATE: Remove upfront ttl initialization.
After the fix for #3673 the ttl var is always initialized inside the
loop itself, so the early initialization is not needed.

Variables declaration also moved to a more local scope.
2016-12-14 12:43:55 +01:00
Salvatore Sanfilippo
c9f0456d81 Merge pull request #3673 from badboy/reset-ttl-on-migrating
Reset the ttl for additional keys
2016-12-14 12:41:00 +01:00
antirez
04542cff92 Replication: fix the infamous key leakage of writable slaves + EXPIRE.
BACKGROUND AND USE CASEj

Redis slaves are normally write only, however the supprot a "writable"
mode which is very handy when scaling reads on slaves, that actually
need write operations in order to access data. For instance imagine
having slaves replicating certain Sets keys from the master. When
accessing the data on the slave, we want to peform intersections between
such Sets values. However we don't want to intersect each time: to cache
the intersection for some time often is a good idea.

To do so, it is possible to setup a slave as a writable slave, and
perform the intersection on the slave side, perhaps setting a TTL on the
resulting key so that it will expire after some time.

THE BUG

Problem: in order to have a consistent replication, expiring of keys in
Redis replication is up to the master, that synthesize DEL operations to
send in the replication stream. However slaves logically expire keys
by hiding them from read attempts from clients so that if the master did
not promptly sent a DEL, the client still see logically expired keys
as non existing.

Because slaves don't actively expire keys by actually evicting them but
just masking from the POV of read operations, if a key is created in a
writable slave, and an expire is set, the key will be leaked forever:

1. No DEL will be received from the master, which does not know about
such a key at all.

2. No eviction will be performed by the slave, since it needs to disable
eviction because it's up to masters, otherwise consistency of data is
lost.

THE FIX

In order to fix the problem, the slave should be able to tag keys that
were created in the slave side and have an expire set in some way.

My solution involved using an unique additional dictionary created by
the writable slave only if needed. The dictionary is obviously keyed by
the key name that we need to track: all the keys that are set with an
expire directly by a client writing to the slave are tracked.

The value in the dictionary is a bitmap of all the DBs where such a key
name need to be tracked, so that we can use a single dictionary to track
keys in all the DBs used by the slave (actually this limits the solution
to the first 64 DBs, but the default with Redis is to use 16 DBs).

This solution allows to pay both a small complexity and CPU penalty,
which is zero when the feature is not used, actually. The slave-side
eviction is encapsulated in code which is not coupled with the rest of
the Redis core, if not for the hook to track the keys.

TODO

I'm doing the first smoke tests to see if the feature works as expected:
so far so good. Unit tests should be added before merging into the
4.0 branch.
2016-12-13 10:59:54 +01:00
Jan-Erik Rediger
2a32f0371e Reset the ttl for additional keys
Before, if a previous key had a TTL set but the current one didn't, the
TTL was reused and thus resulted in wrong expirations set.

This behaviour was experienced, when `MigrateDefaultPipeline` in
redis-trib was set to >1

Fixes #3655
2016-12-08 14:27:21 +01:00