Commit Graph

459 Commits

Author SHA1 Message Date
antirez
94030fa4d7 Two cluster.c comments improved. 2015-03-21 12:12:23 +01:00
antirez
2950824ab6 Cluster: TAKEOVER option for manual failover. 2015-03-21 11:54:32 +01:00
antirez
a7010ae208 Cluster: non-conditional steps of slave failover refactored into a function. 2015-03-20 17:56:21 +01:00
antirez
230d141420 Cluster: separate unknown master check from the rest.
In no case we should try to attempt to failover if myself->slaveof is
NULL.
2015-03-20 16:56:59 +01:00
antirez
4f2555aa17 Cluster: refactoring around configEpoch handling.
This commit moves the process of generating a new config epoch without
consensus out of the clusterCommand() implementation, in order to make
it reusable for other reasons (current target is to have a CLUSTER
FAILOVER option forcing the failover when no master majority is
reachable).

Moreover the commit moves other functions which are similarly related to
config epochs in a new logical section of the cluster.c file, just for
clarity.
2015-03-20 16:42:52 +01:00
antirez
25c0f5ac63 Cluster: better cluster state transiction handling.
Before we relied on the global cluster state to make sure all the hash
slots are linked to some node, when getNodeByQuery() is called. So
finding the hash slot unbound was checked with an assertion. However
this is fragile. The cluster state is often updated in the
clusterBeforeSleep() function, and not ASAP on state change, so it may
happen to process clients with a cluster state that is 'ok' but yet
certain hash slots set to NULL.

With this commit the condition is also checked in getNodeByQuery() and
reported with a identical error code of -CLUSTERDOWN but slightly
different error message so that we have more debugging clue in the
future.

Root cause of issue #2288.
2015-03-20 09:59:28 +01:00
antirez
438a1a84e8 Cluster: more robust slave check in CLUSTER REPLICATE.
There are rare conditions where node->slaveof may be NULL even if the
node is a slave. To check by flag is much more robust.
2015-03-18 12:10:14 +01:00
antirez
93b1320fac Cluster: fix CLUSTER NODES optimization error in 'j' increment. 2015-03-13 13:16:35 +01:00
antirez
e1b6c9dd18 Cluster: CLUSTER NODES speedup. 2015-03-13 11:26:04 +01:00
Michel Martens
6201eb0c55 Add command CLUSTER MYID 2015-03-10 16:43:19 +00:00
antirez
c77081a45a Migrate: replace conditional with pre-computed value. 2015-02-27 22:33:54 +01:00
antirez
832b0c7cce Improvements to PR #2425
1. Remove useless "cs" initialization.
2. Add a "select" var to capture a condition checked multiple times.
3. Avoid duplication of the same if (!copy) conditional.
4. Don't increment dirty if copy is given (no deletion is performed),
   otherwise we propagate MIGRATE when not needed.
2015-02-26 10:27:56 +01:00
Tommy Wang
7fda935ad3 Add last_dbid to migrateCachedSocket to avoid redundant SELECT
Avoid redundant SELECT calls when continuously migrating keys to
the same dbid within a target Redis instance.
2015-02-26 10:18:43 +01:00
Salvatore Sanfilippo
d83c810265 Merge pull request #2301 from mattsta/fix/lengths
Improve type correctness
2015-02-24 17:22:53 +01:00
antirez
233729fe7f Cluster: some bias towwards FAIL/PFAIL nodes in gossip sections.
This improves PFAIL -> FAIL switch. Too late at this point in the RC
releases to add proper PFAIL/FAIL separate dictionary to do this in a
less randomized way. Tested in practice with experiments that this
helps. PFAIL -> FAIL average with 20 nodes and node-timeout set to 5
seconds takes 2.5 seconds without this commit, 1 second with this
commit.
2015-01-30 11:55:36 +01:00
antirez
69b4f00d28 More correct wanted / maxiterations values in clusterSendPing(). 2015-01-30 11:23:27 +01:00
antirez
e5a22064cc Cluster: magical 10% of nodes explained in comments. 2015-01-29 15:43:35 +01:00
antirez
1efacfe53d CLUSTER count-failure-reports command added. 2015-01-29 15:02:10 +01:00
antirez
3fd43062c8 Cluster: use a number of gossip sections proportional to cluster size.
Otherwise it is impossible to receive the majority of failure reports in
the node_timeout*2 window in larger clusters.

Still with a 200 nodes cluster, 20 gossip sections are a very reasonable
amount of bytes to send.

A side effect of this change is also fater cluster nodes joins for large
clusters, because the cluster layout makes less time to propagate.
2015-01-29 14:20:59 +01:00
antirez
9802ec3c83 Cluster: initialized not used fileds in gossip section.
Otherwise we risk sending not initialized data to other nodes, that may
contain anything. This was actually not possible only because the
initialization of the buffer where the cluster packets header is created
was larger than the 3 gossip sections we use, so the memory was already
all filled with zeroes by the memset().
2015-01-24 07:52:24 +01:00
Matt Stancliff
051a43e03a Fix cluster migrate memory leak
Fixes valgrind error:
48 bytes in 1 blocks are definitely lost in loss record 196 of 373
   at 0x4910D3: je_malloc (jemalloc.c:944)
   by 0x42807D: zmalloc (zmalloc.c:125)
   by 0x41FA0D: dictGetIterator (dict.c:543)
   by 0x41FA48: dictGetSafeIterator (dict.c:555)
   by 0x459B73: clusterHandleSlaveMigration (cluster.c:2776)
   by 0x45BF27: clusterCron (cluster.c:3123)
   by 0x423344: serverCron (redis.c:1239)
   by 0x41D6CD: aeProcessEvents (ae.c:311)
   by 0x41D8EA: aeMain (ae.c:455)
   by 0x41A84B: main (redis.c:3832)
2015-01-21 18:47:16 +01:00
Matt Stancliff
29049507ec Fix potential invalid read past end of array
If array has N elements, we can't read +1 if we are already at N.

Also, we need to move elements by their storage size in the array,
not just by individual bytes.
2015-01-21 18:01:03 +01:00
Matt Stancliff
30152554ea Fix cluster reset memory leak
[maybe] Fixes valgrind errors:
32 bytes in 4 blocks are definitely lost in loss record 107 of 228
   at 0x80EA447: je_malloc (jemalloc.c:944)
   by 0x806E59C: zrealloc (zmalloc.c:125)
   by 0x80A9AFC: clusterSetMaster (cluster.c:801)
   by 0x80AEDC9: clusterCommand (cluster.c:3994)
   by 0x80682A5: call (redis.c:2049)
   by 0x8068A20: processCommand (redis.c:2309)
   by 0x8076497: processInputBuffer (networking.c:1143)
   by 0x8073BAF: readQueryFromClient (networking.c:1208)
   by 0x8060E98: aeProcessEvents (ae.c:412)
   by 0x806123B: aeMain (ae.c:455)
   by 0x806C3DB: main (redis.c:3832)

64 bytes in 8 blocks are definitely lost in loss record 143 of 228
   at 0x80EA447: je_malloc (jemalloc.c:944)
   by 0x806E59C: zrealloc (zmalloc.c:125)
   by 0x80AAB40: clusterProcessPacket (cluster.c:801)
   by 0x80A847F: clusterReadHandler (cluster.c:1975)
   by 0x30000FF: ???

80 bytes in 10 blocks are definitely lost in loss record 148 of 228
   at 0x80EA447: je_malloc (jemalloc.c:944)
   by 0x806E59C: zrealloc (zmalloc.c:125)
   by 0x80AAB40: clusterProcessPacket (cluster.c:801)
   by 0x80A847F: clusterReadHandler (cluster.c:1975)
   by 0x2FFFFFF: ???
2015-01-21 17:51:57 +01:00
Matt Stancliff
72b8574cca Fix sending uninitialized bytes
Fixes valgrind error:
Syscall param write(buf) points to uninitialised byte(s)
   at 0x514C35D: ??? (syscall-template.S:81)
   by 0x456B81: clusterWriteHandler (cluster.c:1907)
   by 0x41D596: aeProcessEvents (ae.c:416)
   by 0x41D8EA: aeMain (ae.c:455)
   by 0x41A84B: main (redis.c:3832)
 Address 0x5f268e2 is 2,274 bytes inside a block of size 8,192 alloc'd
   at 0x4932D1: je_realloc (jemalloc.c:1297)
   by 0x428185: zrealloc (zmalloc.c:162)
   by 0x4269E0: sdsMakeRoomFor.part.0 (sds.c:142)
   by 0x426CD7: sdscatlen (sds.c:251)
   by 0x4579E7: clusterSendMessage (cluster.c:1995)
   by 0x45805A: clusterSendPing (cluster.c:2140)
   by 0x45BB03: clusterCron (cluster.c:2944)
   by 0x423344: serverCron (redis.c:1239)
   by 0x41D6CD: aeProcessEvents (ae.c:311)
   by 0x41D8EA: aeMain (ae.c:455)
   by 0x41A84B: main (redis.c:3832)
 Uninitialised value was created by a stack allocation
   at 0x457810: nodeUpdateAddressIfNeeded (cluster.c:1236)
2015-01-21 17:50:17 +01:00
antirez
2601e3e461 Cluster: node deletion cleanup / centralization. 2015-01-21 16:03:43 +01:00
antirez
59ad6ac5fe Cluster: set the slaves->slaveof filed to NULL when master is freed.
Related to issue #2289.
2015-01-21 15:55:53 +01:00
Matt Stancliff
53c082ec39 Improve networking type correctness
read() and write() return ssize_t (signed long), not int.

For other offsets, we can use the unsigned size_t type instead
of a signed offset (since our replication offsets and buffer
positions are never negative).
2015-01-19 14:10:12 -05:00
antirez
cf76af6b9f Cluster: fetch my IP even if msg is not MEET for the first time.
In order to avoid that misconfigured cluster nodes at some time may
force an IP update on other nodes, it is required that nodes update
their own address only on MEET messages. However it does not make sense
to do this the first time a node is contacted and yet does not have an
IP, we just risk that myself->ip remains not assigned if there are
messages lost or cluster creation procedures that don't make sure
everybody is targeted by at least one incoming MEET message.

Also fix the logging of the IP switch avoiding the :-1 tail.
2015-01-13 10:50:34 +01:00
antirez
5b0f4a83ac Cluster: clusterMsgDataGossip structure, explict padding + minor stuff.
Also explicitly set version to 0, add a protocol version define, improve
comments in the gossip structure.

Note that the structure layout is the same after the change, we are just
making the padding explicit with an additional not used 16 bits field.
So this commit is still able to talk with the previous versions of
cluster nodes.
2015-01-13 10:40:09 +01:00
antirez
237ab727b9 Suppress valgrind error about write sending uninitialized data.
Valgrind checks that the buffers we transfer via syscalls are all
composed of bytes actually initialized. This is useful, it makes we able
to avoid leaking informations in non initialized parts fo messages
transferred to other hosts. This commit fixes one of such issues.
2015-01-13 09:31:37 +01:00
antirez
6274a6789d Cluster: initialize mf_end.
Can't be initialized by resetManualFailover() since it's actual state
the function uses, so we need to initialize it at startup time. Not
really a bug in practical terms, but showed up into valgrind and is not
technically correct anyway.
2015-01-12 15:55:00 +01:00
Matt Stancliff
ad41a7c404 Add addReplyBulkSds() function
Refactor a common pattern into one function so we don't
end up with copy/paste programming.
2014-12-23 09:31:02 -05:00
Matt Stancliff
a772747ffc Cluster: Notify user on accept error
If we woke up to accept a connection, but we can't
accept it, inform the user of the error going on
with their networking.

(The previous message was the same for success or error!)
2014-12-17 10:49:32 -05:00
antirez
1aef29e079 Fix comment in clusterHandleSlaveFailover(). 2014-12-16 15:03:12 +01:00
antirez
90c7d8cfa1 Make sure buffer is enough in clusterSendPing(). 2014-12-15 10:18:22 +01:00
antirez
ce269ad3c5 AnetFormatIP(): renamed, commented, now sticks to IP:port format.
A few code style changes + consistent format: not nice for humans but
better for parsers.
2014-12-11 18:20:30 +01:00
Matt Stancliff
491881e13b Cleanup all IP formatting code
Instead of manually checking for strchr(n,':') everywhere,
we can use our new centralized IP formatting functions.
2014-12-11 10:12:18 -05:00
antirez
06e76bc3e2 Better read-only behavior for expired keys in slaves.
Slaves key expire is orchestrated by the master. Sometimes the master
will send the synthesized DEL to expire keys on the slave with a non
trivial delay (when the key is not accessed, only the incremental expiry
algorithm will expire it in background).

During that time, a key is logically expired, but slaves still return
the key if you GET (or whatever) it. This is a bad behavior.

However we can't simply trust the slave view of the key, since we need
the master to be able to send write commands to update the slave data
set, and DELs should only happen when the key is expired in the master
in order to ensure consistency.

However 99.99% of the issues with this behavior is when a client which
is not a master sends a read only command. In this case we are safe and
can consider the key as non existing.

This commit does a few changes in order to make this sane:

1. lookupKeyRead() is modified in order to return NULL if the above
conditions are met.
2. Calls to lookupKeyRead() in commands actually writing to the data set
are repliaced with calls to lookupKeyWrite().

There are redundand checks, so for example, if in "2" something was
overlooked, we should be still safe, since anyway, when the master
writes the behavior is to don't care about what expireIfneeded()
returns.

This commit is related to  #1768, #1770, #2131.
2014-12-10 16:10:21 +01:00
antirez
669aa2a210 Cluster PUBLISH message: fix totlen count.
bulk_data field size was not removed from the count. It is not possible
to declare it simply as 'char bulk_data[]' since the structure is nested
into another structure.
2014-11-28 10:21:47 +01:00
Salvatore Sanfilippo
5a526c22cc Merge pull request #2096 from mattsta/cluster-ipv6
Enable Cluster IPv6 Support
2014-10-31 10:38:22 +01:00
Matt Stancliff
0014966c1e Networking: add more outbound IP binding fixes
Same as the original bind fixes (we just missed these the
first time around).

This helps Redis not automatically send
connections from the first IP on an interface if we are bound
to a specific IP address (e.g. with multiple IP aliases on one
interface, you want to send from _your_ IP, not from the first IP
on the interface).
2014-10-29 15:09:09 -04:00
Matt Stancliff
daca1edb6e Parse cluster state file in IPv6 compatible way
We need to pick the port based on the _last_ colon, not the first one.
2014-10-29 15:08:35 -04:00
antirez
5f6950caa8 Cluster: process gossip section only for known nodes.
With the exception of nodes sending MEET packets: we have to trust them
since they can send us MEET packets only when the cluster is initially
created or because sysadmin manual action.
2014-10-08 16:58:12 +02:00
antirez
36e34a656a Cluster: fix logic to detect we are among a minority.
In the cluster evaluation function we are supposed to set the cluster
state as "fail" if we are among a minority, however the code was not
detecting to be into a minority partition if exactly half the masters
were reachable, which is a minority.
2014-10-08 16:27:07 +02:00
antirez
edb3987a06 Cluster: more chatty slaves when failover is stalled. 2014-10-07 09:51:55 +02:00
Matt Stancliff
12d0195b30 Clean up text throughout project
- Remove trailing newlines from redis.conf
  - Fix comment misspelling
  - Clarifies zipEncodeLength usage and a C API mention (#1243, #1242)
  - Fix cluster typos (inspired by @papanikge #1507)
  - Fix rewite -> rewrite in a few places (inspired by #682)

Closes #1243, #1242, #1507
2014-09-29 06:49:07 -04:00
antirez
2374496799 Cluster: claim ping_sent time even if we can't connect.
This fixes a potential bug that was never observed in practice since
what happens is that the asynchronous connect returns ok (to fail later,
calling the handler) every time, so a ping is queued, and sent_ping
happens to always be populated.

Howver technically connect(2) with a non blocking socket may return an
error synchronously, so before this fix the code was not correct.
2014-09-17 16:39:41 +02:00
antirez
c89afc8e5d Cluster: new option to work with partial slots coverage. 2014-09-17 11:10:09 +02:00
Matt Stancliff
60c448b584 Cluster: Fix segfault if cluster config corrupt
This commit adds a size check after initial config
line parsing to make sure we have *at least* 8 arguments
per line.

Also, instead of asserting for cluster->myself, we just test
and error out normally (since the error does a hard exit anyway).

Closes #1597
2014-08-25 10:11:38 +02:00
Matt Stancliff
879e18b7ec Fix memory leak in cluster config parsing
The continue stop us from triggering the
free after the long line for loop, so add it
earlier.
2014-08-18 11:27:19 +02:00