Otherwise it is impossible to receive the majority of failure reports in
the node_timeout*2 window in larger clusters.
Still with a 200 nodes cluster, 20 gossip sections are a very reasonable
amount of bytes to send.
A side effect of this change is also fater cluster nodes joins for large
clusters, because the cluster layout makes less time to propagate.
Previouly if we loaded a corrupt RDB, Redis printed an error report
with a big "REPORT ON GITHUB" message at the bottom. But, we know
RDB load failures are corrupt data, not corrupt code.
Now when RDB failure is detected (duplicate keys or unknown data
types in the file), we run check-rdb against the RDB then exit. The
automatic check-rdb hopefully gives the user instant feedback
about what is wrong instead of providing a mysterious stack
trace.
redis-check-rdb (previously redis-check-dump) had every RDB define
copy/pasted from rdb.h and some defines copied from redis.h. Since
the initial copy, some constants had changed in Redis headers and
check-dump was using incorrect values.
Since check-rdb is now a mode of Redis, the old check-dump code
is cleaned up to:
- replace all printf with redisLog (and remove \n from all strings)
- remove all copy/pasted defines to use defines from rdb.h and redis.h
- replace all malloc/free with zmalloc/zfree
- remove unnecessary include headers
redis-check-dump is now named redis-check-rdb and it runs
as a mode of redis-server instead of an independent binary.
You can now use 'redis-server redis.conf --check-rdb' to check
the RDB defined in redis.conf. Using argument --check-rdb
checks the RDB and exits. We could potentially also allow
the server to continue starting if the RDB check succeeds.
This change also enables us to use RDB checking programatically
from inside Redis for certain failure conditions.
Otherwise we risk sending not initialized data to other nodes, that may
contain anything. This was actually not possible only because the
initialization of the buffer where the cluster packets header is created
was larger than the 3 gossip sections we use, so the memory was already
all filled with zeroes by the memset().
On Darwin /dev/urandom depletes terribly fast. This is not an issue
normally, but with Redis Cluster we generate a lot of unique IDs, for
example during nodes handshakes. Our IDs need just to be unique without
other strong crypto requirements, so this commit turns the function into
something that gets a 20 bytes seed from /dev/urandom, and produces the
rest of the output just using SHA1 in counter mode.
Fixes valgrind error:
48 bytes in 1 blocks are definitely lost in loss record 196 of 373
at 0x4910D3: je_malloc (jemalloc.c:944)
by 0x42807D: zmalloc (zmalloc.c:125)
by 0x41FA0D: dictGetIterator (dict.c:543)
by 0x41FA48: dictGetSafeIterator (dict.c:555)
by 0x459B73: clusterHandleSlaveMigration (cluster.c:2776)
by 0x45BF27: clusterCron (cluster.c:3123)
by 0x423344: serverCron (redis.c:1239)
by 0x41D6CD: aeProcessEvents (ae.c:311)
by 0x41D8EA: aeMain (ae.c:455)
by 0x41A84B: main (redis.c:3832)
If array has N elements, we can't read +1 if we are already at N.
Also, we need to move elements by their storage size in the array,
not just by individual bytes.
[maybe] Fixes valgrind errors:
32 bytes in 4 blocks are definitely lost in loss record 107 of 228
at 0x80EA447: je_malloc (jemalloc.c:944)
by 0x806E59C: zrealloc (zmalloc.c:125)
by 0x80A9AFC: clusterSetMaster (cluster.c:801)
by 0x80AEDC9: clusterCommand (cluster.c:3994)
by 0x80682A5: call (redis.c:2049)
by 0x8068A20: processCommand (redis.c:2309)
by 0x8076497: processInputBuffer (networking.c:1143)
by 0x8073BAF: readQueryFromClient (networking.c:1208)
by 0x8060E98: aeProcessEvents (ae.c:412)
by 0x806123B: aeMain (ae.c:455)
by 0x806C3DB: main (redis.c:3832)
64 bytes in 8 blocks are definitely lost in loss record 143 of 228
at 0x80EA447: je_malloc (jemalloc.c:944)
by 0x806E59C: zrealloc (zmalloc.c:125)
by 0x80AAB40: clusterProcessPacket (cluster.c:801)
by 0x80A847F: clusterReadHandler (cluster.c:1975)
by 0x30000FF: ???
80 bytes in 10 blocks are definitely lost in loss record 148 of 228
at 0x80EA447: je_malloc (jemalloc.c:944)
by 0x806E59C: zrealloc (zmalloc.c:125)
by 0x80AAB40: clusterProcessPacket (cluster.c:801)
by 0x80A847F: clusterReadHandler (cluster.c:1975)
by 0x2FFFFFF: ???
Fixes valgrind error:
Syscall param write(buf) points to uninitialised byte(s)
at 0x514C35D: ??? (syscall-template.S:81)
by 0x456B81: clusterWriteHandler (cluster.c:1907)
by 0x41D596: aeProcessEvents (ae.c:416)
by 0x41D8EA: aeMain (ae.c:455)
by 0x41A84B: main (redis.c:3832)
Address 0x5f268e2 is 2,274 bytes inside a block of size 8,192 alloc'd
at 0x4932D1: je_realloc (jemalloc.c:1297)
by 0x428185: zrealloc (zmalloc.c:162)
by 0x4269E0: sdsMakeRoomFor.part.0 (sds.c:142)
by 0x426CD7: sdscatlen (sds.c:251)
by 0x4579E7: clusterSendMessage (cluster.c:1995)
by 0x45805A: clusterSendPing (cluster.c:2140)
by 0x45BB03: clusterCron (cluster.c:2944)
by 0x423344: serverCron (redis.c:1239)
by 0x41D6CD: aeProcessEvents (ae.c:311)
by 0x41D8EA: aeMain (ae.c:455)
by 0x41A84B: main (redis.c:3832)
Uninitialised value was created by a stack allocation
at 0x457810: nodeUpdateAddressIfNeeded (cluster.c:1236)
The cleanup code expects that if 'di' is not NULL, it is a valid
iterator that should be freed.
The result of this bug was a crash of the AOF rewriting process if an
error occurred after the DBs data are written and the iterator is no
longer valid.
Rationale is that when re-entering, it is likely due to Lua debugging
hooks. Returning an error will be ignored in most cases, going totally
unnoticed. With the log at least we leave a trace.
Related to issue #2302.
In order to avoid that misconfigured cluster nodes at some time may
force an IP update on other nodes, it is required that nodes update
their own address only on MEET messages. However it does not make sense
to do this the first time a node is contacted and yet does not have an
IP, we just risk that myself->ip remains not assigned if there are
messages lost or cluster creation procedures that don't make sure
everybody is targeted by at least one incoming MEET message.
Also fix the logging of the IP switch avoiding the :-1 tail.
Also explicitly set version to 0, add a protocol version define, improve
comments in the gossip structure.
Note that the structure layout is the same after the change, we are just
making the padding explicit with an additional not used 16 bits field.
So this commit is still able to talk with the previous versions of
cluster nodes.
Valgrind checks that the buffers we transfer via syscalls are all
composed of bytes actually initialized. This is useful, it makes we able
to avoid leaking informations in non initialized parts fo messages
transferred to other hosts. This commit fixes one of such issues.
Can't be initialized by resetManualFailover() since it's actual state
the function uses, so we need to initialize it at startup time. Not
really a bug in practical terms, but showed up into valgrind and is not
technically correct anyway.