When we fail to setup the write handler it does not make sense to take
the client around, it is missing writes: whatever is a client or a slave
anyway the connection should terminated ASAP.
Moreover what the function does exactly with its return value, and in
which case the write handler is installed on the socket, was not clear,
so the functions comment are improved to make the goals of the function
more obvious.
Also related to #2485.
master was closing the connection if the RDB transfer took long time.
and also sent PINGs to the slave before it got the initial ACK, in which case the slave wouldn't be able to find the EOF marker.
Segfault introduced during a refactoring / warning suppression a few
commits away. This particular call assumed that it is safe to pass NULL
to the object pointer argument when we are sure the set has a given
encoding. This can't be assumed and is now guaranteed to segfault
because of the new API of setTypeNext().
This change fixes several warnings compiling at -O3 level with GCC
4.8.2, and at the same time, in case of misuse of the API, we have the
pointer initialize to NULL or the integer initialized to the value
-123456789 which is easy to spot by naked eye.
No semantical changes since to make dict.c truly able to scale over the
32 bit table size limit, the hash function shoulds and other internals
related to hash function output should be 64 bit ready.
rehashidx is always positive in the two code paths, since the only
negative value it could have is -1 when there is no rehashing in
progress, and the condition is explicitly checked.
Bug as old as Redis and blocking operations. It's hard to trigger since
only happens on instance role switch, but the results are quite bad
since an inconsistency between master and slave is created.
How to trigger the bug is a good description of the bug itself.
1. Client does "BLPOP mylist 0" in master.
2. Master is turned into slave, that replicates from New-Master.
3. Client does "LPUSH mylist foo" in New-Master.
4. New-Master propagates write to slave.
5. Slave receives the LPUSH, the blocked client get served.
Now Master "mylist" key has "foo", Slave "mylist" key is empty.
Highlights:
* At step "2" above, the client remains attached, basically escaping any
check performed during command dispatch: read only slave, in that case.
* At step "5" the slave (that was the master), serves the blocked client
consuming a list element, which is not consumed on the master side.
This scenario is technically likely to happen during failovers, however
since Redis Sentinel already disconnects clients using the CLIENT
command when changing the role of the instance, the bug is avoided in
Sentinel deployments.
Closes#2473.
There was a bug in Redis Cluster caused by clients blocked in a blocking
list pop operation, for keys no longer handled by the instance, or
in a condition where the cluster became down after the client blocked.
A typical situation is:
1) BLPOP <somekey> 0
2) <somekey> hash slot is resharded to another master.
The client will block forever int this case.
A symmentrical non-cluster-specific bug happens when an instance is
turned from master to slave. In that case it is more serious since this
will desynchronize data between slaves and masters. This other bug was
discovered as a side effect of thinking about the bug explained and
fixed in this commit, but will be fixed in a separated commit.
1. No need to set btype in processUnblockedClients(), since clients
flagged REDIS_UNBLOCKED should have it already cleared.
2. When putting clients in the unblocked clients list, clientsArePaused()
should flag them with REDIS_UNBLOCKED. Not strictly needed with the
current code but is more coherent.
When the list of unblocked clients were processed, btype was set to
blocking type none, but the client remained flagged with REDIS_BLOCKED.
When timeout is reached (or when the client disconnects), unblocking it
will trigger an assertion.
There is no need to process pending requests from blocked clients, so
now clientsArePaused() just avoid touching blocked clients.
Close#2467.
This commit moves the process of generating a new config epoch without
consensus out of the clusterCommand() implementation, in order to make
it reusable for other reasons (current target is to have a CLUSTER
FAILOVER option forcing the failover when no master majority is
reachable).
Moreover the commit moves other functions which are similarly related to
config epochs in a new logical section of the cluster.c file, just for
clarity.
Before we relied on the global cluster state to make sure all the hash
slots are linked to some node, when getNodeByQuery() is called. So
finding the hash slot unbound was checked with an assertion. However
this is fragile. The cluster state is often updated in the
clusterBeforeSleep() function, and not ASAP on state change, so it may
happen to process clients with a cluster state that is 'ok' but yet
certain hash slots set to NULL.
With this commit the condition is also checked in getNodeByQuery() and
reported with a identical error code of -CLUSTERDOWN but slightly
different error message so that we have more debugging clue in the
future.
Root cause of issue #2288.
Not perfect since The Solution IMHO is to have a DSL with a table of
configuration functions with type, limits, and aux functions to handle
the odd ones. However this hacky macro solution is already better and
forces to put limits in the range of numerical fields.
More field types to be refactored in the next commits hopefully.
Should be much faster, and regardless, the code is more obvious now
compared to generating a string just to get the return value of the
ll2stirng() function.
1. HVSTRLEN -> HSTRLEN. It's unlikely one needs the length of the key,
not clear how the API would work (by value does not make sense) and
there will be better names anyway.
2. Default is to return 0 when field is missing.
3. Default is to return 0 when key is missing.
4. The implementation was slower than needed, and produced unnecessary COW.
Related issue #2415.
1. Remove useless "cs" initialization.
2. Add a "select" var to capture a condition checked multiple times.
3. Avoid duplication of the same if (!copy) conditional.
4. Don't increment dirty if copy is given (no deletion is performed),
otherwise we propagate MIGRATE when not needed.
Less grays: more readable palette since usually we have a non linear
distribution of percentages and very near gray tones are hard to take
apart. Final part of the palette is gradient from yellow to red. The red
part is hardly reached because of usual distribution of latencies, but
shows up mainly when latencies are very high because of the logarithmic
scale, this is coherent to what people expect: red = bad.
The old version of SPOP with "count" argument used an API call of dict.c
which was actually designed for a different goal, and was not capable of
good distribution. We follow a different three-cases approach optimized
for different ratiion between sets and requested number of elements.
The implementation is simpler and allowed the removal of a large amount
of code.
Severan problems are addressed but still a few missing.
Since replication of this command was more complex than others since it
needs to replicate multiple SREM commands, an old API able to do this
was reused (it was taken inside the implementation since it was pretty
obvious soon or later that would be useful). The API was improved a bit
so that now a command may opt-out for the standard command replication
when the server.dirty counter is incremented, in order to "manually"
replicate what it wants.
--stat mode already used to reconnect automatically if the server is no
longer available. This is useful since this is an interactive mode used
for debugging, however the same applies to --latency and --latency-dist
modes, so now both use the reconnecting command execution as well.
The reconnection code was modified to use basic VT100 escape sequences
in order to play better with different kinds of output on the screen
when the reconnection happens, and to hide the reconnection attempt
output when finally the reconnection happens.
So far not able to find a color palette within the 256 colors which is
not confusing. However I believe it is a possible task, so will try
better later.
This also makes it backward compatible in the usage, but for the command
name. However the old command name was less obvious so it is worth to
break it probably.
With the new setup the program main can perform argument parsing and
everything else useful for an RDB check regardless of the Redis server
itself.
When trying to debug sentinel connections or max connections errors it
would be very useful to have the ability to see the list of connected
clients to a running sentinel. At the same time it would be very helpful
to be able to name each sentinel connection or kill offending clients.
This commits adds the already defined CLIENT commands back to Redis
Sentinel.
This improves PFAIL -> FAIL switch. Too late at this point in the RC
releases to add proper PFAIL/FAIL separate dictionary to do this in a
less randomized way. Tested in practice with experiments that this
helps. PFAIL -> FAIL average with 20 nodes and node-timeout set to 5
seconds takes 2.5 seconds without this commit, 1 second with this
commit.
Otherwise it is impossible to receive the majority of failure reports in
the node_timeout*2 window in larger clusters.
Still with a 200 nodes cluster, 20 gossip sections are a very reasonable
amount of bytes to send.
A side effect of this change is also fater cluster nodes joins for large
clusters, because the cluster layout makes less time to propagate.
Previouly if we loaded a corrupt RDB, Redis printed an error report
with a big "REPORT ON GITHUB" message at the bottom. But, we know
RDB load failures are corrupt data, not corrupt code.
Now when RDB failure is detected (duplicate keys or unknown data
types in the file), we run check-rdb against the RDB then exit. The
automatic check-rdb hopefully gives the user instant feedback
about what is wrong instead of providing a mysterious stack
trace.
redis-check-rdb (previously redis-check-dump) had every RDB define
copy/pasted from rdb.h and some defines copied from redis.h. Since
the initial copy, some constants had changed in Redis headers and
check-dump was using incorrect values.
Since check-rdb is now a mode of Redis, the old check-dump code
is cleaned up to:
- replace all printf with redisLog (and remove \n from all strings)
- remove all copy/pasted defines to use defines from rdb.h and redis.h
- replace all malloc/free with zmalloc/zfree
- remove unnecessary include headers
redis-check-dump is now named redis-check-rdb and it runs
as a mode of redis-server instead of an independent binary.
You can now use 'redis-server redis.conf --check-rdb' to check
the RDB defined in redis.conf. Using argument --check-rdb
checks the RDB and exits. We could potentially also allow
the server to continue starting if the RDB check succeeds.
This change also enables us to use RDB checking programatically
from inside Redis for certain failure conditions.
Otherwise we risk sending not initialized data to other nodes, that may
contain anything. This was actually not possible only because the
initialization of the buffer where the cluster packets header is created
was larger than the 3 gossip sections we use, so the memory was already
all filled with zeroes by the memset().
On Darwin /dev/urandom depletes terribly fast. This is not an issue
normally, but with Redis Cluster we generate a lot of unique IDs, for
example during nodes handshakes. Our IDs need just to be unique without
other strong crypto requirements, so this commit turns the function into
something that gets a 20 bytes seed from /dev/urandom, and produces the
rest of the output just using SHA1 in counter mode.
Fixes valgrind error:
48 bytes in 1 blocks are definitely lost in loss record 196 of 373
at 0x4910D3: je_malloc (jemalloc.c:944)
by 0x42807D: zmalloc (zmalloc.c:125)
by 0x41FA0D: dictGetIterator (dict.c:543)
by 0x41FA48: dictGetSafeIterator (dict.c:555)
by 0x459B73: clusterHandleSlaveMigration (cluster.c:2776)
by 0x45BF27: clusterCron (cluster.c:3123)
by 0x423344: serverCron (redis.c:1239)
by 0x41D6CD: aeProcessEvents (ae.c:311)
by 0x41D8EA: aeMain (ae.c:455)
by 0x41A84B: main (redis.c:3832)
If array has N elements, we can't read +1 if we are already at N.
Also, we need to move elements by their storage size in the array,
not just by individual bytes.
[maybe] Fixes valgrind errors:
32 bytes in 4 blocks are definitely lost in loss record 107 of 228
at 0x80EA447: je_malloc (jemalloc.c:944)
by 0x806E59C: zrealloc (zmalloc.c:125)
by 0x80A9AFC: clusterSetMaster (cluster.c:801)
by 0x80AEDC9: clusterCommand (cluster.c:3994)
by 0x80682A5: call (redis.c:2049)
by 0x8068A20: processCommand (redis.c:2309)
by 0x8076497: processInputBuffer (networking.c:1143)
by 0x8073BAF: readQueryFromClient (networking.c:1208)
by 0x8060E98: aeProcessEvents (ae.c:412)
by 0x806123B: aeMain (ae.c:455)
by 0x806C3DB: main (redis.c:3832)
64 bytes in 8 blocks are definitely lost in loss record 143 of 228
at 0x80EA447: je_malloc (jemalloc.c:944)
by 0x806E59C: zrealloc (zmalloc.c:125)
by 0x80AAB40: clusterProcessPacket (cluster.c:801)
by 0x80A847F: clusterReadHandler (cluster.c:1975)
by 0x30000FF: ???
80 bytes in 10 blocks are definitely lost in loss record 148 of 228
at 0x80EA447: je_malloc (jemalloc.c:944)
by 0x806E59C: zrealloc (zmalloc.c:125)
by 0x80AAB40: clusterProcessPacket (cluster.c:801)
by 0x80A847F: clusterReadHandler (cluster.c:1975)
by 0x2FFFFFF: ???
Fixes valgrind error:
Syscall param write(buf) points to uninitialised byte(s)
at 0x514C35D: ??? (syscall-template.S:81)
by 0x456B81: clusterWriteHandler (cluster.c:1907)
by 0x41D596: aeProcessEvents (ae.c:416)
by 0x41D8EA: aeMain (ae.c:455)
by 0x41A84B: main (redis.c:3832)
Address 0x5f268e2 is 2,274 bytes inside a block of size 8,192 alloc'd
at 0x4932D1: je_realloc (jemalloc.c:1297)
by 0x428185: zrealloc (zmalloc.c:162)
by 0x4269E0: sdsMakeRoomFor.part.0 (sds.c:142)
by 0x426CD7: sdscatlen (sds.c:251)
by 0x4579E7: clusterSendMessage (cluster.c:1995)
by 0x45805A: clusterSendPing (cluster.c:2140)
by 0x45BB03: clusterCron (cluster.c:2944)
by 0x423344: serverCron (redis.c:1239)
by 0x41D6CD: aeProcessEvents (ae.c:311)
by 0x41D8EA: aeMain (ae.c:455)
by 0x41A84B: main (redis.c:3832)
Uninitialised value was created by a stack allocation
at 0x457810: nodeUpdateAddressIfNeeded (cluster.c:1236)
The cleanup code expects that if 'di' is not NULL, it is a valid
iterator that should be freed.
The result of this bug was a crash of the AOF rewriting process if an
error occurred after the DBs data are written and the iterator is no
longer valid.
Rationale is that when re-entering, it is likely due to Lua debugging
hooks. Returning an error will be ignored in most cases, going totally
unnoticed. With the log at least we leave a trace.
Related to issue #2302.
read() and write() return ssize_t (signed long), not int.
For other offsets, we can use the unsigned size_t type instead
of a signed offset (since our replication offsets and buffer
positions are never negative).
It's possible large objects could be larger than 'int', so let's
upgrade all size counters to ssize_t.
This also fixes rdbSaveObject serialized bytes calculation.
Since entire serializations of data structures can be large,
so we don't want to limit their calculated size to a 32 bit signed max.
This commit increases object size calculation and
cascades the change back up to serializedlength printing.
Before:
127.0.0.1:6379> debug object hihihi
... encoding:quicklist serializedlength:-2147483559 ...
After:
127.0.0.1:6379> debug object hihihi
... encoding:quicklist serializedlength:2147483737 ...
In order to avoid that misconfigured cluster nodes at some time may
force an IP update on other nodes, it is required that nodes update
their own address only on MEET messages. However it does not make sense
to do this the first time a node is contacted and yet does not have an
IP, we just risk that myself->ip remains not assigned if there are
messages lost or cluster creation procedures that don't make sure
everybody is targeted by at least one incoming MEET message.
Also fix the logging of the IP switch avoiding the :-1 tail.
Also explicitly set version to 0, add a protocol version define, improve
comments in the gossip structure.
Note that the structure layout is the same after the change, we are just
making the padding explicit with an additional not used 16 bits field.
So this commit is still able to talk with the previous versions of
cluster nodes.
Valgrind checks that the buffers we transfer via syscalls are all
composed of bytes actually initialized. This is useful, it makes we able
to avoid leaking informations in non initialized parts fo messages
transferred to other hosts. This commit fixes one of such issues.