Commit Graph

6196 Commits

Author SHA1 Message Date
antirez
ece658713b Modules TSC: Improve inter-thread synchronization.
More work to do with server.unixtime and similar. Need to write Helgrind
suppression file in order to suppress the valse positives.
2017-05-09 11:57:09 +02:00
antirez
2a51bac44e Simplify atomicvar.h usage by having the mutex name implicit. 2017-05-04 17:01:00 +02:00
antirez
52bc74f221 Lazyfree: fix lazyfreeGetPendingObjectsCount() race reading counter. 2017-05-04 10:35:40 +02:00
antirez
7d9326b1f3 Modules TSC: HELLO.KEYS reply format fixed. 2017-05-03 23:43:49 +02:00
antirez
9b01b64430 Modules TSC: put the client in the pending write list. 2017-05-03 14:54:48 +02:00
antirez
e67fb915eb adlist: fix final list count in listJoin(). 2017-05-03 14:54:14 +02:00
antirez
79226cb9fa adlist: fix listJoin() to handle empty lists. 2017-05-03 14:15:25 +02:00
antirez
6798736909 Modules: remove unused var in example module. 2017-05-03 14:10:21 +02:00
antirez
1ed2ff5570 Modules TSC: HELLO.KEYS example draft finished. 2017-05-03 14:08:12 +02:00
antirez
7127f15ebe Module: fix RedisModule_Call() "l" specifier to create a raw string. 2017-05-03 14:07:10 +02:00
antirez
3fcf959e60 Modules TSC: Release the GIL for all the time we are blocked.
Instead of giving the module background operations just a small time to
run in the beforeSleep() function, we can have the lock released for all
the time we are blocked in the multiplexing syscall.
2017-05-03 11:26:21 +02:00
antirez
ba4a5a3255 Modules TSC: Export symbols of the new API. 2017-05-02 15:19:28 +02:00
antirez
275905b328 Modules TSC: Handling of RM_Reply* functions. 2017-05-02 15:05:39 +02:00
antirez
9c500b89fb Modules TSC: Basic TS context creeation and handling. 2017-05-02 12:53:10 +02:00
antirez
59b06b14c9 Modules TSC: GIL and cooperative multi tasking setup. 2017-04-28 18:41:10 +02:00
antirez
c180bc7d98 Regression test for PSYNC2 issue #3899 added.
Experimentally verified that it can trigger the issue reverting the fix.
At least on my system... Being the bug time/backlog dependant, it is
very hard to tell if this test will be able to trigger the problem
consistently, however even if it triggers the problem once in a while,
we'll see it in the CI environment at http://ci.redis.io.
2017-04-28 10:37:07 +02:00
antirez
469d6e2b37 PSYNC2: fix master cleanup when caching it.
The master client cleanup was incomplete: resetClient() was missing and
the output buffer of the client was not reset, so pending commands
related to the previous connection could be still sent.

The first problem caused the client argument vector to be, at times,
half populated, so that when the correct replication stream arrived the
protcol got mixed to the arugments creating invalid commands that nobody
called.

Thanks to @yangsiran for also investigating this problem, after
already providing important design / implementation hints for the
original PSYNC2 issues (see referenced Github issue).

Note that this commit adds a new function to the list library of Redis
in order to be able to reset a list without destroying it.

Related to issue #3899.
2017-04-27 17:08:37 +02:00
antirez
c861e1e1ee Defrag: test currently disabled, too many false positives.
Related to #3786.
2017-04-22 15:59:57 +02:00
antirez
a17390853d Defrag: fix test false positive.
Apparently 1.4 is too low compared to what you get in certain setups
(including mine). I raised it to 1.55 that hopefully is still enough to
test that the fragmentation went down from 1.7 but without incurring in
issues, however the test setup may be still fragile so certain times this
may lead to false positives again, it's hard to test for these things
in a determinsitic way.

Related to #3786.
2017-04-22 13:21:41 +02:00
oranagra
0fb5c4ebd8 add test for active defrag 2017-04-22 13:17:09 +02:00
antirez
e3b8492e83 Revert "Jemalloc updated to 4.4.0."
This reverts commit 36c1acc222d29e6e2dc9fc25362e4faa471111bd.
2017-04-22 13:17:07 +02:00
antirez
238cebdd5e Check event loop creation return value. Fix #3951.
Normally we never check for OOM conditions inside Redis since the
allocator will always return a pointer or abort the program on OOM
conditons. However we cannot have control on epool_create(), that may
fail for kernel OOM (according to the manual page) even if all the
parameters are correct, so the function aeCreateEventLoop() may indeed
return NULL and this condition must be checked.
2017-04-21 16:27:38 +02:00
Salvatore Sanfilippo
3773c06d28 Merge pull request #3950 from kensou97/unstable
update block->free after some diff data are written to the child process
2017-04-20 07:55:51 +02:00
antirez
7d9dd80db3 Fix getKeysUsingCommandTable() in cluster mode.
Close #3940.
2017-04-19 16:17:08 +02:00
antirez
189a12afb4 PSYNC2: discard pending transactions from cached master.
During the review of the fix for #3899, @yangsiran identified an
implementation bug: given that the offset is now relative to the applied
part of the replication log, when we cache a master, the successive
PSYNC2 request will be made in order to *include* the transaction that
was not completely processed. This means that we need to discard any
pending transaction from our replication buffer: it will be re-executed.
2017-04-19 14:02:52 +02:00
antirez
22be435efe Fix PSYNC2 incomplete command bug as described in #3899.
This bug was discovered by @kevinmcgehee and constituted a major hidden
bug in the PSYNC2 implementation, caused by the propagation from the
master of incomplete commands to slaves.

The bug had several results:

1. Borrowing from Kevin text in the issue: "Given that slaves blindly
copy over their master's input into their own replication backlog over
successive read syscalls, it's possible that with large commands or
small TCP buffers, partial commands are present in this buffer. If the
master were to fail before successfully propagating the entire command
to a slave, the slaves will never execute the partial command (since the
client is invalidated) but will copy it to replication backlog which may
relay those invalid bytes to its slaves on PSYNC2, corrupting the
backlog and possibly other valid commands that follow the failover.
Simple command boundaries aren't sufficient to capture this, either,
because in the case of a MULTI/EXEC block, if the master successfully
propagates a subset of the commands but not the EXEC, then the
transaction in the backlog becomes corrupt and could corrupt other
slaves that consume this data."

2. As identified by @yangsiran later, there is another effect of the
bug. For the same mechanism of the first problem, a slave having another
slave, could receive a full resynchronization request with an already
half-applied command in the backlog. Once the RDB is ready, it will be
sent to the slave, and the replication will continue sending to the
sub-slave the other half of the command, which is not valid.

The fix, designed by @yangsiran and @antirez, and implemented by
@antirez, uses a secondary buffer in order to feed the sub-masters and
update the replication backlog and offsets, only when a given part of
the query buffer is actually *applied* to the state of the instance,
that is, when the command gets processed and the command is not pending
in the Redis transaction buffer because of CLIENT_MULTI state.

Given that now the backlog and offsets representation are in agreement
with the actual processed commands, both issue 1 and 2 should no longer
be possible.

Thanks to @kevinmcgehee, @yangsiran and @oranagra for their work in
identifying and designing a fix for this problem.
2017-04-19 10:25:45 +02:00
Salvatore Sanfilippo
27fe8e9fb2 Merge pull request #3945 from badboy/dicthash-bench-compile
Reorder to make dict-benchmark compile on Linux
2017-04-18 16:31:18 +02:00
antirez
02d02a3754 Fix #3848 by closing the descriptor on error. 2017-04-18 16:24:06 +02:00
antirez
8b7b4d6734 Merge branch 'unstable' of github.com:/antirez/redis into unstable 2017-04-18 16:15:24 +02:00
antirez
da2f9cd186 Fix descriptor leak. Close #3848. 2017-04-18 16:15:16 +02:00
Salvatore Sanfilippo
332a05dc33 Merge pull request #3856 from viennadd/issue-3847
fix #3847: add close socket before return ANET_ERR.
2017-04-18 16:13:23 +02:00
张文康
5f88bd320e update block->free after some diff data are written to the child process 2017-04-18 20:10:08 +08:00
antirez
c33493277a Clarify why we save ziplist elements in revserse order.
Also get rid of variables that are now kinda redundant, since the
dictionary iterator was removed.

This is related to PR #3949.
2017-04-18 11:01:47 +02:00
Salvatore Sanfilippo
0a942f1751 Merge pull request #3949 from spinlock/unstable-rdb-encoding
rdb: saving skiplist in reversed order to accelerate the deserialisation process
2017-04-18 10:56:57 +02:00
Jan-Erik Rediger
c4ad4765b0 Reorder to make dict-benchmark compile on Linux
Fixes #3944
2017-04-17 13:37:59 +02:00
spinlock
23ec36909e rdb: saving skiplist in reversed order to accelerate the deserialisation process 2017-04-17 13:22:34 +08:00
antirez
271733f4f8 Cluster: discard pong times in the future.
However we allow for 500 milliseconds of tolerance, in order to
avoid often discarding semantically valid info (the node is up)
because of natural few milliseconds desync among servers even when
NTP is used.

Note that anyway we should ping the node from time to time regardless and
discover if it's actually down from our point of view, since no update
is accepted while we have an active ping on the node.

Related to #3929.
2017-04-15 10:12:08 +02:00
antirez
3f068b92b9 Test: fix, hopefully, false PSYNC failure like in issue #2715.
And many other related Github issues... all reporting the same problem.
There was probably just not enough backlog in certain unlucky runs.
I'll ask people that can reporduce if they see now this as fixed as
well.
2017-04-14 17:53:11 +02:00
antirez
02777bb252 Cluster: always add PFAIL nodes at end of gossip section.
To rely on the fact that nodes in PFAIL state will be shared around by
randomly adding them in the gossip section is a weak assumption,
especially after changes related to sending less ping/pong packets.

We want to always include gossip entries for all the nodes that are in
PFAIL state, so that the PFAIL -> FAIL state promotion can happen much
faster and reliably.

Related to #3929.
2017-04-14 13:39:49 +02:00
antirez
8c829d9e43 Cluster: fix gossip section ping/pong times encoding.
The gossip section times are 32 bit, so cannot store the milliseconds
time but just the seconds approximation, which is good enough for our
uses. At the same time however, when comparing the gossip section times
of other nodes with our node's view, we need to convert back to
milliseconds.

Related to #3929. Without this change the patch to reduce the traffic in
the bus message does not work.
2017-04-14 11:01:22 +02:00
antirez
6878a3fedd Cluster: add clean-logs command to create-cluster script. 2017-04-14 10:52:00 +02:00
antirez
8f7bf2841a Cluster: decrease ping/pong traffic by trusting other nodes reports.
Cluster of bigger sizes tend to have a lot of traffic in the cluster bus
just for failure detection: a node will try to get a ping reply from
another node no longer than when the half the node timeout would elapsed,
in order to avoid a false positive.

However this means that if we have N nodes and the node timeout is set
to, for instance M seconds, we'll have to ping N nodes every M/2
seconds. This N*M/2 pings will receive the same number of pongs, so
a total of N*M packets per node. However given that we have a total of N
nodes doing this, the total number of messages will be N*N*M.

In a 100 nodes cluster with a timeout of 60 seconds, this translates
to a total of 100*100*30 packets per second, summing all the packets
exchanged by all the nodes.

This is, as you can guess, a lot... So this patch changes the
implementation in a very simple way in order to trust the reports of
other nodes: if a node A reports a node B as alive at least up to
a given time, we update our view accordingly.

The problem with this approach is that it could result into a subset of
nodes being able to reach a given node X, and preventing others from
detecting that is actually not reachable from the majority of nodes.
So the above algorithm is refined by trusting other nodes only if we do
not have currently a ping pending for the node X, and if there are no
failure reports for that node.

Since each node, anyway, pings 10 other nodes every second (one node
every 100 milliseconds), anyway eventually even trusting the other nodes
reports, we will detect if a given node is down from our POV.

Now to understand the number of packets that the cluster would exchange
for failure detection with the patch, we can start considering the
random PINGs that the cluster sent anyway as base line:
Each node sends 10 packets per second, so the total traffic if no
additioal packets would be sent, including PONG packets, would be:

    Total messages per second = N*10*2

However by trusting other nodes gossip sections will not AWALYS prevent
pinging nodes for the "half timeout reached" rule all the times. The
math involved in computing the actual rate as N and M change is quite
complex and depends also on another parameter, which is the number of
entries in the gossip section of PING and PONG packets. However it is
possible to compare what happens in cluster of different sizes
experimentally. After applying this patch a very important reduction in
the number of packets exchanged is trivial to observe, without apparent
impacts on the failure detection performances.

Actual numbers with different cluster sizes should be published in the
Reids Cluster documentation in the future.

Related to #3929.
2017-04-14 10:43:53 +02:00
antirez
c5d6f577f0 Cluster: collect more specific bus messages stats.
First step in order to change Cluster in order to use less messages.
Related to issue #3929.
2017-04-13 19:22:35 +02:00
antirez
104584b95e Fix typo in feedReplicationBacklog() top comment. 2017-04-12 12:28:05 +02:00
antirez
1210af3804 Add a top comment in crucial functions inside networking.c. 2017-04-12 10:12:27 +02:00
antirez
4a850be4dc Set lua-time-limit default value at safe place.
Otherwise, as it was, it will overwrite whatever the user set.

Close #3703.
2017-04-11 16:56:00 +02:00
antirez
f47607af02 Fix preprocessor if/else chain broken in order to fix #3927. 2017-04-11 16:54:27 +02:00
antirez
74720ea993 Merge branch 'unstable' of github.com:/antirez/redis into unstable 2017-04-11 16:45:49 +02:00
antirez
aa5b4be02e Fix zmalloc_get_memory_size() ifdefs to actually use the else branch.
Close #3927.
2017-04-11 16:45:11 +02:00
Salvatore Sanfilippo
69ce5c5d10 Merge pull request #3924 from lorneli/unstable
Expire: Update comment of activeExpireCycle function
2017-04-11 16:31:55 +02:00