1. Master replication offset was cleared after switching configuration
to some other slave, since it was assumed you can't PSYNC after a
switch. Note the case anymore and when we successfully PSYNC we need to
have our offset untouched.
2. Secondary replication ID was not reset to "000..." pattern at
startup.
3. Master in error state replying -LOADING or other transient errors
forced the slave to discard the cached master and full resync. This is
now fixed.
4. Better logging of what's happening on failed PSYNCs.
This means that stopping a slave and restarting it will still make it
able to PSYNC with the master. Moreover the master itself will retain
its ID/offset, in case it gets turned into a slave, or if a slave will
try to PSYNC with it with an exactly updated offset (otherwise there is
no backlog).
This change was possible thanks to PSYNC v2 that makes saving the current
replication state much simpler.
The gist of the changes is that now, partial resynchronizations between
slaves and masters (without the need of a full resync with RDB transfer
and so forth), work in a number of cases when it was impossible
in the past. For instance:
1. When a slave is promoted to mastrer, the slaves of the old master can
partially resynchronize with the new master.
2. Chained slalves (slaves of slaves) can be moved to replicate to other
slaves or the master itsef, without requiring a full resync.
3. The master itself, after being turned into a slave, is able to
partially resynchronize with the new master, when it joins replication
again.
In order to obtain this, the following main changes were operated:
* Slaves also take a replication backlog, not just masters.
* Same stream replication for all the slaves and sub slaves. The
replication stream is identical from the top level master to its slaves
and is also the same from the slaves to their sub-slaves and so forth.
This means that if a slave is later promoted to master, it has the
same replication backlong, and can partially resynchronize with its
slaves (that were previously slaves of the old master).
* A given replication history is no longer identified by the `runid` of
a Redis node. There is instead a `replication ID` which changes every
time the instance has a new history no longer coherent with the past
one. So, for example, slaves publish the same replication history of
their master, however when they are turned into masters, they publish
a new replication ID, but still remember the old ID, so that they are
able to partially resynchronize with slaves of the old master (up to a
given offset).
* The replication protocol was slightly modified so that a new extended
+CONTINUE reply from the master is able to inform the slave of a
replication ID change.
* REPLCONF CAPA is used in order to notify masters that a slave is able
to understand the new +CONTINUE reply.
* The RDB file was extended with an auxiliary field that is able to
select a given DB after loading in the slave, so that the slave can
continue receiving the replication stream from the point it was
disconnected without requiring the master to insert "SELECT" statements.
This is useful in order to guarantee the "same stream" property, because
the slave must be able to accumulate an identical backlog.
* Slave pings to sub-slaves are now sent in a special form, when the
top-level master is disconnected, in order to don't interfer with the
replication stream. We just use out of band "\n" bytes as in other parts
of the Redis protocol.
An old design document is available here:
https://gist.github.com/antirez/ae068f95c0d084891305
However the implementation is not identical to the description because
during the work to implement it, different changes were needed in order
to make things working well.
Redis fails to compile on MacOS 10.8.5 with Clang 4, version 421.0.57
(based on LLVM 3.1svn).
When compiling zmalloc.c, we get these warnings:
CC zmalloc.o
zmalloc.c:109:5: warning: implicit declaration of function '__atomic_add_fetch' is invalid in C99 [-Wimplicit-function-declaration]
update_zmalloc_stat_alloc(zmalloc_size(ptr));
^
zmalloc.c:75:9: note: expanded from macro 'update_zmalloc_stat_alloc'
atomicIncr(used_memory,__n,used_memory_mutex); \
^
./atomicvar.h:57:37: note: expanded from macro 'atomicIncr'
#define atomicIncr(var,count,mutex) __atomic_add_fetch(&var,(count),__ATOMIC_RELAXED)
^
zmalloc.c:145:5: warning: implicit declaration of function '__atomic_sub_fetch' is invalid in C99 [-Wimplicit-function-declaration]
update_zmalloc_stat_free(oldsize);
^
zmalloc.c:85:9: note: expanded from macro 'update_zmalloc_stat_free'
atomicDecr(used_memory,__n,used_memory_mutex); \
^
./atomicvar.h:58:37: note: expanded from macro 'atomicDecr'
#define atomicDecr(var,count,mutex) __atomic_sub_fetch(&var,(count),__ATOMIC_RELAXED)
^
zmalloc.c:205:9: warning: implicit declaration of function '__atomic_load_n' is invalid in C99 [-Wimplicit-function-declaration]
atomicGet(used_memory,um,used_memory_mutex);
^
./atomicvar.h:60:14: note: expanded from macro 'atomicGet'
dstvar = __atomic_load_n(&var,__ATOMIC_RELAXED); \
^
3 warnings generated.
Also on lazyfree.c:
CC lazyfree.o
lazyfree.c:68:13: warning: implicit declaration of function '__atomic_add_fetch' is invalid in C99 [-Wimplicit-function-declaration]
atomicIncr(lazyfree_objects,1,lazyfree_objects_mutex);
^
./atomicvar.h:57:37: note: expanded from macro 'atomicIncr'
#define atomicIncr(var,count,mutex) __atomic_add_fetch(&var,(count),__ATOMIC_RELAXED)
^
lazyfree.c:111:5: warning: implicit declaration of function '__atomic_sub_fetch' is invalid in C99 [-Wimplicit-function-declaration]
atomicDecr(lazyfree_objects,1,lazyfree_objects_mutex);
^
./atomicvar.h:58:37: note: expanded from macro 'atomicDecr'
#define atomicDecr(var,count,mutex) __atomic_sub_fetch(&var,(count),__ATOMIC_RELAXED)
^
2 warnings generated.
Then in the linking stage:
LINK redis-server
Undefined symbols for architecture x86_64:
"___atomic_add_fetch", referenced from:
_zmalloc in zmalloc.o
_zcalloc in zmalloc.o
_zrealloc in zmalloc.o
_dbAsyncDelete in lazyfree.o
_emptyDbAsync in lazyfree.o
_slotToKeyFlushAsync in lazyfree.o
"___atomic_load_n", referenced from:
_zmalloc_used_memory in zmalloc.o
_zmalloc_get_fragmentation_ratio in zmalloc.o
"___atomic_sub_fetch", referenced from:
_zrealloc in zmalloc.o
_zfree in zmalloc.o
_lazyfreeFreeObjectFromBioThread in lazyfree.o
_lazyfreeFreeDatabaseFromBioThread in lazyfree.o
_lazyfreeFreeSlotsMapFromBioThread in lazyfree.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[1]: *** [redis-server] Error 1
make: *** [all] Error 2
With this patch, the compilation is sucessful, no warnings.
Running `make test` we get a almost clean bill of health. Test pass with
one exception:
[err]: Check for memory leaks (pid 52793) in tests/unit/dump.tcl
[err]: Check for memory leaks (pid 53103) in tests/unit/auth.tcl
[err]: Check for memory leaks (pid 53117) in tests/unit/auth.tcl
[err]: Check for memory leaks (pid 53131) in tests/unit/protocol.tcl
[err]: Check for memory leaks (pid 53145) in tests/unit/protocol.tcl
[ok]: Check for memory leaks (pid 53160)
[err]: Check for memory leaks (pid 53175) in tests/unit/scan.tcl
[ok]: Check for memory leaks (pid 53189)
[err]: Check for memory leaks (pid 53221) in tests/unit/type/incr.tcl
.
.
.
Full debug log (289MB, uncompressed) available at
https://dl.dropboxusercontent.com/u/75548/logs/redis-debug-log-macos-10.8.5.log.xz
Most if not all of the memory leak tests fail. Not sure if this is
related. They are the only ones that fail. I belive they are not related,
but just the memory leak detector is not working properly on 10.8.5.
Signed-off-by: Pedro Melo <melo@simplicidade.org>
This new command swaps two Redis databases, so that immediately all the
clients connected to a given DB will see the data of the other DB, and
the other way around. Example:
SWAPDB 0 1
This will swap DB 0 with DB 1. All the clients connected with DB 0 will
immediately see the new data, exactly like all the clients connected
with DB 1 will see the data that was formerly of DB 0.
MOTIVATION AND HISTORY
---
The command was recently demanded by Pedro Melo, but was suggested in
the past multiple times, and always refused by me.
The reason why it was asked: Imagine you have clients operating in DB 0.
At the same time, you create a new version of the dataset in DB 1.
When the new version of the dataset is available, you immediately want
to swap the two views, so that the clients will transparently use the
new version of the data. At the same time you'll likely destroy the
DB 1 dataset (that contains the old data) and start to build a new
version, to repeat the process.
This is an interesting pattern, but the reason why I always opposed to
implement this, was that FLUSHDB was a blocking command in Redis before
Redis 4.0 improvements. Now we have FLUSHDB ASYNC that releases the
old data in O(1) from the point of view of the client, to reclaim memory
incrementally in a different thread.
At this point, the pattern can really be supported without latency
spikes, so I'm providing this implementation for the users to comment.
In case a very compelling argument will be made against this new command
it may be removed.
BEHAVIOR WITH BLOCKING OPERATIONS
---
If a client is blocking for a list in a given DB, after the swap it will
still be blocked in the same DB ID, since this is the most logical thing
to do: if I was blocked for a list push to list "foo", even after the
swap I want still a LPUSH to reach the key "foo" in the same DB in order
to unblock.
However an interesting thing happens when a client is, for instance,
blocked waiting for new elements in list "foo" of DB 0. Then the DB
0 and 1 are swapped with SWAPDB. However the DB 1 happened to have
a list called "foo" containing elements. When this happens, this
implementation can correctly unblock the client.
It is possible that there are subtle corner cases that are not covered
in the implementation, but since the command is self-contained from the
POV of the implementation and the Redis core, it cannot cause anything
bad if not used.
Tests and documentation are yet to be provided.
It was noted by @dvirsky that it is not possible to use string functions
when writing the AOF file. This sometimes is critical since the command
rewriting may need to be built in the context of the AOF callback, and
without access to the context, and the limited types that the AOF
production functions will accept, this can be an issue.
Moreover there are other needs that we can't anticipate regarding the
ability to use Redis Modules APIs using the context in order to build
representations to emit AOF / RDB.
Because of this a new API was added that allows the user to get a
temporary context from the IO context. The context is auto released
if obtained when the RDB / AOF callback returns.
Calling multiple time the function to get the context, always returns
the same one, since it is invalid to have more than a single context.
This commit fixes a vunlerability reported by Cory Duplantis
of Cisco Talos, see TALOS-2016-0206 for reference.
CONFIG SET client-output-buffer-limit accepts as client class "master"
which is actually only used to implement CLIENT KILL. The "master" class
has ID 3. What happens is that the global structure:
server.client_obuf_limits[class]
Is accessed with class = 3. However it is a 3 elements array, so writing
the 4th element means to write up to 24 bytes of memory *after* the end
of the array, since the structure is defined as:
typedef struct clientBufferLimitsConfig {
unsigned long long hard_limit_bytes;
unsigned long long soft_limit_bytes;
time_t soft_limit_seconds;
} clientBufferLimitsConfig;
EVALUATION OF IMPACT:
Checking what's past the boundaries of the array in the global
'server' structure, we find AOF state fields:
clientBufferLimitsConfig client_obuf_limits[CLIENT_TYPE_OBUF_COUNT];
/* AOF persistence */
int aof_state; /* AOF_(ON|OFF|WAIT_REWRITE) */
int aof_fsync; /* Kind of fsync() policy */
char *aof_filename; /* Name of the AOF file */
int aof_no_fsync_on_rewrite; /* Don't fsync if a rewrite is in prog. */
int aof_rewrite_perc; /* Rewrite AOF if % growth is > M and... */
off_t aof_rewrite_min_size; /* the AOF file is at least N bytes. */
off_t aof_rewrite_base_size; /* AOF size on latest startup or rewrite. */
off_t aof_current_size; /* AOF current size. */
Writing to most of these fields should be harmless and only cause problems in
Redis persistence that should not escalate to security problems.
However unfortunately writing to "aof_filename" could be potentially a
security issue depending on the access pattern.
Searching for "aof.filename" accesses in the source code returns many different
usages of the field, including using it as input for open(), logging to the
Redis log file or syslog, and calling the rename() syscall.
It looks possible that attacks could lead at least to informations
disclosure of the state and data inside Redis. However note that the
attacker must already have access to the server. But, worse than that,
it looks possible that being able to change the AOF filename can be used
to mount more powerful attacks: like overwriting random files with AOF
data (easily a potential security issue as demostrated here:
http://antirez.com/news/96), or even more subtle attacks where the
AOF filename is changed to a path were a malicious AOF file is loaded
in order to exploit other potential issues when the AOF parser is fed
with untrusted input (no known issue known currently).
The fix checks the places where the 'master' class is specifiedf in
order to access configuration data structures, and return an error in
this cases.
WHO IS AT RISK?
The "master" client class was introduced in Redis in Jul 28 2015.
Every Redis instance released past this date is not vulnerable
while all the releases after this date are. Notably:
Redis 3.0.x is NOT vunlerable.
Redis 3.2.x IS vulnerable.
Redis unstable is vulnerable.
In order for the instance to be at risk, at least one of the following
conditions must be true:
1. The attacker can access Redis remotely and is able to send
the CONFIG SET command (often banned in managed Redis instances).
2. The attacker is able to control the "redis.conf" file and
can wait or trigger a server restart.
The problem was fixed 26th September 2016 in all the releases affected.
Recently we moved the "return ASAP" condition for the Delete() function
from checking .size to checking .used, which is smarter, however while
testing the first table alone always works to ensure the dict is totally
emtpy, when we test the .size field, testing .used requires testing both
T0 and T1, since a rehashing could be in progress.