- Almost all Cluster Manager related code moved to
the same section.
- Many macroes converted to functions
- Added various comments
- Little code restyling
This is useful in the reply and timeout callback, if the module wants to
do some cleanup of the blocked client handle that may be stored around
in the module-private data structures.
In some modules it may be useful to have an idea about being near to
OOM. Anyway additionally an explicit call to get the fill ratio will be
added in the future.
This way it is possible to use conditional compilation to be compatible
with a larger amount of Redis versions, however note that this breaks
binary compatibiltiy, so the module must be compiled with the
corresponding redismodule.h file depending on the version of Redis
targeted.
Note that this was an experimental API that can only be enabled with
REIDSMODULE_EXPERIMENTAL_API, so it is subject to change until its
promoted to stable API. Sorry for the breakage, it is trivial to
resolve btw. This change will not be back ported to Redis 4.0.
While this feature is not used by Redis, ae.c implements the ability for
a timer to call a finalizer callback when an timer event is deleted.
This feature was bugged since the start, and because it was never used
we never noticed a problem. However Anthony LaTorre was using the same
library in order to implement a different system: he found a bug that he
describes as follows, and which he fixed with the patch in this commit,
sent me by private email:
--- Anthony email ---
've found one bug in the current implementation of the timed events.
It's possible to lose track of a timed event if an event is added in
the finalizerProc of another event.
For example, suppose you start off with three timed events 1, 2, and
3. Then the linked list looks like:
3 -> 2 -> 1
Then, you run processTimeEvents and events 2 and 3 finish, so now the
list looks like:
-1 -> -1 -> 2
Now, on the next iteration of processTimeEvents it starts by deleting
the first event, and suppose this finalizerProc creates a new event,
so that the list looks like this:
4 -> -1 -> 2
On the next iteration of the while loop, when it gets to the second
event, the variable prev is still set to NULL, so that the head of the
event loop after the next event will be set to 2, i.e. after deleting
the next event the event loop will look like:
2
and the event with id 4 will be lost.
I've attached an example program to illustrate the issue. If you run
it you will see that it prints:
```
foo id = 0
spam!
```
But if you uncomment line 29 and run it again it won't print "spam!".
--- End of email ---
Test.c source code is as follows:
#include "ae.h"
#include <stdio.h>
aeEventLoop *el;
int foo(struct aeEventLoop *el, long long id, void *data)
{
printf("foo id = %lld\n", id);
return AE_NOMORE;
}
int spam(struct aeEventLoop *el, long long id, void *data)
{
printf("spam!\n");
return AE_NOMORE;
}
void bar(struct aeEventLoop *el, void *data)
{
aeCreateTimeEvent(el, 0, spam, NULL, NULL);
}
int main(int argc, char **argv)
{
el = aeCreateEventLoop(100);
//aeCreateTimeEvent(el, 0, foo, NULL, NULL);
aeCreateTimeEvent(el, 0, foo, NULL, bar);
aeMain(el);
return 0;
}
Anthony fixed the problem by using a linked list for the list of timers, and
sent me back this patch after he tested the code in production for some time.
The code looks sane to me, so committing it to Redis.
There are too many advantages in doing this, RDB is faster to persist,
more compact, much faster to load back. The main issues here are that
the code is less tested because this was not the old default (so we are
enabling it for the new 5.0 release), and that the AOF is no longer a
trivially parsable format from now on. However the non-preamble mode
will be supported in the future as well, if new data types will be
added.
This should be more than enough, even if in case of partial IDs that are
not found, we send all the IDs to the slave/AOF, but this is definitely
a corner case without bad effects if not some wasted space.
This is a big win for caching use cases, since on reloading Redis will
still have some idea about what is worth to evict and what not.
However this only solves part of the problem because the information is
only partially propagated to slaves (on write operations). Reads will
not affect slaves LFU and LRU counters, so after a failover the eviction
decisions are kinda random until keys start to collect some aging/freq info.
However since new slaves are initially populated via RDB file transfer,
this means that if we spin up a new slave from a master, and perform an
immediate manual failover (for instance in order to upgrade the master),
the slave will have eviction informations to use for some time.
The LFU/LRU info is persisted only if the maxmemory policy is set to one
of the relevant type, even if no actual "maxmemory" memory limit is
set.
XINFO is mainly an observability command that will be used more by
humans than computers, and even when used by computers it will be a very
low traffic command. For this reason the format was changed in order to
have field names. They'll consume some bandwidth and CPU cycles, but in
this context this is much better than having to understand what the
numbers in the output array are.
Now you can use:
addReplyError("-MYERRORCODE some message");
If the error code is omitted, the behavior is like in the past,
the generic -ERR will be used.
This commit, in some parts derived from PR #3041 which is no longer
possible to merge (because the user deleted the original branch),
implements the ability of slaves to have a special configuration
preventing that they try to start a failover when the master is failing.
There are multiple reasons for wanting this, and the feautre was
requested in issue #3021 time ago.
The differences between this patch and the original PR are the
following:
1. The flag is saved/loaded on the nodes configuration.
2. The 'myself' node is now flag-aware, the flag is updated as needed
when the configuration is changed via CONFIG SET.
3. The flag name uses NOFAILOVER instead of NO_FAILOVER to be consistent
with existing NOADDR.
4. The redis.conf documentation was rewritten.
Thanks to @deep011 for the original patch.
other fixes / improvements:
- LUA script memory isn't taken from zmalloc (taken from libc malloc)
so it can cause high fragmentation ratio to be displayed (which is false)
- there was a problem with "fragmentation" info being calculated from
RSS and used_memory sampled at different times (now sampling them together)
other details:
- adding a few more allocator info fields to INFO and MEMORY commands
- improve defrag test to measure defrag latency of big keys
- increasing the accuracy of the defrag test (by looking at real grag info)
this way we can use an even lower threshold and still avoid false positives
- keep the old (total) "fragmentation" field unchanged, but add new ones for spcific things
- add these the MEMORY DOCTOR command
- deduct LUA memory from the rss in case of non jemalloc allocator (one for which we don't "allocator active/used")
- reduce sampling rate of the rss and allocator info
- big keys are not defragged in one go from within the dict scan
instead they are scanned in parts after the main dict hash bucket is done.
- add latency monitor sample for defrag
- change default active-defrag-cycle-min to induce lower latency
- make active defrag start a new scan right away if needed, so it's easier
(for the test suite) to detect when it's done
- make active defrag quick the current cycle after each db / big key
- defrag some non key long term global allocations
- some refactoring for smaller functions and more reusable code
- during dict rehashing, one scan iteration of the dict, can end up scanning
one bucket in the smaller dict and many many buckets in the larger dict.
so waiting for 16 scan iterations before checking the time, may be much too long.
AE_BARRIER was implemented like:
- Fire the readable event.
- Do not fire the writabel event if the readable fired.
However this may lead to the writable event to never be called if the
readable event is always fired. There is an alterantive, we can just
invert the sequence of the calls in case AE_BARRIER is set. This commit
does that.
In case the write handler is already installed, it could happen that we
serve the reply of a query in the same event loop cycle we received it,
preventing beforeSleep() from guaranteeing that we do the AOF fsync
before sending the reply to the client.
The AE_BARRIER mechanism, introduced in a previous commit, prevents this
problem. This commit makes actual use of this new feature to fix the
bug.
Add AE_BARRIER to the writable event loop so that slaves requesting
votes can't be served before we re-enter the event loop in the next
iteration, so clusterBeforeSleep() will fsync to disk in time.
Also add the call to explicitly fsync, given that we modified the last
vote epoch variable.
AOF fsync=always, and certain Redis Cluster bus operations, require to
fsync data on disk before replying with an acknowledge.
In such case, in order to implement Group Commits, we want to be sure
that queries that are read in a given cycle of the event loop, are never
served to clients in the same event loop iteration. This way, by using
the event loop "before sleep" callback, we can fsync the information
just one time before returning into the event loop for the next cycle.
This is much more efficient compared to calling fsync() multiple times.
Unfortunately because of a bug, this was not always guaranteed: the
actual way the events are installed was the sole thing that could
control. Normally this problem is hard to trigger when AOF is enabled
with fsync=always, because we try to flush the output buffers to the
socekt directly in the beforeSleep() function of Redis. However if the
output buffers are full, we actually install a write event, and in such
a case, this bug could happen.
This change to ae.c modifies the event loop implementation to make this
concept explicit. Write events that are registered with:
AE_WRITABLE|AE_BARRIER
Are guaranteed to never fire after the readable event was fired for the
same file descriptor. In this way we are sure that data is persisted to
disk before the client performing the operation receives an
acknowledged.
However note that this semantics does not provide all the guarantees
that one may believe are automatically provided. Take the example of the
blocking list operations in Redis.
With AOF and fsync=always we could have:
Client A doing: BLPOP myqueue 0
Client B doing: RPUSH myqueue a b c
In this scenario, Client A will get the "a" elements immediately after
the Client B RPUSH will be executed, even before the operation is persisted.
However when Client B will get the acknowledge, it can be sure that
"b,c" are already safe on disk inside the list.
What to note here is that it cannot be assumed that Client A receiving
the element is a guaranteed that the operation succeeded from the point
of view of Client B.
This is due to the fact that the barrier exists within the same socket,
and not between different sockets. However in the case above, the
element "a" was not going to be persisted regardless, so it is a pretty
synthetic argument.
This commit adds two new fields in the INFO output, stats section:
expired_stale_perc:0.34
expired_time_cap_reached_count:58
The first field is an estimate of the number of keys that are yet in
memory but are already logically expired. They reason why those keys are
yet not reclaimed is because the active expire cycle can't spend more
time on the process of reclaiming the keys, and at the same time nobody
is accessing such keys. However as the active expire cycle runs, while
it will eventually have to return to the caller, because of time limit
or because there are less than 25% of keys logically expired in each
given database, it collects the stats in order to populate this INFO
field.
Note that expired_stale_perc is a running average, where the current
sample accounts for 5% and the history for 95%, so you'll see it
changing smoothly over time.
The other field, expired_time_cap_reached_count, counts the number
of times the expire cycle had to stop, even if still it was finding a
sizeable number of keys yet to expire, because of the time limit.
This allows people handling operations to understand if the Redis
server, during mass-expiration events, is able to collect keys fast
enough usually. It is normal for this field to increment during mass
expires, but normally it should very rarely increment. When instead it
constantly increments, it means that the current workloads is using
a very important percentage of CPU time to expire keys.
This feature was created thanks to the hints of Rashmi Ramesh and
Bart Robinson from Twitter. In private email exchanges, they noted how
it was important to improve the observability of this parameter in the
Redis server. Actually in big deployments, the amount of keys that are
yet to expire in each server, even if they are logically expired, may
account for a very big amount of wasted memory.
It is possible to do BGREWRITEAOF even if appendonly=no. This is by design.
stopAppendonly() didn't turn off aof_rewrite_scheduled (it can be turned on
again by BGREWRITEAOF even while appendonly is off anyway).
After configuring `appendonly yes` it will see that the state is AOF_OFF,
there's no RDB fork, so it will do rewriteAppendOnlyFileBackground() which
will fail since the aof_child_pid is set (was scheduled and started by cron).
Solution:
stopAppendonly() will turn off the schedule flag (regardless of who asked for it).
startAppendonly() will terminate any existing fork and start a new one (so it is the most recent).
When feeding the master with a high rate traffic the the slave's feed is much slower.
This causes the replication buffer to grow (indefinitely) which leads to slave disconnection.
The problem is that writeToClient() decides to stop writing after NET_MAX_WRITES_PER_EVENT
writes (In order to be fair to clients).
We should ignore this when the client is a slave.
It's better if clients wait longer, the alternative is that the slave has no chance to stay in
sync in this situation.
See #3462 and related PRs.
We use a simple algorithm to calculate the level of affinity violation,
and then an optimizer that performs random swaps until things improve.
after a slave is promoted (assuming it has no slaves
and it booted over an hour ago), it will lose it's replication
backlog at the next replication cron, rather than waiting for slaves
to connect to it.
so on a simple master/slave faiover, if the new slave doesn't connect
immediately, it may be too later and PSYNC2 will fail.
This fixes a crash with Redis Cluster when OBJECT is mis-used, because
getKeysUsingCommandTable() will call serverPanic() detecting we are
accessing an invalid argument in the case "OBJECT foo" is called.
This bug was introduced when OBJECT HELP was introduced, because the key
argument is set fixed at index 2 in the command table, however now
OBJECT may be called with an insufficient number of arguments to extract
the key.
The "Right Thing" would be to have a specific function to extract keys
from the OBJECT command, however this is kinda of an overkill, so I
preferred to make getKeysUsingCommandTable() more robust and just return
no keys when it's not possible to honor the command table, because new
commands are often added and also there are a number with an HELP
subcommand violating the normal form, and crashing for this trivial
reason or having many command-specific key extraction functions is not
great.
- protocol parsing (processMultibulkBuffer) was limitted to 32big positions in the buffer
readQueryFromClient potential overflow
- rioWriteBulkCount used int, although rioWriteBulkString gave it size_t
- several places in sds.c that used int for string length or index.
- bugfix in RM_SaveAuxField (return was 1 or -1 and not length)
- RM_SaveStringBuffer was limitted to 32bit length
The commit splits the add functions into a set() and add() set of
functions, so that it's possible to set registers in an independent way
just having the index and count.
Related to #3819, otherwise a fix is not possible.
The main change introduced by this commit is pretending that help
arrays are more text than code, thus indenting them at level 0. This
improves readability, and is an old practice when defining arrays of
C strings describing text.
Additionally a few useless return statements are removed, and the HELP
subcommand capitalized when printed to the user.
We have this operation in two places: when caching the master and
when linking a new client after the client creation. By having an API
for this we avoid incurring in errors when modifying one of the two
places forgetting the other. The function is also a good place where to
document why we cache the linked list node.
Related to #4497 and #4210.
The function in its initial form, and after the fixes for the PSYNC2
bugs, required code duplication in multiple spots. This commit modifies
it in order to always compute the script name independently, and to
return the SDS of the SHA of the body: this way it can be used in all
the places, including for SCRIPT LOAD, without duplicating the code to
create the Lua function name. Note that this requires to re-compute the
body SHA1 in the case of EVAL seeing a script for the first time, but
this should not change scripting performance in any way because new
scripts definition is a rare event happening the first time a script is
seen, and the SHA1 computation is anyway not a very slow process against
the typical Redis script and compared to the actua Lua byte compiling of
the body.
Note that the function used to assert() if a duplicated script was
loaded, however actually now two times over three, we want the function
to handle duplicated scripts just fine: this happens in SCRIPT LOAD and
in RDB AUX "lua" loading. Moreover the assert was not defending against
some obvious failure mode, so now the function always tests against
already defined functions at start.
Unfortunately, as outlined by @soloestoy in #4505, "lua" AUX RDB field
loading in case of duplicated script was still broken. This commit fixes
this problem and also a memory leak introduced by the past commit.
Note that now we have a regression test able to duplicate the issue, so
this commit was actually tested against the regression. The original PR
also had a valid fix, but I prefer to hide the details of scripting.c
outside scripting.c, and later "SCRIPT LOAD" should also be able to use
the function luaCreateFunction() instead of redoing the work.
With PSYNC2 to force a full SYNC in tests is hard. With this new DEBUG
subcommand we just need to call it and then CLIENT KILL TYPE master in
the slave.
In the case of slaves loading the RDB from master, or in other similar
cases, the script is already defined, and the function registering the
script should not fail in the assert() call.
It's a bit of black magic without actually tracking it inside rax.c,
however Redis usage of the radix tree for the stream data structure is
quite consistent, so a few magic constants apparently are producing
results that make sense.
Note that streams produced by XADD in previous broken versions having
elements with 4096 bytes or more will be permanently broken and must be
created again from scratch.
Fix#4428Fix#4349
After checking with the community via Twitter (here:
https://twitter.com/antirez/status/915130876861788161) the verdict was to
use ":". However I later realized, after users lamented the fact that
it's hard to copy IDs just with double click, that this was the reason
why I moved to "." in the first instance. Fortunately "-", that was the
other option with most votes, also gets selected with double click on
most terminal applications on Linux and MacOS.
So my reasoning was:
1) We can't retain "." because it's actually confusing to newcomers, it
looks like a floating number, people may be tricked into thinking they
can order IDs numerically as floats.
2) Moving to a double-click-to-select format is much better. People will
work with such IDs for long time when coding / debugging. Why making now
a choice that will impact this for the next years?
The only other viable option was "-", and that's what I did. Thanks.
The core of this change is the implementation of stream trimming, and
the resulting MAXLEN option of XADD as a trivial result of having
trimming functionalities. MAXLEN already works but in order to be more
efficient listpack GC should be implemented, currently marked as a TODO
item inside the comments.
Listpack max size is a tradeoff between space and time. A 2k max entry
puts the memory usage approximately at a similar order of magnitude (5
million entries went from 96 to 120 MB), but the range queries speed
doubled (because there are half entries to scan in the average case).
Lower values could be considered, or maybe this parameter should be
made tunable.
We used to have the master ID stored at the start of the listpack,
however using the key directly makes more sense in order to create a
space efficient representation: anyway the key at the radix tree is very
unlikely to change because of how the stream is implemented. Moreover on
nodes merging, to rewrite the merged listpacks is anyway the most
sensible operation, and we can use the iterator and the append-to-stream
function in order to avoid re-implementing the code needed for merging.
This commit also adds two items at the start of the listpack: the
number of valid items inside the listpack, and the number of items
marked as deleted. This means that there is no need to scan a listpack
in order to understand if it's a good candidate for garbage collection,
if the ration between valid/deleted items triggers the GC.
The approach used is to set a fixed header at the start of every
listpack blob (that contains many entries). The header contains a
"master" ID and fields, that are initially just obtained from the first
entry inserted in the listpack, so that the first enty is always well
compressed. Later every new entry is checked against these fields, and
if it matches, the SAMEFIELD flag is set in the entry so that we know to
just use the master entry flags. The IDs are always delta-encoded
against the first entry. This approach avoids cascading effects in which
entries are encoded depending on the previous entries, in order to avoid
complexity and rewritings of the data when data is removed in the middle
(which is a planned feature).
blockForKeys() was not freeing the allocation holding the ID when the
key was already found busy. Fortunately the unit test checked explicitly
for blocking multiple times for the same key (copying a regression in
the blocking lists tests), so the bug was detected by the Redis test leak
checker.
XADD was suboptimal in the first incarnation of the command, not being
able to accept an ID (very useufl for replication), nor options for
having capped streams.
The keyspace notification for streams was not implemented.
A client may lose a lot of time between invocations of blocking XREAD,
for example because it is processing the messages or for any other
cause. When it returns back, it may provide a low enough message ID that
the server will block to send an unreasonable number of messages in a
single call. For this reason we set a COUNT when the client is blocked
with XREAD calls, even if no COUNT is given. This is arbitrarily set to
1000 because it's enough to avoid slowing down the reception of many
messages, but low enough to avoid to block.
With lists we need to signal only on key creation, but streams can
provide data to clients listening at every new item added.
To make this slightly more efficient we now track different classes of
blocked clients to avoid signaling keys when there is nobody listening.
A typical case is when the stream is used as a time series DB and
accessed only by range with XRANGE.
After a few attempts it looked quite saner to just add the last item ID
at the end of the serialized listpacks, instead of scanning the last
listpack loaded from head to tail just to fetch it. It's a disk space VS
CPU-and-simplicity tradeoff basically.
Related to #4483. As suggested by @soloestoy, we can retrieve the SHA1
from the body. Given that in the new implementation using AUX fields we
ended copying around a lot to create new objects and strings, extremize
such concept and trade CPU for space inside the RDB file.
This is currently needed in order to fix#4483, but this can be
useful in other contexts, so maybe later we may want to remove the
conditionals and always save/load scripts.
Note that we are using the "lua" AUX field here, in order to guarantee
backward compatibility of the RDB file. The unknown AUX fields must be
discarded by past versions of Redis.
Doing the following ended with a broken server.executable:
1. Start Redis with src/redis-server
2. Send CONFIG SET DIR /tmp/
3. Send DEBUG RESTART
At this point we called execve with an argv[0] that is no longer related
to the new path. So after the restart the absolute path of the
executable is recomputed in the wrong way. With this fix we pass the
absolute path already computed as argv[0].
This adds a new `addReplyHelp` helper that's used by commands
when returning a help text. The following commands have been
touched: DEBUG, OBJECT, COMMAND, PUBSUB, SCRIPT and SLOWLOG.
WIP
Fix entry command table entry for OBJECT for HELP option.
After #4472 the command may have just 2 arguments.
Improve OBJECT HELP descriptions.
See #4472.
WIP 2
WIP 3
See #4192, the original PR removed lines of code that are actually
needed, so thanks to @chunqiulfq for reporting the problem, but merging
solution from @jeesyn after checking, together with @artix75, that the
logic covers all the cases.
Firstly, use access time to replace the decreas time of LFU.
For function LFUDecrAndReturn,
it should only try to get decremented counter,
not update LFU fields, we will update it in an explicit way.
And we will times halve the counter according to the times of
elapsed time than server.lfu_decay_time.
Everytime a key is accessed, we should update the LFU
including update access time, and increment the counter after
call function LFUDecrAndReturn.
If a key is overwritten, the LFU should be also updated.
Then we can use `OBJECT freq` command to get a key's frequence,
and LFUDecrAndReturn should be called in `OBJECT freq` command
in case of the key has not been accessed for a long time,
because we update the access time only when the key is read or
overwritten.
getLongLongFromObject calls string2ll which has this line:
/* Return if not all bytes were used. */
so if you pass an sds with 3 characters "1\01" it will fail.
but getLongDoubleFromObject calls strtold, and considers it ok if eptr[0]==`\0`
i.e. if the end of the string found by strtold ends with null terminator
127.0.0.1:6379> set a 1
OK
127.0.0.1:6379> setrange a 2 2
(integer) 3
127.0.0.1:6379> get a
"1\x002"
127.0.0.1:6379> incrbyfloat a 2
"3"
127.0.0.1:6379> get a
"3"
For example:
1. A module command called within a MULTI section.
2. A Lua script with replicate_commands() called within a MULTI section.
3. A module command called from a Lua script in the above context.
Normally in modern Redis you can't create zero-len lists, however it's
possible to load them from old RDB files generated, for instance, using
Redis 2.8 (see issue #4409). The "Right Thing" would be not loading such
lists at all, but this requires to hook in rdb.c random places in a not
great way, for a problem that is at this point, at best, minor.
Here in this commit instead I just fix the fact that zero length lists,
materialized as quicklists with the first node set to NULL, were
iterated in the wrong way while they are saved, leading to a crash.
The other parts of the list implementation are apparently able to deal
with empty lists correctly, even if they are no longer a thing.
Since SDS v2, we no longer have a single header, so the function to
rewrite the SDS in terms of the minimum space required, instead of just
using realloc() and let the underlying allocator decide what to do,
was doing an allocation + copy every time the minimum possible header
needed to represent the string was different than the current one.
This could be often a bit wasteful, because if we go, for instance, from
the 32 bit fields header to the 16 bit fields header, the overhead of
the header is normally very small. With this commit we call realloc
instead, unless the change in header size is very significant in relation
to the string length.
When we free the backlog, we should use a new
replication ID and clear the ID2. Since without
backlog we can not increment master_repl_offset
even do write commands, that may lead to inconsistency
when we try to connect a "slave-before" master
(if this master is our slave before, our replid
equals the master's replid2). As the master have our
history, so we can match the master's replid2 and
second_replid_offset, that make partial sync work,
but the data is inconsistent.
There was not enough sanity checking in the code loading the slots of
Redis Cluster from the nodes.conf file, this resulted into the
attacker's ability to write data at random addresses in the process
memory, by manipulating the index of the array. The bug seems
exploitable using the following techique: the config file may be altered so
that one of the nodes gets, as node ID (which is the first field inside the
structure) some data that is actually executable: then by writing this
address in selected places, this node ID part can be executed after a
jump. So it is mostly just a matter of effort in order to exploit the
bug. In practice however the issue is not very critical because the
bug requires an unprivileged user to be able to modify the Redis cluster
nodes configuration, and at the same time this should result in some
gain. However Redis normally is unprivileged as well. Yet much better to
have this fixed indeed.
Fix#4278.
Certain checks were useless, at the same time certain malformed inputs
were accepted without problems (emtpy strings parsed as zero).
Cases where strtod() returns ERANGE but we still want to parse the input
where ok in getDoubleFromObject() but not in the long variant.
As a side effect of these fixes, this commit fixes#4391.
This commit is a reinforcement of commit c1c99e9.
1. Replication information can be stored when the RDB file is
generated by a mater using server.slaveseldb when server.repl_backlog
is not NULL, or set repl_stream_db be -1. That's safe, because
NULL server.repl_backlog will trigger full synchronization,
then master will send SELECT command to replicaiton stream.
2. Only do rdbSave* when rsiptr is not NULL,
if we do rdbSave* without rdbSaveInfo, slave will miss repl-stream-db.
3. Save the replication informations also in the case of
SAVE command, FLUSHALL command and DEBUG reload.
This commit attempts to fix a number of bugs reported in #4316.
They are related to the way replication info like replication ID,
offsets, and currently selected DB in the master client, are stored
and loaded by Redis. In order to avoid inconsistencies the changes in
this commit try to enforce that:
1. Replication information are only stored when the RDB file is
generated by a slave that has a valid 'master' client, so that we can
always extract the currently selected DB.
2. When replication informations are persisted in the RDB file, all the
info for a successful PSYNC or nothing is persisted.
3. The RDB replication informations are only loaded if the instance is
configured as a slave, otherwise a master can start with IDs that relate
to a different history of the data set, and stil retain such IDs in the
future while receiving unrelated writes.
A slave may be started with an RDB file able to provide enough slave to
perform a successful partial SYNC with its master. However in such a
case, how outlined in issue #4268, the slave backlog will not be
started, since it was only initialized on full syncs attempts. This
creates different problems with successive PSYNC attempts that will
always result in full synchronizations.
Thanks to @fdingiit for discovering the issue.
when SHUTDOWN command is recived it is possible that some of the recent
command were not yet flushed from the AOF buffer, and the server
experiences data loss at shutdown.
Lua scripting does not support calling blocking commands, however all
the native Redis commands are flagged as "s" (no scripting flag), so
this is not possible at all. With modules there is no such mechanism in
order to flag a command as non callable by the Lua scripting engine,
moreover we cannot trust the modules users from complying all the times:
it is likely that modules will be released to have blocking commands
without such commands being flagged correctly, even if we provide a way to
signal this fact.
This commit attempts to address the problem in a short term way, by
detecting that a module is trying to block in the context of the Lua
scripting engine client, and preventing to do this. The module will
actually believe to block as usually, but what happens is that the Lua
script receives an error immediately, and the background call is ignored
by the Redis engine (if not for the cleanup callbacks, once it
unblocks).
Long term, the more likely solution, is to introduce a new call called
RedisModule_GetClientFlags(), so that a command can detect if the caller
is a Lua script, and return an error, or avoid blocking at all.
Being the blocking API experimental right now, more work is needed in
this regard in order to reach a level well blocking module commands and
all the other Redis subsystems interact peacefully.
Now the effect is like the following:
127.0.0.1:6379> eval "redis.call('hello.block',1,5000)" 0
(error) ERR Error running script (call to
f_b5ba35ff97bc1ef23debc4d6e9fd802da187ed53): @user_script:1: ERR
Blocking module command called from Lua script
This commit fixes issue #4127 in the short term.
This function failed when an internal-only flag was set as an only flag
in a node: the string was trimmed expecting a final comma before
exiting the function, causing a crash. See issue #4142.
Moreover generation of flags representation only needed at DEBUG log
level was always performed: a waste of CPU time. This is fixed as well
by this commit.
The function cache was not working at all, and the function returned
wrong values if there where two or more modules exporting native data
types.
See issue #4131 for more details.
In Redis 4.0 replication, with the introduction of PSYNC2, masters and
slaves replicate commands to cascading slaves and to the replication
backlog itself in a different way compared to the past.
Masters actually replicate the effects of client commands.
Slaves just propagate what they receive from masters.
This mechanism can cause problems when the configuration of an instance
is changed from master to slave inside a transaction. For instance
we could send to a master instance the following sequence:
MULTI
SLAVEOF 127.0.0.1 0
EXEC
SLAVEOF NO ONE
Before the fixes in this commit, the MULTI command used to be propagated
into the replication backlog, however after the SLAVEOF command the
instance is a slave, so the EXEC implementation failed to also propagate
the EXEC command. When the slaves of the above instance reconnected,
they were incrementally synchronized just sending a "MULTI". This put
the master client (in the slaves) into MULTI state, breaking the
replication.
Notably even Redis Sentinel uses the above approach in order to guarantee
that configuration changes are always performed together with rewrites
of the configuration and with clients disconnection. Sentiel does:
MULTI
SLAVEOF ...
CONFIG REWRITE
CLIENT KILL TYPE normal
EXEC
So this was a really problematic issue. However even with the fix in
this commit, that will add the final EXEC to the replication stream in
case the instance was switched from master to slave during the
transaction, the result would be to increment the slave replication
offset, so a successive reconnection with the new master, will not
permit a successful partial resynchronization: no way the new master can
provide us with the backlog needed, we incremented our offset to a value
that the new master cannot have.
However the EXEC implementation waits to emit the MULTI, so that if the
commands inside the transaction actually do not need to be replicated,
no commands propagation happens at all. From multi.c:
if (!must_propagate && !(c->cmd->flags & (CMD_READONLY|CMD_ADMIN))) {
execCommandPropagateMulti(c);
must_propagate = 1;
}
The above code is already modified by this commit you are reading.
Now also ADMIN commands do not trigger the emission of MULTI. It is actually
not clear why we do not just check for CMD_WRITE... Probably I wrote it this
way in order to make the code more reliable: better to over-emit MULTI
than not emitting it in time.
So this commit should indeed fix issue #3836 (verified), however it looks
like some reconsideration of this code path is needed in the long term.
BONUS POINT: The reverse bug.
Even in a read only slave "B", in a replication setup like:
A -> B -> C
There are commands without the READONLY nor the ADMIN flag, that are also
not flagged as WRITE commands. An example is just the PING command.
So if we send B the following sequence:
MULTI
PING
SLAVEOF NO ONE
EXEC
The result will be the reverse bug, where only EXEC is emitted, but not the
previous MULTI. However this apparently does not create problems in practice
but it is yet another acknowledge of the fact some work is needed here
in order to make this code path less surprising.
Note that there are many different approaches we could follow. For instance
MULTI/EXEC blocks containing administrative commands may be allowed ONLY
if all the commands are administrative ones, otherwise they could be
denined. When allowed, the commands could simply never be replicated at all.
In general we do not want before/after sleep() callbacks to be called
when we re-enter the event loop, since those calls are only designed in
order to perform operations every main iteration of the event loop, and
re-entering is often just a way to incrementally serve clietns with
error messages or other auxiliary operations. However, if we call the
callbacks, we are then forced to think at before/after sleep callbacks
as re-entrant, which is much harder without any good need.
However here there was also a clear bug: beforeSleep() was actually
never called when re-entering the event loop. But the new afterSleep()
callback was. This is broken and in this instance re-entering
afterSleep() caused a modules GIL dead lock.
Redis clients need to have an instantaneous idea of the amount of memory
they are consuming (if the number is not exact should at least be
proportional to the actual memory usage). We do that adding and
subtracting the SDS length when pushing / popping from the client->reply
list. However it is quite simple to add bugs in such a setup, by not
taking the objects in the list and the count in sync. For such reason,
Redis has an assertion to track counts near 2^64: those are always the
result of the counter wrapping around because we subtract more than we
add. This commit adds the symmetrical assertion: when the list is empty
since we sent everything, the reply_bytes count should be zero. Thanks
to the new assertion it should be simple to also detect the other
problem, where the count slowly increases because of over-counting.
The assertion adds a conditional in the code that sends the buffer to
the socket but should not create any measurable performance slowdown,
listLength() just accesses a structure field, and this code path is
totally dominated by write(2).
Related to #4100.
This commit closes issue #3698, at least for now, since the root cause
was not fixed: the bounding box function, for huge radiuses, does not
return a correct bounding box, there are points still within the radius
that are left outside.
So when using GEORADIUS queries with radiuses in the order of 5000 km or
more, it was possible to see, at the edge of the area, certain points
not correctly reported.
Because the bounding box for now was used just as an optimization, and
such huge radiuses are not common, for now the optimization is just
switched off when the radius is near such magnitude.
Three test cases found by the Continuous Integration test were added, so
that we can easily trigger the bug again, both for regression testing
and in order to properly fix it as some point in the future.
This feature was proposed by @rosmo in PR #2643 and later redesigned
in order to fit better with the other options for non-interactive modes
of redis-cli. The idea is basically to allow to collect latency
information in scripts, cron jobs or whateever, just running for a
limited time and then producing a single output.
Issue #4084 shows how for a design error, GEORADIUS is a write command
because of the STORE option. Because of this it does not work
on readonly slaves, gets redirected to masters in Redis Cluster even
when the connection is in READONLY mode and so forth.
To break backward compatibility at this stage, with Redis 4.0 to be in
advanced RC state, is problematic for the user base. The API can be
fixed into the unstable branch soon if we'll decide to do so in order to
be more consistent, and reease Redis 5.0 with this incompatibility in
the future. This is still unclear.
However, the ability to scale GEO queries in slaves easily is too
important so this commit adds two read-only variants to the GEORADIUS
and GEORADIUSBYMEMBER command: GEORADIUS_RO and GEORADIUSBYMEMBER_RO.
The commands are exactly as the original commands, but they do not
accept the STORE and STOREDIST options.
This is the first step towards getting rid of HMSET which is a command
that does not make much sense once HSET is variadic, and has a saner
return value.
The original RDB serialization format was not parsable without the
module loaded, becuase the structure was managed only by the module
itself. Moreover RDB is a streaming protocol in the sense that it is
both produce di an append-only fashion, and is also sometimes directly
sent to the socket (in the case of diskless replication).
The fact that modules values cannot be parsed without the relevant
module loaded is a problem in many ways: RDB checking tools must have
loaded modules even for doing things not involving the value at all,
like splitting an RDB into N RDBs by key or alike, or just checking the
RDB for sanity.
In theory module values could be just a blob of data with a prefixed
length in order for us to be able to skip it. However prefixing the values
with a length would mean one of the following:
1. To be able to write some data at a previous offset. This breaks
stremaing.
2. To bufferize values before outputting them. This breaks performances.
3. To have some chunked RDB output format. This breaks simplicity.
Moreover, the above solution, still makes module values a totally opaque
matter, with the fowllowing problems:
1. The RDB check tool can just skip the value without being able to at
least check the general structure. For datasets composed mostly of
modules values this means to just check the outer level of the RDB not
actually doing any checko on most of the data itself.
2. It is not possible to do any recovering or processing of data for which a
module no longer exists in the future, or is unknown.
So this commit implements a different solution. The modules RDB
serialization API is composed if well defined calls to store integers,
floats, doubles or strings. After this commit, the parts generated by
the module API have a one-byte prefix for each of the above emitted
parts, and there is a final EOF byte as well. So even if we don't know
exactly how to interpret a module value, we can always parse it at an
high level, check the overall structure, understand the types used to
store the information, and easily skip the whole value.
The change is backward compatible: older RDB files can be still loaded
since the new encoding has a new RDB type: MODULE_2 (of value 7).
The commit also implements the ability to check RDB files for sanity
taking advantage of the new feature.
It looks safer to return C_OK from freeMemoryIfNeeded() when clients are
paused because returning C_ERR may prevent success of writes. It is
possible that there is no difference in practice since clients cannot
execute writes while clients are paused, but it looks more correct this
way, at least conceptually.
Related to PR #4028.
1. brpop last key index, thus checking all keys for slots.
2. Memory leak in clusterRedirectBlockedClientIfNeeded.
3. Remove while loop in clusterRedirectBlockedClientIfNeeded.
This avoids Helgrind complaining, but we are actually not using
atomicGet() to get the unixtime value for now: too many places where it
is used and given tha time_t is word-sized it should be safe in all the
archs we support as it is.
On the other hand, Helgrind, when Redis is compiled with "make helgrind"
in order to force the __sync macros, will detect the write in
updateCachedTime() as a read (because atomic functions are used) and
will not complain about races.
This commit also includes minor refactoring of mutex initializations and
a "helgrind" target in the Makefile.
The __sync builtin can be correctly detected by Helgrind so to force it
is useful for testing. The API in the INFO output can be useful for
debugging after problems are reported.
Instead of giving the module background operations just a small time to
run in the beforeSleep() function, we can have the lock released for all
the time we are blocked in the multiplexing syscall.