The new ACL key based permissions in #9974 require the key-specs (#8324) to have more
explicit flags rather than just READ and WRITE. See discussion in #10040
This PR defines two groups of flags:
One about how redis internally handles the key (mutually-exclusive).
The other is about the logical operation done from the user's point of view (3 mutually exclusive
write flags, and one read flag, all optional).
In both groups, if we can't explicitly flag something as explicit read-only, delete-only, or
insert-only, we flag it as `RW` or `UPDATE`.
here's the definition from the code:
```
/* Key-spec flags *
* -------------- */
/* The following refer what the command actually does with the value or metadata
* of the key, and not necessarily the user data or how it affects it.
* Each key-spec may must have exaclty one of these. Any operation that's not
* distinctly deletion, overwrite or read-only would be marked as RW. */
#define CMD_KEY_RO (1ULL<<0) /* Read-Only - Reads the value of the key, but
* doesn't necessarily returns it. */
#define CMD_KEY_RW (1ULL<<1) /* Read-Write - Modifies the data stored in the
* value of the key or its metadata. */
#define CMD_KEY_OW (1ULL<<2) /* Overwrite - Overwrites the data stored in
* the value of the key. */
#define CMD_KEY_RM (1ULL<<3) /* Deletes the key. */
/* The follwing refer to user data inside the value of the key, not the metadata
* like LRU, type, cardinality. It refers to the logical operation on the user's
* data (actual input strings / TTL), being used / returned / copied / changed,
* It doesn't refer to modification or returning of metadata (like type, count,
* presence of data). Any write that's not INSERT or DELETE, would be an UPADTE.
* Each key-spec may have one of the writes with or without access, or none: */
#define CMD_KEY_ACCESS (1ULL<<4) /* Returns, copies or uses the user data from
* the value of the key. */
#define CMD_KEY_UPDATE (1ULL<<5) /* Updates data to the value, new value may
* depend on the old value. */
#define CMD_KEY_INSERT (1ULL<<6) /* Adds data to the value with no chance of,
* modification or deletion of existing data. */
#define CMD_KEY_DELETE (1ULL<<7) /* Explicitly deletes some content
* from the value of the key. */
```
Unrelated changes:
- generate-command-code.py is only compatible with python3 (modified the shabang)
- generate-command-code.py print file on json parsing error
- rename `shard_channel` key-spec flag to just `channel`.
- add INCOMPLETE flag in input spec of SORT and SORT_RO
Modules can now register sockets/pipe to the Redis main thread event loop and do network operations asynchronously. Previously, modules had to maintain an event loop and another thread for asynchronous network operations.
Also, if a module is calling API functions after doing some network operations, it had to synchronize its event loop thread's access with Redis main thread by locking the GIL, causing contention on the lock. After this commit, no synchronization is needed as module can operate in Redis main thread context. So, this commit may improve the performance for some use cases.
Added three functions to the module API:
* RedisModule_EventLoopAdd(int fd, int mask, RedisModuleEventLoopFunc func, void *user_data)
* RedisModule_EventLoopDel(int fd, int mask)
* RedisModule_EventLoopAddOneShot(RedisModuleEventLoopOneShotFunc func, void *user_data) - This function can be called from other threads to trigger callback on Redis main thread. Callback will be triggered only once. If Redis main thread is sleeping, this call will wake up the Redis main thread.
Event loop callbacks are called by Redis main thread after locking the GIL. Inside callbacks, modules can operate as if they are holding the GIL.
Added REDISMODULE_EVENT_EVENTLOOP event with two subevents:
* REDISMODULE_SUBEVENT_EVENTLOOP_BEFORE_SLEEP
* REDISMODULE_SUBEVENT_EVENTLOOP_AFTER_SLEEP
These events are for modules that want to participate in the before and after sleep action. e.g It might be useful to implement batching : Read data from the network, write all to a file in one go on BEFORE_SLEEP event.
Following discussion on: https://github.com/redis/redis/issues/9899#issuecomment-1014689385
Raise error if unknows parameter is given to `FUNCTION LOAD`.
Before the fix:
```
127.0.0.1:6379> function load LUA lib2 foo bar "local function test1() return 5 end redis.register_function('test1', test1)"
OK
```
After the fix:
```
127.0.0.1:6379> function load LUA lib2 foo bar "local function test1() return 5 end redis.register_function('test1', test1)"
(error) ERR Unkowns option given: foo
```
1. enable diskless replication by default
2. add a new config named repl-diskless-sync-max-replicas that enables
replication to start before the full repl-diskless-sync-delay was
reached.
3. put replica online sooner on the master (see below)
4. test suite uses repl-diskless-sync-delay of 0 to be faster
5. a few tests that use multiple replica on a pre-populated master, are
now using the new repl-diskless-sync-max-replicas
6. fix possible timing issues in a few cluster tests (see below)
put replica online sooner on the master
----------------------------------------------------
there were two tests that failed because they needed for the master to
realize that the replica is online, but the test code was actually only
waiting for the replica to realize it's online, and in diskless it could
have been before the master realized it.
changes include two things:
1. the tests wait on the right thing
2. issues in the master, putting the replica online in two steps.
the master used to put the replica as online in 2 steps. the first
step was to mark it as online, and the second step was to enable the
write event (only after getting ACK), but in fact the first step didn't
contains some of the tasks to put it online (like updating good slave
count, and sending the module event). this meant that if a test was
waiting to see that the replica is online form the point of view of the
master, and then confirm that the module got an event, or that the
master has enough good replicas, it could fail due to timing issues.
so now the full effect of putting the replica online, happens at once,
and only the part about enabling the writes is delayed till the ACK.
fix cluster tests
--------------------
I added some code to wait for the replica to sync and avoid race
conditions.
later realized the sentinel and cluster tests where using the original 5
seconds delay, so changed it to 0.
this means the other changes are probably not needed, but i suppose
they're still better (avoid race conditions)
since `info commandstats` already shows sub-commands, we should do the same in `info latencystats`.
similarly, the LATENCY HISTOGRAM command now shows sub-commands (with their full name) when:
* asking for all commands
* asking for a specific container command
* asking for a specific sub-command)
Co-authored-by: Oran Agra <oran@redislabs.com>
These two tests have a high probability of failure
on MacOS. Or it takes many retries to succeed.
Keys often expire before we can access them.
So this time we try to avoid this by reducing the time
of the first `after`, or removeing the first `after`.
The results of doing `20/81` and `0/101` are:
- PEXPIRE (20/81): 1069/1949
- PEXPIREAT (20/81): 1093/1949
- PEXPIRE (0/101): 31936 / 31936
- PEXPIREAT (0/101): 31936 / 31936
The first number is the number of times that the
test succeeded without any retries.
The second number is the total number of executions.
And we can see that `0/101` doesn't even need an extra
retries. Also reduces the time required for testing.
So in the end we chose `0/100`, i.e. remove the first `after`.
As for `PEXPIREAT`, there is no failure, but we still changed
it together, using `0/201`, after 2W tests, none of them failed.
### Describe
Fix crash found by CI, Introduced by #9849.
When we do any operation on the quicklist, we should make sure that all nodes
of the quicklist should not be in the recompressed state.
### Issues
This PR fixes two issues with incorrect recompression.
1. The current quicklist node is full and the previous node isn't full,
the current node is not recompressed correctly after inserting elements into the previous node.
2. The current quicklist node is full and the next node isn't full,
the current node is not recompressed correctly after inserting elements into the next node.
### Test
Add two tests to cover incorrect compression issues.
### Other
Fix unittest test failure caused by assertion introduced by #9849.
# Redis Functions Flags
Following the discussion on #10025 Added Functions Flags support.
The PR is divided to 2 sections:
* Add named argument support to `redis.register_function` API.
* Add support for function flags
## `redis.register_function` named argument support
The first part of the PR adds support for named argument on `redis.register_function`, example:
```
redis.register_function{
function_name='f1',
callback=function()
return 'hello'
end,
description='some desc'
}
```
The positional arguments is also kept, which means that it still possible to write:
```
redis.register_function('f1', function() return 'hello' end)
```
But notice that it is no longer possible to pass the optional description argument on the positional
argument version. Positional argument was change to allow passing only the mandatory arguments
(function name and callback). To pass more arguments the user must use the named argument version.
As with positional arguments, the `function_name` and `callback` is mandatory and an error will be
raise if those are missing. Also, an error will be raise if an unknown argument name is given or the
arguments type is wrong.
Tests was added to verify the new syntax.
## Functions Flags
The second part of the PR is adding functions flags support.
Flags are given to Redis when the engine calls `functionLibCreateFunction`, supported flags are:
* `no-writes` - indicating the function perform no writes which means that it is OK to run it on:
* read-only replica
* Using FCALL_RO
* If disk error detected
It will not be possible to run a function in those situations unless the function turns on the `no-writes` flag
* `allow-oom` - indicate that its OK to run the function even if Redis is in OOM state, if the function will
not turn on this flag it will not be possible to run it if OOM reached (even if the function declares `no-writes`
and even if `fcall_ro` is used). If this flag is set, any command will be allow on OOM (even those that is
marked with CMD_DENYOOM). The assumption is that this flag is for advance users that knows its
meaning and understand what they are doing, and Redis trust them to not increase the memory usage.
(e.g. it could be an INCR or a modification on an existing key, or a DEL command)
* `allow-state` - indicate that its OK to run the function on stale replica, in this case we will also make
sure the function is only perform `stale` commands and raise an error if not.
* `no-cluster` - indicate to disallow running the function if cluster is enabled.
Default behaviure of functions (if no flags is given):
1. Allow functions to read and write
2. Do not run functions on OOM
3. Do not run functions on stale replica
4. Allow functions on cluster
### Lua API for functions flags
On Lua engine, it is possible to give functions flags as `flags` named argument:
```
redis.register_function{function_name='f1', callback=function() return 1 end, flags={'no-writes', 'allow-oom'}, description='description'}
```
The function flags argument must be a Lua table that contains all the requested flags, The following
will result in an error:
* Unknown flag
* Wrong flag type
Default behaviour is the same as if no flags are used.
Tests were added to verify all flags functionality
## Additional changes
* mark FCALL and FCALL_RO with CMD_STALE flag (unlike EVAL), so that they can run if the function was
registered with the `allow-stale` flag.
* Verify `CMD_STALE` on `scriptCall` (`redis.call`), so it will not be possible to call commands from script while
stale unless the command is marked with the `CMD_STALE` flags. so that even if the function is allowed while
stale we do not allow it to bypass the `CMD_STALE` flag of commands.
* Flags section was added to `FUNCTION LIST` command to provide the set of flags for each function:
```
> FUNCTION list withcode
1) 1) "library_name"
2) "test"
3) "engine"
4) "LUA"
5) "description"
6) (nil)
7) "functions"
8) 1) 1) "name"
2) "f1"
3) "description"
4) (nil)
5) "flags"
6) (empty array)
9) "library_code"
10) "redis.register_function{function_name='f1', callback=function() return 1 end}"
```
* Added API to get Redis version from within a script, The redis version can be provided using:
1. `redis.REDIS_VERSION` - string representation of the redis version in the format of MAJOR.MINOR.PATH
2. `redis.REDIS_VERSION_NUM` - number representation of the redis version in the format of `0x00MMmmpp`
(`MM` - major, `mm` - minor, `pp` - patch). The number version can be used to check if version is greater or less
another version. The string version can be used to return to the user or print as logs.
This new API is provided to eval scripts and functions, it also possible to use this API during functions loading phase.
Force create a BASE file (use a foreground `rewriteAppendOnlyFile`) when redis starts from an
empty data set and `appendonly` is yes.
The reasoning is that normally, after redis is running for some time, and the AOF has gone though
a few rewrites, there's always a base rdb file. and the scenario where the base file is missing, is
kinda rare (happens only at empty startup), so this change normalizes it.
But more importantly, there are or could be some complex modules that are started with some
configuration, when they create persistence they write that configuration to RDB AUX fields, so
that can can always know with which configuration the persistence file they're loading was
created (could be critical). there is (was) one scenario in which they could load their persisted data,
and that configuration was missing, and this change fixes it.
Add a new module event: REDISMODULE_SUBEVENT_PERSISTENCE_SYNC_AOF_START, similar to
REDISMODULE_SUBEVENT_PERSISTENCE_AOF_START which is async.
Co-authored-by: Oran Agra <oran@redislabs.com>
Use `getFullCommandName` to get the full name of the command.
It can also get the full name of the subcommand, like "script|help".
Before:
```
> SCRIPT HELP
(error) NOPERM this user has no permissions to run the 'help' command or its subcommand
> ACL LOG
7) "object"
8) "help"
```
After:
```
> SCRIPT HELP
(error) NOPERM this user has no permissions to run the 'script|help' command
> ACL LOG
7) "object"
8) "script|help"
```
Fix#10094
This commit adds some tests that the test cases will
access the keys with expiration time set in the script call.
There was no test case for this part before. See #10080
Also there is a test will cover #1525. we block the time so
that the key can not expire in the middle of the script execution.
Other changes:
1. Delete `evalTimeSnapshot` and just use `scriptTimeSnapshot` in it's place.
2. Some cleanups to scripting.tcl.
3. better names for tests that run in a loop to make them distinctable
Syntax:
`COMMAND DOCS [<command name> ...]`
Background:
Apparently old version of hiredis (and thus also redis-cli) can't
support more than 7 levels of multi-bulk nesting.
The solution is to move all the doc related metadata from COMMAND to a
new COMMAND DOCS sub-command.
The new DOCS sub-command returns a map of commands (not an array like in COMMAND),
And the same goes for the `subcommands` field inside it (also contains a map)
Besides that, the remaining new fields of COMMAND (hints, key-specs, and
sub-commands), are placed in the outer array rather than a nested map.
this was done mainly for consistency with the old format.
Other changes:
---
* Allow COMMAND INFO with no arguments, which returns all commands, so that we can some day deprecated
the plain COMMAND (no args)
* Reduce the amount of deferred replies from both COMMAND and COMMAND
DOCS, especially in the inner loops, since these create many small
reply objects, which lead to many small write syscalls and many small
TCP packets.
To make this easier, when populating the command table, we count the
history, args, and hints so we later know their size in advance.
Additionally, the movablekeys flag was moved into the flags register.
* Update generate-commands-json.py to take the data from both command, it
now executes redis-cli directly, instead of taking input from stdin.
* Sub-commands in both COMMAND (and COMMAND INFO), and also COMMAND DOCS,
show their full name. i.e. CONFIG
* GET will be shown as `config|get` rather than just `get`.
This will be visible both when asking for `COMMAND INFO config` and COMMAND INFO config|get`, but is
especially important for the later.
i.e. imagine someone doing `COMMAND INFO slowlog|get config|get` not being able to distinguish between the two
items in the array response.
It used to return `$-1` in RESP2, now we will return `*-1`.
This is a bug in redis 6.2 when COUNT was added, the `COUNT`
option was introduced in #8179. Fix#10089.
the documentation of [LPOP](https://redis.io/commands/lpop) says
```
When called without the count argument:
Bulk string reply: the value of the first element, or nil when key does not exist.
When called with the count argument:
Array reply: list of popped elements, or nil when key does not exist.
```
The following error commands will crash redis-server:
```
> get|
Error: Server closed the connection
> get|set
Error: Server closed the connection
> get|other
```
The reason is in #9504, we use `lookupCommandBySds` for find the
container command. And it split the command (argv[0]) with `|`.
If we input something like `get|other`, after the split, `get`
will become a valid command name, pass the `ERR unknown command`
check, and finally crash in `addReplySubcommandSyntaxError`
In this case we do not need to split the command name with `|`
and just look in the commands dict to find if `argv[0]` is a
container command.
So this commit introduce a new function call `isContainerCommandBySds`
that it will return true if a command name is a container command.
Also with the old code, there is a incorrect error message:
```
> config|get set
(error) ERR Unknown subcommand or wrong number of arguments for 'set'. Try CONFIG|GET HELP.
```
The crash was reported in #10070.
Fix#9410
Crucial for the ms and sequence deltas, but I changed all
calls, just in case (e.g. "flags")
Before this commit:
`ms_delta` and `seq_delta` could have overflown, causing `currid` to be wrong,
which in turn would cause `streamTrim` to trim the entire rax node (see new test)
# Redis Function Libraries
This PR implements Redis Functions Libraries as describe on: https://github.com/redis/redis/issues/9906.
Libraries purpose is to provide a better code sharing between functions by allowing to create multiple
functions in a single command. Functions that were created together can safely share code between
each other without worrying about compatibility issues and versioning.
Creating a new library is done using 'FUNCTION LOAD' command (full API is described below)
This PR introduces a new struct called libraryInfo, libraryInfo holds information about a library:
* name - name of the library
* engine - engine used to create the library
* code - library code
* description - library description
* functions - the functions exposed by the library
When Redis gets the `FUNCTION LOAD` command it creates a new empty libraryInfo.
Redis passes the `CODE` to the relevant engine alongside the empty libraryInfo.
As a result, the engine will create one or more functions by calling 'libraryCreateFunction'.
The new funcion will be added to the newly created libraryInfo. So far Everything is happening
locally on the libraryInfo so it is easy to abort the operation (in case of an error) by simply
freeing the libraryInfo. After the library info is fully constructed we start the joining phase by
which we will join the new library to the other libraries currently exist on Redis.
The joining phase make sure there is no function collision and add the library to the
librariesCtx (renamed from functionCtx). LibrariesCtx is used all around the code in the exact
same way as functionCtx was used (with respect to RDB loading, replicatio, ...).
The only difference is that apart from function dictionary (maps function name to functionInfo
object), the librariesCtx contains also a libraries dictionary that maps library name to libraryInfo object.
## New API
### FUNCTION LOAD
`FUNCTION LOAD <ENGINE> <LIBRARY NAME> [REPLACE] [DESCRIPTION <DESCRIPTION>] <CODE>`
Create a new library with the given parameters:
* ENGINE - REPLACE Engine name to use to create the library.
* LIBRARY NAME - The new library name.
* REPLACE - If the library already exists, replace it.
* DESCRIPTION - Library description.
* CODE - Library code.
Return "OK" on success, or error on the following cases:
* Library name already taken and REPLACE was not used
* Name collision with another existing library (even if replace was uses)
* Library registration failed by the engine (usually compilation error)
## Changed API
### FUNCTION LIST
`FUNCTION LIST [LIBRARYNAME <LIBRARY NAME PATTERN>] [WITHCODE]`
Command was modified to also allow getting libraries code (so `FUNCTION INFO` command is no longer
needed and removed). In addition the command gets an option argument, `LIBRARYNAME` allows you to
only get libraries that match the given `LIBRARYNAME` pattern. By default, it returns all libraries.
### INFO MEMORY
Added number of libraries to `INFO MEMORY`
### Commands flags
`DENYOOM` flag was set on `FUNCTION LOAD` and `FUNCTION RESTORE`. We consider those commands
as commands that add new data to the dateset (functions are data) and so we want to disallows
to run those commands on OOM.
## Removed API
* FUNCTION CREATE - Decided on https://github.com/redis/redis/issues/9906
* FUNCTION INFO - Decided on https://github.com/redis/redis/issues/9899
## Lua engine changes
When the Lua engine gets the code given on `FUNCTION LOAD` command, it immediately runs it, we call
this run the loading run. Loading run is not a usual script run, it is not possible to invoke any
Redis command from within the load run.
Instead there is a new API provided by `library` object. The new API's:
* `redis.log` - behave the same as `redis.log`
* `redis.register_function` - register a new function to the library
The loading run purpose is to register functions using the new `redis.register_function` API.
Any attempt to use any other API will result in an error. In addition, the load run is has a time
limit of 500ms, error is raise on timeout and the entire operation is aborted.
### `redis.register_function`
`redis.register_function(<function_name>, <callback>, [<description>])`
This new API allows users to register a new function that will be linked to the newly created library.
This API can only be called during the load run (see definition above). Any attempt to use it outside
of the load run will result in an error.
The parameters pass to the API are:
* function_name - Function name (must be a Lua string)
* callback - Lua function object that will be called when the function is invokes using fcall/fcall_ro
* description - Function description, optional (must be a Lua string).
### Example
The following example creates a library called `lib` with 2 functions, `f1` and `f1`, returns 1 and 2 respectively:
```
local function f1(keys, args)
return 1
end
local function f2(keys, args)
return 2
end
redis.register_function('f1', f1)
redis.register_function('f2', f2)
```
Notice: Unlike `eval`, functions inside a library get the KEYS and ARGV as arguments to the
functions and not as global.
### Technical Details
On the load run we only want the user to be able to call a white list on API's. This way, in
the future, if new API's will be added, the new API's will not be available to the load run
unless specifically added to this white list. We put the while list on the `library` object and
make sure the `library` object is only available to the load run by using [lua_setfenv](https://www.lua.org/manual/5.1/manual.html#lua_setfenv) API. This API allows us to set
the `globals` of a function (and all the function it creates). Before starting the load run we
create a new fresh Lua table (call it `g`) that only contains the `library` API (we make sure
to set global protection on this table just like the general global protection already exists
today), then we use [lua_setfenv](https://www.lua.org/manual/5.1/manual.html#lua_setfenv)
to set `g` as the global table of the load run. After the load run finished we update `g`
metatable and set `__index` and `__newindex` functions to be `_G` (Lua default globals),
we also pop out the `library` object as we do not need it anymore.
This way, any function that was created on the load run (and will be invoke using `fcall`) will
see the default globals as it expected to see them and will not have the `library` API anymore.
An important outcome of this new approach is that now we can achieve a distinct global table
for each library (it is not yet like that but it is very easy to achieve it now). In the future we can
decide to remove global protection because global on different libraries will not collide or we
can chose to give different API to different libraries base on some configuration or input.
Notice that this technique was meant to prevent errors and was not meant to prevent malicious
user from exploit it. For example, the load run can still save the `library` object on some local
variable and then using in `fcall` context. To prevent such a malicious use, the C code also make
sure it is running in the right context and if not raise an error.
Callers of redisFork() are logging `strerror(errno)` on failure.
`errno` is not set when there is already a child process, causing printing
current value of errno which was set before `redisFork()` call.
Setting errno to EEXIST on this failure to provide more meaningful error message.
# Short description
The Redis extended latency stats track per command latencies and enables:
- exporting the per-command percentile distribution via the `INFO LATENCYSTATS` command.
**( percentile distribution is not mergeable between cluster nodes ).**
- exporting the per-command cumulative latency distributions via the `LATENCY HISTOGRAM` command.
Using the cumulative distribution of latencies we can merge several stats from different cluster nodes
to calculate aggregate metrics .
By default, the extended latency monitoring is enabled since the overhead of keeping track of the
command latency is very small.
If you don't want to track extended latency metrics, you can easily disable it at runtime using the command:
- `CONFIG SET latency-tracking no`
By default, the exported latency percentiles are the p50, p99, and p999.
You can alter them at runtime using the command:
- `CONFIG SET latency-tracking-info-percentiles "0.0 50.0 100.0"`
## Some details:
- The total size per histogram should sit around 40 KiB. We only allocate those 40KiB when a command
was called for the first time.
- With regards to the WRITE overhead As seen below, there is no measurable overhead on the achievable
ops/sec or full latency spectrum on the client. Including also the measured redis-benchmark for unstable
vs this branch.
- We track from 1 nanosecond to 1 second ( everything above 1 second is considered +Inf )
## `INFO LATENCYSTATS` exposition format
- Format: `latency_percentiles_usec_<CMDNAME>:p0=XX,p50....`
## `LATENCY HISTOGRAM [command ...]` exposition format
Return a cumulative distribution of latencies in the format of a histogram for the specified command names.
The histogram is composed of a map of time buckets:
- Each representing a latency range, between 1 nanosecond and roughly 1 second.
- Each bucket covers twice the previous bucket's range.
- Empty buckets are not printed.
- Everything above 1 sec is considered +Inf.
- At max there will be log2(1000000000)=30 buckets
We reply a map for each command in the format:
`<command name> : { `calls`: <total command calls> , `histogram` : { <bucket 1> : latency , < bucket 2> : latency, ... } }`
Co-authored-by: Oran Agra <oran@redislabs.com>
Following #10038.
This PR introduces two changes.
1. Show the elapsed time of a single test in the test output, in order to have a more
detailed understanding of the changes in test run time.
2. Speedup two tests related to `key-load-delay` configuration.
other tests do not seem to be affected by #10003.
Creating fork (or even a foreground SAVE) during a transaction breaks the atomicity of the transaction.
In addition to that, it could mess up the propagated transaction to the AOF file.
This change blocks SAVE, PSYNC, SYNC and SHUTDOWN from being executed inside MULTI-EXEC.
It does that by adding a command flag, so that modules can flag their commands with that flag too.
Besides it changes BGSAVE, BGREWRITEAOF, and CONFIG SET appendonly, to turn the
scheduled flag instead of forking righ taway.
Other changes:
* expose `protected`, `no-async-loading`, and `no_multi` flags in COMMAND command
* add a test to validate propagation of FLUSHALL inside a transaction.
* add a test to validate how CONFIG SET that errors reacts in a transaction
Co-authored-by: Oran Agra <oran@redislabs.com>
This would mean that the effects of `CONFIG SET maxmemory` may not be visible once the command returns.
That could anyway happen since incremental eviction was added in redis 6.2 (see #7653)
We do this to fix one of the propagation bugs about eviction see #9890 and #10014.
Implement Multi-Part AOF mechanism to avoid overheads during AOFRW.
Introducing a folder with multiple AOF files tracked by a manifest file.
The main issues with the the original AOFRW mechanism are:
* buffering of commands that are processed during rewrite (consuming a lot of RAM)
* freezes of the main process when the AOFRW completes to drain the remaining part of the buffer and fsync it.
* double disk IO for the data that arrives during AOFRW (had to be written to both the old and new AOF files)
The main modifications of this PR:
1. Remove the AOF rewrite buffer and related code.
2. Divide the AOF into multiple files, they are classified as two types, one is the the `BASE` type,
it represents the full amount of data (Maybe AOF or RDB format) after each AOFRW, there is only
one `BASE` file at most. The second is `INCR` type, may have more than one. They represent the
incremental commands since the last AOFRW.
3. Use a AOF manifest file to record and manage these AOF files mentioned above.
4. The original configuration of `appendfilename` will be the base part of the new file name, for example:
`appendonly.aof.1.base.rdb` and `appendonly.aof.2.incr.aof`
5. Add manifest-related TCL tests, and modified some existing tests that depend on the `appendfilename`
6. Remove the `aof_rewrite_buffer_length` field in info.
7. Add `aof-disable-auto-gc` configuration. By default we're automatically deleting HISTORY type AOFs.
It also gives users the opportunity to preserve the history AOFs. just for testing use now.
8. Add AOFRW limiting measure. When the AOFRW failures reaches the threshold (3 times now),
we will delay the execution of the next AOFRW by 1 minute. If the next AOFRW also fails, it will be
delayed by 2 minutes. The next is 4, 8, 16, the maximum delay is 60 minutes (1 hour). During the limit
period, we can still use the 'bgrewriteaof' command to execute AOFRW immediately.
9. Support upgrade (load) data from old version redis.
10. Add `appenddirname` configuration, as the directory name of the append only files. All AOF files and
manifest file will be placed in this directory.
11. Only the last AOF file (BASE or INCR) can be truncated. Otherwise redis will exit even if
`aof-load-truncated` is enabled.
Co-authored-by: Oran Agra <oran@redislabs.com>
This commit implements a sharded pubsub implementation based off of shard channels.
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
To avoid data loss, this commit adds a grace period for lagging replicas to
catch up the replication offset.
Done:
* Wait for replicas when shutdown is triggered by SIGTERM and SIGINT.
* Wait for replicas when shutdown is triggered by the SHUTDOWN command. A new
blocked client type BLOCKED_SHUTDOWN is introduced, allowing multiple clients
to call SHUTDOWN in parallel.
Note that they don't expect a response unless an error happens and shutdown is aborted.
* Log warning for each replica lagging behind when finishing shutdown.
* CLIENT_PAUSE_WRITE while waiting for replicas.
* Configurable grace period 'shutdown-timeout' in seconds (default 10).
* New flags for the SHUTDOWN command:
- NOW disables the grace period for lagging replicas.
- FORCE ignores errors writing the RDB or AOF files which would normally
prevent a shutdown.
- ABORT cancels ongoing shutdown. Can't be combined with other flags.
* New field in the output of the INFO command: 'shutdown_in_milliseconds'. The
value is the remaining maximum time to wait for lagging replicas before
finishing the shutdown. This field is present in the Server section **only**
during shutdown.
Not directly related:
* When shutting down, if there is an AOF saving child, it is killed **even** if AOF
is disabled. This can happen if BGREWRITEAOF is used when AOF is off.
* Client pause now has end time and type (WRITE or ALL) per purpose. The
different pause purposes are *CLIENT PAUSE command*, *failover* and
*shutdown*. If clients are unpaused for one purpose, it doesn't affect client
pause for other purposes. For example, the CLIENT UNPAUSE command doesn't
affect client pause initiated by the failover or shutdown procedures. A completed
failover or a failed shutdown doesn't unpause clients paused by the CLIENT
PAUSE command.
Notes:
* DEBUG RESTART doesn't wait for replicas.
* We already have a warning logged when a replica disconnects. This means that
if any replica connection is lost during the shutdown, it is either logged as
disconnected or as lagging at the time of exit.
Co-authored-by: Oran Agra <oran@redislabs.com>
Preventing COFIG SET maxmemory from propagating is just the tip of the iceberg.
Module that performs a write operation in a notification can cause any
command to be propagated, based on server.dirty
We need to come up with a better solution.
It turns out that libc malloc can return an allocation of a different size on requests of the same size.
this means that matching MEMORY USAGE of one key to another copy of the same data can fail.
Solution:
Keep running the test that calls MEMORY USAGE, but ignore the response.
We do that by introducing a new utility function to get the memory usage, which always returns 1
when the allocator is not jemalloc.
Other changes:
Some formatting for datatype2.tcl
Co-authored-by: Oran Agra <oran@redislabs.com>
Following #9656, this script generates a "commands.json" file from the output
of the new COMMAND. The output of this script is used in redis/redis-doc#1714
and by redis/redis-io#259. This also converts a couple of rogue dashes (in
'key-specs' and 'multiple-token' flags) to underscores (continues #9959).
PR #9890 may have introduced a problem.
There are tests that use MULTI-EXEC to make sure two BGSAVE / BGREWRITEAOF are executed together.
But now it's not valid to run run commands that create a snapshot inside a transaction (gonna be blocked soon)
This PR modifies the test not to rely on MULTI-EXEC.
Co-authored-by: Oran Agra <oran@redislabs.com>
There's a race between testing DBSIZE and the thread starting.
If the thread hadn't started by the time we checked DBISZE, no
keys will have been evicted.
The correct way is to check the evicted_keys stat.
Follow the conclusions to support Functions in redis cluster (#9899)
Added 2 new FUNCTION sub-commands:
1. `FUNCTION DUMP` - dump a binary payload representation of all the functions.
2. `FUNCTION RESTORE <PAYLOAD> [FLUSH|APPEND|REPLACE]` - give the binary payload extracted
using `FUNCTION DUMP`, restore all the functions on the given payload. Restore policy can be given to
control how to handle existing functions (default is APPEND):
* FLUSH: delete all existing functions.
* APPEND: appends the restored functions to the existing functions. On collision, abort.
* REPLACE: appends the restored functions to the existing functions. On collision,
replace the old function with the new function.
Modify `redis-cli --cluster add-node` to use `FUNCTION DUMP` to get existing functions from
one of the nodes in the cluster, and `FUNCTION RESTORE` to load the same set of functions
to the new node. `redis-cli` will execute this step before sending the `CLUSTER MEET` command
to the new node. If `FUNCTION DUMP` returns an error, assume the current Redis version do not
support functions and skip `FUNCTION RESTORE`. If `FUNCTION RESTORE` fails, abort and do not send
the `CLUSTER MEET` command. If the new node already contains functions (before the `FUNCTION RESTORE`
is sent), abort and do not add the node to the cluster. Test was added to verify
`redis-cli --cluster add-node` works as expected.
Use case insensitive string comparison for function names (like we do for commands and configs)
In addition, add verification that the functions only use the following characters: [a-zA-Z0-9_]
The mess:
Some parts use alsoPropagate for late propagation, others using an immediate one (propagate()),
causing edge cases, ugly/hacky code, and the tendency for bugs
The basic idea is that all commands are propagated via alsoPropagate (i.e. added to a list) and the
top-most call() is responsible for going over that list and actually propagating them (and wrapping
them in MULTI/EXEC if there's more than one command). This is done in the new function,
propagatePendingCommands.
Callers to propagatePendingCommands:
1. top-most call() (we want all nested call()s to add to the also_propagate array and just the top-most
one to propagate them) - via `afterCommand`
2. handleClientsBlockedOnKeys: it is out of call() context and it may propagate stuff - via `afterCommand`.
3. handleClientsBlockedOnKeys edge case: if the looked-up key is already expired, we will propagate the
expire but will not unblock any client so `afterCommand` isn't called. in that case, we have to propagate
the deletion explicitly.
4. cron stuff: active-expire and eviction may also propagate stuff
5. modules: the module API allows to propagate stuff from just about anywhere (timers, keyspace notifications,
threads). I could have tried to catch all the out-of-call-context places but it seemed easier to handle it in one
place: when we free the context. in the spirit of what was done in call(), only the top-most freeing of a module
context may cause propagation.
6. modules: when using a thread-safe ctx it's not clear when/if the ctx will be freed. we do know that the module
must lock the GIL before calling RM_Replicate/RM_Call so we propagate the pending commands when
releasing the GIL.
A "known limitation", which were actually a bug, was fixed because of this commit (see propagate.tcl):
When using a mix of RM_Call with `!` and RM_Replicate, the command would propagate out-of-order:
first all the commands from RM_Call, and then the ones from RM_Replicate
Another thing worth mentioning is that if, in the past, a client would issue a MULTI/EXEC with just one
write command the server would blindly propagate the MULTI/EXEC too, even though it's redundant.
not anymore.
This commit renames propagate() to propagateNow() in order to cause conflicts in pending PRs.
propagatePendingCommands is the only caller of propagateNow, which is now a static, internal helper function.
Optimizations:
1. alsoPropagate will not add stuff to also_propagate if there's no AOF and replicas
2. alsoPropagate reallocs also_propagagte exponentially, to save calls to memmove
Bugfixes:
1. CONFIG SET can create evictions, sending notifications which can cause to dirty++ with modules.
we need to prevent it from propagating to AOF/replicas
2. We need to set current_client in RM_Call. buggy scenario:
- CONFIG SET maxmemory, eviction notifications, module hook calls RM_Call
- assertion in lookupKey crashes, because current_client has CONFIG SET, which isn't CMD_WRITE
3. minor: in eviction, call propagateDeletion after notification, like active-expire and all commands
(we always send a notification before propagating the command)
## background
Till now CONFIG SET was blocked during loading.
(In the not so distant past, GET was disallowed too)
We recently (not released yet) added an async-loading mode, see #9323,
and during that time it'll serve CONFIG SET and any other command.
And now we realized (#9770) that some configs, and commands are dangerous
during async-loading.
## changes
* Allow most CONFIG SET during loading (both on async-loading and normal loading)
* Allow CONFIG REWRITE and CONFIG RESETSTAT during loading
* Block a few config during loading (`appendonly`, `repl-diskless-load`, and `dir`)
* Block a few commands during loading (list below)
## the blocked commands:
* SAVE - obviously we don't wanna start a foregreound save during loading 8-)
* BGSAVE - we don't mind to schedule one, but we don't wanna fork now
* BGREWRITEAOF - we don't mind to schedule one, but we don't wanna fork now
* MODULE - we obviously don't wanna unload a module during replication / rdb loading
(MODULE HELP and MODULE LIST are not blocked)
* SYNC / PSYNC - we're in the middle of RDB loading from master, must not allow sync
requests now.
* REPLICAOF / SLAVEOF - we're in the middle of replicating, maybe it makes sense to let
the user abort it, but he couldn't do that so far, i don't wanna take any risk of bugs due to odd state.
* CLUSTER - only allow [HELP, SLOTS, NODES, INFO, MYID, LINKS, KEYSLOT, COUNTKEYSINSLOT,
GETKEYSINSLOT, RESET, REPLICAS, COUNT_FAILURE_REPORTS], for others, preserve the status quo
## other fixes
* processEventsWhileBlocked had an issue when being nested, this could happen with a busy script
during async loading (new), but also in a busy script during AOF loading (old). this lead to a crash in
the scenario described in #6988
If a test fails at `wait_for_blocked_clients_count` after the `PAUSE` command,
It won't send `UNPAUSE` to server, leading to the server hanging until timeout,
which is bad and hard to debug sometimes when developing.
This PR tries to fix this.
Timeout in `CLIENT PAUSE` shortened from 1e5 seconds(extremely long) to 50~100 seconds.
The issue with MAY_REPLICATE is that all automatic mechanisms to handle
write commands will not work. This require have a special treatment for:
* Not allow those commands to be executed on RO replica.
* Allow those commands to be executed on RO replica from primary connection.
* Allow those commands to be executed on the RO replica from AOF.
By setting those commands as WRITE commands we are getting all those properties from Redis.
Test was added to verify that those properties work as expected.
In addition, rearrange when and where functions are flushed. Before this PR functions were
flushed manually on `rdbLoadRio` and cleaned manually on failure. This contradicts the
assumptions that functions are data and need to be created/deleted alongside with the
data. A side effect of this, for example, `debug reload noflush` did not flush the data but
did flush the functions, `debug loadaof` flush the data but not the functions.
This PR move functions deletion into `emptyDb`. `emptyDb` (renamed to `emptyData`) will
now accept an additional flag, `NOFUNCTIONS` which specifically indicate that we do not
want to flush the functions (on all other cases, functions will be flushed). Used the new flag
on FLUSHALL and FLUSHDB only! Tests were added to `debug reload` and `debug loadaof`
to verify that functions behave the same as the data.
Notice that because now functions will be deleted along side with the data we can not allow
`CLUSTER RESET` to be called from within a function (it will cause the function to be released
while running), this PR adds `NO_SCRIPT` flag to `CLUSTER RESET` so it will not be possible
to be called from within a function. The other cluster commands are allowed from within a
function (there are use-cases that uses `GETKEYSINSLOT` to iterate over all the keys on a
given slot). Tests was added to verify `CLUSTER RESET` is denied from within a script.
Another small change on this PR is that `RDBFLAGS_ALLOW_DUP` is also applicable on functions.
When loading functions, if this flag is set, we will replace old functions with new ones on collisions.
# Background
The main goal of this PR is to remove relevant logics on Lua script verbatim replication,
only keeping effects replication logic, which has been set as default since Redis 5.0.
As a result, Lua in Redis 7.0 would be acting the same as Redis 6.0 with default
configuration from users' point of view.
There are lots of reasons to remove verbatim replication.
Antirez has listed some of the benefits in Issue #5292:
>1. No longer need to explain to users side effects into scripts.
They can do whatever they want.
>2. No need for a cache about scripts that we sent or not to the slaves.
>3. No need to sort the output of certain commands inside scripts
(SMEMBERS and others): this both simplifies and gains speed.
>4. No need to store scripts inside the RDB file in order to startup correctly.
>5. No problems about evicting keys during the script execution.
When looking back at Redis 5.0, antirez and core team decided to set the config
`lua-replicate-commands yes` by default instead of removing verbatim replication
directly, in case some bad situations happened. 3 years later now before Redis 7.0,
it's time to remove it formally.
# Changes
- configuration for lua-replicate-commands removed
- created config file stub for backward compatibility
- Replication script cache removed
- this is useless under script effects replication
- relevant statistics also removed
- script persistence in RDB files is also removed
- Propagation of SCRIPT LOAD and SCRIPT FLUSH to replica / AOF removed
- Deterministic execution logic in scripts removed (i.e. don't run write commands
after random ones, and sorting output of commands with random order)
- the flags indicating which commands have non-deterministic results are kept as hints to clients.
- `redis.replicate_commands()` & `redis.set_repl()` changed
- now `redis.replicate_commands()` does nothing and return an 1
- ...and then `redis.set_repl()` can be issued before `redis.replicate_commands()` now
- Relevant TCL cases adjusted
- DEBUG lua-always-replicate-commands removed
# Other changes
- Fix a recent bug comparing CLIENT_ID_AOF to original_client->flags instead of id. (introduced in #9780)
Co-authored-by: Oran Agra <oran@redislabs.com>
- add needs:debug flag for some tests
- disable "save" in external tests (speedup?)
- use debug_digest proc instead of debug command directly so it can be skipped
- use OBJECT ENCODING instead of DEBUG OBJECT to get encoding
- add a proc for OBJECT REFCOUNT so it can be skipped
- move a bunch of tests in latency_monitor tests to happen later so that latency monitor has some values in it
- add missing close_replication_stream calls
- make sure to close the temp client if DEBUG LOG fails
Block sensitive configs and commands by default.
* `enable-protected-configs` - block modification of configs with the new `PROTECTED_CONFIG` flag.
Currently we add this flag to `dbfilename`, and `dir` configs,
all of which are non-mutable configs that can set a file redis will write to.
* `enable-debug-command` - block the `DEBUG` command
* `enable-module-command` - block the `MODULE` command
These have a default value set to `no`, so that these features are not
exposed by default to client connections, and can only be set by modifying the config file.
Users can change each of these to either `yes` (allow all access), or `local` (allow access from
local TCP connections and unix domain connections)
Note that this is a **breaking change** (specifically the part about MODULE command being disabled by default).
I.e. we don't consider DEBUG command being blocked as an issue (people shouldn't have been using it),
and the few configs we protected are unlikely to have been set at runtime anyway.
On the other hand, it's likely to assume some users who use modules, load them from the config file anyway.
Note that's the whole point of this PR, for redis to be more secure by default and reduce the attack surface on
innocent users, so secure defaults will necessarily mean a breaking change.
some languages can build a json-like object by parsing a textual json,
but it works poorly when attributes contain hyphens
example in JS:
```
let j = JSON.parse(json)
j['key-name'] <- works
j.key-name <= illegal syntax
```
Introduce memory management on cluster link buffers:
* Introduce a new `cluster-link-sendbuf-limit` config that caps memory usage of cluster bus link send buffers.
* Introduce a new `CLUSTER LINKS` command that displays current TCP links to/from peers.
* Introduce a new `mem_cluster_links` field under `INFO` command output, which displays the overall memory usage by all current cluster links.
* Introduce a new `total_cluster_links_buffer_limit_exceeded` field under `CLUSTER INFO` command output, which displays the accumulated count of cluster links freed due to `cluster-link-sendbuf-limit`.
Added `FUNCTION FLUSH` command. The new sub-command allows delete all the functions.
An optional `[SYNC|ASYNC]` argument can be given to control whether or not to flush the
functions synchronously or asynchronously. if not given the default flush mode is chosen by
`lazyfree-lazy-user-flush` configuration values.
Add the missing `functions.tcl` test to the list of tests that are executed in test_helper.tcl,
and call FUNCTION FLUSH in between servers in external mode
Support doing `CONFIG GET <x> <y> <z>`, each of them can also be
a pattern with wildcards.
This avoids duplicates in the result by looping over the configs and for
each once checking all the patterns, once a match is found for a pattern
we move on to the next config.
Delete the hardcoded command table and replace it with an auto-generated table, based
on a JSON file that describes the commands (each command must have a JSON file).
These JSON files are the SSOT of everything there is to know about Redis commands,
and it is reflected fully in COMMAND INFO.
These JSON files are used to generate commands.c (using a python script), which is then
committed to the repo and compiled.
The purpose is:
* Clients and proxies will be able to get much more info from redis, instead of relying on hard coded logic.
* drop the dependency between Redis-user and the commands.json in redis-doc.
* delete help.h and have redis-cli learn everything it needs to know just by issuing COMMAND (will be
done in a separate PR)
* redis.io should stop using commands.json and learn everything from Redis (ultimately one of the release
artifacts should be a large JSON, containing all the information about all of the commands, which will be
generated from COMMAND's reply)
* the byproduct of this is:
* module commands will be able to provide that info and possibly be more of a first-class citizens
* in theory, one may be able to generate a redis client library for a strictly typed language, by using this info.
### Interface changes
#### COMMAND INFO's reply change (and arg-less COMMAND)
Before this commit the reply at index 7 contained the key-specs list
and reply at index 8 contained the sub-commands list (Both unreleased).
Now, reply at index 7 is a map of:
- summary - short command description
- since - debut version
- group - command group
- complexity - complexity string
- doc-flags - flags used for documentation (e.g. "deprecated")
- deprecated-since - if deprecated, from which version?
- replaced-by - if deprecated, which command replaced it?
- history - a list of (version, what-changed) tuples
- hints - a list of strings, meant to provide hints for clients/proxies. see https://github.com/redis/redis/issues/9876
- arguments - an array of arguments. each element is a map, with the possibility of nesting (sub-arguments)
- key-specs - an array of keys specs (already in unstable, just changed location)
- subcommands - a list of sub-commands (already in unstable, just changed location)
- reply-schema - will be added in the future (see https://github.com/redis/redis/issues/9845)
more details on these can be found in https://github.com/redis/redis-doc/pull/1697
only the first three fields are mandatory
#### API changes (unreleased API obviously)
now they take RedisModuleCommand opaque pointer instead of looking up the command by name
- RM_CreateSubcommand
- RM_AddCommandKeySpec
- RM_SetCommandKeySpecBeginSearchIndex
- RM_SetCommandKeySpecBeginSearchKeyword
- RM_SetCommandKeySpecFindKeysRange
- RM_SetCommandKeySpecFindKeysKeynum
Currently, we did not add module API to provide additional information about their commands because
we couldn't agree on how the API should look like, see https://github.com/redis/redis/issues/9944.
### Somehow related changes
1. Literals should be in uppercase while placeholder in lowercase. Now all the GEO* command
will be documented with M|KM|FT|MI and can take both lowercase and uppercase
### Unrelated changes
1. Bugfix: no_madaory_keys was absent in COMMAND's reply
2. expose CMD_MODULE as "module" via COMMAND
3. have a dedicated uint64 for ACL categories (instead of having them in the same uint64 as command flags)
Co-authored-by: Itamar Haber <itamar@garantiadata.com>
When CONFIG SET fails, print the name of the config that failed.
This is helpful since config set is now variadic.
however, there are cases where several configs have the same apply
function, and we can't be sure which one of them caused the failure.
This caused a crash when adding elements larger than 2GB to a set (same goes for hash keys). See #8455.
Details:
* The fix makes the dict hash functions receive a `size_t` instead of an `int`. In practice the dict hash functions
call siphash which receives a `size_t` and the callers to the hash function pass a `size_t` to it so the fix is trivial.
* The issue was recreated by attempting to add a >2gb value to a set. Appropriate tests were added where I create
a set with large elements and check basic functionality on it (SADD, SCARD, SPOP, etc...).
* When I added the tests I also refactored a bit all the tests code which is run under the `--large-memory` flag.
This removed code duplication for the test framework's `write_big_bulk` and `write_big_bulk` code and also takes
care of not allocating the test frameworks helper huge string used by these tests when not run under `--large-memory`.
* I also added the _violoations.tcl_ unit tests to be part of the entire test suite and leaned up non relevant list related
tests that were in there. This was done in this PR because most of the _violations_ tests are "large memory" tests.
A test failure was reported in Daily CI (FreeBSD).
`XREAD: XADD + DEL should not awake client`
```
*** [err]: XREAD: XADD + DEL should not awake client in tests/unit/type/stream.tcl
Expected [lindex 0 0] eq {s1} (context: type eval line 11 cmd {assert {[lindex $res 0 0] eq {s1}}} proc ::test)
```
It seems that `r` is executed before `rd` enters the blocking
state. And ended up getting a empty reply by timeout.
We use `wait_for_blocked_clients_count` to wait for the
blocking client to be ready and avoid this situation.
Also fixed other test cases that may have the same issue.
Added `HIDDEN_CONFIG` to hide debug / dev / testing configs from CONFIG GET
when it is used with a wildcard.
These are not documented in redis.conf so now CONFIG GET only works when they
are explicitly specified.
The current configs are:
```
key-load-delay
loading-process-events-interval-bytes
rdb-key-save-delay
use-exit-on-panic
watchdog-period
```
When disabling redis oom-score-adj managment we restore the
base value read before enabling oom-score-adj management.
This fixes an issue introduced in #9748 where updating
`oom-score-adj-values` while `oom-score-adj` was set to `no`
would write the base oom score adj value read on startup to `/proc`.
This is a bug since while `oom-score-adj` is disabled we should
never write to proc and let external processes manage it.
Added appropriate tests.
# Redis Function
This PR added the Redis Functions capabilities that were suggested on #8693.
The PR also introduce a big refactoring to the current Lua implementation
(i.e `scripting.c`). The main purpose of the refactoring is to have better
code sharing between the Lua implementation that exists today on Redis
(`scripting.c`) and the new Lua engine that is introduced on this PR.
The refactoring includes code movements and file name changes as well as some
logic changes that need to be carefully reviewed. To make the review easier,
the PR was split into multiple commits. Each commit is deeply described later on
but the main concept is that some commits are just moving code around without
making any logical changes, those commits are less likely to cause any issues
or regressions and can be reviewed fast. Other commits, which perform code and
logic changes, need to be reviewed carefully, but those commits were created
after the code movements so it's pretty easy to see what was changed. To sum up,
it is highly recommended to review this PR commit by commit as it will be easier
to see the changes, it is also recommended to read each commit description
(written below) to understand what was changed on the commit and whether or not
it's just a huge code movement or a logic changes.
## Terminology
Currently, the terminology in Redis is not clearly defined. Scripts refer to Lua
scripts and eval also refers only to Lua. Introducing Redis Function requires
redefining those terms to be able to clearly understand what is been discussed
on each context.
* eval - legacy Lua script implementation.
* Function - new scripting implementation (currently implemented in Lua but in
the future, it might be other languages like javascript).
* Engine - the component that is responsible for executing functions.
* Script - Function or legacy Lua (executed with `eval` or `evalsha`)
## Refactoring New Structure
Today, the entire scripting logic is located on `scripting.c`. This logic can
be split into 3 main groups:
1. Script management - responsible for storing the scripts that were sent to
Redis and retrieving them when they need to be run (base on the script sha
on the current implementation).
2. Script invocation - invoke the script given on `eval` or `evalsha` command
(this part includes finding the relevant script, preparing the arguments, ..)
3. Interact back with Redis (command invocation)
Those 3 groups are tightly coupled on `scripting.c`. Redis Functions also need
to use those groups logics, for example, to interact back with Redis or to
execute Lua code. The refactoring attempts to split those 3 groups and define
APIs so that we can reuse the code both on legacy Lua scripts and Redis Functions.
In order to do so we define the following units:
1. script.c: responsible for interaction with Redis from within a script.
2. script_lua.c: responsible to execute Lua code, uses `script.c` to interact
with Redis from within the Lua code.
3. function_lua.c: contains the Lua engine implementation, uses `script_lua.c`
to execute the Lua code.
4. functions.c: Contains Redis Functions implementation (`FUNCTION` command,),
uses `functions_lua.c` if the function it wants to invoke needs the Lua
engine.
4. eval.c: the original `scripting.c` contains the Lua legacy implementation and
was refactored to use `script_lua.c` to invoke the Lua code.
## Commits breakdown
Notice: Some small commits are omitted from this list as they are small and
insignificant (for example build fixes)
### First commit - code movements
This commit rename `scripting.c` -> `eval.c` and introduce the new `script_lua.c`
unit. The commit moves relevant code from `eval.c` (`scripting.c`) to
`script_lua.c`, the purpose of moving the code is so that later we will be able
to re-use the code on the Lua engine (`function_lua.c`). The commit only moves
the code without modifying even a single line, so there is a very low risk of
breaking anything and it also makes it much easier to see the changes on the
following commits.
Because the commit does not change the code (only moves it), it does not compile.
But we do not care about it as the only purpose here is to make the review
processes simpler.
### Second commit - move legacy Lua variables into `eval.c`
Today, all Lua-related variables are located on the server struct. The commit
attempt to identify those variable and take them out from the server struct,
leaving only script related variables (variables that later need to be used
also by engines)
The following variable where renamed and left on the server struct:
* lua_caller -> script_caller
* lua_time_limit -> script_time_limit
* lua_timedout -> script_timedout
* lua_oom -> script_oom
* lua_disable_deny_script -> script_disable_deny_script
* in_eval -> in_script
The following variables where moved to lctx under eval.c
* lua
* lua_client
* lua_cur_script
* lua_scripts
* lua_scripts_mem
* lua_replicate_commands
* lua_write_dirty
* lua_random_dirty
* lua_multi_emitted
* lua_repl
* lua_kill
* lua_time_start
* lua_time_snapshot
This commit is in a low risk of introducing any issues and it is just moving
variables around and not changing any logic.
### Third commit - introducing script unit
This commit introduces the `script.c` unit. Its purpose (as described above) is
to provide an API for scripts to interact with Redis. Interaction includes
mostly executing commands, but also other functionalities. The interaction is
done using a `ScriptRunCtx` object that needs to be created by the user and
initialized using `scriptPrepareForRun`. A detailed list of functionalities
expose by the unit:
1. Calling commands (including all the validation checks such as
acl, cluster, read only run, ...)
2. Set Resp
3. Set Replication method (AOF/REPLICATION/NONE)
4. Call Redis back on long-running scripts to allow Redis to reply to clients
and perform script kill
The commit introduces the new unit and uses it on eval commands to interact with
Redis.
### Fourth commit - Moved functionality of invoke Lua code to `script_lua.c`
This commit moves the logic of invoking the Lua code into `script_lua.c` so
later it can be used also by Lua engine (`function_lua.c`). The code is located
on `callFunction` function and assumes the Lua function already located on the
top of the Lua stack. This commit also change `eval.c` to use the new
functionality to invoke Lua code.
### Fith commit - Added Redis Functions unit (`functions.c`) and Lua engine
(`function_lua.c`)
Added Redis Functions unit under `functions.c`, included:
1. FUNCTION command:
* FUNCTION CREATE
* FUNCTION CALL
* FUNCTION DELETE
* FUNCTION KILL
* FUNCTION INFO
* FUNCTION STATS
2. Register engines
In addition, this commit introduces the first engine that uses the Redis
Functions capabilities, the Lua engine (`function_lua.c`)
## API Changes
### `lua-time-limit`
configuration was renamed to `script-time-limit` (keep `lua-time-limit` as alias
for backward compatibility).
### Error log changes
When integrating with Redis from within a Lua script, the `Lua` term was removed
from all the error messages and instead we write only `script`. For example:
`Wrong number of args calling Redis command From Lua script` -> `Wrong number
of args calling Redis command From script`
### `info memory` changes:
Before stating all the changes made to memory stats we will try to explain the
reason behind them and what we want to see on those metrics:
* memory metrics should show both totals (for all scripting frameworks), as well
as a breakdown per framework / vm.
* The totals metrics should have "human" metrics while the breakdown shouldn't.
* We did try to maintain backward compatibility in some way, that said we did
make some repurpose to existing metrics where it looks reasonable.
* We separate between memory used by the script framework (part of redis's
used_memory), and memory used by the VM (not part of redis's used_memory)
A full breakdown of `info memory` changes:
* `used_memory_lua` and `used_memory_lua_human` was deprecated,
`used_memory_vm_eval` has the same meaning as `used_memory_lua`
* `used_memory_scripts` was renamed to `used_memory_scripts_eval`
* `used_memory_scripts` and `used_memory_scripts_human` were repurposed and now
return the total memory used by functions and eval (not including vm memory,
only code cache, and structs).
* `used_memory_vm_function` was added and represents the total memory used by
functions vm's
* `used_memory_functions` was added and represents the total memory by functions
(not including vm memory, only code cache, and structs)
* `used_memory_vm_total` and `used_memory_vm_total_human` was added and
represents the total memory used by vm's (functions and eval combined)
### `functions.caches`
`functions.caches` field was added to `memory stats`, representing the memory
used by engines that are not functions (this memory includes data structures
like dictionaries, arrays, ...)
## New API
### FUNCTION CREATE
Usage: FUNCTION CREATE `ENGINE` `NAME` `[REPLACE]` `[DESC <DESCRIPTION>]` `<CODE>`
* `ENGINE` - The name of the engine to use to create the script.
* `NAME` - the name of the function that can be used later to call the function
using `FUNCTION CALL` command.
* `REPLACE` - if given, replace the given function with the existing function
(if exists).
* `DESCRIPTION` - optional argument describing the function and what it does
* `CODE` - function code.
The command will return `OK` if created successfully or error in the following
cases:
* The given engine name does not exist
* The function name is already taken and `REPLACE` was not used.
* The given function failed on the compilation.
### FCALL and FCALL_RO
Usage: FCALL/FCALL_RO `NAME` `NUM_KEYS key1 key2` … ` arg1 arg2`
Call and execute the function specified by `NAME`. The function will receive
all arguments given after `NUM_KEYS`. The return value from the function will
be returned to the user as a result.
* `NAME` - Name of the function to run.
* The rest is as today with EVALSHA command.
The command will return an error in the following cases:
* `NAME` does not exist
* The function itself returned an error.
The `FCALL_RO` is equivalent to `EVAL_RO` and allows only read-only commands to
be invoked from the script.
### FUNCTION DELETE
Usage: FUNCTION DELETE `NAME`
Delete a function identified by `NAME`. Return `OK` on success or error on one
of the following:
* The given function does not exist
### FUNCTION INFO
Usage: FUNCTION INFO `NAME` [WITHCODE]
Return information about a function by function name:
* Function name
* Engine name
* Description
* Raw code (only if WITHCODE argument is given)
### FUNCTION LIST
Usage: FUNCTION LIST
Return general information about all the functions:
* Function name
* Engine name
* Description
### FUNCTION STATS
Usage: FUNCTION STATS
Return information about the current running function:
* Function name
* Command that was used to invoke the function
* Duration in MS that the function is already running
If no function is currently running, this section is just a RESP nil.
Additionally, return a list of all the available engines.
### FUNCTION KILL
Usage: `FUNCTION KILL`
Kill the currently executing function. The command will fail if the function
already initiated a write command.
## Notes
Note: Function creation/deletion is replicated to AOF but AOFRW is not
implemented sense its going to be removed: #9794
Redis function unit is located inside functions.c
and contains Redis Function implementation:
1. FUNCTION commands:
* FUNCTION CREATE
* FCALL
* FCALL_RO
* FUNCTION DELETE
* FUNCTION KILL
* FUNCTION INFO
2. Register engine
In addition, this commit introduce the first engine
that uses the Redis Function capabilities, the
Lua engine.
After the introduction of `Multiparam config set` in #9748,
there are two tests cases failed.
```
[exception]: Executing test client: ERR Config set failed - Failed to set current oom_score_adj. Check server logs..
ERR Config set failed - Failed to set current oom_score_adj. Check server logs.
```
`CONFIG sanity` test failed on the `config set oom-score-adj-values`
which is a "special" config that does not catch no-op changes.
And then it will update `oom-score-adj` which not supported in
MacOs. We solve it by adding `oom-score*` to the `skip_configs` list.
```
*** [err]: CONFIG SET rollback on apply error in tests/unit/introspection.tcl
Expected an error but nothing was caught
```
`CONFIG SET rollback on apply error` test failed on the
`config set port $used_port`. In theory, it should throw the
error `Unable to listen on this port*`. But it failed on MacOs.
We solve it by adding `-myaddr 127.0.0.1` to the socket call.
Script unit is a new unit located on script.c.
Its purpose is to provides an API for functions (and eval)
to interact with Redis. Interaction includes mostly
executing commands, but also functionalities like calling
Redis back on long scripts or check if the script was killed.
The interaction is done using a scriptRunCtx object that
need to be created by the user and initialized using scriptPrepareForRun.
Detailed list of functionalities expose by the unit:
1. Calling commands (including all the validation checks such as
acl, cluster, read only run, ...)
2. Set Resp
3. Set Replication method (AOF/REPLICATION/NONE)
4. Call Redis back to on long running scripts to allow Redis reply
to clients and perform script kill
The commit introduce the new unit and uses it on eval commands to
interact with Redis.
We can now do: `config set maxmemory 10m repl-backlog-size 5m`
## Basic algorithm to support "transaction like" config sets:
1. Backup all relevant current values (via get).
2. Run "verify" and "set" on everything, if we fail run "restore".
3. Run "apply" on everything (optional optimization: skip functions already run). If we fail run "restore".
4. Return success.
### restore
1. Run set on everything in backup. If we fail log it and continue (this puts us in an undefined
state but we decided it's better than the alternative of panicking). This indicates either a bug
or some unsupported external state.
2. Run apply on everything in backup (optimization: skip functions already run). If we fail log
it (see comment above).
3. Return error.
## Implementation/design changes:
* Apply function are idempotent (have no effect if they are run more than once for the same config).
* No indication in set functions if we're reading the config or running from the `CONFIG SET` command
(removed `update` argument).
* Set function should set some config variable and assume an (optional) apply function will use that
later to apply. If we know this setting can be safely applied immediately and can always be reverted
and doesn't depend on any other configuration we can apply immediately from within the set function
(and not store the setting anywhere). This is the case of this `dir` config, for example, which has no
apply function. No apply function is need also in the case that setting the variable in the `server` struct
is all that needs to be done to make the configuration take effect. Note that the original concept of `update_fn`,
which received the old and new values was removed and replaced by the optional apply function.
* Apply functions use settings written to the `server` struct and don't receive any inputs.
* I take care that for the generic (non-special) configs if there's no change I avoid calling the setter (possible
optimization: avoid calling the apply function as well).
* Passing the same config parameter more than once to `config set` will fail. You can't do `config set my-setting
value1 my-setting value2`.
Note that getting `save` in the context of the conf file parsing to work here as before was a pain.
The conf file supports an aggregate `save` definition, where each `save` line is added to the server's
save params. This is unlike any other line in the config file where each line overwrites any previous
configuration. Since we now support passing multiple save params in a single line (see top comments
about `save` in https://github.com/redis/redis/pull/9644) we should deprecate the aggregate nature of
this config line and perhaps reduce this ugly code in the future.
Adds the ability to autogenerate the sequence part of the millisecond-only explicit ID specified for `XADD`. This is useful in case added entries have an externally-provided timestamp without sub-millisecond resolution.
The issue can only happened with a bad Lua script that claims to return
a big number while actually return data which is not a big number (contains
chars that are not digits). Such thing will not cause an issue unless the big
number value contains `\r\n` and then it messes the resp3 structure. The fix
changes all the appearances of '\r\n' with spaces.
Such an issue can also happened on simple string or error replies but those
already handle it the same way this PR does (replace `\r\n` with spaces).
Other replies type are not vulnerable to this issue because they are not
counting on free text that is terminated with `\r\n` (either it contains the
bulk length like string reply or they are typed reply that can not inject free
text like boolean or number).
The issue only exists on unstable branch, big number reply on Lua script
was not yet added to any official release.
* Fix CLIENT KILL kill all clients with id 0 or with skipme
CLIENT KILL with ID argument should only kill the client with the provided ID. In old code,
CLIENT KILL with id 0 will kill all the connected clients.
Co-authored-by: Ofir Luzon <ofirluzon@gmail.com>
This test relies on that `XREAD BLOCK 20000 STREAMS s1{t} s2{t} s3{t} $ $ $`
is executed by redis before `XADD s2{t} * new abcd1234`. A ` wait_for_blocked_client`
is needed between the two to ensure the order, otherwise `XADD s2{t} * new abcd1234`
might be executed first due to network delay causing a test failure.
Co-authored-by: xiaolei <xiaolei@91jkys.com>
This pr is following #9779 .
## Describe of feature
Now when we turn on the `list-compress-depth` configuration, the list will compress
the ziplist between `[list-compress-depth, -list-compress-depth]`.
When we need to use the compressed data, we will first decompress it, then use it,
and finally compress it again.
It's controlled by `quicklistNode->recompress`, which is designed to avoid the need to
re-traverse the entire quicklist for compression after each decompression, we only need
to recompress the quicklsitNode being used.
In order to ensure the correctness of recompressing, we should normally let
quicklistDecompressNodeForUse and quicklistCompress appear in pairs, otherwise,
it may lead to the head and tail being compressed or the middle ziplist not being
compressed correctly, which is exactly the problem this pr needs to solve.
## Solution
1. Reset `quicklistIter` after insert and replace.
The quicklist node will be compressed in `quicklistInsertAfter`, `quicklistInsertBefore`,
`quicklistReplaceAtIndex`, so we can safely reset the quicklistIter to avoid it being used again
2. `quicklistIndex` will return an iterator that can be used to recompress the current node after use.
## Test
1. In the `Stress Tester for #3343-Similar Errors` test, when the server crashes or when
`valgrind` or `asan` error is detected, print violating commands.
2. Add a crash test due to wrongly recompressing after `lrem`.
3. Remove `insert before with 0 elements` and `insert after with 0 elements`,
Now we forbid any operation on an NULL quicklistIter.
In order to test the situation where multiple clients are
blocked, we set up multiple clients to execute some blocking
commands. These tests depend on the order of command processing.
Those tests are based on the wrong assumption that the command
send first will be executed by the server first, which is obviously
wrong in some network delyas.
This commit ensures orderly execution of commands by waiting
and judging the number of blocked clients each time.
Fix#9850
This commit 0f8b634cd (CVE-2021-32626 released in 6.2.6, 6.0.16, 5.0.14)
fixes an invalid memory write issue by using `lua_checkstack` API to make
sure the Lua stack is not overflow. This fix was added on 3 places:
1. `luaReplyToRedisReply`
2. `ldbRedis`
3. `redisProtocolToLuaType`
On the first 2 functions, `lua_checkstack` is handled gracefully while the
last is handled with an assert and a statement that this situation can
not happened (only with misbehave module):
> the Redis reply might be deep enough to explode the LUA stack (notice
that currently there is no such command in Redis that returns such a nested
reply, but modules might do it)
The issue that was discovered is that user arguments is also considered part
of the stack, and so the following script (for example) make the assertion reachable:
```
local a = {}
for i=1,7999 do
a[i] = 1
end
return redis.call("lpush", "l", unpack(a))
```
This is a regression because such a script would have worked before and now
its crashing Redis. The solution is to clear the function arguments from the Lua
stack which makes the original assumption true and the assertion unreachable.
Writable replicas now no longer use the values of expired keys. Expired keys are
deleted when lookupKeyWrite() is used, even on a writable replica. Previously,
writable replicas could use the value of an expired key in write commands such
as INCR, SUNIONSTORE, etc..
This commit also sorts out the mess around the functions lookupKeyRead() and
lookupKeyWrite() so they now indicate what we intend to do with the key and
are not affected by the command calling them.
Multi-key commands like SUNIONSTORE, ZUNIONSTORE, COPY and SORT with the
store option now use lookupKeyRead() for the keys they're reading from (which will
not allow reading from logically expired keys).
This commit also fixes a bug where PFCOUNT could return a value of an
expired key.
Test modules commands have their readonly and write flags updated to correctly
reflect their lookups for reading or writing. Modules are not required to
correctly reflect this in their command flags, but this change is made for
consistency since the tests serve as usage examples.
Fixes#6842. Fixes#7475.
Remove lcsGetKeys to clean up the remaining STRALGO after #9733.
i.e. it still used a getkeys_proc which was still looking for the KEYS or STRINGS arguments
In #9323, when `repl-diskless-load` is enabled and set to `swapdb`,
if the master replication ID hasn't changed, we can load data-set
asynchronously, and serving read commands during the full resync.
In `diskless loading short read` test, after a loading successfully,
we will wait for the loading to stop and continue the for loop.
After the introduction of `async_loading`, we also need to check it.
Otherwise the next loop will start too soon, may trigger a timing issue.
In #8287, some overflow checks have been added. But when
`when *= 1000` overflows, it will become a positive number.
And the check not able to catch it. The key will be added with
a short expiration time and will deleted a few seconds later.
In #9601, will check the overflow after `*=` and return an
error first, and avoiding this situation.
In this commit, added some tests to cover those code paths.
Found it in #9825, and close it.
Some people complain that QUIT is missing from help/command table.
Not appearing in COMMAND command, command stats, ACL, etc.
and instead, there's a hack in processCommand with a comment that looks outdated.
Note that it is [documented](https://redis.io/commands/quit)
At the same time, HOST: and POST are there in the command table although these are not real commands.
They would appear in the COMMAND command, and even in commandstats.
Other changes:
1. Initialize the static logged_time static var in securityWarningCommand
2. add `no-auth` flag to RESET so it can always be executed.
The `PEXPIRE/PSETEX/PEXPIREAT can set sub-second expires` test is
a very time sensitive test, it used to occasionally fail on MacOS.
It will perform there internal tests in a loop, as long as one
fails, it will try to excute again in the next loop.
oranagra suggested that we can split it into three individual tests,
so that if one fails, we do not need to retry the others. And maybe
it will increase the chances of success dramatically.
Each is executed 500 times, and the number of retries is collected:
```
PSETEX, total: 500, sum: 745, min: 0, max: 13, avg: 1.49
PEXPIRE, total: 500, sum: 575, min: 0, max: 16, avg: 1.15
PEXPIREAT, total: 500, sum: 0, min: 0, max: 0, avg: 0.0
ALL(old_way), total: 500, sum: 8090, min: 0, max: 138, avg: 16.18
```
And we can see the threshold is very low.
Splitting the test also makes the code better to maintain.
Co-authored-by: Oran Agra <oran@redislabs.com>
Background:
Following the upgrade to jemalloc 5.2, there was a test that used to be flaky and
started failing consistently (on 32bit), so we disabled it (see #9645).
This is a test that i introduced in #7289 when i attempted to solve a rare stagnation
problem, and it later turned out i failed to solve it, ans what's more i added a test that
caused it to be not so rare, and as i mentioned, now in jemalloc 5.2 it became consistent on 32bit.
Stagnation can happen when all the slabs of the bin are equally utilized, so the decision
to move an allocation from a relatively empty slab to a relatively full one, will never
happen, and in that test all the slabs are at 50% utilization, so the defragger could just
keep scanning the keyspace and not move anything.
What this PR changes:
* First, finally in jemalloc 5.2 we have the count of non-full slabs, so when we compare
the utilization of the current slab, we can compare it to the average utilization of the non-full
slabs in our bin, instead of the total average of our bin. this takes the full slabs out of the game,
since they're not candidates for migration (neither source nor target).
* Secondly, We add some 12% (100/8) to the decision to defrag an allocation, this is the part
that aims to avoid stagnation, and it's especially important since the above mentioned change
can get us closer to stagnation.
* Thirdly, since jemalloc 5.2 adds sharded bins, we take into account all shards (something
that's missing from the original PR that merged it), this isn't expected to make any difference
since anyway there should be just one shard.
How this was benchmarked.
What i did was run the memefficiency test unit with `--verbose` and compare the defragger hits
and misses the tests reported.
At first, when i took into consideration only the non-full slabs, it got a lot worse (i got into
stagnation, or just got a lot of misses and a lot of hits), but when i added the 10% i got back
to results that were slightly better than the ones of the jemalloc 5.1 branch. i.e. full defragmentation
was achieved with fewer hits (relocations), and fewer misses (keyspace scans).
Recently we started using list-compress-depth in tests (was completely untested till now).
Turns this triggered test failures with the external mode, since the tests left the setting enabled
and then it was used in other tests (specifically the fuzzer named "Stress tester for #3343-alike bugs").
This PR fixes the issue of the `recompress` flag being left set by mistake, which caused the code to
later to compress the head or tail nodes (which should never be compressed)
The solution is to reset the recompress flag when it should have been (when it was decided not to compress).
Additionally we're adding some assertions and improve the tests so in order to catch other similar bugs.
Currently PING returns different status when server is not serving data,
for example when `LOADING` or `BUSY`.
But same was not true for `MASTERDOWN`
This commit makes PING reply with `MASTERDOWN` when
replica-serve-stale-data=no and link is MASTER is down.
Drop the STRALGO command, now LCS is a command of its own and it only works on keys (not input strings).
The motivation is that STRALGO's syntax was really messed-up...
- assumes all (future) string algorithms will take similar arguments
- mixes command that takes keys and one that doesn't in the same command.
- make it nearly impossible to expose the right key spec in COMMAND INFO (issues cluster clients)
- hard for cluster clients to determine the key names (firstkey, lastkey, etc)
- hard for ACL / flags (is it a read command?)
This is a breaking change.
Moves ZPOP ... 0 fast exit path after type check to reply with
WRONGTYPE. In the past it will return an empty array.
Also now count is not allowed to be negative.
see #9680
before:
```
127.0.0.1:6379> set zset str
OK
127.0.0.1:6379> zpopmin zset 0
(empty array)
127.0.0.1:6379> zpopmin zset -1
(empty array)
```
after:
```
127.0.0.1:6379> set zset str
OK
127.0.0.1:6379> zpopmin zset 0
(error) WRONGTYPE Operation against a key holding the wrong kind of value
127.0.0.1:6379> zpopmin zset -1
(error) ERR value is out of range, must be positive
```
Redis supports inserting data over 4GB into string (and recently for lists too, see #9357),
But LZF compression used in RDB files (see `rdbcompression` config), and in quicklist
(see `list-compress-depth` config) does not support compress/decompress data over
UINT32_MAX, which will result in corrupting the rdb after compression.
Internal changes:
1. Modify the `unsigned int` parameter of `lzf_compress/lzf_decompress` to `size_t`.
2. Modify the variable types in `lzf_compress` involving offsets and lengths to `size_t`.
3. Set LZF_USE_OFFSETS to 0.
When LZF_USE_OFFSETS is 1, lzf store offset into `LZF_HSLOT`(32bit).
Even in 64-bit, `LZF_USE_OFFSETS` defaults to 1, because lzf assumes that it only
compresses and decompresses data smaller than UINT32_MAX.
But now we need to make lzf support 64-bit, turning on `LZF_USE_OFFSETS` will make
it impossible to store 64-bit offsets or pointers.
BTW, disable LZF_USE_OFFSETS also brings a few performance improvements.
Tests:
1. Add test for compress/decompress string large than UINT32_MAX.
2. Add unittest for compress/decompress quicklistNode.
Two issues:
1. In many tests we simply forgot to close the connections we created, which doesn't matter for normal tests where the server is killed, but creates a leak on external server tests.
2. When calling `start_server` on external test we create a fresh connection instead of really starting a new server, but never clean it at the end.
I have seen this CI failure twice on MacOS:
*** [err]: PEXPIRE/PSETEX/PEXPIREAT can set sub-second expires in tests/unit/expire.tcl
Expected 'somevalue {} somevalue {} somevalue {}' to equal or match '{} {} {} {} somevalue {}'
I did some loop test in my own daily CI, the results show that is
not particularly stable. Change the threshold from 30 to 50.
In both tests, "diskless loading short read" and "diskless loading short read with module",
the timeout of waiting for the replica to respond to a short read and log it, is too short.
Also, add --dump-logs in runtest-moduleapi for valgrind runs.
For diskless replication in swapdb mode, considering we already spend replica memory
having a backup of current db to restore in case of failure, we can have the following benefits
by instead swapping database only in case we succeeded in transferring db from master:
- Avoid `LOADING` response during failed and successful synchronization for cases where the
replica is already up and running with data.
- Faster total time of diskless replication, because now we're moving from Transfer + Flush + Load
time to Transfer + Load only. Flushing the tempDb is done asynchronously after swapping.
- This could be implemented also for disk replication with similar benefits if consumers are willing
to spend the extra memory usage.
General notes:
- The concept of `backupDb` becomes `tempDb` for clarity.
- Async loading mode will only kick in if the replica is syncing from a master that has the same
repl-id the one it had before. i.e. the data it's getting belongs to a different time of the same timeline.
- New property in INFO: `async_loading` to differentiate from the blocking loading
- Slot to Key mapping is now a field of `redisDb` as it's more natural to access it from both server.db
and the tempDb that is passed around.
- Because this is affecting replicas only, we assume that if they are not readonly and write commands
during replication, they are lost after SYNC same way as before, but we're still denying CONFIG SET
here anyways to avoid complications.
Considerations for review:
- We have many cases where server.loading flag is used and even though I tried my best, there may
be cases where async_loading should be checked as well and cases where it shouldn't (would require
very good understanding of whole code)
- Several places that had different behavior depending on the loading flag where actually meant to just
handle commands coming from the AOF client differently than ones coming from real clients, changed
to check CLIENT_ID_AOF instead.
**Additional for Release Notes**
- Bugfix - server.dirty was not incremented for any kind of diskless replication, as effect it wouldn't
contribute on triggering next database SAVE
- New flag for RM_GetContextFlags module API: REDISMODULE_CTX_FLAGS_ASYNC_LOADING
- Deprecated RedisModuleEvent_ReplBackup. Starting from Redis 7.0, we don't fire this event.
Instead, we have the new RedisModuleEvent_ReplAsyncLoad holding 3 sub-events: STARTED,
ABORTED and COMPLETED.
- New module flag REDISMODULE_OPTIONS_HANDLE_REPL_ASYNC_LOAD for RedisModule_SetModuleOptions
to allow modules to declare they support the diskless replication with async loading (when absent, we fall
back to disk-based loading).
Co-authored-by: Eduardo Semprebon <edus@saxobank.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
Introduced in #8179, this fixes the command's replies in the 0 count edge case.
[BREAKING] changes the reply type when count is 0 to an empty array (instead of nil)
Moves LPOP ... 0 fast exit path after type check to reply with WRONGTYPE
Redis lists are stored in quicklist, which is currently a linked list of ziplists.
Ziplists are limited to storing elements no larger than 4GB, so when bigger
items are added they're getting truncated.
This PR changes quicklists so that they're capable of storing large items
in quicklist nodes that are plain string buffers rather than ziplist.
As part of the PR there were few other changes in redis:
1. new DEBUG sub-commands:
- QUICKLIST-PACKED-THRESHOLD - set the threshold of for the node type to
be plan or ziplist. default (1GB)
- QUICKLIST <key> - Shows low level info about the quicklist encoding of <key>
2. rdb format change:
- A new type was added - RDB_TYPE_LIST_QUICKLIST_2 .
- container type (packed / plain) was added to the beginning of the rdb object
(before the actual node list).
3. testing:
- Tests that requires over 100MB will be by default skipped. a new flag was
added to 'runtest' to run the large memory tests (not used by default)
Co-authored-by: sundb <sundbcn@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
Add new no-mandatory-keys flag to support COMMAND GETKEYS of commands
which have no mandatory keys.
In the past we would have got this error:
```
127.0.0.1:6379> command getkeys eval "return 1" 0
(error) ERR Invalid arguments specified for command
```
The issue was that setting maxmemory to used_memory and expecting
eviction is insufficient, since we need to take
mem_not_counted_for_evict into consideration.
This test got broken by #9166
The External tests started failing recently for unclear reason:
```
*** [err]: Tracking invalidation message of eviction keys should be before response in tests/unit/tracking.tcl
Expected '0' to be equal to 'invalidate volatile-key' (context: type eval line 21 cmd {assert_equal $res {invalidate volatile-key}} proc ::test)
```
I suspect the issue is that the used_memory sample is taken while a lazy free is still being processed.
The module test in reply.tcl was introduced by #8521 but didn't run until recently (see #9639)
and then it started failing with valgrind.
This is because valgrind uses 64 bit long double (unlike most other platforms that have at least 80 bits)
But besides valgrind, the tests where also incompatible with ARM32, which also uses 64 bit long doubles.
We now use appropriate value to avoid issues with either valgrind or ARM32
In all the double tests, i use 3.141, which is safe since since addReplyDouble uses
`%.17Lg` which is able to represent this value without adding any digits due to precision loss.
In the long double, since we use `%.17Lf` in ld2string, it preserves 17 significant
digits, rather than 17 digit after the decimal point (like in `%.17Lg`).
So to make these similar, i use value lower than 1 (no digits left of
the period)
Lastly, we have the same issue with TCL (no long doubles) so we read
raw protocol in that test.
Note that the only error before this fix (in both valgrind and ARM32 is this:
```
*** [err]: RM_ReplyWithLongDouble: a float reply in tests/unit/moduleapi/reply.tcl
Expected '3.141' to be equal to '3.14100000000000001' (context: type eval line 2 cmd {assert_equal 3.141 [r rw.longdouble 3.141]} proc ::test)
```
so the changes to debug.c and scripting.tcl aren't really needed, but i consider them a cleanup
(i.e. scripting.c validated a different constant than the one that's sent to it from debug.c).
Another unrelated change is to add the RESP version to the repeated tests in reply.tcl
The previous code did not check whether COUNT is set.
So we can use `lmpop 2 key1 key2 left count 1 count 2`.
This situation can occur in LMPOP/BLMPOP/ZMPOP/BZMPOP commands.
LMPOP/BLMPOP introduced in #9373, ZMPOP/BZMPOP introduced in #9484.
Let modules use additional type of RESP3 response (unused by redis so far)
Also fix tests that where introduced in #8521 but didn't actually run.
Co-authored-by: Oran Agra <oran@redislabs.com>
## Background
For redis master, one replica uses one copy of replication buffer, that is a big waste of memory,
more replicas more waste, and allocate/free memory for every reply list also cost much.
If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with
replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big,
master may be OOM when there are many replicas that separately keep much memory.
Because replication buffers of different replica client are the same, one simple idea is that
all replicas only use one replication buffer, that will effectively save memory.
Since replication backlog content is the same as replicas' output buffer, now we
can discard replication backlog memory and use global shared replication buffer
to implement replication backlog mechanism.
## Implementation
I create one global "replication buffer" which contains content of replication stream.
The structure of "replication buffer" is similar to the reply list that exists in every client.
But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields.
```c
/* Replication buffer blocks is the list of replBufBlock.
*
* +--------------+ +--------------+ +--------------+
* | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 |
* +--------------+ +--------------+ +--------------+
* | / \
* | / \
* | / \
* Repl Backlog Replia_A Replia_B
*
* Each replica or replication backlog increments only the refcount of the
* 'ref_repl_buf_node' which it points to. So when replica walks to the next
* node, it should first increase the next node's refcount, and when we trim
* the replication buffer nodes, we remove node always from the head node which
* refcount is 0. If the refcount of the head node is not 0, we must stop
* trimming and never iterate the next node. */
/* Similar with 'clientReplyBlock', it is used for shared buffers between
* all replica clients and replication backlog. */
typedef struct replBufBlock {
int refcount; /* Number of replicas or repl backlog using. */
long long id; /* The unique incremental number. */
long long repl_offset; /* Start replication offset of the block. */
size_t size, used;
char buf[];
} replBufBlock;
```
So now when we feed replication stream into replication backlog and all replicas, we only need
to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of
replication backlog and replicas to references of the global replication buffer blocks. And we also
need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim
replication backlog if exceeding `repl-backlog-size`.
When sending reply to replicas, we also need to iterate replication buffer blocks and send its
content, when totally sending one block for replica, we decrease current node count and
increase the next current node count, and then free the block which reference is 0 from the
head of replication buffer blocks.
Since now we use linked list to manage replication backlog, it may cost much time for iterating
all linked list nodes to find corresponding replication buffer node. So we create a rax tree to
store some nodes for index, but to avoid rax tree occupying too much memory, i record
one per 64 nodes for index.
Currently, to make partial resynchronization as possible as much, we always let replication
backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting
if slow replicas that reference vast replication buffer blocks, and this method doesn't increase
memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced
replication buffer blocks when we need to trim backlog for exceeding backlog size setting,
we trim backlog incrementally (free 64 blocks per call now), and make it faster in
`beforeSleep` (free 640 blocks).
### Other changes
- `mem_total_replication_buffers`: we add this field in INFO command, it means the total
memory of replication buffers used.
- `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory
is not 0, but it still may be 0, since replication backlog and replicas share one global replication
buffer, only if replication buffer memory is more than the repl backlog setting size, we consider
the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption
of repl backlog.
- Key eviction
Since all replicas and replication backlog share global replication buffer, we think only the
part of exceeding backlog size the extra separate consumption of replicas.
Because we trim backlog incrementally in the background, backlog size may exceeds our
setting if slow replicas that reference vast replication buffer blocks disconnect.
To avoid massive eviction loop, we don't count the delayed freed replication backlog into
used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory.
- `client-output-buffer-limit` check for replica clients
It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size
config (partial sync will succeed and then replica will get disconnected). Such a configuration is
ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption
implications since the replica client will share the backlog buffers memory.
- Drop replication backlog after loading data if needed
We always create replication backlog if server is a master, we need it because we put DELs in
it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb,
it is not possible to support partial resynchronization, to avoid extra memory of replication backlog,
we drop it.
- Multi IO threads
Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled,
to guarantee data accessing thread safe, we must let main thread handle sending the output buffer
to all replicas. But before, other IO threads could handle sending output buffer of all replicas.
## Other optimizations
This solution resolve some other problem:
- When replicas disconnect with master since of out of output buffer limit, releasing the output
buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now,
it doesn't cause freezing.
- This implementation may mitigate reply list copy cost time(also freezes server) when one replication
has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy
reference info, it is very light.
- If we set replication backlog size big, it also may cost much time to copy replication backlog into
replica's output buffer. But this commit eliminates this problem.
- Resizing replication backlog size doesn't empty current replication backlog content.
Before this commit, module blocked clients did not carry through the original RESP version, resulting with RESP3 clients receiving unexpected RESP2 replies.
Following #9483 the daily CI exposed a few problems.
* The cluster creation code (uses redis-cli) is complicated to test with TLS enabled.
for now i'm just skipping them since the tests we run there don't really need that kind of coverage
* cluster port binding failures
note that `find_available_port` already looks for a free cluster port
but the code in `wait_server_started` couldn't detect the failure of binding
(the text it greps for wasn't found in the log)