The check in fileIsManifest misjudged the manifest file. For example,
if resp aof contains "file", it will be considered a manifest file and
the check will fail:
```
*3
$3
set
$4
file
$4
file
```
In #12951, if the preamble aof also contains it, it will also fail.
Fixes#12951.
the bug was happening if the the word "file" is mentioned
in the first 1024 lines of the AOF. and now as soon as it finds
a non-comment line it'll break (if it contains "file" or doesn't)
Recently I saw in CI that reply-schemas-validator fails here:
```
Failed validating 'minimum' in schema[1]['properties']['groups']['items']['properties']['consumers']['items']['properties']['active-time']:
{'description': 'Last time this consumer was active (successful '
'reading/claiming).',
'minimum': 0,
'type': 'integer'}
On instance['groups'][0]['consumers'][0]['active-time']:
-1729380548878722639
```
The reason is that in fuzzer, we may restore corrupted active-time,
which will cause the reply schema CI to fail.
The fuzzer can cause corrupt the state in many places, which will
bugs that mess up the reply, so we decided to skip logreqres.
Also, seen-time is the same type as active-time, adding the minimum.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Fix a daily test failure because alpine doesn't support stack traces and
add in an extra assertion related to making sure the stack trace was
printed twice.
see discussion from after https://github.com/redis/redis/pull/12453 was
merged
----
This PR replaces signals that are not considered async-signal-safe
(AS-safe) with safe calls.
#### **1. serverLog() and serverLogFromHandler()**
`serverLog` uses unsafe calls. It was decided that we will **avoid**
`serverLog` calls by the signal handlers when:
* The signal is not fatal, such as SIGALRM. In these cases, we prefer
using `serverLogFromHandler` which is the safe version of `serverLog`.
Note they have different prompts:
`serverLog`: `62220:M 26 Oct 2023 14:39:04.526 # <msg>`
`serverLogFromHandler`: `62220:signal-handler (1698331136) <msg>`
* The code was added recently. Calls to `serverLog` by the signal
handler have been there ever since Redis exists and it hasn't caused
problems so far. To avoid regression, from now we should use
`serverLogFromHandler`
#### **2. `snprintf` `fgets` and `strtoul`(base = 16) -------->
`_safe_snprintf`, `fgets_async_signal_safe`, `string_to_hex`**
The safe version of `snprintf` was taken from
[here](8cfc4ca5e7/src/mc_util.c (L754))
#### **3. fopen(), fgets(), fclose() --------> open(), read(), close()**
#### **4. opendir(), readdir(), closedir() --------> open(),
syscall(SYS_getdents64), close()**
#### **5. Threads_mngr sync mechanisms**
* waiting for the thread to generate stack trace: semaphore -------->
busy-wait
* `globals_rw_lock` was removed: as we are not using malloc and the
semaphore anymore we don't need to protect `ThreadsManager_cleanups`.
#### **6. Stacktraces buffer**
The initial problem was that we were not able to safely call malloc
within the signal handler.
To solve that we created a buffer on the stack of `writeStacktraces` and
saved it in a global pointer, assuming that under normal circumstances,
the function `writeStacktraces` would complete before any thread
attempted to write to it. However, **if threads lag behind, they might
access this global pointer after it no longer belongs to the
`writeStacktraces` stack, potentially corrupting memory.**
To address this, various solutions were discussed
[here](https://github.com/redis/redis/pull/12658#discussion_r1390442896)
Eventually, we decided to **create a pipe** at server startup that will
remain valid as long as the process is alive.
We chose this solution due to its minimal memory usage, and since
`write()` and `read()` are atomic operations. It ensures that stack
traces from different threads won't mix.
**The stacktraces collection process is now as follows:**
* Cleaning the pipe to eliminate writes of late threads from previous
runs.
* Each thread writes to the pipe its stacktrace
* Waiting for all the threads to mark completion or until a timeout (2
sec) is reached
* Reading from the pipe to print the stacktraces.
#### **7. Changes that were considered and eventually were dropped**
* replace watchdog timer with a POSIX timer:
according to [settimer man](https://linux.die.net/man/2/setitimer)
> POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending
the use of the POSIX timers API
([timer_gettime](https://linux.die.net/man/2/timer_gettime)(2),
[timer_settime](https://linux.die.net/man/2/timer_settime)(2), etc.)
instead.
However, although it is supposed to conform to POSIX std, POSIX timers
API is not supported on Mac.
You can take a look here at the Linux implementation:
[here](c7562ee135)
To avoid messing up the code, and uncertainty regarding compatibility,
it was decided to drop it for now.
* avoid using sds (uses malloc) in logConfigDebugInfo
It was considered to print config info instead of using sds, however
apparently, `logConfigDebugInfo` does more than just print the sds, so
it was decided this fix is out of this issue scope.
#### **8. fix Signal mask check**
The check `signum & sig_mask` intended to indicate whether the signal is
blocked by the thread was incorrect. Actually, the bit position in the
signal mask corresponds to the signal number. We fixed this by changing
the condition to: `sig_mask & (1L << (sig_num - 1))`
#### **9. Unrelated changes**
both `fork.tcl `and `util.tcl` implemented a function called
`count_log_message` expecting different parameters. This caused
confusion when trying to run daily tests with additional test parameters
to run a specific test.
The `count_log_message` in `fork.tcl` was removed and the calls were
replaced with calls to `count_log_message` located in `util.tcl`
---------
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
recently there are some incidents of hanged tests in the CI
when we try to reproduce them, we get an assertion, not a hang.
maybe the server logs will reveal some info.
## Crash fix
### Current behavior
We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example.
The threads mngr API guarantees that the output array length will be `tids_len`, however, some
indices can be NULL, in case it fails to collect some of the threads' outputs.
When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL
entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage,
so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() )
### fix
Allocate the global output array with zcalloc.
### To reproduce the bug, you'll have to change the code:
**in threadsmngr:ThreadsManager_runOnThreads():**
make sure the g_output_array allocation is initialized with garbage and not 0s
(add `memset(g_output_array, 2, sizeof(void*) * tids_len);` below the allocation).
Force one of the threads to write to the array:
add a global var: `static redisAtomic size_t return_now = 0;`
add to `invoke_callback()` before writing to the output array:
```
size_t i_return;
atomicGetIncr(return_now, i_return, 1);
if(i_return == 1) return;
```
compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert`
The assertion triggers the the stacktrace collection.
Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from
the threads mngr, it can be safely triggered again.
## Added global variables r/w lock in ThreadsManager
To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still
invoking the signal handler, we use a r/w lock.
For cleanups, we will acquire the write lock.
The threads will acquire the read lock to enable them to write simultaneously.
If we fail to acquire the read lock, it means cleanups are in progress and we return immediately.
After acquiring the lock we can safely check that the global output array wasn't nullified and proceed
to write to it.
This way we ensure the threads are not modifying the global variables/ trying to write to the output
array after they were zeroed/nullified/destroyed(the semaphore).
## other minor logging change
1. removed logging if the semaphore times out because the threads can still write to the output array
after this check. Instead, we print the total number of printed stacktraces compared to the exacted
number (len_tids).
2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct.
3. improve testing
Co-authored-by: Oran Agra <oran@redislabs.com>
In some tests, the code manually searches for a log message, and it
uses tail -1 with a delay of 1 second, which can miss the expected line.
Also, because the aof tests use start_server_aof and not start_server,
the test name doesn't log into the server log.
To fix the above, I made the following changes:
- Change the start_server_aof to wrap the start_server.
This will add the created aof server to the servers list, and make
srv() and wait_for_log_messages() available for the tests.
- Introduce a new option for start_server.
'wait_ready' - an option to let the caller start the test code without
waiting for the server to be ready. useful for tests on a server that
is expected to exit on startup.
- Create a new start_server_aof_ex.
The new proc also accept options as argument and make use of the
new 'short_life' option for tests that are expected to exit on startup
because of some error in the aof file(s).
Because of the above, I had to change many lines and replace every
local srv variable (a server config) usage with the srv().
In this PR we are adding the functionality to collect all the process's threads' backtraces.
## Changes made in this PR
### **introduce threads mngr API**
The **threads mngr API** which has 2 abilities:
* `ThreadsManager_init() `- register to SIGUSR2. called on the server start-up.
* ` ThreadsManager_runOnThreads()` - receives a list of a pid_t and a callback, tells every
thread in the list to invoke the callback, and returns the output collected by each invocation.
**Elaborating atomicvar API**
* `atomicIncrGet(var,newvalue_var,count) `-- Increment and get the atomic counter new value
* `atomicFlagGetSet` -- Get and set the atomic counter value to 1
### **Always set SIGALRM handler**
SIGALRM handler prints the process's stacktrace to the log file. Up until now, it was set only if the
`server.watchdog_period` > 0. This can be also useful if debugging is needed. However, in situations
where the server can't get requests, (a deadlock, for example) we weren't able to change the signal handler.
To make it available at run time we set SIGALRM handler on server startup. The signal handler name was
changed to a more general `sigalrmSignalHandler`.
### **Print all the process' threads' stacktraces**
`logStackTrace()` now calls `writeStacktraces()`, instead of logging the current thread stacktrace.
`writeStacktraces()`:
* On Linux systems we use the threads manager API to collect the backtraces of all the process' threads.
To get the `tids` list (threads ids) we read the `/proc/<redis-server-pid>/tasks` file which includes a list of directories.
Each directory name corresponds to one tid (including the main thread). For each thread, we also need to check if it
can get the signal from the threads manager (meaning it is not blocking/ignoring that signal). We send the threads
manager this tids list and `collect_stacktrace_data()` callback, which collects the thread's backtrace addresses,
its name, and tid.
* On other systems, the behavior remained as it was (writing only the current thread stacktrace to the log file).
## compatibility notes
1. **The threads mngr API is only supported in linux.**
2. glibc earlier than 2.3 We use `syscall(SYS_gettid)` and `syscall(SYS_tgkill...)` because their dedicated
alternatives (`gettid()` and `tgkill`) were added in glibc 2.3.
## Output example
Each thread backtrace will have the following format:
`<tid> <thread_name> [additional_info]`
* **tid**: as read from the `/proc/<redis-server-pid>/tasks` file
* **thread_name**: the tread name as it is registered in the os/
* **additional_info**: Sometimes we want to add specific information about one of the threads. currently.
it is only used to mark the thread that handles the backtraces collection by adding "*".
In case of crash - this also indicates which thread caused the crash. The handling thread in won't
necessarily appear first.
```
------ STACK TRACE ------
EIP:
/lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc]
67089 redis-server *
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb9437790]
/lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc]
redis-server *:6379(+0x75e0c)[0xaaaac2fe5e0c]
redis-server *:6379(aeProcessEvents+0x18c)[0xaaaac2fe6c00]
redis-server *:6379(aeMain+0x24)[0xaaaac2fe7038]
redis-server *:6379(main+0xe0c)[0xaaaac3001afc]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xffffb91d73fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffffb91d74cc]
redis-server *:6379(_start+0x30)[0xaaaac2fe0370]
67093 bio_lazy_free
/lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc]
/lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc]
redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8]
/lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8]
/lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c]
67091 bio_close_file
/lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc]
/lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc]
redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8]
/lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8]
/lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c]
67092 bio_aof
/lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc]
/lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc]
redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8]
/lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8]
/lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c]
67089:signal-handler (1693824528) --------
```
Recently, the option of sending an argument from stdin using `-x` flag
was added to redis-benchmark (this option is available in redis-cli as well).
However, using the `-x` option for sending a blobs that contains null-characters
doesn't work as expected - the argument is trimmed in the first occurrence of
`\X00` (unlike in redis-cli).
This PR aims to fix this issue and add the support for every binary string input,
by sending arguments length to `redisFormatCommandArgv` when processing
redis-benchmark command, so we won't treat the arguments as C-strings.
Additionally, we add a simple test coverage for `-x` (without binary strings, and
also remove an excessive server started in tests, and make sure to select db 0
so that `r` and the benchmark work on the same db.
Co-authored-by: Oran Agra <oran@redislabs.com>
This will increase the size of an already large COB (one already passed
the threshold for disconnection)
This could also mean that we'll attempt to write that data to the socket
and the replica will manage to read it, which will result in an
undesired partial sync (undesired for the test)
A single SPOP with command with count argument resulted in many SPOP
commands being propagated to the replica.
This is inefficient because the key name is repeated many times, and is also
being looked-up many times.
also it results in high QPS metrics on the replica.
To solve that, we flush batches of 1024 fields per SPOP command.
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
When master reboot from RDB, if rsi in RDB is valid we should not free replication backlog, even if master_repl_offset or repl-offset is 0.
Since if master doesn't send any data to replicas master_repl_offset is 0, it's a valid number.
A clear example:
1. start a master and apply some write commands, the master's master_repl_offset is 0 since it has no replicas.
2. stop write commands on master, and start another instance and replicaof the master, trigger an FULLRESYNC
3. the master's master_repl_offset is still 0 (set a large number for repl-ping-replica-period), do BGSAVE and restart the master
4. master load master_repl_offset from RDB's rsi and it's still 0, and we should make sure replica can partially resync with master.
In order to speed up tests, avoid saving an RDB (mostly notable on shutdown),
except for tests that explicitly test the RDB mechanism
In addition, use `shutdown-on-sigterm force` to prevetn shutdown from failing
in case the server is in the middle of the initial AOFRW
Also a a test that checks that the `shutdown-on-sigterm default` is to refuse
shutdown if there's an initial AOFRW
Co-authored-by: Guy Benoish <guy.benoish@redislabs.com>
The MacOS CI in github actions often hangs without any logs. GH argues that
it's due to resource utilization, either running out of disk space, memory, or CPU
starvation, and thus the runner is terminated.
This PR contains multiple attempts to resolve this:
1. introducing pause_process instead of SIGSTOP, which waits for the process
to stop before resuming the test, possibly resolving race conditions in some tests,
this was a suspect since there was one test that could result in an infinite loop in that
case, in practice this didn't help, but still a good idea to keep.
2. disable the `save` config in many tests that don't need it, specifically ones that use
heavy writes and could create large files.
3. change the `populate` proc to use short pipeline rather than an infinite one.
4. use `--clients 1` in the macos CI so that we don't risk running multiple resource
demanding tests in parallel.
5. enable `--verbose` to be repeated to elevate verbosity and print more info to stdout
when a test or a server starts.
Now that the command argument specs are available at runtime (#9656), this PR addresses
#8084 by implementing a complete solution for command-line hinting in `redis-cli`.
It correctly handles nearly every case in Redis's complex command argument definitions, including
`BLOCK` and `ONEOF` arguments, reordering of optional arguments, and repeated arguments
(even when followed by mandatory arguments). It also validates numerically-typed arguments.
It may not correctly handle all possible combinations of those, but overall it is quite robust.
Arguments are only matched after the space bar is typed, so partial word matching is not
supported - that proved to be more confusing than helpful. When the user's current input
cannot be matched against the argument specs, hinting is disabled.
Partial support has been implemented for legacy (pre-7.0) servers that do not support
`COMMAND DOCS`, by falling back to a statically-compiled command argument table.
On startup, if the server does not support `COMMAND DOCS`, `redis-cli` will now issue
an `INFO SERVER` command to retrieve the server version (unless `HELLO` has already
been sent, in which case the server version will be extracted from the reply to `HELLO`).
The server version will be used to filter the commands and arguments in the command table,
removing those not supported by that version of the server. However, the static table only
includes core Redis commands, so with a legacy server hinting will not be supported for
module commands. The auto generated help.h and the scripts that generates it are gone.
Command and argument tables for the server and CLI use different structs, due primarily
to the need to support different runtime data. In order to generate code for both, macros
have been added to `commands.def` (previously `commands.c`) to make it possible to
configure the code generation differently for different use cases (one linked with redis-server,
and one with redis-cli).
Also adding a basic testing framework for the command hints based on new (undocumented)
command line options to `redis-cli`: `--test_hint 'INPUT'` prints out the command-line hint for
a given input string, and `--test_hint_file <filename>` runs a suite of test cases for the hinting
mechanism. The test suite is in `tests/assets/test_cli_hint_suite.txt`, and it is run from
`tests/integration/redis-cli.tcl`.
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The reason is in reply-schemas-validator, the resp of the
client we create will be client_default_resp (currently 3):
```
client *createClient(connection *conn) {
client *c = zmalloc(sizeof(client));
#ifdef LOG_REQ_RES
reqresReset(c, 0);
c->resp = server.client_default_resp;
#else
c->resp = 2;
#endif
}
```
But current_resp3 in redis-cli will be inconsistent with it,
the test adds a simple hello 3 to avoid this failure, test
was added in #11873.
Added help descriptions for dont-pre-clean option, it was
added in #10273
The message "Reading messages... (press Ctrl-C to quit)" is replaced by
"Reading messages... (press Ctrl-C to quit or any key to type command)".
This allows users to subscribe to more channels, to try out UNSUBSCRIBE and to
combine pubsub with other features such as push messages from client tracking.
The "Reading messages" info message is displayed in the bottom of the output in a
distinct style and moves downward as more messages appear. When any key is pressed,
the info message is replaced by the prompt with for entering commands.
After entering a command and the reply is displayed, the "Reading messages" info
messages appears again. This is added to the repl loop in redis-cli and in the
corresponding place for non-interactive mode.
An indication "(subscribed mode)" is included in the prompt when entering commands
in subscribed mode.
Also:
* Fixes a problem that UNSUBSCRIBE hanged when used with RESP3 and push callback,
without first entering subscribe mode. It hanged because UNSUBSCRIBE gets one or
more push replies but no in-band reply.
* Exit subscribed mode after RESET.
In #11666, we added a while loop and will split a big reply
node to multiple nodes. The update of tail->repl_offset may
be wrong. Like before #11666, we would have created at most
one new reply node, and now we will create multiple nodes if
it is a big reply node.
Now we are creating more than one node, and the tail->repl_offset
of all the nodes except the last one are incorrect. Because we
update master_repl_offset at the beginning, and then use it to
update the tail->repl_offset. This would have lead to an assertion
during PSYNC, a test was added to validate that case.
Besides that, the calculation of size was adjusted to fix
tests that failed due to a combination of a very low backlog size,
and some thresholds of that get violated because of the relatively
high overhead of replBufBlock. So now if the backlog size / 16 is too
small, we'll take PROTO_REPLY_CHUNK_BYTES instead.
Co-authored-by: Oran Agra <oran@redislabs.com>
This can happen when a key almost equal or larger than the
client output buffer limit of the replica is written.
Example:
1. DB is empty
2. Backlog size is 1 MB
3. Client out put buffer limit is 2 MB
4. Client writes a 3 MB key
5. The shared replication buffer will have a single node which contains
the key written above, and it exceeds the backlog size.
At this point the client output buffer usage calculation will report the
replica buffer to be 3 MB (or more) even after sending all the data to
the replica.
The primary drops the replica connection for exceeding the limits,
the replica reconnects and successfully executes partial sync but the
primary will drop the connection again because the buffer usage is still
3 MB. This happens over and over.
To mitigate the problem, this fix limits the maximum size of a single
backlog node to be (repl_backlog_size/16). This way a single node can't
exceed the limits of the COB (the COB has to be larger than the
backlog).
It also means that if the backlog has some excessive data it can't trim,
it would be at most about 6% overuse.
other notes:
1. a loop was added in feedReplicationBuffer which caused a massive LOC
change due to indentation, the actual changes are just the `min(max` and the loop.
3. an unrelated change in an existing test to speed up a server termination which took 10 seconds.
Co-authored-by: Oran Agra <oran@redislabs.com>
Work in progress towards implementing a reply schema as part of COMMAND DOCS, see #9845
Since ironing the details of the reply schema of each and every command can take a long time, we
would like to merge this PR when the infrastructure is ready, and let this mature in the unstable branch.
Meanwhile the changes of this PR are internal, they are part of the repo, but do not affect the produced build.
### Background
In #9656 we add a lot of information about Redis commands, but we are missing information about the replies
### Motivation
1. Documentation. This is the primary goal.
2. It should be possible, based on the output of COMMAND, to be able to generate client code in typed
languages. In order to do that, we need Redis to tell us, in detail, what each reply looks like.
3. We would like to build a fuzzer that verifies the reply structure (for now we use the existing
testsuite, see the "Testing" section)
### Schema
The idea is to supply some sort of schema for the various replies of each command.
The schema will describe the conceptual structure of the reply (for generated clients), as defined in RESP3.
Note that the reply structure itself may change, depending on the arguments (e.g. `XINFO STREAM`, with
and without the `FULL` modifier)
We decided to use the standard json-schema (see https://json-schema.org/) as the reply-schema.
Example for `BZPOPMIN`:
```
"reply_schema": {
"oneOf": [
{
"description": "Timeout reached and no elements were popped.",
"type": "null"
},
{
"description": "The keyname, popped member, and its score.",
"type": "array",
"minItems": 3,
"maxItems": 3,
"items": [
{
"description": "Keyname",
"type": "string"
},
{
"description": "Member",
"type": "string"
},
{
"description": "Score",
"type": "number"
}
]
}
]
}
```
#### Notes
1. It is ok that some commands' reply structure depends on the arguments and it's the caller's responsibility
to know which is the relevant one. this comes after looking at other request-reply systems like OpenAPI,
where the reply schema can also be oneOf and the caller is responsible to know which schema is the relevant one.
2. The reply schemas will describe RESP3 replies only. even though RESP3 is structured, we want to use reply
schema for documentation (and possibly to create a fuzzer that validates the replies)
3. For documentation, the description field will include an explanation of the scenario in which the reply is sent,
including any relation to arguments. for example, for `ZRANGE`'s two schemas we will need to state that one
is with `WITHSCORES` and the other is without.
4. For documentation, there will be another optional field "notes" in which we will add a short description of
the representation in RESP2, in case it's not trivial (RESP3's `ZRANGE`'s nested array vs. RESP2's flat
array, for example)
Given the above:
1. We can generate the "return" section of all commands in [redis-doc](https://redis.io/commands/)
(given that "description" and "notes" are comprehensive enough)
2. We can generate a client in a strongly typed language (but the return type could be a conceptual
`union` and the caller needs to know which schema is relevant). see the section below for RESP2 support.
3. We can create a fuzzer for RESP3.
### Limitations (because we are using the standard json-schema)
The problem is that Redis' replies are more diverse than what the json format allows. This means that,
when we convert the reply to a json (in order to validate the schema against it), we lose information (see
the "Testing" section below).
The other option would have been to extend the standard json-schema (and json format) to include stuff
like sets, bulk-strings, error-string, etc. but that would mean also extending the schema-validator - and that
seemed like too much work, so we decided to compromise.
Examples:
1. We cannot tell the difference between an "array" and a "set"
2. We cannot tell the difference between simple-string and bulk-string
3. we cannot verify true uniqueness of items in commands like ZRANGE: json-schema doesn't cover the
case of two identical members with different scores (e.g. `[["m1",6],["m1",7]]`) because `uniqueItems`
compares (member,score) tuples and not just the member name.
### Testing
This commit includes some changes inside Redis in order to verify the schemas (existing and future ones)
are indeed correct (i.e. describe the actual response of Redis).
To do that, we added a debugging feature to Redis that causes it to produce a log of all the commands
it executed and their replies.
For that, Redis needs to be compiled with `-DLOG_REQ_RES` and run with
`--reg-res-logfile <file> --client-default-resp 3` (the testsuite already does that if you run it with
`--log-req-res --force-resp3`)
You should run the testsuite with the above args (and `--dont-clean`) in order to make Redis generate
`.reqres` files (same dir as the `stdout` files) which contain request-response pairs.
These files are later on processed by `./utils/req-res-log-validator.py` which does:
1. Goes over req-res files, generated by redis-servers, spawned by the testsuite (see logreqres.c)
2. For each request-response pair, it validates the response against the request's reply_schema
(obtained from the extended COMMAND DOCS)
5. In order to get good coverage of the Redis commands, and all their different replies, we chose to use
the existing redis test suite, rather than attempt to write a fuzzer.
#### Notes about RESP2
1. We will not be able to use the testing tool to verify RESP2 replies (we are ok with that, it's time to
accept RESP3 as the future RESP)
2. Since the majority of the test suite is using RESP2, and we want the server to reply with RESP3
so that we can validate it, we will need to know how to convert the actual reply to the one expected.
- number and boolean are always strings in RESP2 so the conversion is easy
- objects (maps) are always a flat array in RESP2
- others (nested array in RESP3's `ZRANGE` and others) will need some special per-command
handling (so the client will not be totally auto-generated)
Example for ZRANGE:
```
"reply_schema": {
"anyOf": [
{
"description": "A list of member elements",
"type": "array",
"uniqueItems": true,
"items": {
"type": "string"
}
},
{
"description": "Members and their scores. Returned in case `WITHSCORES` was used.",
"notes": "In RESP2 this is returned as a flat array",
"type": "array",
"uniqueItems": true,
"items": {
"type": "array",
"minItems": 2,
"maxItems": 2,
"items": [
{
"description": "Member",
"type": "string"
},
{
"description": "Score",
"type": "number"
}
]
}
}
]
}
```
### Other changes
1. Some tests that behave differently depending on the RESP are now being tested for both RESP,
regardless of the special log-req-res mode ("Pub/Sub PING" for example)
2. Update the history field of CLIENT LIST
3. Added basic tests for commands that were not covered at all by the testsuite
### TODO
- [x] (maybe a different PR) add a "condition" field to anyOf/oneOf schemas that refers to args. e.g.
when `SET` return NULL, the condition is `arguments.get||arguments.condition`, for `OK` the condition
is `!arguments.get`, and for `string` the condition is `arguments.get` - https://github.com/redis/redis/issues/11896
- [x] (maybe a different PR) also run `runtest-cluster` in the req-res logging mode
- [x] add the new tests to GH actions (i.e. compile with `-DLOG_REQ_RES`, run the tests, and run the validator)
- [x] (maybe a different PR) figure out a way to warn about (sub)schemas that are uncovered by the output
of the tests - https://github.com/redis/redis/issues/11897
- [x] (probably a separate PR) add all missing schemas
- [x] check why "SDOWN is triggered by misconfigured instance replying with errors" fails with --log-req-res
- [x] move the response transformers to their own file (run both regular, cluster, and sentinel tests - need to
fight with the tcl including mechanism a bit)
- [x] issue: module API - https://github.com/redis/redis/issues/11898
- [x] (probably a separate PR): improve schemas: add `required` to `object`s - https://github.com/redis/redis/issues/11899
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Co-authored-by: Hanna Fadida <hanna.fadida@redislabs.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Shaya Potter <shaya@redislabs.com>
* Make it clear that current_client is the root client that was called by
external connection
* add executing_client which is the client that runs the current command
(can be a module or a script)
* Remove script_caller that was used for commands that have CLIENT_SCRIPT
to get the client that called the script. in most cases, that's the current_client,
and in others (when being called from a module), it could be an intermediate
client when we actually want the original one used by the external connection.
bugfixes:
* RM_Call with C flag should log ACL errors with the requested user rather than
the one used by the original client, this also solves a crash when RM_Call is used
with C flag from a detached thread safe context.
* addACLLogEntry would have logged info about the script_caller, but in case the
script was issued by a module command we actually want the current_client. the
exception is when RM_Call is called from a timer event, in which case we don't
have a current_client.
behavior changes:
* client side tracking for scripts now tracks the keys that are read by the script
instead of the keys that are declared by the caller for EVAL
other changes:
* Log both current_client and executing_client in the crash log.
* remove prepareLuaClient and resetLuaClient, being dead code that was forgotten.
* remove scriptTimeSnapshot and snapshot_time and instead add cmd_time_snapshot
that serves all commands and is reset only when execution nesting starts.
* remove code to propagate CLIENT_FORCE_REPL from the executed command
to the script caller since scripts aren't propagated anyway these days and anyway
this flag wouldn't have had an effect since CLIENT_PREVENT_PROP is added by scriptResetRun.
* fix a module GIL violation issue in afterSleep that was introduced in #10300 (unreleased)
If we have integer scores on the sorted set we're not using the fastest way
to reply by calling `d2string` which uses `double2ll` and `ll2string` when it can,
instead of `fpconv_dtoa`.
This results by some 50% performance improvement in certain cases of integer
scores for both RESP2 and RESP3, and no apparent impact on double scores.
Co-authored-by: Oran Agra <oran@redislabs.com>
Failure happens in FreeBSD daily:
```
*** [err]: Test replication with parallel clients writing in different DBs in tests/integration/replication-4.tcl
Expected [::redis::redisHandle2 dbsize] > 0 (context: type eval line 19 cmd {assert {[$master dbsize] > 0}} proc ::test)
```
The test is failing because db 9 has no data (default db), and
according to the log, we can see that db 9 does not have a key:
```
### Starting test Test replication with parallel clients writing in different DBs in tests/integration/replication-4.tcl
3338:S 03 Feb 2023 00:15:18.723 - DB 11: 1 keys (0 volatile) in 4 slots HT.
3338:S 03 Feb 2023 00:15:18.723 - DB 12: 141 keys (0 volatile) in 256 slots HT.
```
We use `wait_for_condition` to ensure that parallel clients have
written data before calling stop_bg_complex_data. At the same time,
`wait_for_condition` is also used to remove the above `after 1000`,
which can save time in most cases.
Test on x86 + TLS fail with this error:
```
*** [err]: Slave is able to detect timeout during handshake in tests/integration/replication.tcl
Replica is not able to detect timeout
```
The replica logs is:
```
### Starting test Slave is able to detect timeout during handshake in tests/integration/replication.tcl
7681:S 05 Jan 2023 00:21:56.635 * Non blocking connect for SYNC fired the event.
7681:S 05 Jan 2023 00:21:56.638 * Master replied to PING, replication can continue...
7681:S 05 Jan 2023 00:21:56.638 * Trying a partial resynchronization (request ef70638885500aad12dd673c68ca1541116a59fe:1).
7681:S 05 Jan 2023 00:22:56.894 # Failed to read response from the server: error:0A000126:SSL routines::unexpected eof while reading
7681:S 05 Jan 2023 00:22:56.894 # Master did not reply to PSYNC, will try later
```
This is another issue that appeared after #11640 was merged. This PR try to fix it.
The idea is to make it stable in `wait_bgsave`, for example, it may wait until the
next psync retry in the following situation: `Master did not reply to PSYNC, will try later`
Other than that, the change will make the test more consistent / predictable since
it'll mean the master is always frozen in the desired state (waiting for repl-diskless-sync-delay
to happen, rather than earlier stages of the handshake).
Test on ARM + TLS often fail with this error:
```
*** [err]: Slave is able to detect timeout during handshake in tests/integration/replication.tcl
Replica is not able to detect timeout
```
https://github.com/redis/redis-extra-ci/actions/runs/3727554226/jobs/6321797837
The replica logs show that in this case the replica got timeout before even getting a response to the PING command (instead of the SYNC command).
it should have shown these:
```
* MASTER <-> REPLICA sync started
* REPLICAOF 127.0.0.1:22112 enabled ....
### Starting test Slave enters handshake in tests/integration/replication.tcl
* Non blocking connect for SYNC fired the event.
```
then:
```
* Master replied to PING, replication can continue...
* Trying a partial resynchronization (request 50da9eff70d774f4e6cb723eb4b091440f215772:1).
```
and then hang for 5 seconds:
```
# Timeout connecting to the MASTER...
* Reconnecting to MASTER 127.0.0.1:21112 after failure
```
but instead it got this (looks like it disconnected too early, and then tried to re-connect):
```
10890:M 19 Dec 2022 01:32:54.794 * Ready to accept connections tls
10890:M 19 Dec 2022 01:32:54.809 - Accepted 127.0.0.1:41047
10890:M 19 Dec 2022 01:32:54.878 - Reading from client: error:0A000126:SSL routines::unexpected eof while reading
10890:M 19 Dec 2022 01:32:54.925 - Accepted 127.0.0.1:39207
10890:S 19 Dec 2022 01:32:55.463 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
10890:S 19 Dec 2022 01:32:55.463 * Connecting to MASTER 127.0.0.1:24126
10890:S 19 Dec 2022 01:32:55.463 * MASTER <-> REPLICA sync started
10890:S 19 Dec 2022 01:32:55.463 * REPLICAOF 127.0.0.1:24126 enabled (user request from 'id=4 addr=127.0.0.1:39207 laddr=127.0.0.1:24125 fd=8 name= age=1 idle=0 flags=N db=9 sub=0 psub=0 ssub=0 multi=-1 qbuf=43 qbuf-free=20431 argv-mem=21 multi-mem=0 rbs=1024 rbp=5 obl=0 oll=0 omem=0 tot-mem=22317 events=r cmd=slaveof user=default redir=-1 resp=2')
### Starting test Slave enters handshake in tests/integration/replication.tcl
10890:S 19 Dec 2022 01:32:55.476 * Non blocking connect for SYNC fired the event.
10890:S 19 Dec 2022 01:33:00.701 # Failed to read response from the server: (null) <- note this!!
10890:S 19 Dec 2022 01:33:00.701 # Master did not respond to command during SYNC handshake
10890:S 19 Dec 2022 01:33:01.002 * Connecting to MASTER 127.0.0.1:24126
10890:S 19 Dec 2022 01:33:01.002 * MASTER <-> REPLICA sync started
### Starting test Slave is able to detect timeout during handshake in tests/integration/replication.tcl
10890:S 19 Dec 2022 01:33:05.497 * Non blocking connect for SYNC fired the event.
10890:S 19 Dec 2022 01:33:05.500 * Master replied to PING, replication can continue...
10890:S 19 Dec 2022 01:33:05.510 * Trying a partial resynchronization (request 947e1956372a0e6c819cfec51c42cc7979b0c221:1).
10890:S 19 Dec 2022 01:34:05.833 # Failed to read response from the server: error:0A000126:SSL routines::unexpected eof while reading
10890:S 19 Dec 2022 01:34:05.833 # Master did not reply to PSYNC, will try later
```
This PR sets enables the 5 seconds timeout at a later stage to try and prevent the early disconnection.
*TL;DR*
---------------------------------------
Following the discussion over the issue [#7551](https://github.com/redis/redis/issues/7551)
We decided to refactor the client blocking code to eliminate some of the code duplications
and to rebuild the infrastructure better for future key blocking cases.
*In this PR*
---------------------------------------
1. reprocess the command once a client becomes unblocked on key (instead of running
custom code for the unblocked path that's different than the one that would have run if
blocking wasn't needed)
2. eliminate some (now) irrelevant code for handling unblocking lists/zsets/streams etc...
3. modify some tests to intercept the error in cases of error on reprocess after unblock (see
details in the notes section below)
4. replace '$' on the client argv with current stream id. Since once we reprocess the stream
XREAD we need to read from the last msg and not wait for new msg in order to prevent
endless block loop.
5. Added statistics to the info "Clients" section to report the:
* `total_blocking_keys` - number of blocking keys
* `total_blocking_keys_on_nokey` - number of blocking keys which have at least 1 client
which would like
to be unblocked on when the key is deleted.
6. Avoid expiring unblocked key during unblock. Previously we used to lookup the unblocked key
which might have been expired during the lookup. Now we lookup the key using NOTOUCH and
NOEXPIRE to avoid deleting it at this point, so propagating commands in blocked.c is no longer needed.
7. deprecated command flags. We decided to remove the CMD_CALL_STATS and CMD_CALL_SLOWLOG
and make an explicit verification in the call() function in order to decide if stats update should take place.
This should simplify the logic and also mitigate existing issues: for example module calls which are
triggered as part of AOF loading might still report stats even though they are called during AOF loading.
*Behavior changes*
---------------------------------------------------
1. As this implementation prevents writing dedicated code handling unblocked streams/lists/zsets,
since we now re-process the command once the client is unblocked some errors will be reported differently.
The old implementation used to issue
``UNBLOCKED the stream key no longer exists``
in the following cases:
- The stream key has been deleted (ie. calling DEL)
- The stream and group existed but the key type was changed by overriding it (ie. with set command)
- The key not longer exists after we swapdb with a db which does not contains this key
- After swapdb when the new db has this key but with different type.
In the new implementation the reported errors will be the same as if the command was processed after effect:
**NOGROUP** - in case key no longer exists, or **WRONGTYPE** in case the key was overridden with a different type.
2. Reprocessing the command means that some checks will be reevaluated once the
client is unblocked.
For example, ACL rules might change since the command originally was executed and
will fail once the client is unblocked.
Another example is OOM condition checks which might enable the command to run and
block but fail the command reprocess once the client is unblocked.
3. One of the changes in this PR is that no command stats are being updated once the
command is blocked (all stats will be updated once the client is unblocked). This implies
that when we have many clients blocked, users will no longer be able to get that information
from the command stats. However the information can still be gathered from the client list.
**Client blocking**
---------------------------------------------------
the blocking on key will still be triggered the same way as it is done today.
in order to block the current client on list of keys, the call to
blockForKeys will still need to be made which will perform the same as it is today:
* add the client to the list of blocked clients on each key
* keep the key with a matching list node (position in the global blocking clients list for that key)
in the client private blocking key dict.
* flag the client with CLIENT_BLOCKED
* update blocking statistics
* register the client on the timeout table
**Key Unblock**
---------------------------------------------------
Unblocking a specific key will be triggered (same as today) by calling signalKeyAsReady.
the implementation in that part will stay the same as today - adding the key to the global readyList.
The reason to maintain the readyList (as apposed to iterating over all clients blocked on the specific key)
is in order to keep the signal operation as short as possible, since it is called during the command processing.
The main change is that instead of going through a dedicated code path that operates the blocked command
we will just call processPendingCommandsAndResetClient.
**ClientUnblock (keys)**
---------------------------------------------------
1. Unblocking clients on keys will be triggered after command is
processed and during the beforeSleep
8. the general schema is:
9. For each key *k* in the readyList:
```
For each client *c* which is blocked on *k*:
in case either:
1. *k* exists AND the *k* type matches the current client blocking type
OR
2. *k* exists and *c* is blocked on module command
OR
3. *k* does not exists and *c* was blocked with the flag
unblock_on_deleted_key
do:
1. remove the client from the list of clients blocked on this key
2. remove the blocking list node from the client blocking key dict
3. remove the client from the timeout list
10. queue the client on the unblocked_clients list
11. *NEW*: call processCommandAndResetClient(c);
```
*NOTE:* for module blocked clients we will still call the moduleUnblockClientByHandle
which will queue the client for processing in moduleUnblockedClients list.
**Process Unblocked clients**
---------------------------------------------------
The process of all unblocked clients is done in the beforeSleep and no change is planned
in that part.
The general schema will be:
For each client *c* in server.unblocked_clients:
* remove client from the server.unblocked_clients
* set back the client readHandler
* continue processing the pending command and input buffer.
*Some notes regarding the new implementation*
---------------------------------------------------
1. Although it was proposed, it is currently difficult to remove the
read handler from the client while it is blocked.
The reason is that a blocked client should be unblocked when it is
disconnected, or we might consume data into void.
2. While this PR mainly keep the current blocking logic as-is, there
might be some future additions to the infrastructure that we would
like to have:
- allow non-preemptive blocking of client - sometimes we can think
that a new kind of blocking can be expected to not be preempt. for
example lets imagine we hold some keys on disk and when a command
needs to process them it will block until the keys are uploaded.
in this case we will want the client to not disconnect or be
unblocked until the process is completed (remove the client read
handler, prevent client timeout, disable unblock via debug command etc...).
- allow generic blocking based on command declared keys - we might
want to add a hook before command processing to check if any of the
declared keys require the command to block. this way it would be
easier to add new kinds of key-based blocking mechanisms.
Co-authored-by: Oran Agra <oran@redislabs.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
This test sometimes fails:
```
*** [err]: PSYNC2: Partial resync after Master restart using RDB aux fields with expire in tests/integration/psync2-master-restart.tcl
Expected [status ::redis::redisHandle24 sync_partial_ok] == 1 (context: type eval line 49 cmd {assert {[status $replica sync_partial_ok] == 1}} proc ::test)
```
This is because the default repl-timeout value is 10s, sometimes the test
got timeout, then it will do a reconnect, it will incr the sync_partial_ok
counter, and then cause the test to fail.In this fix, we set the repl-timeout
to a very large number to make sure we won't get the timeout.
There is a race in the test:
```
*** [err]: Diskless load swapdb (async_loading): new database is exposed after swapping in tests/integration/replication.tcl
Expected 'myvalue' to be equal to '' (context: type eval line 3 cmd {assert_equal [$replica GET mykey] ""} proc ::test)
```
When doing `$replica GET mykey`, the replica is using the old database.
The reason may be that when doing `master client kill type replica`,
the replica did not yet realize it got disconnected from the master.
So the check of master_link_status fails, and the replica did not
finish the swapdb and the loading.
In that case, i think the solution is to check the sync_full stat on
the master and wait for it to get incremented from the previous value.
i.e. the way to know that we're done with the full sync is not to check
that our state is up (could be up if we check too early), but rather
check that the sync_full counter got incremented.
During the reviewing, we found another race, in Aborted testType,
the `$master config set rdb-key-save-delay 10000` is done after we
already initiated the disconnection, so there's a chance that the replica
will attempt to reconnect before that call, in which case if we fork() before
it, the config will not take effect. Move it to above the disconnection.
Co-authored-by: Oran Agra <oran@redislabs.com>
attach_to_replication_stream already stops pings, but it stops them on
the server we connect to, and in this case it's a replica, and we need
to stop them on the real master.
There is a timing issue in the test, happens with valgrind:
```
*** [err]: diskless fast replicas drop during rdb pipe in tests/integration/replication.tcl
log message of '"*Loading DB in memory*"' not found in ./tests/tmp/server.3580.246/stdout after line: 0 till line: 39
```
The server logs:
```
43465:S 03 Dec 2022 01:26:25.664 * Trying a partial resynchronization (request 15155fa24af0539b70428f9b41f4f7129d774560:1).
43465:S 03 Dec 2022 01:26:35.133 * Full resync from master: 8ddf5a3f7c8ca1061c6b29aa84e7c985c5b29c61:680
```
From the logs, we can see it took almost 10s to get full resync response,
happens with valgrind. it's extremely slow. So i guess it's just an
insufficient wait_for_condition timeout.
Set the time to 15s, and modify other similar places at the same time.
This test sets the master ping interval to 1 hour, in order to avoid
pings in the replicatoin stream incrementing the replication offset,
however, it didn't increase the repl-timeout so on slow machines
where the test took more than 60 seconds, the replicas would drop
and reconnect.
```
*** [err]: PSYNC2: Partial resync after restart using RDB aux fields in tests/integration/psync2.tcl
Replica didn't partial sync
```
The test would detect 4 additional partial syncs where it expects
only one.
In replica, the key expired before master's `INCR` was arrived, so INCR
creates a new key in the replica and the test failed.
```
*** [err]: Replication of an expired key does not delete the expired key in tests/integration/replication-4.tcl
Expected '0' to be equal to '1' (context: type eval line 13 cmd {assert_equal 0 [$slave exists k]} proc ::test)
```
This test is very likely to do a false positive if the `wait_for_ofs_sync`
takes longer than the expiration time, so give it a few more chances.
The test was introduced in #9572.
This payload produces a set with duplicate elements (listpack encoding):
```
restore _key 0 "\x14\x25\x25\x00\x00\x00\x0A\x00\x06\x01\x82\x5F\x35\x03\x04\x01\x82\x5F\x31\x03\x82\x5F\x33\x03\x00\x01\x82\x5F\x39\x03\x82\x5F\x33\x03\x08\x01\x02\x01\xFF\x0B\x00\x31\xBE\x7D\x41\x01\x03\x5B\xEC"
smembers key
1) "6"
2) "_5"
3) "4"
4) "_1"
5) "_3" ---> dup
6) "0"
7) "_9"
8) "_3" ---> dup
9) "8"
10) "2"
```
This kind of sets will cause SDIFF to hang, SDIFF generated a broken
protocol and left the client hung. (Expected ten elements, but only
got nine elements due to the duplication.)
If we set `sanitize-dump-payload` to yes, we will be able to find
the duplicate elements and report "ERR Bad data format".
Discovered and discussed in #11290.
This PR also improve prints when corrupt-dump-fuzzer hangs, it will
print the cmds and the payload, an example like:
```
Testing integration/corrupt-dump-fuzzer
[TIMEOUT]: clients state report follows.
sock6 => (SPAWNED SERVER) pid:28884
Killing still running Redis server 28884
commands caused test to hang:
SDIFF __key
payload that caused test to hang: "\x14\balabala"
```
Co-authored-by: Oran Agra <oran@redislabs.com>
The following example will create an empty set (listpack encoding):
```
> RESTORE key 0
"\x14\x25\x25\x00\x00\x00\x00\x00\x02\x01\x82\x5F\x37\x03\x06\x01\x82\x5F\x35\x03\x82\x5F\x33\x03\x00\x01\x82\x5F\x31\x03\x82\x5F\x39\x03\x04\xA9\x08\x01\xFF\x0B\x00\xA3\x26\x49\xB4\x86\xB0\x0F\x41"
OK
> SCARD key
(integer) 0
> SRANDMEMBER key
Error: Server closed the connection
```
In the spirit of #9297, skip empty set when loading RDB_TYPE_SET_LISTPACK.
Introduced in #11290
Improve memory efficiency of list keys
## Description of the feature
The new listpack encoding uses the old `list-max-listpack-size` config
to perform the conversion, which we can think it of as a node inside a
quicklist, but without 80 bytes overhead (internal fragmentation included)
of quicklist and quicklistNode structs.
For example, a list key with 5 items of 10 chars each, now takes 128 bytes
instead of 208 it used to take.
## Conversion rules
* Convert listpack to quicklist
When the listpack length or size reaches the `list-max-listpack-size` limit,
it will be converted to a quicklist.
* Convert quicklist to listpack
When a quicklist has only one node, and its length or size is reduced to half
of the `list-max-listpack-size` limit, it will be converted to a listpack.
This is done to avoid frequent conversions when we add or remove at the bounding size or length.
## Interface changes
1. add list entry param to listTypeSetIteratorDirection
When list encoding is listpack, `listTypeIterator->lpi` points to the next entry of current entry,
so when changing the direction, we need to use the current node (listTypeEntry->p) to
update `listTypeIterator->lpi` to the next node in the reverse direction.
## Benchmark
### Listpack VS Quicklist with one node
* LPUSH - roughly 0.3% improvement
* LRANGE - roughly 13% improvement
### Both are quicklist
* LRANGE - roughly 3% improvement
* LRANGE without pipeline - roughly 3% improvement
From the benchmark, as we can see from the results
1. When list is quicklist encoding, LRANGE improves performance by <5%.
2. When list is listpack encoding, LRANGE improves performance by ~13%,
the main enhancement is brought by `addListListpackRangeReply()`.
## Memory usage
1M lists(key:0~key:1000000) with 5 items of 10 chars ("hellohello") each.
shows memory usage down by 35.49%, from 214MB to 138MB.
## Note
1. Add conversion callback to support doing some work before conversion
Since the quicklist iterator decompresses the current node when it is released, we can
no longer decompress the quicklist after we convert the list.
Fix a few issues with the recent #11463
* use exitFromChild instead of exit
* test should ignore defunct process since that's what we expect to
happen for thees child processes when the parent dies.
* fix typo
Co-authored-by: Binbin <binloveplay1314@qq.com>
During a diskless sync, if the master main process crashes, the child would
have hung in `write`. This fix closes the read fd on the child side, so that if the
parent crashes, the child will get a write error and exit.
This change also fixes disk-based replication, BGSAVE and AOFRW.
In that case the child wouldn't have been hang, it would have just kept
running until done which may be pointless.
There is a certain degree of risk here. in case there's a BGSAVE child that could
maybe succeed and the parent dies for some reason, the old code would have let
the child keep running and maybe succeed and avoid data loss.
On the other hand, if the parent is restarted, it would have loaded an old rdb file
(or none), and then the child could reach the end and rename the rdb file (data
conflicting with what the parent has), or also have a race with another BGSAVE
child that the new parent started.
Note that i removed a comment saying a write error will be ignored in the child
and handled by the parent (this comment was very old and i don't think relevant).
All commands / use cases that heavily rely on double to a string representation conversion,
(e.g. meaning take a double-precision floating-point number like 1.5 and return a string like "1.5" ),
could benefit from a performance boost by swapping snprintf(buf,len,"%.17g",value) by the
equivalent [fpconv_dtoa](https://github.com/night-shift/fpconv) or any other algorithm that ensures
100% coverage of conversion.
This is a well-studied topic and Projects like MongoDB. RedPanda, PyTorch leverage libraries
( fmtlib ) that use the optimized double to string conversion underneath.
The positive impact can be substantial. This PR uses the grisu2 approach ( grisu explained on
https://www.cs.tufts.edu/~nr/cs257/archive/florian-loitsch/printf.pdf section 5 ).
test suite changes:
Despite being compatible, in some cases it produces a different result from printf, and some tests
had to be adjusted.
one case is that `%.17g` (which means %e or %f which ever is shorter), chose to use `5000000000`
instead of 5e+9, which sounds like a bug?
In other cases, we changed TCL to compare numbers instead of strings to ignore minor rounding
issues (`expr 0.8 == 0.79999999999999999`)
Fix `Test replication with lazy expire` test to not timeout the wait command.
This fix will allow the test to pass on slow environments and when running with valgrind.
The PR reverts the changes made on #10969.
The reason for revert was trigger because of occasional test failure
that started after the PR was merged.
The issue is that if there is a lazy expire during the command invocation,
the `del` command is added to the replication stream after the command
placeholder. So the logical order on the primary is:
* Delete the key (lazy expiration)
* Command invocation
But the replication stream gets it the other way around:
* Command invocation (because the command is written into the placeholder)
* Delete the key (lazy expiration)
So if the command write to the key that was just lazy expired we will get
inconsistency between primary and replica.
One solution we considered is to add another lazy expire replication stream
and write all the lazy expire there. Then when replicating, we will replicate the
lazy expire replication stream first. This will solve this specific test failure but
we realize that the issues does not ends here and the more we dig the more
problems we find.One of the example we thought about (that can actually
crashes Redis) is as follow:
* User perform SINTERSTORE
* When Redis tries to fetch the second input key it triggers lazy expire
* The lazy expire trigger a module logic that deletes the first input key
* Now Redis hold the robj of the first input key that was actually freed
We believe we took the wrong approach and we will come up with another
PR that solve the problem differently, for now we revert the changes so we
will not have the tests failure.
Notice that not the entire code was revert, some parts of the PR are changes
that we would like to keep. The changes that **was** reverted are:
* Saving a placeholder for replication at the beginning of the command (`call` function)
* Order of the replication stream on active expire and eviction (we will decide how
to handle it correctly on follow up PR)
* `Spop` changes are no longer needed (because we reverted the placeholder code)
Changes that **was not** reverted:
* On expire/eviction, wrap the `del` and the notification effect in a multi exec.
* `PropagateNow` function can still accept a special dbid, -1, indicating not to replicate select.
* Keep optimisation for reusing the `alsoPropagate` array instead of allocating it each time.
Tests:
* All tests was kept and only few tests was modify to work correctly with the changes
* Test was added to verify that the revert fixes the issues.
This is the history of aof-race related changes:
1. added in 3aa4b00970
2. disabled in dcdfd005a0
3. enabled in 5c63922691
4. disabled in 53a2af3941
This PR refreshes the aof-race test, re-enable it.
Closes#10971
The kill above is sometimes successful and sometimes already too late.
The PING in pysnc wrong offset test got rejected by bgsaveerr because
lastbgsave_status is C_ERR.
In theory, using diskless can avoid PING being affected, because when
the replica is dropped, we will kill the child with SIGUSR1, and this
will not affect lastbgsave_status.
Anyway, this kill is not particularly needed here, dropping the kill
is the best one, since we do have the waitForBgsave, so just let it
take care of the bgsave. No need for fast termination.