The following four configurations are renamed to align with Redis style:
1. server_cpulist renamed to server-cpulist
2. bio_cpulist renamed to bio-cpulist
3. aof_rewrite_cpulist renamed to aof-rewrite-cpulist
4. bgsave_cpulist renamed to bgsave-cpulist
The original names are retained as aliases to ensure compatibility with
old configuration files. We recommend users to gradually transition to
using the new configuration names to maintain consistency in style.
Introduced in #11695 .
The tryResizeHashTables function gets stuck on the last non-empty slot
while iterating through dictionaries. It does not restart from the
beginning. The reason for this issue is a problem with the usage of
dbIteratorNextDict:
/* Returns next dictionary from the iterator, or NULL if iteration is complete. */
dict *dbIteratorNextDict(dbIterator *dbit) {
if (dbit->next_slot == -1) return NULL;
dbit->slot = dbit->next_slot;
dbit->next_slot = dbGetNextNonEmptySlot(dbit->db, dbit->slot, dbit->keyType);
return dbGetDictFromIterator(dbit);
}
When iterating to the last non-empty slot, next_slot is set to -1,
causing it to loop indefinitely on that slot. We need to modify the code
to ensure that after iterating to the last non-empty slot, it returns to
the first non-empty slot.
BTW, function tryResizeHashTables is actually iterating over slots
that have keys. However, in its implementation, it leverages the
dbIterator (which is a key iterator) to obtain slot and dictionary
information. While this approach works fine, but it is not very
intuitive. This PR also improves readability by changing the iteration
to directly iterate over slots, thereby enhancing clarity.
When we register notification or server event in RedisModule_OnLoad, but
RedisModule_OnLoad eventually fails, triggering notification or server
event
will cause the server to crash.
If the loading fails on a later stage of moduleLoad, we do call
moduleUnload
which handles all un-registration, but when it fails on the
RedisModule_OnLoad
call, we only un-register several specific things and these were
missing:
- moduleUnsubscribeNotifications
- moduleUnregisterFilters
- moduleUnsubscribeAllServerEvents
Refactored the code to reuse the code from moduleUnload.
Fixes#12808.
see discussion from after https://github.com/redis/redis/pull/12453 was
merged
----
This PR replaces signals that are not considered async-signal-safe
(AS-safe) with safe calls.
#### **1. serverLog() and serverLogFromHandler()**
`serverLog` uses unsafe calls. It was decided that we will **avoid**
`serverLog` calls by the signal handlers when:
* The signal is not fatal, such as SIGALRM. In these cases, we prefer
using `serverLogFromHandler` which is the safe version of `serverLog`.
Note they have different prompts:
`serverLog`: `62220:M 26 Oct 2023 14:39:04.526 # <msg>`
`serverLogFromHandler`: `62220:signal-handler (1698331136) <msg>`
* The code was added recently. Calls to `serverLog` by the signal
handler have been there ever since Redis exists and it hasn't caused
problems so far. To avoid regression, from now we should use
`serverLogFromHandler`
#### **2. `snprintf` `fgets` and `strtoul`(base = 16) -------->
`_safe_snprintf`, `fgets_async_signal_safe`, `string_to_hex`**
The safe version of `snprintf` was taken from
[here](8cfc4ca5e7/src/mc_util.c (L754))
#### **3. fopen(), fgets(), fclose() --------> open(), read(), close()**
#### **4. opendir(), readdir(), closedir() --------> open(),
syscall(SYS_getdents64), close()**
#### **5. Threads_mngr sync mechanisms**
* waiting for the thread to generate stack trace: semaphore -------->
busy-wait
* `globals_rw_lock` was removed: as we are not using malloc and the
semaphore anymore we don't need to protect `ThreadsManager_cleanups`.
#### **6. Stacktraces buffer**
The initial problem was that we were not able to safely call malloc
within the signal handler.
To solve that we created a buffer on the stack of `writeStacktraces` and
saved it in a global pointer, assuming that under normal circumstances,
the function `writeStacktraces` would complete before any thread
attempted to write to it. However, **if threads lag behind, they might
access this global pointer after it no longer belongs to the
`writeStacktraces` stack, potentially corrupting memory.**
To address this, various solutions were discussed
[here](https://github.com/redis/redis/pull/12658#discussion_r1390442896)
Eventually, we decided to **create a pipe** at server startup that will
remain valid as long as the process is alive.
We chose this solution due to its minimal memory usage, and since
`write()` and `read()` are atomic operations. It ensures that stack
traces from different threads won't mix.
**The stacktraces collection process is now as follows:**
* Cleaning the pipe to eliminate writes of late threads from previous
runs.
* Each thread writes to the pipe its stacktrace
* Waiting for all the threads to mark completion or until a timeout (2
sec) is reached
* Reading from the pipe to print the stacktraces.
#### **7. Changes that were considered and eventually were dropped**
* replace watchdog timer with a POSIX timer:
according to [settimer man](https://linux.die.net/man/2/setitimer)
> POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending
the use of the POSIX timers API
([timer_gettime](https://linux.die.net/man/2/timer_gettime)(2),
[timer_settime](https://linux.die.net/man/2/timer_settime)(2), etc.)
instead.
However, although it is supposed to conform to POSIX std, POSIX timers
API is not supported on Mac.
You can take a look here at the Linux implementation:
[here](c7562ee135)
To avoid messing up the code, and uncertainty regarding compatibility,
it was decided to drop it for now.
* avoid using sds (uses malloc) in logConfigDebugInfo
It was considered to print config info instead of using sds, however
apparently, `logConfigDebugInfo` does more than just print the sds, so
it was decided this fix is out of this issue scope.
#### **8. fix Signal mask check**
The check `signum & sig_mask` intended to indicate whether the signal is
blocked by the thread was incorrect. Actually, the bit position in the
signal mask corresponds to the signal number. We fixed this by changing
the condition to: `sig_mask & (1L << (sig_num - 1))`
#### **9. Unrelated changes**
both `fork.tcl `and `util.tcl` implemented a function called
`count_log_message` expecting different parameters. This caused
confusion when trying to run daily tests with additional test parameters
to run a specific test.
The `count_log_message` in `fork.tcl` was removed and the calls were
replaced with calls to `count_log_message` located in `util.tcl`
---------
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
We have some test cases of swapdb with watchkey but missing seperate
basic swapdb test cases, unhappy path and flushdb after swapdb. So added
the test cases in keyspace.tcl.
CI reports that this test failed, the reason is because during
the command processing, the node processed PING/PONG, resulting
in ping_sent or pong_received mismatch.
Change to use MULTI to avoid timing issue. The test was introduced
in #12224.
When using DB iterator, it will use dictInitSafeIterator to init a old safe
dict iterator. When dbIteratorNext is used, it will jump to the next slot db
dict when we are done a dict. During this process, we do not have any calls to
dictResumeRehashing, which causes the dict's pauserehash to always be > 0.
And at last, it will be returned directly in dictRehashMilliseconds, which causes
us to have slot dict in a state where rehash cannot be completed.
In the "expire scan should skip dictionaries with lot's of empty buckets" test,
adding a `keys *` can reproduce the problem stably. `keys *` will call dbIteratorNext
to trigger a traversal of all slot dicts.
Added dbReleaseIterator and dbIteratorInitNextSafeIterator methods to call dictResetIterator.
Issue was introduced in #11695.
Test recently added fails on timeout in valgrind in GH actions.
Locally with valgrind the test finishes within 1.5 sec(s). Couldn't find
any issue due to lack of reproducibility. Increasing the timeout and
adding an additional log to the test to understand how many keys
were left at the end.
Redis 7.2 (#9406) introduced a new modules event, `RedisModuleEvent_Key`.
This new event allows the module to read the key data just before it is removed
from the database (either deleted, expired, evicted, or overwritten).
When the key is removed from the database, either by active expire or eviction.
The new event was not called as part of an execution unit. This can cause an
issue if the module registers a post notification job inside the event. This job will
not be executed atomically with the expiration/eviction operation and will not
replicated inside a Multi/Exec. Moreover, the post notification job will be executed
right after the event where it is still not safe to perform any write operation, this will
violate the promise that post notification job will be called atomically with the
operation that triggered it and **only when it is safe to write**.
This PR fixes the issue by wrapping each expiration/eviction of a key with an execution
unit. This makes sure the entire operation will run atomically and all the post notification
jobs will be executed at the end where it is safe to write.
Tests were modified to verify the fix.
This change overcomes many stability issues experienced with the
vmactions action.
We need to limit VMs to 8GB for better stability, as the 13GB default
seems to hang them occasionally.
Shell code has been simplified since this action seem to use `bash -e`
which will abort on non-zero exit codes anyway.
Reverts the skipping defrag tests in cluster mode (done in #12672.
instead it skips only some defrag tests that are relevant for cluster modes.
The test now run well after investigating and making the changes in #12674 and #12694.
Co-authored-by: Oran Agra <oran@redislabs.com>
Optimize the performance of SCAN commands when a match pattern can only contain keys from a
single slot in cluster mode. This can happen when the pattern contains a hash tag before any
wildcard matchers or when the key contains no matchers.
Test failure on freebsd CI:
```
*** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl
scan didn't handle slot skipping logic.
```
Observation:
expiry of keys might happen before the empty buckets are formed and won't help with the expiry skip logic validation.
Solution:
Disable expiration until the empty buckets are formed.
Fixing issues started after #11695 when the defrag tests are being executed in cluster mode too.
For some reason, it looks like the defragmentation is over too quickly, before the test is able to
detect that it's running.
so now instead of waiting to see that it's active, we wait to see that it did some work
```
[err]: Active defrag big list: cluster in tests/unit/memefficiency.tcl
defrag not started.
[err]: Active defrag big keys: cluster in tests/unit/memefficiency.tcl
defrag didn't stop.
```
Temporarily disabling few of the defrag tests in cluster mode to make the daily run stable:
Active defrag eval scripts
Active defrag big keys
Active defrag big list
Active defrag edge case
Dictionary iterator logic in the `tryResizeHashTables` method is picking the next
(incorrect) dictionary while the cursor is at a given slot. This could lead to some
dictionary/slot getting skipped from resizing.
Also stabilize the test.
problem introduced recently in #11695
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
when a server in the test suite crashes and is restarted by redstart_server, we didn't clean it's pid from the list.
we can see that when the corrupt-dump-fuzzer hangs, it has a long list of servers to lean, but in fact they're all already dead.
In #11568 we removed the NOSCRIPT flag from commands and keep the BLOCKING flag.
Aiming to allow them in scripts and let them implicitly behave in the non-blocking way.
In that sense, the old behavior was to allow LPOP and reject BLPOP, and the new behavior,
is to allow BLPOP too, and fail it only in case it ends up blocking.
So likewise, so far we allowed XREAD and rejected XREAD BLOCK, and we will now allow
that too, and only reject it if it ends up blocking.
recently there are some incidents of hanged tests in the CI
when we try to reproduce them, we get an assertion, not a hang.
maybe the server logs will reveal some info.
## Crash fix
### Current behavior
We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example.
The threads mngr API guarantees that the output array length will be `tids_len`, however, some
indices can be NULL, in case it fails to collect some of the threads' outputs.
When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL
entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage,
so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() )
### fix
Allocate the global output array with zcalloc.
### To reproduce the bug, you'll have to change the code:
**in threadsmngr:ThreadsManager_runOnThreads():**
make sure the g_output_array allocation is initialized with garbage and not 0s
(add `memset(g_output_array, 2, sizeof(void*) * tids_len);` below the allocation).
Force one of the threads to write to the array:
add a global var: `static redisAtomic size_t return_now = 0;`
add to `invoke_callback()` before writing to the output array:
```
size_t i_return;
atomicGetIncr(return_now, i_return, 1);
if(i_return == 1) return;
```
compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert`
The assertion triggers the the stacktrace collection.
Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from
the threads mngr, it can be safely triggered again.
## Added global variables r/w lock in ThreadsManager
To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still
invoking the signal handler, we use a r/w lock.
For cleanups, we will acquire the write lock.
The threads will acquire the read lock to enable them to write simultaneously.
If we fail to acquire the read lock, it means cleanups are in progress and we return immediately.
After acquiring the lock we can safely check that the global output array wasn't nullified and proceed
to write to it.
This way we ensure the threads are not modifying the global variables/ trying to write to the output
array after they were zeroed/nullified/destroyed(the semaphore).
## other minor logging change
1. removed logging if the semaphore times out because the threads can still write to the output array
after this check. Instead, we print the total number of printed stacktraces compared to the exacted
number (len_tids).
2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct.
3. improve testing
Co-authored-by: Oran Agra <oran@redislabs.com>
In some tests, the code manually searches for a log message, and it
uses tail -1 with a delay of 1 second, which can miss the expected line.
Also, because the aof tests use start_server_aof and not start_server,
the test name doesn't log into the server log.
To fix the above, I made the following changes:
- Change the start_server_aof to wrap the start_server.
This will add the created aof server to the servers list, and make
srv() and wait_for_log_messages() available for the tests.
- Introduce a new option for start_server.
'wait_ready' - an option to let the caller start the test code without
waiting for the server to be ready. useful for tests on a server that
is expected to exit on startup.
- Create a new start_server_aof_ex.
The new proc also accept options as argument and make use of the
new 'short_life' option for tests that are expected to exit on startup
because of some error in the aof file(s).
Because of the above, I had to change many lines and replace every
local srv variable (a server config) usage with the srv().
The problem is that WAITAOF could have hang in case commands were
propagated only to replicas.
This can happen if a module uses RM_Call with the REDISMODULE_ARGV_NO_AOF flag.
In that case, master_repl_offset would increase, but there would be nothing to fsync, so
in the absence of other traffic, fsynced_reploff_pending would stay the static, and WAITAOF can hang.
This commit updates fsynced_reploff_pending to the latest offset in flushAppendOnlyFile in case
there's nothing to fsync. i.e. in case it's behind because of the above mentions case it'll be refreshed
and release the WAITAOF.
Other changes:
Fix a race in wait.tcl (client getting blocked vs. the fsync thread)
If we set `fsynced_reploff_pending` in `startAppendOnly`, and the fork doesn't start
immediately (e.g. there's another fork active at the time), any subsequent commands
will increment `server.master_repl_offset`, but will not cause a fsync (given they were
executed before the fork started, they just ended up in the RDB part of it)
Therefore, any WAITAOF will wait on the new master_repl_offset, but it will time out
because no fsync will be executed.
Release notes:
```
WAITAOF could timeout in the absence of write traffic in case a new AOF is created and
an AOFRW can't immediately start.
This can happen by the appendonly config is changed at runtime, but also after FLUSHALL,
and replica full sync.
```
In this PR we are adding the functionality to collect all the process's threads' backtraces.
## Changes made in this PR
### **introduce threads mngr API**
The **threads mngr API** which has 2 abilities:
* `ThreadsManager_init() `- register to SIGUSR2. called on the server start-up.
* ` ThreadsManager_runOnThreads()` - receives a list of a pid_t and a callback, tells every
thread in the list to invoke the callback, and returns the output collected by each invocation.
**Elaborating atomicvar API**
* `atomicIncrGet(var,newvalue_var,count) `-- Increment and get the atomic counter new value
* `atomicFlagGetSet` -- Get and set the atomic counter value to 1
### **Always set SIGALRM handler**
SIGALRM handler prints the process's stacktrace to the log file. Up until now, it was set only if the
`server.watchdog_period` > 0. This can be also useful if debugging is needed. However, in situations
where the server can't get requests, (a deadlock, for example) we weren't able to change the signal handler.
To make it available at run time we set SIGALRM handler on server startup. The signal handler name was
changed to a more general `sigalrmSignalHandler`.
### **Print all the process' threads' stacktraces**
`logStackTrace()` now calls `writeStacktraces()`, instead of logging the current thread stacktrace.
`writeStacktraces()`:
* On Linux systems we use the threads manager API to collect the backtraces of all the process' threads.
To get the `tids` list (threads ids) we read the `/proc/<redis-server-pid>/tasks` file which includes a list of directories.
Each directory name corresponds to one tid (including the main thread). For each thread, we also need to check if it
can get the signal from the threads manager (meaning it is not blocking/ignoring that signal). We send the threads
manager this tids list and `collect_stacktrace_data()` callback, which collects the thread's backtrace addresses,
its name, and tid.
* On other systems, the behavior remained as it was (writing only the current thread stacktrace to the log file).
## compatibility notes
1. **The threads mngr API is only supported in linux.**
2. glibc earlier than 2.3 We use `syscall(SYS_gettid)` and `syscall(SYS_tgkill...)` because their dedicated
alternatives (`gettid()` and `tgkill`) were added in glibc 2.3.
## Output example
Each thread backtrace will have the following format:
`<tid> <thread_name> [additional_info]`
* **tid**: as read from the `/proc/<redis-server-pid>/tasks` file
* **thread_name**: the tread name as it is registered in the os/
* **additional_info**: Sometimes we want to add specific information about one of the threads. currently.
it is only used to mark the thread that handles the backtraces collection by adding "*".
In case of crash - this also indicates which thread caused the crash. The handling thread in won't
necessarily appear first.
```
------ STACK TRACE ------
EIP:
/lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc]
67089 redis-server *
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb9437790]
/lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc]
redis-server *:6379(+0x75e0c)[0xaaaac2fe5e0c]
redis-server *:6379(aeProcessEvents+0x18c)[0xaaaac2fe6c00]
redis-server *:6379(aeMain+0x24)[0xaaaac2fe7038]
redis-server *:6379(main+0xe0c)[0xaaaac3001afc]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xffffb91d73fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffffb91d74cc]
redis-server *:6379(_start+0x30)[0xaaaac2fe0370]
67093 bio_lazy_free
/lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc]
/lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc]
redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8]
/lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8]
/lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c]
67091 bio_close_file
/lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc]
/lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc]
redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8]
/lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8]
/lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c]
67092 bio_aof
/lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc]
/lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc]
redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8]
/lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8]
/lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c]
67089:signal-handler (1693824528) --------
```
Recently, the option of sending an argument from stdin using `-x` flag
was added to redis-benchmark (this option is available in redis-cli as well).
However, using the `-x` option for sending a blobs that contains null-characters
doesn't work as expected - the argument is trimmed in the first occurrence of
`\X00` (unlike in redis-cli).
This PR aims to fix this issue and add the support for every binary string input,
by sending arguments length to `redisFormatCommandArgv` when processing
redis-benchmark command, so we won't treat the arguments as C-strings.
Additionally, we add a simple test coverage for `-x` (without binary strings, and
also remove an excessive server started in tests, and make sure to select db 0
so that `r` and the benchmark work on the same db.
Co-authored-by: Oran Agra <oran@redislabs.com>
The new test added in #12476 causes reply-schemas-validator to fail.
When doing `catch {r get key}`, the req-res output is:
```
3
get
3
key
12
__argv_end__
$100000
aaaaaaaaaaaaaaaaaaaa...4
info
5
stats
12
__argv_end__
=1670
txt:# Stats
...
```
And we can see the link after `$100000`, there is a 4 in the last,
it break the req-res-log-validator script since the format is wrong.
The reason i guess is after the client reconnection (after the output
buf limit), we will not add newlines, but append args directly.
Since obuf-limits.tcl is doing the same thing, and it had the logreqres:skip
flag, so this PR is following it.
ZRANGE BYSCORE/BYLEX with [LIMIT offset count] option was
using every level in skiplist to jump to the first/last node in range,
but only use level[0] in skiplist to locate the node at offset, resulting
in sub-optimal performance using LIMIT:
```
while (ln && offset--) {
if (reverse) {
ln = ln->backward;
} else {
ln = ln->level[0].forward;
}
}
```
It could be slow when offset is very big. We can get the total rank of
the offset location and use skiplist to jump to it. It is an improvement
from O(offset) to O(log rank).
Below shows how this is implemented (if the offset is positve):
Use the skiplist to seach for the first element in the range, record its
rank `rank_0`, so we can have the rank of the target node `rank_t`.
Meanwhile we record the last node we visited which has zsl->level-1
levels and its rank `rank_1`. Then we start from the zsl->level-1 node,
use skiplist to go forward `rank_t-rank_1` nodes to reach the target node.
It is very similiar when the offset is reversed.
Note that if `rank_t` is very close to `rank_0`, we just start from the first
element in range and go node by node, this for the case when zsl->level-1
node is to far away and it is quicker to reach the target node by node.
Here is a test using a random generated zset including 10000 elements
(with different positive scores), doing a bench mark which compares how
fast the `ZRANGE` command is exucuted before and after the optimization.
The start score is set to 0 and the count is set to 1 to make sure that
most of the time is spent on locating the offset.
```
memtier_benchmark -h 127.0.0.1 -p 6379 --command="zrange test 0 +inf byscore limit <offset> 1"
```
| offset | QPS(unstable) | QPS(optimized) |
|--------|--------|--------|
| 10 | 73386.02 | 74819.82 |
| 1000 | 48084.96 | 73177.73 |
| 2000 | 31156.79 | 72805.83 |
| 5000 | 10954.83 | 71218.21 |
With the result above, we can see that the original code is greatly
slowed down when offset gets bigger, and with the optimization the
speed is almost not affected.
Similiar results are generated when testing reversed offset:
```
memtier_benchmark -h 127.0.0.1 -p 6379 --command="zrange test +inf 0 byscore rev limit <offset> 1"
```
| offset | QPS(unstable) | QPS(optimized) |
|--------|--------|--------|
| 10 | 74505.14 | 71653.67 |
| 1000 | 46829.25 | 72842.75 |
| 2000 | 28985.48 | 73669.01 |
| 5000 | 11066.22 | 73963.45 |
And the same conclusion is drawn from the tests of ZRANGE BYLEX.
This PR adds a new Module API int RM_AddACLCategory(RedisModuleCtx *ctx, const char *category_name) to add a new ACL command category.
Here, we initialize the ACLCommandCategories array by allocating space for 64 categories and duplicate the 21 default categories from the predefined array 'ACLDefaultCommandCategories' into the ACLCommandCategories array while ACL initialization. Valid ACL category names can only contain alphanumeric characters, underscores, and dashes.
The API when called, checks for the onload flag, category name validity, and for duplicate category name if present. If the conditions are satisfied, the API adds the new category to the trailing end of the ACLCommandCategories array and assigns the acl_categories flag bit according to the index at which the category is added.
If any error is encountered the errno is set accordingly by the API.
---------
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Add these INFO metrics:
* client_query_buffer_limit_disconnections
* client_output_buffer_limit_disconnections
Sometimes it is useful to monitor whether clients reaches size limit of
query buffer and output buffer, to decide whether we need to adjust the
buffer size limit or reduce client query payload.
This test failed several times:
```
*** [err]: LATENCY GRAPH can output the event graph in tests/unit/latency-monitor.tcl
Expected '478' to be more than or equal to '500' (context: type eval
line 8 cmd {assert_morethan_equal $high 500} proc ::test)
```
Not sure why, adding some verbose printing that'll print the command
result on the next time.
BITCOUNT and BITPOS with non-existing key will return 0 even the
arguments are error, before this commit:
```
> flushall
OK
> bitcount s 0
(integer) 0
> bitpos s 0 0 1 hello
(integer) 0
> set s 1
OK
> bitcount s 0
(error) ERR syntax error
> bitpos s 0 0 1 hello
(error) ERR syntax error
```
The reason is that we judged non-existing before parameter checking and
returned. This PR fixes it, and after this commit:
```
> flushall
OK
> bitcount s 0
(error) ERR syntax error
> bitpos s 0 0 1 hello
(error) ERR syntax error
```
Also BITPOS made the same fix as #12394, check for wrong argument, before
checking for key.
```
> lpush mylist a b c
(integer) 3
> bitpos mylist 1 a b
(error) WRONGTYPE Operation against a key holding the wrong kind of value
```
Generally, In any command we first check for the argument and then check if key exist.
Some of the examples are
```
127.0.0.1:6379> getrange no-key invalid1 invalid2
(error) ERR value is not an integer or out of range
127.0.0.1:6379> setbit no-key 1 invalid
(error) ERR bit is not an integer or out of range
127.0.0.1:6379> xrange no-key invalid1 invalid2
(error) ERR Invalid stream ID specified as stream command argument
```
**Before change**
```
bitcount no-key invalid1 invalid2
0
```
**After change**
```
bitcount no-key invalid1 invalid2
(error) ERR value is not an integer or out of range
```
In our test case, now we missed some test coverage for client sub-commands.
This pr goal is to add some test coverage cases of the following commands:
Client caching
Client kill
Client no-evict
Client pause
Client reply
Client tracking
Client setname
At the very least, this is useful to make sure there are no leaks and crashes in these code paths.
When adding a new ACL rule was added, an attempt was made to remove
any "overlapping" rules. However, there when a match was found, the search
was not resumed at the right location, but instead after the original position of
the original command.
For example, if the current rules were `-config +config|get` and a rule `+config`
was added. It would identify that `-config` was matched, but it would skip over
`+config|get`, leaving the compacted rule `-config +config`. This would be evaluated
safely, but looks weird.
This bug can only be triggered with subcommands, since that is the only way to
have sequential matching rules. Resolves#12470. This is also only present in 7.2.
I think there was also a minor risk of removing another valid rule, since it would start
the search of the next command at an arbitrary point. I couldn't find a valid offset that
would have cause a match using any of the existing commands that have subcommands
with another command.
Fix the assertion when a busy script (timeout) signal ready keys (like LPUSH),
and then an arbitrary client's `allow-busy` command steps into `handleClientsBlockedOnKeys`
try wake up clients blocked on keys (like BLPOP).
Reproduction process:
1. start a redis with aof
`./redis-server --appendonly yes`
2. exec blpop
`127.0.0.1:6379> blpop a 0`
3. use another client call a busy script and this script push the blocked key
`127.0.0.1:6379> eval "redis.call('lpush','a','b') while(1) do end" 0`
4. user a new client call an allow-busy command like auth
`127.0.0.1:6379> auth a`
BTW, this issue also break the atomicity of script.
This bug has been around for many years, the old versions only have the
atomic problem, only 7.0/7.2 has the assertion problem.
Co-authored-by: Oran Agra <oran@redislabs.com>
This PR mainly fixes a possible integer overflow in `json_append_string()`.
When we use `cjson.encoding()` to encode a string larger than 2GB, at specific
compilation flags, an integer overflow may occur leading to truncation, resulting
in the part of the string larger than 2GB not being encoded.
On the other hand, this overflow doesn't cause any read or write out-of-range or segment fault.
1) using -O0 for lua_cjson (`make LUA_DEBUG=yes`)
In this case, `i` will overflow and leads to truncation.
When `i` reaches `INT_MAX+1` and overflows to INT_MIN, when compared to
len, `i` (1000000..00) is expanded to 64 bits signed integer (1111111.....000000) .
At this point i will be greater than len and jump out of the loop, so `for (i = 0; i < len; i++)`
will loop up to 2^31 times, and the part of larger than 2GB will be truncated.
```asm
`i` => -0x24(%rbp)
<+253>: addl $0x1,-0x24(%rbp) ; overflow if i large than 2^31
<+257>: mov -0x24(%rbp),%eax
<+260>: movslq %eax,%rdx ; move a 32-bit value with sign extension into a 64-bit signed
<+263>: mov -0x20(%rbp),%rax
<+267>: cmp %rax,%rdx ; check `i < len`
<+270>: jb 0x212600 <json_append_string+148>
```
2) using -O2/-O3 for lua_cjson (`make LUA_DEBUG=no`, **the default**)
In this case, because singed integer overflow is an undefined behavior, `i` will not overflow.
`i` will be optimized by the compiler and use 64-bit registers for all subsequent instructions.
```asm
<+180>: add $0x1,%rbx ; Using 64-bit register `rbx` for i++
<+184>: lea 0x1(%rdx),%rsi
<+188>: mov %rsi,0x10(%rbp)
<+192>: mov %al,(%rcx,%rdx,1)
<+195>: cmp %rbx,(%rsp) ; check `i < len`
<+199>: ja 0x20b63a <json_append_string+154>
```
3) using 32bit
Because `strbuf_ensure_empty_length()` preallocates memory of length (len * 6 + 2),
in 32-bit `cjson.encode()` can only handle strings smaller than ((2 ^ 32) - 3 ) / 6.
So 32bit is not affected.
Also change `i` in `strbuf_append_string()` to `size_t`.
Since its second argument `str` is taken from the `char2escape` string array which is never
larger than 6, so `strbuf_append_string()` is not at risk of overflow (the bug was unreachable).
When doing merge selector, we should check whether the merge
has started (i.e., whether open_bracket_start is -1) every time.
Otherwise, encountering an illegal selector pattern could succeed
and also cause memory leaks, for example:
```
acl setuser test1 (+PING (+SELECT (+DEL )
```
The above would leak memory and succeed with only DEL being applied,
and would now error after the fix.
Co-authored-by: Oran Agra <oran@redislabs.com>
Additional test coverage for incr/decr operation.
integer number could be present in raw encoding format due to operation like append. A incr/decr operation following it optimize the string to int encoding format.