Authenticated users can use string matching commands with a
specially crafted pattern to trigger a denial-of-service attack on Redis,
causing it to hang and consume 100% CPU time.
Co-authored-by: Tom Levy <tomlevy93@gmail.com>
Avoid calling unwatchAllKeys when running touchAllWatchedKeysInDb (which was unnecessary)
This can potentially lead to use-after-free and memory corruption when the next entry
pointer held by the watched keys iterator is freed when unwatching all keys of a specific client.
found with address sanitizer, added a test which will not always fail (depending on the random
dict hashing seed)
problem introduced in #9829 (Reids 7.0)
Co-authored-by: Oran Agra <oran@redislabs.com>
Currently there is no BUG. However during some internal code changes
I found that it can happen (for example in case new code will not update
the buf_peak) which can currently lead to memory overrun which is much
harder to detect and root cause.
Why did I please the assert here? The reason is to be able to have the
buf_peak value without the risk of it being overriden by the peak_reset
The test sporadically failed with valgrind trying to match
`no client named obuf-client1 found*`
in the log it looks like `obuf-client1` was indeed dropped,
so i'm guessing it's because CLIENT LIST was processed first.
When no-touch mode is enabled, the client will not touch LRU/LFU of the
keys it accesses, except when executing command `TOUCH`.
This allows inspecting or modifying the key-space without affecting their eviction.
Changes:
- A command `CLIENT NO-TOUCH ON|OFF` to switch on and off this mode.
- A client flag `#define CLIENT_NOTOUCH (1ULL<<45)`, which can be shown
with `CLIENT INFO`, by the letter "T" in the "flags" field.
- Clear `NO-TOUCH` flag in `clearClientConnectionState`, which is used by `RESET`
command and resetting temp clients used by modules.
- Also clear `NO-EVICT` flag in `clearClientConnectionState`, this might have been an
oversight, spotted by @madolson.
- A test using `DEBUG OBJECT` command to verify that LRU stat is not touched when
no-touch mode is on.
Co-authored-by: chentianjie <chentianjie@alibaba-inc.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Co-authored-by: sundb <sundbcn@gmail.com>
We noticed that `client evicted due to client tracking prefixes`
takes over 200 seconds with valgrind.
We combine three prefixes in each command, this will probably
save us half the testing time.
Before: normal: 3508ms, valgrind: 289503ms -> 290s
With three prefixes, normal: 1500ms, valgrind: 135742ms -> 136s
Since we did not actually count the memory usage of all prefixes, see
getClientMemoryUsage, so we can not use larger prefixes to speed up the
test here. Also this PR cleaned up some spaces (IDE jobs) and typos.
There is a built in limit to client side tracking keys, which when exceeded will invalidate keys. This occurs in two places, one in the server cron and other before executing a command. If it happens in the second scenario, the invalidations will be queued for later since current client is set. This queue is never drained if a command is not executed (through call) such as a multi-exec command getting queued. This results in a later server assert crashing.
A simple HELLO command to a password protected Redis server replies
with an error with another command suggestion. This omits protocol version
from HELLO command arguments which causes another error.
This PR adds the protocol version in the command suggestion.
This test case is to cover a edge scenario: when a writable replica enabled AOF
at the same time, active expiry keys which was created in writable replicas should
propagate to the AOF file, and some versions might crash (fixed by #11615).
For details, please refer to #11778
stabilize the test introduced in #11248
* remove random aspect of the test by using DEBUG POPULATE instead of redis-benchmark
* disable rdbcompression, so that the rdb file is always about 1GB.
when fadvise was disabled, i get about 1GB in the page cace
when enabled i get less than 200KB
so for now, i'll keep the 500kb threshold.
We have cases where we print information (might be important but by
no means an error indicator) with the LL_WARNING level.
Demoting these to LL_NOTICE:
- oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
- User requested shutdown...
This is also true for cases that we encounter a rare but normal situation.
Demoting to LL_NOTICE. Examples:
- AOF was enabled but there is already another background operation. An AOF background was scheduled to start when possible.
- Connection with master lost.
base on yoav-steinberg's https://github.com/redis/redis/pull/10650#issuecomment-1112280554
and yossigo's https://github.com/redis/redis/pull/10650#pullrequestreview-967677676
* Make it clear that current_client is the root client that was called by
external connection
* add executing_client which is the client that runs the current command
(can be a module or a script)
* Remove script_caller that was used for commands that have CLIENT_SCRIPT
to get the client that called the script. in most cases, that's the current_client,
and in others (when being called from a module), it could be an intermediate
client when we actually want the original one used by the external connection.
bugfixes:
* RM_Call with C flag should log ACL errors with the requested user rather than
the one used by the original client, this also solves a crash when RM_Call is used
with C flag from a detached thread safe context.
* addACLLogEntry would have logged info about the script_caller, but in case the
script was issued by a module command we actually want the current_client. the
exception is when RM_Call is called from a timer event, in which case we don't
have a current_client.
behavior changes:
* client side tracking for scripts now tracks the keys that are read by the script
instead of the keys that are declared by the caller for EVAL
other changes:
* Log both current_client and executing_client in the crash log.
* remove prepareLuaClient and resetLuaClient, being dead code that was forgotten.
* remove scriptTimeSnapshot and snapshot_time and instead add cmd_time_snapshot
that serves all commands and is reset only when execution nesting starts.
* remove code to propagate CLIENT_FORCE_REPL from the executed command
to the script caller since scripts aren't propagated anyway these days and anyway
this flag wouldn't have had an effect since CLIENT_PREVENT_PROP is added by scriptResetRun.
* fix a module GIL violation issue in afterSleep that was introduced in #10300 (unreleased)
In #9373, actually need to replace `$rd $pop blist1{t} blist2{t} 1`
with `bpop_command_two_key $rd $pop blist1{t} blist2{t} 1` but forgot
to delete the latter.
This doesn't affect the test, because the later assert_error "WRONGTYPE"
is expected (and right). And if we read $rd again, it will get the
wrong result, like 'ERR unknown command 'BLMPOP_LEFT' | 'BLMPOP_RIGHT'
In this PR, we use function pointer *isPresent replace the variable "present" in auxFieldHandler, so that in the future, when we have more aux fields, we could decide if the aux field is displayed or not.
All of the POP commands must not decr length below 0.
So, get_fsl will delete the key if the length is 0 (unless
the caller wished to create if doesn't exist)
Other:
1. Use REDISMODULE_WRITE where needed (POP commands)
2. Use wait_for_blokced_clients in test
Unrelated:
Use quotes instead of curly braces in zset.tcl, for variable expansion
Starting from Redis 7.0 (#9890) we started wrapping everything a command
propagates with MULTI/EXEC. The problem is that both SCAN and RANDOMKEY can
lazy-expire arbitrary keys (similar behavior to active-expire), and put DELs in a transaction.
Fix: When these commands are called without a parent exec-unit (e.g. not in EVAL or
MULTI) we avoid wrapping their DELs in a transaction (for the same reasons active-expire
and eviction avoids a transaction)
This PR adds a per-command flag that indicates that the command may touch arbitrary
keys (not the ones in the arguments), and uses that flag to avoid the MULTI-EXEC.
For now, this flag is internal, since we're considering other solutions for the future.
Note for cluster mode: if SCAN/RANDOMKEY is inside EVAL/MULTI it can still cause the
same situation (as it always did), but it won't cause a CROSSSLOT because replicas and AOF
do not perform slot checks.
The problem with the above is mainly for 3rd party ecosystem tools that propagate commands
from master to master, or feed an AOF file with redis-cli into a master.
This PR aims to fix the regression in redis 7.0, and we opened #11792 to try to handle the
bigger problem with lazy expire better for another release.
# Background
The RDB file is usually generated and used once and seldom used again, but the content would reside in page cache until OS evicts it. A potential problem is that once the free memory exhausts, the OS have to reclaim some memory from page cache or swap anonymous page out, which may result in a jitters to the Redis service.
Supposing an exact scenario, a high-capacity machine hosts many redis instances, and we're upgrading the Redis together. The page cache in host machine increases as RDBs are generated. Once the free memory drop into low watermark(which is more likely to happen in older Linux kernel like 3.10, before [watermark_scale_factor](https://lore.kernel.org/lkml/1455813719-2395-1-git-send-email-hannes@cmpxchg.org/) is introduced, the `low watermark` is linear to `min watermark`, and there'is not too much buffer space for `kswapd` to be wake up to reclaim memory), a `direct reclaim` happens, which means the process would stall to wait for memory allocation.
# What the PR does
The PR introduces a capability to reclaim the cache when the RDB is operated. Generally there're two cases, read and write the RDB. For read it's a little messy to address the incremental reclaim, so the reclaim is done in one go in background after the load is finished to avoid blocking the work thread. For write, incremental reclaim amortizes the work of reclaim so no need to put it into background, and the peak watermark of cache can be reduced in this way.
Two cases are addresses specially, replication and restart, for both of which the cache is leveraged to speed up the processing, so the reclaim is postponed to a right time. To do this, a flag is added to`rdbSave` and `rdbLoad` to control whether the cache need to be kept, with the default value false.
# Something deserve noting
1. Though `posix_fadvise` is the POSIX standard, but only few platform support it, e.g. Linux, FreeBSD 10.0.
2. In Linux `posix_fadvise` only take effect on writeback-ed pages, so a `sync`(or `fsync`, `fdatasync`) is needed to flush the dirty page before `posix_fadvise` if we reclaim write cache.
# About test
A unit test is added to verify the effect of `posix_fadvise`.
In integration test overall cache increase is checked, as well as the cache backed by RDB as a specific TCL test is executed in isolated Github action job.
The PR adds support for the following flags on RedisModule_OpenKey:
* REDISMODULE_OPEN_KEY_NONOTIFY - Don't trigger keyspace event on key misses.
* REDISMODULE_OPEN_KEY_NOSTATS - Don't update keyspace hits/misses counters.
* REDISMODULE_OPEN_KEY_NOEXPIRE - Avoid deleting lazy expired keys.
* REDISMODULE_OPEN_KEY_NOEFFECTS - Avoid any effects from fetching the key
In addition, added `RM_GetOpenKeyModesAll`, which returns the mask of all
supported OpenKey modes. This allows the module to check, in runtime, which
OpenKey modes are supported by the current Redis instance.
Return an error when loadAppendOnlyFiles fails instead of
exiting. DEBUF LOADAOF command is only meant to be used by
the test suite, and only by tests that generated an AOF file
first. So this change is ok (considering that the caller is
likely to catch this error and die).
This actually revert part of the code in #9012, and now
DEBUG LOADAOF behaves the same as DEBUG RELOAD (returns an
error when the load fails).
Plus remove a `after 2000` in a test, which can save times (looks like copy paste error).
If we have integer scores on the sorted set we're not using the fastest way
to reply by calling `d2string` which uses `double2ll` and `ll2string` when it can,
instead of `fpconv_dtoa`.
This results by some 50% performance improvement in certain cases of integer
scores for both RESP2 and RESP3, and no apparent impact on double scores.
Co-authored-by: Oran Agra <oran@redislabs.com>
Failure happens in FreeBSD daily:
```
*** [err]: Test replication with parallel clients writing in different DBs in tests/integration/replication-4.tcl
Expected [::redis::redisHandle2 dbsize] > 0 (context: type eval line 19 cmd {assert {[$master dbsize] > 0}} proc ::test)
```
The test is failing because db 9 has no data (default db), and
according to the log, we can see that db 9 does not have a key:
```
### Starting test Test replication with parallel clients writing in different DBs in tests/integration/replication-4.tcl
3338:S 03 Feb 2023 00:15:18.723 - DB 11: 1 keys (0 volatile) in 4 slots HT.
3338:S 03 Feb 2023 00:15:18.723 - DB 12: 141 keys (0 volatile) in 256 slots HT.
```
We use `wait_for_condition` to ensure that parallel clients have
written data before calling stop_bg_complex_data. At the same time,
`wait_for_condition` is also used to remove the above `after 1000`,
which can save time in most cases.
There is a timing issue in the new ACL log test:
```
*** [err]: ACL LOG aggregates similar errors together and assigns unique entry-id to new errors in tests/unit/acl.tcl
Expected 1675382873989 < 1675382873989 (context: type eval line 15 cmd {assert {$timestamp_last_update_original < $timestamp_last_updated_after_update}} proc ::test)
```
Looking at the test code, we will check the `timestamp-last-updated` before
and after a new ACL error occurs. Actually `WRONGPASS` errors can be executed
very quickly on fast machines. For example, in the this case, the execution is
completed within one millisecond.
The error is easy to reproduce, if we reduce the number of the for loops, for
example set to 2, and using --loop and --stop. Avoid this timing issue by adding
an `after 1` before the new errors.
The test was introduced in #11477.
Added 3 fields to the ACL LOG - adds entry_id, timestamp_created and timestamp_last_updated, which updates similar existing log error entries. The pair - entry_id, timestamp_created is a unique identifier of this entry, in case the node dies and is restarted, it can detect that if it's a new series.
The primary use case of Unique id is to uniquely identify the error messages and not to detect if the server has restarted.
entry-id is the sequence number of the entry (starting at 0) since the server process started. Can also be used to check if items were "lost" if they fell between periods.
timestamp-created is the unix-time in ms at the time the entry was first created.
timestamp-last-updated is the unix-time in ms at the time the entry was last updated
Time_created gives the absolute time which better accounts for network time as compared to time since. It can also be older than 60 secs and presently there is no field that can display the original time of creation once the error entry is updated.
The reason of timestamp_last_updated field is that it provides a more precise value for the “last time” an error was seen where as, presently it is only in the 60 second period.
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Currently while a sharded pubsub message publish tries to propagate the message across the cluster, a NULL check is missing for clusterLink. clusterLink could be NULL if the link is causing memory beyond the set threshold cluster-link-sendbuf-limit and server terminates the link.
This change introduces two things:
Avoids the engine crashes on the publishing node if a message is tried to be sent to a node and the link is NULL.
Adds a debugging tool CLUSTERLINK KILL to terminate the clusterLink between two nodes.
Previously, jemalloc was explicitly configured to build in `gnu99` mode. As a result, `<stdatomic.h>` was presumed to be unavailable and never used.
This commit removes explicit build flags configuration and lets `autoconf` determine the supported build flags. In addition, we also no longer build C++ jemalloc code.
Co-authored-by: Yossi Gottlieb <yossigo@gmail.com>
Test on x86 + TLS fail with this error:
```
*** [err]: Slave is able to detect timeout during handshake in tests/integration/replication.tcl
Replica is not able to detect timeout
```
The replica logs is:
```
### Starting test Slave is able to detect timeout during handshake in tests/integration/replication.tcl
7681:S 05 Jan 2023 00:21:56.635 * Non blocking connect for SYNC fired the event.
7681:S 05 Jan 2023 00:21:56.638 * Master replied to PING, replication can continue...
7681:S 05 Jan 2023 00:21:56.638 * Trying a partial resynchronization (request ef70638885500aad12dd673c68ca1541116a59fe:1).
7681:S 05 Jan 2023 00:22:56.894 # Failed to read response from the server: error:0A000126:SSL routines::unexpected eof while reading
7681:S 05 Jan 2023 00:22:56.894 # Master did not reply to PSYNC, will try later
```
This is another issue that appeared after #11640 was merged. This PR try to fix it.
The idea is to make it stable in `wait_bgsave`, for example, it may wait until the
next psync retry in the following situation: `Master did not reply to PSYNC, will try later`
Other than that, the change will make the test more consistent / predictable since
it'll mean the master is always frozen in the desired state (waiting for repl-diskless-sync-delay
to happen, rather than earlier stages of the handshake).
In #7875 (Redis 6.2), we changed the sds alloc to be the usable allocation
size in order to:
> reduce the need for realloc calls by making the sds implicitly take over
the internal fragmentation
This change was done most sds functions, excluding `sdsRemoveFreeSpace` and
`sdsResize`, the reason is that in some places (e.g. clientsCronResizeQueryBuffer)
we call sdsRemoveFreeSpace when we see excessive free space and want to trim it.
so if we don't trim it exactly to size, the caller may still see excessive free space and
call it again and again.
However, this resulted in some excessive calls to realloc, even when there's no need
and it's gonna be a no-op (e.g. when reducing 15 bytes allocation to 13).
It turns out that a call for realloc with jemalloc can be expensive even if it ends up
doing nothing, so this PR adds a check using `je_nallocx`, which is cheap to avoid
the call for realloc.
in addition to that this PR unifies sdsResize and sdsRemoveFreeSpace into common
code. the difference between them was that sdsResize would avoid using SDS_TYPE_5,
since it want to keep the string ready to be resized again, while sdsRemoveFreeSpace
would permit using SDS_TYPE_5 and get an optimal memory consumption.
now both methods take a `would_regrow` argument that makes it more explicit.
the only actual impact of that is that in clientsCronResizeQueryBuffer we call both sdsResize
and sdsRemoveFreeSpace for in different cases, and we now prevent the use of SDS_TYPE_5 in both.
The new test that was added to cover this concern used to pass before this PR as well,
this PR is just a performance optimization and cleanup.
Benchmark:
`redis-benchmark -c 100 -t set -d 512 -P 10 -n 100000000`
on i7-9850H with jemalloc, shows improvement from 1021k ops/sec to 1067k (average of 3 runs).
some 4.5% improvement.
Co-authored-by: Oran Agra <oran@redislabs.com>
This change improves the performance of cluster slots by removing the deferring lengths that are used. Deferring lengths are used in two contexts, the first is for determining the number of replicas that serve a slot (Added in 6.2 as part of a different performance improvement) and the second is for determining the extra networking options for each node (Added in 7.0). For continuous slots, (e.g. 0-8196) this improvement is very negligible, however it becomes more significant when slots are not continuous (e.g. 0 2 4 6 etc) which can happen in production for various users.
The `cluster slots` command is deprecated in favor of `cluster shards`, but since most clients don't support the new command yet I think it's important to not degrade performance here.
Benchmarking shows about 2x improvement, however I wasn't able to get a coherent TPS number since the benchmark process was being saturated long before Redis was, so had to run with multiple benchmarks and merge results. If needed I can add this to our memtier framework. Instead the next section shows the number of usec per call from the benchmark results, which shows significant improvement as well as having a more coherent response in the CoB.
| | New Code | Old Code | % Improvements
|----|----|----- |-----
| Uniform slots| usec_per_call=10.46 | usec_per_call=11.03 | 5.7%
| Worst case (Only even slots)| usec_per_call=963.80 | usec_per_call=2950.99 | 307%
This change also removes some extra white space that I added a when making a code change for adding hostnames.
Redis 7.0 introduced new logic in expireIfNeeded() where a read-only replica would never consider a key as expired when replicating commands from the master. See acf3495. This was done by checking server.current_client with server.master. However, we should instead check for CLIENT_MASTER flag for this logic to be more robust and consistent with the rest of the Redis code base.
The command:
sentinel config set option value
and
sentinel config get option
They should include at least 4 arguments instead of 3,
This PR fixes this issue.
the only impact on the client is a different error message
If a dict has only keys, and no use of values, then a key can be stored directly in a
dict's hashtable. The key replaces the dictEntry. To distinguish between a key and
a dictEntry, we only use this optimization if the key is odd, i.e. if the key has the least
significant bit set. This is true for sds strings, since the sds header is always an odd
number of bytes.
Dict entries are used as a fallback when there is a hash collision. A special dict entry
without a value (only key and next) is used so we save one word in this case too.
This saves 24 bytes per set element for larges sets, and also gains some speed improvement
as a side effect (less allocations and cache misses).
A quick test adding 1M elements to a set using the command below resulted in memory
usage of 28.83M, compared to 46.29M on unstable.
That's 18 bytes per set element on average.
eval 'for i=1,1000000,1 do redis.call("sadd", "myset", "x"..i) end' 0
Other changes:
Allocations are ensured to have at least 8 bits alignment on all systems. This affects 32-bit
builds compiled without HAVE_MALLOC_SIZE (not jemalloc or glibc) in which Redis
stores the size of each allocation, after this change in 8 bytes instead of previously 4 bytes
per allocation. This is done so we can reliably use the 3 least significant bits in a pointer to
encode stuff.
Related to the hang reported in #11671
Currently, redis can disconnect a client due to reaching output buffer limit,
it'll also avoid feeding that output buffer with more data, but it will keep
running the loop in the command (despite the client already being marked for
disconnection)
This PR is an attempt to mitigate the problem, specifically for commands that
are easy to abuse, specifically: KEYS, HRANDFIELD, SRANDMEMBER, ZRANDMEMBER.
The RAND family of commands can take a negative COUNT argument (which is not
bound to the number of elements in the key), so it's enough to create a key
with one field, and then these commands can be used to hang redis.
For KEYS the caller can use the existing keyspace in redis (if big enough).
Authenticated users issuing specially crafted SETRANGE and SORT(_RO)
commands can trigger an integer overflow, resulting with Redis attempting
to allocate impossible amounts of memory and abort with an OOM panic.
This change increase the frequency of the failover log from 5 minutes to 10 seconds. This log is only emitted when a replica has an outstanding election is progress, and waiting 5 minutes for the next log makes debugging and alarming on the log messages too slow. It also now prints out the number of votes the replica has currently received as well as the number of votes it needs to achieve quorum so that we can track the progress if it's running slowly.
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
This PR refactors the abstraction of the dictEntry by making it opaque. This enables future optimizations of the dict implementation without affecting the code using it.
The PR contains 5 commits. More detailed commit messages are found in each commit.
* Make dictEntry opaque
* Let active expire cycle use dictScan instead of messing with internals
* activeDefragSdsDict use scan instead of iterator and drop dictSetNext
* Remove the bucket-cb from dictScan and move dictEntry defrag to dictScanDefrag
* Move stat_active_defrag_hits increment to activeDefragAlloc
This change deletes the dictGetNext and dictGetNextRef functions, so the
dict API doesn't expose the next field at all.
The bucket function in dictScan is deleted. A separate dictScanDefrag function
is added which takes a defrag alloc function to defrag-reallocate the dict entries.
"Dirty" code accessing the dict internals in active defrag is removed.
An 'afterReplaceEntry' is added to dictType, which allows the dict user
to keep the dictEntry metadata up to date after reallocation/defrag/move.
Additionally, for updating the cluster slot-to-key mapping, after a dictEntry
has been reallocated, we need to know which db a dict belongs to, so we store
a pointer to the db in a new metadata section in the dict struct, which is
a new mechanism similar to dictEntry metadata. This adds some complexity but
provides better isolation.