Background:
Following the upgrade to jemalloc 5.2, there was a test that used to be flaky and
started failing consistently (on 32bit), so we disabled it (see #9645).
This is a test that i introduced in #7289 when i attempted to solve a rare stagnation
problem, and it later turned out i failed to solve it, ans what's more i added a test that
caused it to be not so rare, and as i mentioned, now in jemalloc 5.2 it became consistent on 32bit.
Stagnation can happen when all the slabs of the bin are equally utilized, so the decision
to move an allocation from a relatively empty slab to a relatively full one, will never
happen, and in that test all the slabs are at 50% utilization, so the defragger could just
keep scanning the keyspace and not move anything.
What this PR changes:
* First, finally in jemalloc 5.2 we have the count of non-full slabs, so when we compare
the utilization of the current slab, we can compare it to the average utilization of the non-full
slabs in our bin, instead of the total average of our bin. this takes the full slabs out of the game,
since they're not candidates for migration (neither source nor target).
* Secondly, We add some 12% (100/8) to the decision to defrag an allocation, this is the part
that aims to avoid stagnation, and it's especially important since the above mentioned change
can get us closer to stagnation.
* Thirdly, since jemalloc 5.2 adds sharded bins, we take into account all shards (something
that's missing from the original PR that merged it), this isn't expected to make any difference
since anyway there should be just one shard.
How this was benchmarked.
What i did was run the memefficiency test unit with `--verbose` and compare the defragger hits
and misses the tests reported.
At first, when i took into consideration only the non-full slabs, it got a lot worse (i got into
stagnation, or just got a lot of misses and a lot of hits), but when i added the 10% i got back
to results that were slightly better than the ones of the jemalloc 5.1 branch. i.e. full defragmentation
was achieved with fewer hits (relocations), and fewer misses (keyspace scans).
Recently we started using list-compress-depth in tests (was completely untested till now).
Turns this triggered test failures with the external mode, since the tests left the setting enabled
and then it was used in other tests (specifically the fuzzer named "Stress tester for #3343-alike bugs").
This PR fixes the issue of the `recompress` flag being left set by mistake, which caused the code to
later to compress the head or tail nodes (which should never be compressed)
The solution is to reset the recompress flag when it should have been (when it was decided not to compress).
Additionally we're adding some assertions and improve the tests so in order to catch other similar bugs.
Currently PING returns different status when server is not serving data,
for example when `LOADING` or `BUSY`.
But same was not true for `MASTERDOWN`
This commit makes PING reply with `MASTERDOWN` when
replica-serve-stale-data=no and link is MASTER is down.
Drop the STRALGO command, now LCS is a command of its own and it only works on keys (not input strings).
The motivation is that STRALGO's syntax was really messed-up...
- assumes all (future) string algorithms will take similar arguments
- mixes command that takes keys and one that doesn't in the same command.
- make it nearly impossible to expose the right key spec in COMMAND INFO (issues cluster clients)
- hard for cluster clients to determine the key names (firstkey, lastkey, etc)
- hard for ACL / flags (is it a read command?)
This is a breaking change.
Moves ZPOP ... 0 fast exit path after type check to reply with
WRONGTYPE. In the past it will return an empty array.
Also now count is not allowed to be negative.
see #9680
before:
```
127.0.0.1:6379> set zset str
OK
127.0.0.1:6379> zpopmin zset 0
(empty array)
127.0.0.1:6379> zpopmin zset -1
(empty array)
```
after:
```
127.0.0.1:6379> set zset str
OK
127.0.0.1:6379> zpopmin zset 0
(error) WRONGTYPE Operation against a key holding the wrong kind of value
127.0.0.1:6379> zpopmin zset -1
(error) ERR value is out of range, must be positive
```
Add --accurate to unit tests (new feature recently added)
Add --no-latency to valgrind run (was present only for modules)
add --no-latency to macos and freebsd runs (was not present for modules)
add --timeout to freebsd (same one we have for valgrind)
in `scan 0 match ""` case, pat is empty sds(patlen is 0), I don't think should access the
first character directly in this case(even though the first character is ' \0 '), for the
code readability, I switch the two positions of judgment logic.
Redis supports inserting data over 4GB into string (and recently for lists too, see #9357),
But LZF compression used in RDB files (see `rdbcompression` config), and in quicklist
(see `list-compress-depth` config) does not support compress/decompress data over
UINT32_MAX, which will result in corrupting the rdb after compression.
Internal changes:
1. Modify the `unsigned int` parameter of `lzf_compress/lzf_decompress` to `size_t`.
2. Modify the variable types in `lzf_compress` involving offsets and lengths to `size_t`.
3. Set LZF_USE_OFFSETS to 0.
When LZF_USE_OFFSETS is 1, lzf store offset into `LZF_HSLOT`(32bit).
Even in 64-bit, `LZF_USE_OFFSETS` defaults to 1, because lzf assumes that it only
compresses and decompresses data smaller than UINT32_MAX.
But now we need to make lzf support 64-bit, turning on `LZF_USE_OFFSETS` will make
it impossible to store 64-bit offsets or pointers.
BTW, disable LZF_USE_OFFSETS also brings a few performance improvements.
Tests:
1. Add test for compress/decompress string large than UINT32_MAX.
2. Add unittest for compress/decompress quicklistNode.
The client flags is a 64 bit integer, but the temporary cached value on the stack of call() is 32 bit.
luckily this doesn't lead to any bugs since the only flags used against this variables are below 32 bit.
Two issues:
1. In many tests we simply forgot to close the connections we created, which doesn't matter for normal tests where the server is killed, but creates a leak on external server tests.
2. When calling `start_server` on external test we create a fresh connection instead of really starting a new server, but never clean it at the end.
I have seen this CI failure twice on MacOS:
*** [err]: PEXPIRE/PSETEX/PEXPIREAT can set sub-second expires in tests/unit/expire.tcl
Expected 'somevalue {} somevalue {} somevalue {}' to equal or match '{} {} {} {} somevalue {}'
I did some loop test in my own daily CI, the results show that is
not particularly stable. Change the threshold from 30 to 50.
- Added sanitizer support. `address`, `undefined` and `thread` sanitizers are available.
- To build Redis with desired sanitizer : `make SANITIZER=undefined`
- There were some sanitizer findings, cleaned up codebase
- Added tests with address and undefined behavior sanitizers to daily CI.
- Added tests with address sanitizer to the per-PR CI (smoke out mem leaks sooner).
Basically, there are three types of issues :
**1- Unaligned load/store** : Most probably, this issue may cause a crash on a platform that
does not support unaligned access. Redis does unaligned access only on supported platforms.
**2- Signed integer overflow.** Although, signed overflow issue can be problematic time to time
and change how compiler generates code, current findings mostly about signed shift or simple
addition overflow. For most platforms Redis can be compiled for, this wouldn't cause any issue
as far as I can tell (checked generated code on godbolt.org).
**3 -Minor leak** (redis-cli), **use-after-free**(just before calling exit());
UB means nothing guaranteed and risky to reason about program behavior but I don't think any
of the fixes here worth backporting. As sanitizers are now part of the CI, preventing new issues
will be the real benefit.
On test failure store the external redis server logs as CI artifacts so we can review them.
Write test name to server log for external server tests.
This is attempted and silently failed in case external server doesn't support it.
Note that in non-external server mode we use a more robust method of writing to the log which doesn't depend on the
server actually running/working. This isn't possible for externl servers and required for some complex tests which are
skipped in external mode anyway.
Cleanup: remove dup code.
First, avoid using --accurate on the freebsd CI, we only care about
systematic issues there due to being different platform, but not
accuracy
Secondly, when looking at the test which timed out it seems silly and
outdated:
- it used KEYS to attempt to trigger lazy expiry, but KEYS doesn't do
that anymore.
- it used some hard coded sleeps rather than waiting for things to
happen and exiting ASAP
We saw some tests sporadically time out on valgrind (namely the ones
from #9323).
Increasing valgrind timeout from 20 mins to 40 mins in CI.
And fixing an outdated help message.
In both tests, "diskless loading short read" and "diskless loading short read with module",
the timeout of waiting for the replica to respond to a short read and log it, is too short.
Also, add --dump-logs in runtest-moduleapi for valgrind runs.
During diskless replication, the check for broken EOF mark is misplaced
and should be earlier. Now we do not swap db, we do proper cleanup and
correctly raise module events on this kind of failure.
This issue existed prior to #9323, but before, the side effect was not restoring
backup and not raising the correct module events on this failure.
* Clean up EINTR handling so EINTR will not change connection state to begin with.
* On TLS, catch EINTR and return it as-is before going through OpenSSL error handling (which seems to not distinguish it from EAGAIN).
Optimized port detection for tcl, use 'socket -server' instead of 'socket' to rule out port on TIME_WAIT
Co-authored-by: chendianqiang <chendianqiang@meituan.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
This refactors all `CONFIG SET`s and conf file loading arguments go through
the generic config handling interface.
Refactoring changes:
- All config params go through the `standardConfig` interface (some stuff which
is only related to the config file and not the `CONFIG` command still has special
handling for rewrite/config file parsing, `loadmodule`, for example.) .
- Added `MULTI_ARG_CONFIG` flag for configs to signify they receive a variable
number of arguments instead of a single argument. This is used to break up space
separated arguments to `CONFIG SET` so the generic setter interface can pass
multiple arguments to the setter function. When parsing the config file we also break
up anything after the config name into multiple arguments to the setter function.
Interface changes:
- A side effect of the above interface is that the `bind` argument in the config file can
be empty (no argument at all) this is treated the same as passing an single empty
string argument (same as `save` already used to work).
- Support rewrite and setting `watchdog-period` from config file (was only supported
by the CONFIG command till now).
- Another side effect is that the `save T X` config argument now supports multiple
Time-Changes pairs in a single line like its `CONFIG SET` counterpart. So in the
config file you can either do:
```
save 3600 1
save 600 10
```
or do
```
save 3600 1 600 10
```
Co-authored-by: Bjorn Svensson <bjorn.a.svensson@est.tech>
For diskless replication in swapdb mode, considering we already spend replica memory
having a backup of current db to restore in case of failure, we can have the following benefits
by instead swapping database only in case we succeeded in transferring db from master:
- Avoid `LOADING` response during failed and successful synchronization for cases where the
replica is already up and running with data.
- Faster total time of diskless replication, because now we're moving from Transfer + Flush + Load
time to Transfer + Load only. Flushing the tempDb is done asynchronously after swapping.
- This could be implemented also for disk replication with similar benefits if consumers are willing
to spend the extra memory usage.
General notes:
- The concept of `backupDb` becomes `tempDb` for clarity.
- Async loading mode will only kick in if the replica is syncing from a master that has the same
repl-id the one it had before. i.e. the data it's getting belongs to a different time of the same timeline.
- New property in INFO: `async_loading` to differentiate from the blocking loading
- Slot to Key mapping is now a field of `redisDb` as it's more natural to access it from both server.db
and the tempDb that is passed around.
- Because this is affecting replicas only, we assume that if they are not readonly and write commands
during replication, they are lost after SYNC same way as before, but we're still denying CONFIG SET
here anyways to avoid complications.
Considerations for review:
- We have many cases where server.loading flag is used and even though I tried my best, there may
be cases where async_loading should be checked as well and cases where it shouldn't (would require
very good understanding of whole code)
- Several places that had different behavior depending on the loading flag where actually meant to just
handle commands coming from the AOF client differently than ones coming from real clients, changed
to check CLIENT_ID_AOF instead.
**Additional for Release Notes**
- Bugfix - server.dirty was not incremented for any kind of diskless replication, as effect it wouldn't
contribute on triggering next database SAVE
- New flag for RM_GetContextFlags module API: REDISMODULE_CTX_FLAGS_ASYNC_LOADING
- Deprecated RedisModuleEvent_ReplBackup. Starting from Redis 7.0, we don't fire this event.
Instead, we have the new RedisModuleEvent_ReplAsyncLoad holding 3 sub-events: STARTED,
ABORTED and COMPLETED.
- New module flag REDISMODULE_OPTIONS_HANDLE_REPL_ASYNC_LOAD for RedisModule_SetModuleOptions
to allow modules to declare they support the diskless replication with async loading (when absent, we fall
back to disk-based loading).
Co-authored-by: Eduardo Semprebon <edus@saxobank.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
Introduced in #8179, this fixes the command's replies in the 0 count edge case.
[BREAKING] changes the reply type when count is 0 to an empty array (instead of nil)
Moves LPOP ... 0 fast exit path after type check to reply with WRONGTYPE
When repl-diskless-load is enabled, the connection is set to the blocking state.
The connection may be interrupted by a signal during a system call.
This would have resulted in a disconnection and possibly a reconnection loop.
Co-authored-by: Oran Agra <oran@redislabs.com>
there was a chance that by the time the assertion is executed,
the replica already manages to reconnect.
now we make sure the replica is unable to re-connect to the master.
additionally, we wait for some gossip from the disconnected replica,
to see that it doesn't mess things up.
unrelated: fix a typo when trying to exhaust the backlog, one that
didn't have any harmful implications
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Redis lists are stored in quicklist, which is currently a linked list of ziplists.
Ziplists are limited to storing elements no larger than 4GB, so when bigger
items are added they're getting truncated.
This PR changes quicklists so that they're capable of storing large items
in quicklist nodes that are plain string buffers rather than ziplist.
As part of the PR there were few other changes in redis:
1. new DEBUG sub-commands:
- QUICKLIST-PACKED-THRESHOLD - set the threshold of for the node type to
be plan or ziplist. default (1GB)
- QUICKLIST <key> - Shows low level info about the quicklist encoding of <key>
2. rdb format change:
- A new type was added - RDB_TYPE_LIST_QUICKLIST_2 .
- container type (packed / plain) was added to the beginning of the rdb object
(before the actual node list).
3. testing:
- Tests that requires over 100MB will be by default skipped. a new flag was
added to 'runtest' to run the large memory tests (not used by default)
Co-authored-by: sundb <sundbcn@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
Add new no-mandatory-keys flag to support COMMAND GETKEYS of commands
which have no mandatory keys.
In the past we would have got this error:
```
127.0.0.1:6379> command getkeys eval "return 1" 0
(error) ERR Invalid arguments specified for command
```
When using SETNX and SETXX we could end up doing key lookup twice.
This presents a small inefficiency price.
Also once we have statistics of write hit and miss they'll be wrong (recording the same key hit twice)
The issue was that setting maxmemory to used_memory and expecting
eviction is insufficient, since we need to take
mem_not_counted_for_evict into consideration.
This test got broken by #9166
The External tests started failing recently for unclear reason:
```
*** [err]: Tracking invalidation message of eviction keys should be before response in tests/unit/tracking.tcl
Expected '0' to be equal to 'invalidate volatile-key' (context: type eval line 21 cmd {assert_equal $res {invalidate volatile-key}} proc ::test)
```
I suspect the issue is that the used_memory sample is taken while a lazy free is still being processed.
Since the loop in incrementalTrimReplicationBacklog checks the size of histlen,
we cannot afford to update it only when the loop exits, this may cause deleting
much more replication blocks, and replication backlog may be less than setting size.
introduce in #9166
Co-authored-by: sundb <sundbcn@gmail.com>
After PR #9166 , replication backlog is not a real block of memory, just contains a
reference points to replication buffer's block and the blocks index (to accelerate
search offset when partial sync), so we need update both replication buffer's block's
offset and replication backlog blocks index's offset when master restart from RDB,
since the `server.master_repl_offset` is changed.
The implications of this bug was just a slow search, but not a replication failure.
So it looks like sampling set loglines [count_log_lines -2] was
executed too late, and the replication managed to complete before that.
```
*** [err]: diskless no replicas drop during rdb pipe in tests/integration/replication.tcl
log message of '"*Diskless rdb transfer, done reading from pipe, 2 replicas still up*"' not found in ./tests/tmp/server.6124.69/stdout after line: 52 till line: 52
```
Changes:
1. when we search the master log file, we start to search from before we sent the REPLICAOF
command, to prevent a race in which the replication completed before we sampled the log line count.
2. we don't need to sample the replica loglines sine it's a fresh resplica that's just been started, so the message
we're looking for is the first occurrence in the log, we can start search from 0.
Co-authored-by: Oran Agra <oran@redislabs.com>
Test failed on freebsd:
```
*** [err]: Make the old master a replica of the new one and check conditions in tests/integration/psync2-pingoff.tcl
Expected '162' to be equal to '176' (context: type eval line 18 cmd {assert_equal [status $R(0) master_repl_offset] [status $R(1) master_repl_offset]} proc ::test)
```
There are two possible race conditions in the test.
1. The code waits for sync_full to increment, and assumes that means the
master did the fork. But in fact there are cases the master will increment
that sync_full counter (after replica asks for sync), but will see that
there's already a fork running and will delay the fork creation.
In this case the INCR will be executed before the fork happens, so it'll
not be in the command stream. Solve that by waiting for `master_link_status: up`
on the replica before the INCR.
2. The repl-ping-replica-period is still high (1 second), so there's a chance the
master will send an additional PING between the two calls to INFO (the line that
fails is the one that samples INFO from both servers). So there's a chance one of
them will have an incremented offset due to PING and the other won't have it yet.
In theory we can wait for the repl_offset to match, but then we risk facing a
situation where that race will hide an offset mis-match. so instead, i think we
should just change repl-ping-replica-period to prevent further pings from being pushed.
Co-authored-by: Oran Agra <oran@redislabs.com>
The module test in reply.tcl was introduced by #8521 but didn't run until recently (see #9639)
and then it started failing with valgrind.
This is because valgrind uses 64 bit long double (unlike most other platforms that have at least 80 bits)
But besides valgrind, the tests where also incompatible with ARM32, which also uses 64 bit long doubles.
We now use appropriate value to avoid issues with either valgrind or ARM32
In all the double tests, i use 3.141, which is safe since since addReplyDouble uses
`%.17Lg` which is able to represent this value without adding any digits due to precision loss.
In the long double, since we use `%.17Lf` in ld2string, it preserves 17 significant
digits, rather than 17 digit after the decimal point (like in `%.17Lg`).
So to make these similar, i use value lower than 1 (no digits left of
the period)
Lastly, we have the same issue with TCL (no long doubles) so we read
raw protocol in that test.
Note that the only error before this fix (in both valgrind and ARM32 is this:
```
*** [err]: RM_ReplyWithLongDouble: a float reply in tests/unit/moduleapi/reply.tcl
Expected '3.141' to be equal to '3.14100000000000001' (context: type eval line 2 cmd {assert_equal 3.141 [r rw.longdouble 3.141]} proc ::test)
```
so the changes to debug.c and scripting.tcl aren't really needed, but i consider them a cleanup
(i.e. scripting.c validated a different constant than the one that's sent to it from debug.c).
Another unrelated change is to add the RESP version to the repeated tests in reply.tcl