in the past few days i've seen two failures in the valgrind daily test.
*** [err]: slave fails full sync and diskless load swapdb recovers it in tests/integration/replication.tcl
Replica didn't get into loading mode
can't reproduce it, but i'm hoping it's just too slow (to start loading within 5 seconds)
Since we measure the COW size in this test by changing some keys and reading
the reported COW size, we need to ensure that the "dismiss mechanism" (#8974)
will not free memory and reduce the COW size.
For that, this commit changes the size of the keys to 512B (less than a page).
and because some keys may fall into the same page, we are modifying ten keys
on each iteration and check for at least 50% change in the COW size.
This is similar to the recent addition of LMPOP/BLMPOP (#9373), but zset.
Syntax for the new ZMPOP command:
`ZMPOP numkeys [<key> ...] MIN|MAX [COUNT count]`
Syntax for the new BZMPOP command:
`BZMPOP timeout numkeys [<key> ...] MIN|MAX [COUNT count]`
Some background:
- ZPOPMIN/ZPOPMAX take only one key, and can return multiple elements.
- BZPOPMIN/BZPOPMAX take multiple keys, but return only one element from just one key.
- ZMPOP/BZMPOP can take multiple keys, and can return multiple elements from just one key.
Note that ZMPOP/BZMPOP can take multiple keys, it eventually operates on just on key.
And it will propagate as ZPOPMIN or ZPOPMAX with the COUNT option.
As new commands, if we can not pop any elements, the response like:
- ZMPOP: Return a NIL in both RESP2 and RESP3, unlike ZPOPMIN/ZPOPMAX return emptyarray.
- BZMPOP: Return a NIL in both RESP2 and RESP3 when timeout is reached, like BZPOPMIN/BZPOPMAX.
For the normal response is nested arrays in RESP2 and RESP3:
```
ZMPOP/BZMPOP
1) keyname
2) 1) 1) member1
2) score1
2) 1) member2
2) score2
In RESP2:
1) "myzset"
2) 1) 1) "three"
2) "3"
2) 1) "two"
2) "2"
In RESP3:
1) "myzset"
2) 1) 1) "three"
2) (double) 3
2) 1) "two"
2) (double) 2
```
- Add `-u <uri>` command line option to support `redis://` URI scheme.
- included server connection information object (`struct cliConnInfo`),
used to describe an ip:port pair, db num user input, and user:pass to
avoid a large number of function arguments.
- Using sds on connection info strings for redis-benchmark/redis-cli
Co-authored-by: yoav-steinberg <yoav@monfort.co.il>
The main idea is how to allow a master to load replication info from RDB file when rebooting, if master can load replication info it means that replicas may have the chance to psync with master, it can save much traffic.
The key point is we need guarantee safety and consistency, so there
are two differences between master and replica:
1. master would load the replication info as secondary ID and
offset, in case other masters have the same replid.
2. when master loading RDB, it would propagate expired keys as DEL
command to replication backlog, then replica can receive these
commands to delete stale keys.
p.s. the expired keys when RDB loading is useful for users, so
we show it as `rdb_last_load_keys_expired` and `rdb_last_load_keys_loaded` in info persistence.
Moreover, after load replication info, master should update
`no_replica_time` in case loading RDB cost too long time.
Part two of implementing #8702 (zset), after #8887.
## Description of the feature
Replaced all uses of ziplist with listpack in t_zset, and optimized some of the code to optimize performance.
## Rdb format changes
New `RDB_TYPE_ZSET_LISTPACK` rdb type.
## Rdb loading improvements:
1) Pre-expansion of dict for validation of duplicate data for listpack and ziplist.
2) Simplifying the release of empty key objects when RDB loading.
3) Unify ziplist and listpack data verify methods for zset and hash, and move code to rdb.c.
## Interface changes
1) New `zset-max-listpack-entries` config is an alias for `zset-max-ziplist-entries` (same with `zset-max-listpack-value`).
2) OBJECT ENCODING will return listpack instead of ziplist.
## Listpack improvements:
1) Add `lpDeleteRange` and `lpDeleteRangeWithEntry` functions to delete a range of entries from listpack.
2) Improve the performance of `lpCompare`, converting from string to integer is faster than converting from integer to string.
3) Replace `snprintf` with `ll2string` to improve performance in converting numbers to strings in `lpGet()`.
## Zset improvements:
1) Improve the performance of `zzlFind` method, use `lpFind` instead of `lpCompare` in a loop.
2) Use `lpDeleteRangeWithEntry` instead of `lpDelete` twice to delete a element of zset.
## Tests
1) Add some unittests for `lpDeleteRange` and `lpDeleteRangeWithEntry` function.
2) Add zset RDB loading test.
3) Add benchmark test for `lpCompare` and `ziplsitCompare`.
4) Add empty listpack zset corrupt dump test.
We want to add COUNT option for BLPOP.
But we can't do it without breaking compatibility due to the command arguments syntax.
So this commit introduce two new commands.
Syntax for the new LMPOP command:
`LMPOP numkeys [<key> ...] LEFT|RIGHT [COUNT count]`
Syntax for the new BLMPOP command:
`BLMPOP timeout numkeys [<key> ...] LEFT|RIGHT [COUNT count]`
Some background:
- LPOP takes one key, and can return multiple elements.
- BLPOP takes multiple keys, but returns one element from just one key.
- LMPOP can take multiple keys and return multiple elements from just one key.
Note that LMPOP/BLMPOP can take multiple keys, it eventually operates on just one key.
And it will propagate as LPOP or RPOP with the COUNT option.
As a new command, it still return NIL if we can't pop any elements.
For the normal response is nested arrays in RESP2 and RESP3, like:
```
LMPOP/BLMPOP
1) keyname
2) 1) element1
2) element2
```
I.e. unlike BLPOP that returns a key name and one element so it uses a flat array,
and LPOP that returns multiple elements with no key name, and again uses a flat array,
this one has to return a nested array, and it does for for both RESP2 and RESP3 (like SCAN does)
Some discuss can see: #766#8824
* Delay to discard cache master when full synchronization
* Don't disconnect with replicas before loading transferred RDB when full sync
Previously, once replica need to start full synchronization with master,
it will discard cached master whatever full synchronization is failed or
not.
Now we discard cached master only when transferring RDB is finished
and start to change data space, this make replica could start partial
resynchronization with another new master if new master is failed
during full synchronization.
Until now, giving a negative index seeks from the end of a list and a
positive seeks from the beginning. This change makes it seek from
the nearest end, regardless of the sign of the given index.
quicklistIndex is used by all list commands which operate by index.
LINDEX key 999999 in a list if 1M elements is greately optimized by
this change. Latency is cut by 75%.
LINDEX key -1000000 in a list of 1M elements, likewise.
LRANGE key -1 -1 is affected by this, since LRANGE converts the
indices to positive numbers before seeking.
The tests for corrupt dumps are updated to make sure the corrup
data is seeked in the same direction as before.
1. The output of --help:
* On the Usage line, just write [OPTIONS] [COMMAND ARGS...] instead listing
only a few arbitrary options and no command.
* For --cluster, describe that if the command is supplied on the command line,
the key must contain "{tag}". Otherwise, the command will not be sent to the
right cluster node.
* For -r, add a note that if -r is omitted, all commands in a benchmark will
use the same key. Also align the description.
* For -t, describe that -t is ignored if a command is supplied on the command
line.
2. Print a warning if -t is present when a specific command is supplied.
3. Print all warnings and errors to stderr.
4. Remove -e from calls in redis-benchmark test suite.
We only run OOM related tests on x86_64 and aarch64, as jemalloc on other
platforms (notably s390x) may actually succeed very large allocations. As
a result the test may hang for a very long time at the cleanup phase,
iterating as many as 2^61 hash table slots.
Part one of implementing #8702 (taking hashes first before other types)
## Description of the feature
1. Change ziplist encoded hash objects to listpack encoding.
2. Convert existing ziplists on RDB loading time. an O(n) operation.
## Rdb format changes
1. Add RDB_TYPE_HASH_LISTPACK rdb type.
2. Bump RDB_VERSION to 10
## Interface changes
1. New `hash-max-listpack-entries` config is an alias for `hash-max-ziplist-entries` (same with `hash-max-listpack-value`)
2. OBJECT ENCODING will return `listpack` instead of `ziplist`
## Listpack improvements:
1. Support direct insert, replace integer element (rather than convert back and forth from string)
3. Add more listpack capabilities to match the ziplist ones (like `lpFind`, `lpRandomPairs` and such)
4. Optimize element length fetching, avoid multiple calculations
5. Use inline to avoid function call overhead.
## Tests
1. Add a new test to the RDB load time conversion
2. Adding the listpack unit tests. (based on the one in ziplist.c)
3. Add a few "corrupt payload: fuzzer findings" tests, and slightly modify existing ones.
Co-authored-by: Oran Agra <oran@redislabs.com>
This commit mainly fixes empty keys due to RDB loading and restore command,
which was omitted in #9297.
1) When loading quicklsit, if all the ziplists in the quicklist are empty, NULL will be returned.
If only some of the ziplists are empty, then we will skip the empty ziplists silently.
2) When loading hash zipmap, if zipmap is empty, sanitization check will fail.
3) When loading hash ziplist, if ziplist is empty, NULL will be returned.
4) Add RDB loading test with sanitize.
Replication client no longer checks incoming command length against the client-query-buffer-limit. This makes the master able to replicate commands longer than replica's configured client-query-buffer-limit
The execution of the RPOPLPUSH command by the fuzzer created junk keys,
that were later being selected by RANDOMKEY and modified.
This also meant that lists were statistically tested more than other
files.
Fix the fuzzer not to pass junk key names to RPOPLPUSH, and add a check
that detects that new keys are not added by the fuzzer to detect future
similar issues.
Recently we found two issues in the fuzzer tester: #9302#9285
After fixing them, more problems surfaced and this PR (as well as #9297) aims to fix them.
Here's a list of the fixes
- Prevent an overflow when allocating a dict hashtable
- Prevent OOM when attempting to allocate a huge string
- Prevent a few invalid accesses in listpack
- Improve sanitization of listpack first entry
- Validate integrity of stream consumer groups PEL
- Validate integrity of stream listpack entry IDs
- Validate ziplist tail followed by extra data which start with 0xff
Co-authored-by: sundb <sundbcn@gmail.com>
When we load rdb or restore command, if we encounter a length of 0, it will result in the creation of an empty key.
This could either be a corrupt payload, or a result of a bug (see #8453 )
This PR mainly fixes the following:
1) When restore command will return `Bad data format` error.
2) When loading RDB, we will silently discard the key.
Co-authored-by: Oran Agra <oran@redislabs.com>
The psync2 test has failed several times recently.
In #9159 we only solved half of the problem.
i.e. reordering of the replica that's already connected to
the newly promoted master.
Consider this scenario:
0 slaveof 2
1 slaveof 2
3 slaveof 2
4 slaveof 1
0 slaveof no one, became a new master got a new replid
2 slaveof 0, partial resync and got the new replid
3 reconnect 2, inherit the new replid
3 slaveof 4, use the new replid and got a full resync
And another scenario:
1 slaveof 3
2 slaveof 4
3 slaveof 0
4 slaveof 0
4 slaveof no one, became a new master got a new replid
2 reconnect 4, inherit the new replid
2 slaveof 1, use the new replid and got a full resync
So maybe we should reattach replicas in the right order.
i.e. In the above example, if it would have reattached 1, 3 and 0 to
the new chain formed by 4 before trying to attach 2 to 1, it would succeed.
This commit break the SLAVEOF loop into two loops. (ideas from oran)
First loop that uses random to decide who replicates from who.
Second loop that does the actual SLAVEOF command.
In the second loop, we make sure to execute it in the right order,
and after each SLAVEOF, wait for it to be connected before we proceed.
Co-authored-by: Oran Agra <oran@redislabs.com>
## Backgroud
As we know, after `fork`, one process will copy pages when writing data to these
pages(CoW), and another process still keep old pages, they totally cost more memory.
For redis, we suffered that redis consumed much memory when the fork child is serializing
key/values, even that maybe cause OOM.
But actually we find, in redis fork child process, the child process don't need to keep some
memory and parent process may write or update that, for example, child process will never
access the key-value that is serialized but users may update it in parent process.
So we think it may reduce COW if the child process release memory that it is not needed.
## Implementation
For releasing key value in child process, we may think we call `decrRefCount` to free memory,
but i find the fork child process still use much memory when we don't write any data to redis,
and it costs much more time that slows down bgsave. Maybe because memory allocator doesn't
really release memory to OS, and it may modify some inner data for this free operation, especially
when we free small objects.
Moreover, CoW is based on pages, so it is a easy way that we only free the memory bulk that is
not less than kernel page size. madvise(MADV_DONTNEED) can quickly release specified region
pages to OS bypassing memory allocator, and allocator still consider that this memory still is used
and don't change its inner data.
There are some buffers we can release in the fork child process:
- **Serialized key-values**
the fork child process never access serialized key-values, so we try to free them.
Because we only can release big bulk memory, and it is time consumed to iterate all
items/members/fields/entries of complex data type. So we decide to iterate them and
try to release them only when their average size of item/member/field/entry is more
than page size of OS.
- **Replication backlog**
Because replication backlog is a cycle buffer, it will be changed quickly if redis has heavy
write traffic, but in fork child process, we don't need to access that.
- **Client buffers**
If clients have requests during having the fork child process, clients' buffer also be changed
frequently. The memory includes client query buffer, output buffer, and client struct used memory.
To get child process peak private dirty memory, we need to count peak memory instead
of last used memory, because the child process may continue to release memory (since
COW used to only grow till now, the last was equivalent to the peak).
Also we're adding a new `current_cow_peak` info variable (to complement the existing
`current_cow_size`)
Co-authored-by: Oran Agra <oran@redislabs.com>
When redis-cli received ASK, it used string matching wrong and didn't
handle it.
When we access a slot which is in migrating state, it maybe
return ASK. After redirect to the new node, we need send ASKING
command before retry the command. In this PR after redis-cli receives
ASK, we send a ASKING command before send the origin command
after reconnecting.
Other changes:
* Make redis-cli -u and -c (unix socket and cluster mode) incompatible
with one another.
* When send command fails, we avoid the 2nd reconnect retry and just
print the error info. Users will decide how to do next.
See #9277.
* Add a test faking two redis nodes in TCL to just send ASK and OK in
redis protocol to test ASK behavior.
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Oran Agra <oran@redislabs.com>
In some cases large replies on slow systems may only be partially read
by the test suite, resulting with parsing errors.
This fix is still timing sensitive but should greatly reduce the chances
of this happening.
1. redis-cli can output --rdb data to stdout
but redis-cli also write some messages to stdout which will mess up the rdb.
2. Make redis-cli flush stdout when printing a reply
This was needed in order to fix a hung in redis-cli test that uses
--replica.
Note that printf does flush when there's a newline, but fwrite does not.
3. fix the redis-cli --replica test which used to pass previously
because it didn't really care what it read, and because redis-cli
used printf to print these other things to stdout.
4. improve redis-cli --replica test to run with both diskless and disk-based.
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Viktor Söderqvist <viktor@zuiderkwast.se>
*** [err]: PSYNC2: total sum of full synchronizations is exactly 4 intests/integration/psync2.tcl
Expected 5 == 4 (context: type eval line 8 cmd {assert {$sum == 4}} proc::test)
Sometime the test got an unexpected full sync since a replica switch to master,
before the new master change propagated the new replid to all replicas,
a replica attempted to sync with it using a wrong replid and triggered a full resync.
Consider this scenario:
1 slaveof 4 full resync
0 slaveof 4 full resync
2 slaveof 0 full resync
3 slaveof 1 full resync
1 slaveof no one, replid changed
3 reconnect 1, did a partial resyn and got the new replid
Before 2 inherits the new replid.
3 slaveof 2
3 try to do a partial resyn with 2.
But their replication ids are inconsistent, so a full resync happens.
:) A special thank you for oran and helping me in this test case.
Co-authored-by: Oran Agra <oran@redislabs.com>
# replication-3.tcl
had a test timeout failure with valgrind on daily CI:
```
*** [err]: SLAVE can reload "lua" AUX RDB fields of duplicated scripts in tests/integration/replication-3.tcl
Replication not started.
```
replication took more than 70 seconds.
https://github.com/redis/redis/runs/2854037905?check_suite_focus=true
on my machine it takes only about 30, but i can see how 50 seconds isn't enough.
# replication.tcl
loading was over too quickly in freebsd daily CI:
```
*** [err]: slave fails full sync and diskless load swapdb recovers it in tests/integration/replication.tcl
Expected '0' to be equal to '1' (context: type eval line 44 cmd {assert_equal [s -1 loading] 1} proc ::start_server)
```
# rdb.tcl
loading was over too quickly.
increase the time loading takes, and decrease the amount of work we try to achieve in that time.
Today when we load the AOF on startup, the loadAppendOnlyFile checks if
the file is openning for reading.
This check is redundent (dead code) as we open the AOF file for writing at initServer,
and the file will always be existing for the loadAppendOnlyFile.
In this commit:
- remove all the exit(1) from loadAppendOnlyFile, as it is the caller
responsibility to decide what to do in case of failure.
- move the opening of the AOF file for writing, to be after we loading it.
- avoid return -ERR in DEBUG LOADAOF, when the AOF is existing but empty
This PR adds a spell checker CI action that will fail future PRs if they introduce typos and spelling mistakes.
This spell checker is based on blacklist of common spelling mistakes, so it will not catch everything,
but at least it is also unlikely to cause false positives.
Besides that, the PR also fixes many spelling mistakes and types, not all are a result of the spell checker we use.
Here's a summary of other changes:
1. Scanned the entire source code and fixes all sorts of typos and spelling mistakes (including missing or extra spaces).
2. Outdated function / variable / argument names in comments
3. Fix outdated keyspace masks error log when we check `config.notify-keyspace-events` in loadServerConfigFromString.
4. Trim the white space at the end of line in `module.c`. Check: https://github.com/redis/redis/pull/7751
5. Some outdated https link URLs.
6. Fix some outdated comment. Such as:
- In README: about the rdb, we used to said create a `thread`, change to `process`
- dbRandomKey function coment (about the dictGetRandomKey, change to dictGetFairRandomKey)
- notifyKeyspaceEvent fucntion comment (add type arg)
- Some others minor fix in comment (Most of them are incorrectly quoted by variable names)
7. Modified the error log so that users can easily distinguish between TCP and TLS in `changeBindAddr`
This commit revives the improves the ability to run the test suite against
external servers, instead of launching and managing `redis-server` processes as
part of the test fixture.
This capability existed in the past, using the `--host` and `--port` options.
However, it was quite limited and mostly useful when running a specific tests.
Attempting to run larger chunks of the test suite experienced many issues:
* Many tests depend on being able to start and control `redis-server` themselves,
and there's no clear distinction between external server compatible and other
tests.
* Cluster mode is not supported (resulting with `CROSSSLOT` errors).
This PR cleans up many things and makes it possible to run the entire test suite
against an external server. It also provides more fine grained controls to
handle cases where the external server supports a subset of the Redis commands,
limited number of databases, cluster mode, etc.
The tests directory now contains a `README.md` file that describes how this
works.
This commit also includes additional cleanups and fixes:
* Tests can now be tagged.
* Tag-based selection is now unified across `start_server`, `tags` and `test`.
* More information is provided about skipped or ignored tests.
* Repeated patterns in tests have been extracted to common procedures, both at a
global level and on a per-test file basis.
* Cleaned up some cases where test setup was based on a previous test executing
(a major anti-pattern that repeats itself in many places).
* Cleaned up some cases where test teardown was not part of a test (in the
future we should have dedicated teardown code that executes even when tests
fail).
* Fixed some tests that were flaky running on external servers.
Till now, on replica full-sync we used to transfer absolute time for TTL,
however when a command arrived (EXPIRE or EXPIREAT),
we used to propagate it as is to replicas (possibly with relative time),
but always translate it to EXPIREAT (absolute time) to AOF.
This commit changes that and will always use absolute time for propagation.
see discussion in #8433
Furthermore, we Introduce new commands: `EXPIRETIME/PEXPIRETIME`
that allow extracting the absolute TTL time from a key.
In diskless replication, we create a read pipe for the RDB, between the child and the parent.
When we close this pipe (fd), the read handler also needs to be removed from the event loop (if it still registered).
Otherwise, next time we will use the same fd, the registration will be fail (panic), because
we will use EPOLL_CTL_MOD (the fd still register in the event loop), on fd that already removed from epoll_ctl
When test stop 'load handler' by killing the process that generating the load,
some commands that already in the input buffer, still might be processed by the server.
This may cause some instability in tests, that count on that no more commands
processed after we stop the `load handler'
In this commit, new proc 'wait_load_handlers_disconnected' added, to verify that no more
cammands from any 'load handler' prossesed, by checking that the clients who
genreate the load is disconnceted.
Also, replacing check of dbsize with wait_for_ofs_sync before comparing debug digest, as
it would fail in case the last key the workload wrote was an overridden key (not a new one).
Affected tests
Race fix:
- failover command to specific replica works
- Connect multiple replicas at the same time (issue #141), master diskless=$mdl, replica diskless=$sdl
- AOF rewrite during write load: RDB preamble=$rdbpre
Cleanup and speedup:
- Test replication with blocking lists and sorted sets operations
- Test replication with parallel clients writing in different DBs
- Test replication partial resync: $descr (diskless: $mdl, $sdl, reconnect: $reconnect
In github actions CI with valgrind, i saw that even the fast replica
(one that wasn't paused), didn't get to complete the replication fast
enough, and ended up getting disconnected by timeout.
Additionally, due to a typo in uname, we didn't get to actually run the
CPU efficiency part of the test.
Starting redis 6.0 (part of the TLS feature), diskless master uses pipe from the fork
child so that the parent is the one sending data to the replicas.
This mechanism has an issue in which a hung replica will cause the master to wait
for it to read the data sent to it forever, thus preventing the fork child from terminating
and preventing the creations of any other forks.
This PR adds a timeout mechanism, much like the ACK-based timeout,
we disconnect replicas that aren't reading the RDB file fast enough.
Another test race condition in the macos tests.
the test was waiting for PINGs to be generated and put on the replication stream,
but waiting for 1 or 2 seconds doesn't really guarantee that.
then the test that expected 6 full syncs, found only 4
the corrupt-dump-fuzzer test found a case where an access to a corrupt
stream would have caused accessing to uninitialized memory.
now it'll panic instead.
The issue was that there was a stream that says it has more than 0
records, but looking for the max ID came back empty handed.
p.s. when sanitize-dump-payload is used, this corruption is detected,
and the RESTORE command is gracefully rejected.
Since redis 6.2, redis immediately tries to connect to the master, not
waiting for replication cron.
in the slow freebsd CI, this test failed and master_link_status was
already "up" when INFO was called.
* The `redis-cli --scan` output should honor output mode (set explicitly or implicitly), and quote key names when not in raw mode.
* Technically this is a breaking change, but it should be very minor since raw mode is by default on for non-tty output.
* It should only affect TTY output (human users) or non-tty output if `--no-raw` is specified.
* Added `--quoted-input` option to treat all arguments as potentially quoted strings.
* Added `--quoted-pattern` option to accept a potentially quoted pattern.
Unquoting is applied to potentially quoted input only if single or double quotes are used.
Fixes#8561, #8563
When sanitizing the stream listpack, we need to count the deleted records too.
otherwise the last line that checks the next pointer fails.
Add test to cover that state in the stream tests.
* Remove linux/version.h dependency.
This introduces unnecessary dependencies, and generally not a good idea
as the platform we build on may be different than the platform we run
on.
To determine if sync_file_range exists we can simply rely on header file
hints.
* Fix setproctitle() on libmusl.
The previous ifdef checks were a bit too strict for no apparent
reason.
* Fix tests failure on Linux with no backtrace.
* Add alpine daily CI job.