Apparently the "leaks" took reports a different error string about process
that's not found in each version of MacOS.
This cause the test suite to fail on some OS versions, since some tests terminate
the process before looking for leaks.
Instead of looking at the error string, we now look at the (documented) exit code.
Add a new set of defrag functions that take a defrag context and allow
defragmenting memory blocks and RedisModuleStrings.
Modules can register a defrag callback which will be invoked when the
defrag process handles globals.
Modules with custom data types can also register a datatype-specific
defrag callback which is invoked for keys that require defragmentation.
The callback and associated functions support both one-step and
multi-step options, depending on the complexity of the key as exposed by
the free_effort callback.
This adds a new `tls-client-cert-file` and `tls-client-key-file`
configuration directives which make it possible to use different
certificates for the TLS-server and TLS-client functions of Redis.
This is an optional directive. If it is not specified the `tls-cert-file`
and `tls-key-file` directives are used for TLS client functions as well.
Also, `utils/gen-test-certs.sh` now creates additional server-only and client-only certs and will skip intensive operations if target files already exist.
when using --baseport to run two tests suite in parallel (different
folders), we need to also make sure the port used by the testsuite to
communicate with it's workers is unique. otherwise the attept to find a
free port connects to the other test suite and messes it.
maybe one day we need to attempt to bind, instead of connect when tring
to find a free port.
The test creates keys with various encodings, DUMP them, corrupt the payload
and RESTORES it.
It utilizes the recently added use-exit-on-panic config to distinguish between
asserts and segfaults.
If the restore succeeds, it runs random commands on the key to attempt to
trigger a crash.
It runs in two modes, one with deep sanitation enabled and one without.
In the first one we don't expect any assertions or segfaults, in the second one
we expect assertions, but no segfaults.
We also check for leaks and invalid reads using valgrind, and if we find them
we print the commands that lead to that issue.
Changes in the code (other than the test):
- Replace a few NPD (null pointer deference) flows and division by zero with an
assertion, so that it doesn't fail the test. (since we set the server to use
`exit` rather than `abort` on assertion).
- Fix quite a lot of flows in rdb.c that could have lead to memory leaks in
RESTORE command (since it now responds with an error rather than panic)
- Add a DEBUG flag for SET-SKIP-CHECKSUM-VALIDATION so that the test don't need
to bother with faking a valid checksum
- Remove a pile of code in serverLogObjectDebugInfo which is actually unsafe to
run in the crash report (see comments in the code)
- fix a missing boundary check in lzf_decompress
test suite infra improvements:
- be able to run valgrind checks before the process terminates
- rotate log files when restarting servers
When loading an encoded payload we will at least do a shallow validation to
check that the size that's encoded in the payload matches the size of the
allocation.
This let's us later use this encoded size to make sure the various offsets
inside encoded payload don't reach outside the allocation, if they do, we'll
assert/panic, but at least we won't segfault or smear memory.
We can also do 'deep' validation which runs on all the records of the encoded
payload and validates that they don't contain invalid offsets. This lets us
detect corruptions early and reject a RESTORE command rather than accepting
it and asserting (crashing) later when accessing that payload via some command.
configuration:
- adding ACL flag skip-sanitize-payload
- adding config sanitize-dump-payload [yes/no/clients]
For now, we don't have a good way to ensure MIGRATE in cluster resharding isn't
being slowed down by these sanitation, so i'm setting the default value to `no`,
but later on it should be set to `clients` by default.
changes:
- changing rdbReportError not to `exit` in RESTORE command
- adding a new stat to be able to later check if cluster MIGRATE isn't being
slowed down by sanitation.
Test support for the new map, null and push message types. Map objects are parsed as a list of lists of key value pairs.
for instance: user => john password => 123
will be parsed to the following TCL list:
{{user john} {password 123}}
Also added the following tests:
Redirection still works with RESP3
Able to use a RESP3 client as a redirection client
No duplicate invalidation messages when turning BCAST mode on after normal tracking
Server is able to evacuate enough keys when num of keys surpasses limit by more than defined initial effort
Different clients using different protocols can track the same key
OPTOUT tests
OPTIN tests
Clients can redirect to the same connection
tracking-redir-broken test
HELLO 3 checks
Invalidation messages still work when using RESP3, with and without redirection
Switching to RESP3 doesn't disturb previous tracked keys
Tracking info is correct
Flushall and flushdb produce invalidation messages
These tests achieve 100% line coverage for tracking.c using lcov.
The tests sometimes fail to find a log message.
Recently i added a print that shows the log files that are searched
and it shows that the message was in deed there.
The only reason i can't think of for this seach to fail, is we we
happened to read an incomplete line, which didn't match our pattern and
then on the next iteration we would continue reading from the line after
it.
The fix is to always re-evaluation the previous line.
- add test suite coverage for redis-benchmark
- add --version (similar to what redis-cli has)
- fix bug sending more requests than intended when pipeline > 1.
- when done sending requests, avoid freeing client in the write handler, in theory before
responses are received (probably dead code since the read handler will call clientDone first)
Co-authored-by: Oran Agra <oran@redislabs.com>
Adding [B]LMOVE <src> <dst> RIGHT|LEFT RIGHT|LEFT. deprecating [B]RPOPLPUSH.
Note that when receiving a BRPOPLPUSH we'll still propagate an RPOPLPUSH,
but on BLMOVE RIGHT LEFT we'll propagate an LMOVE
improvement to existing tests
- Replace "after 1000" with "wait_for_condition" when wait for
clients to block/unblock.
- Add a pre-existing element to target list on basic tests so
that we can check if the new element was added to the correct
side of the list.
- check command stats on the replica to make sure the right
command was replicated
Co-authored-by: Oran Agra <oran@redislabs.com>
if there are nested tests and nested servers, we need to restore the
previous value of cur_test when a test exist.
example:
```
test{test 1} {
start_server {
test{test 1.1 - master only} {
}
start_server {
test{test 1.2 - with replication} {
}
}
}
}
```
when `test 1.1 - master only exists`, we're still inside `test 1`
1) cur_test: when restart_server, "no such variable" error occurs
./runtest --single integration/rdb
test {client freed during loading}
SET ::cur_test
restart_server
kill_server
test "Check for memory leaks (pid $pid)"
SET ::cur_test
UNSET ::cur_test
UNSET ::cur_test // This global variable has been unset.
2) `ps --ppid` not available on macOS platform, can be replaced with
`pgrep -P pid`.
There is an inherent race condition in port allocation for spawned
servers. If a server fails to start because a port is taken, a new port
is allocated. This fixes a problem where the logs are not truncated and
as a result a large number of unmonitored servers are started.
Starting redis 6.0 and the changes we made to the diskless master to be
suitable for TLS, I made the master avoid reaping (wait3) the pid of the
child until we know all replicas are done reading their rdb.
I did that in order to avoid a state where the rdb_child_pid is -1 but
we don't yet want to start another fork (still busy serving that data to
replicas).
It turns out that the solution used so far was problematic in case the
fork child was being killed (e.g. by the kernel OOM killer), in that
case there's a chance that we currently disabled the read event on the
rdb pipe, since we're waiting for a replica to become writable again.
and in that scenario the master would have never realized the child
exited, and the replica will remain hung too.
Note that there's no mechanism to detect a hung replica while it's in
rdb transfer state.
The solution here is to add another pipe which is used by the parent to
tell the child it is safe to exit. this mean that when the child exits,
for whatever reason, it is safe to reap it.
Besides that, i'm re-introducing an adjustment to REPLCONF ACK which was
part of #6271 (Accelerate diskless master connections) but was dropped
when that PR was rebased after the TLS fork/pipe changes (5a47794).
Now that RdbPipeCleanup no longer calls checkChildrenDone, and the ACK
has chance to detect that the child exited, it should be the one to call
it so that we don't have to wait for cron (server.hz) to do that.
- redirect valgrind reports to a dedicated file rather than console
- try to avoid killing instances with SIGKILL so that we get the memory
leak report (killing with SIGTERM before resorting to SIGKILL)
- search for valgrind reports when done, print them and fail the tests
- add --dont-clean option to keep the logs on exit
- fix exit error code when crash is found (would have exited with 0)
changes that affect the normal redis test suite:
- refactor check_valgrind_errors into two functions one to search and
one to report
- move the search half into util.tcl to serve the cluster tests too
- ignore "address range perms" valgrind warnings which seem non relevant.
in some cases a command that returns an error possibly due to a timing
issue causes the tcl code to crash and thus prevents the rest of the
tests from running. this adds an option to make the test proceed despite
the crash.
maybe it should be the default mode some day.
- skip full units
- skip a single test (not just a list of tests)
- when skipping tag, skip spinning up servers, not just the tests
- skip tags when running against an external server too
- allow using multiple tags (split them)
During long running scripts or loading RDB/AOF, we may need to do some
defragging. Since processEventsWhileBlocked is called periodically at
unknown intervals, and many cron jobs either depend on run_with_period
(including active defrag), or rely on being called at server.hz rate
(i.e. active defrag knows ho much time to run by looking at server.hz),
the whileBlockedCron may have to run a loop triggering the cron jobs in it
(currently only active defrag) several times.
Other changes:
- Adding a test for defrag during aof loading.
- Changing key-load-delay config to take negative values for fractions
of a microsecond sleep
This is a rebased version of #3078 originally by shaharmor
with the following patches by TysonAndre made after rebasing
to work with the updated C API:
1. Add 2 more unit tests
(wrong argument count error message, integer over 64 bits)
2. Use addReplyArrayLen instead of addReplyMultiBulkLen.
3. Undo changes to src/help.h - for the ZMSCORE PR,
I heard those should instead be automatically
generated from the redis-doc repo if it gets updated
Motivations:
- Example use case: Client code to efficiently check if each element of a set
of 1000 items is a member of a set of 10 million items.
(Similar to reasons for working on #7593)
- HMGET and ZMSCORE already exist. This may lead to developers deciding
to implement functionality that's best suited to a regular set with a
data type of sorted set or hash map instead, for the multi-get support.
Currently, multi commands or lua scripting to call sismember multiple times
would almost definitely be less efficient than a native smismember
for the following reasons:
- Need to fetch the set from the string every time
instead of reusing the C pointer.
- Using pipelining or multi-commands would result in more bytes sent
and received by the client for the repeated SISMEMBER KEY sections.
- Need to specially encode the data and decode it from the client
for lua-based solutions.
- Proposed solutions using Lua or SADD/SDIFF could trigger writes to
memory, which is undesirable on a redis replica server
or when commands get replicated to replicas.
Co-Authored-By: Shahar Mor <shahar@peer5.com>
Co-Authored-By: Tyson Andre <tysonandre775@hotmail.com>
Syntax: `ZMSCORE KEY MEMBER [MEMBER ...]`
This is an extension of #2359
amended by Tyson Andre to work with the changed unstable API,
add more tests, and consistently return an array.
- It seemed as if it would be more likely to get reviewed
after updating the implementation.
Currently, multi commands or lua scripting to call zscore multiple times
would almost definitely be less efficient than a native ZMSCORE
for the following reasons:
- Need to fetch the set from the string every time instead of reusing the C
pointer.
- Using pipelining or multi-commands would result in more bytes sent by
the client for the repeated `ZMSCORE KEY` sections.
- Need to specially encode the data and decode it from the client
for lua-based solutions.
- The fastest solution I've seen for large sets(thousands or millions)
involves lua and a variadic ZADD, then a ZINTERSECT, then a ZRANGE 0 -1,
then UNLINK of a temporary set (or lua). This is still inefficient.
Co-authored-by: Tyson Andre <tysonandre775@hotmail.com>
- the test now waits for specific set of log messages rather than wait for
timeout looking for just one message.
- we don't wanna sample the current length of the log after an action, due
to a race, we need to start the search from the line number of the last
message we where waiting for.
- when attempting to trigger a full sync, use multi-exec to avoid a race
where the replica manages to re-connect before we completed the set of
actions that should force a full sync.
- fix verify_log_message which was broken and unused
in cases where you have
test name {
start_server {
start_server {
assert
}
}
}
the exception will be thrown to the test proc, and the servers are
supposed to be killed on the way out. but it seems there was always a
bug of not cleaning the server stack, and recently (#7404) we started
relying on that stack in order to kill them, so with that bug sometimes
we would have tried to kill the same server twice, and leave one alive.
luckly, in most cases the pattern is:
start_server {
test name {
}
}
in the majority of the cases (on this rarely used feature) we want to
stop and be able to connect to the shard with redis-cli.
since these are two different processes interracting with the tty we
need to stop both, and we'll have to hit enter twice, but it's not that
bad considering it is rarely used.
tests were sensitive to additional log lines appearing in the log
causing the search to come empty handed.
instead of just looking for the n last log lines, capture the log lines
before performing the action, and then search from that offset.
* tests/valgrind: don't use debug restart
DEBUG REATART causes two issues:
1. it uses execve which replaces the original process and valgrind doesn't
have a chance to check for errors, so leaks go unreported.
2. valgrind report invalid calls to close() which we're unable to resolve.
So now the tests use restart_server mechanism in the tests, that terminates
the old server and starts a new one, new PID, but same stdout, stderr.
since the stderr can contain two or more valgrind report, it is not enough
to just check for the absence of leaks, we also need to check for some known
errors, we do both, and fail if we either find an error, or can't find a
report saying there are no leaks.
other changes:
- when killing a server that was already terminated we check for leaks too.
- adding DEBUG LEAK which was used to test it.
- adding --trace-children to valgrind, although no longer needed.
- since the stdout contains two or more runs, we need slightly different way
of checking if the new process is up (explicitly looking for the new PID)
- move the code that handles --wait-server to happen earlier (before
watching the startup message in the log), and serve the restarted server too.
* squashme - CR fixes
i.e. don't start the search from scratch hitting the used ones again.
this will also reduce the likelihood of collisions (if there are any
left) by increasing the time until we re-use a port we did use in the
past.
apparently when running tests in parallel (the default of --clients 16),
there's a chance for two tests to use the same port.
specifically, one test might shutdown a master and still have the
replica up, and then another test will re-use the port number of master
for another master, and then that replica will connect to the master of
the other test.
this can cause a master to count too many full syncs and fail a test if
we run the tests with --single integration/psync2 --loop --stop
see Probmem 2 in #7314
sometimes we have several assertions with the same condition in the same test
at different stages, and when these fail (the ones that print the condition
text) you don't know which one it was. other assertions didn't print the
condition text (variable names), just the expected and unexpected values.
So now, all assertions print context line, and conditin text.
besides, one of the major differences between 'assert' and 'assert_equal',
is that the later is able to print the value that doesn't match the expected.
if there is a rare non-reproducible failure, it is helpful to know what was
the value the test encountered and how far it was from the threshold.
So now, adding assert_lessthan and assert_range that can be used in some places.
were we used just 'assert { a > b }' so far.
* Introduce a connection abstraction layer for all socket operations and
integrate it across the code base.
* Provide an optional TLS connections implementation based on OpenSSL.
* Pull a newer version of hiredis with TLS support.
* Tests, redis-cli updates for TLS support.
now that replica can read rdb directly from the socket, it should avoid exiting
on short read and instead try to re-sync.
this commit tries to have minimal effects on non-diskless rdb reading.
and includes a test that tries to trigger this scenario on various read cases.
The implementation of the diskless replication was currently diskless only on the master side.
The slave side was still storing the received rdb file to the disk before loading it back in and parsing it.
This commit adds two modes to load rdb directly from socket:
1) when-empty
2) using "swapdb"
the third mode of using diskless slave by flushdb is risky and currently not included.
other changes:
--------------
distinguish between aof configuration and state so that we can re-enable aof only when sync eventually
succeeds (and not when exiting from readSyncBulkPayload after a failed attempt)
also a CONFIG GET and INFO during rdb loading would have lied
When loading rdb from the network, don't kill the server on short read (that can be a network error)
Fix rdb check when performed on preamble AOF
tests:
run replication tests for diskless slave too
make replication test a bit more aggressive
Add test for diskless load swapdb
* allowing --single to be repeated
* adding --only so that only a specific test inside a unit can be run
* adding --skiptill useful to resume a test that crashed passed the problematic unit.
useful together with --clients 1
* adding --skipfile to use a file containing list of tests names to skip
* printing the names of the tests that are skiped by skipfile or denytags
* adding --config to add config file options from command line
* fail the test (exit code) in case of timeout.
* add --wait-server to allow attaching a debugger
* add --dont-clean to keep log files when tests are done
This replaces individual ziplist vs. linkedlist representations
for Redis list operations.
Big thanks for all the reviews and feedback from everybody in
https://github.com/antirez/redis/pull/2143
start_server now uses return value from Tcl exec to get the server pid,
however this introduces errors that depend from timing: a lot of the
testing code base assumed the server to be actually up and running when
server_start returns.
So the old code that waits to see the pid in the log file was restored.
Previously the PID format was:
[PID] Timestamp
But it recently changed to:
PID:X Timestamp
The tcl testing framework was grabbing the PID from \[\d+\], but
that's not valid anymore.
Now we grab the pid from "PID: <PID>" in the part of Redis startup
output to the right of the ASCII logo.
Better handling of connection errors in order to update the table and
recovery, populate the startup nodes table after fetching the list of
nodes.
More work to do about it, it is still not as reliable as
redis-rb-cluster implementation which is the minimal reference
implementation for Redis Cluster clients.
Sometimes the process is still there but no longer in a state that can
be checked (after being killed). This used to happen after a call to
SHUTDOWN NOSAVE in the scripting unit, causing a false positive.
It is now possible to kill and restart sentinel or redis instances for
more real-world testing.
The 01 unit tests the capability of Sentinel to update the configuration
of Sentinels rejoining the cluster, however the test is pretty trivial
and more tests should be added.
Some inline test moved into server_is_up procedure.
Also find_available_port was moved into util since it is going
to be used for the Sentinel test as well.
Due to changes in recent releases of osx leaks utility, the osx leak
detection no longer worked. Now it is fixed in a way that should be
backward compatible.
A new stress test was added to stress test the code converting a ziplist
into an hash table.
In this commit also randomValue helper function was modified to also
return negative values.
wait_for_condition is now used instead of the usual "after 1000" (that
is the way to sleep in Tcl). This should avoid to find the replica in
a state where it is loading the RDB in memory, returning -LOADING error.
This test used to fail when running the test over valgrind, due to the
added latencies.
Due to a change in the format of the bug report in case of crash of
failed assertion the test suite was no longer able to properly log it.
Instead just a protocol error was logged by the Redis TCL client that
provided no clue about the actual problem.
This commit resolves the issue by logging everything from the first line
of the log including the string REDIS BUG REPORT, till the end of the
file.
Now it uses the new wait_for_condition testing primitive.
Also wait_for_condition implementation was fixed in this commit to properly
escape the expr command and its argument.
A new primitive wait_for_condition was introduced in the scripting
engine that makes waiting for events simpler, so that it is simpler to
write tests that are more resistant to timing issues.
networking related stuff moved into networking.c
moved more code
more work on layout of source code
SDS instantaneuos memory saving. By Pieter and Salvatore at VMware ;)
cleanly compiling again after the first split, now splitting it in more C files
moving more things around... work in progress
split replication code
splitting more
Sets split
Hash split
replication split
even more splitting
more splitting
minor change