some tests use attach_to_replication_stream to watch what's propagated
to replicas, but in some cases the periodic ping may slip in and fail
the test.
we disable that ping by setting the period to once an hour (tests should
not run for that long).
other change is so that the next time this oom-score-adj test fails,
we'll see the value (assert_equals prints it)
This Commit pushes forward the observability on overall error statistics and command statistics within redis-server:
It extends INFO COMMANDSTATS to have
- failed_calls in - so we can keep track of errors that happen from the command itself, broken by command.
- rejected_calls - so we can keep track of errors that were triggered outside the commmand processing per se
Adds a new section to INFO, named ERRORSTATS that enables keeping track of the different errors that
occur within redis ( within processCommand and call ) based on the reply Error Prefix ( The first word
after the "-", up to the first space ).
This commit also fixes RM_ReplyWithError so that it can be correctly identified as an error reply.
This adds a new `tls-client-cert-file` and `tls-client-key-file`
configuration directives which make it possible to use different
certificates for the TLS-server and TLS-client functions of Redis.
This is an optional directive. If it is not specified the `tls-cert-file`
and `tls-key-file` directives are used for TLS client functions as well.
Also, `utils/gen-test-certs.sh` now creates additional server-only and client-only certs and will skip intensive operations if target files already exist.
when using --baseport to run two tests suite in parallel (different
folders), we need to also make sure the port used by the testsuite to
communicate with it's workers is unique. otherwise the attept to find a
free port connects to the other test suite and messes it.
maybe one day we need to attempt to bind, instead of connect when tring
to find a free port.
The test creates keys with various encodings, DUMP them, corrupt the payload
and RESTORES it.
It utilizes the recently added use-exit-on-panic config to distinguish between
asserts and segfaults.
If the restore succeeds, it runs random commands on the key to attempt to
trigger a crash.
It runs in two modes, one with deep sanitation enabled and one without.
In the first one we don't expect any assertions or segfaults, in the second one
we expect assertions, but no segfaults.
We also check for leaks and invalid reads using valgrind, and if we find them
we print the commands that lead to that issue.
Changes in the code (other than the test):
- Replace a few NPD (null pointer deference) flows and division by zero with an
assertion, so that it doesn't fail the test. (since we set the server to use
`exit` rather than `abort` on assertion).
- Fix quite a lot of flows in rdb.c that could have lead to memory leaks in
RESTORE command (since it now responds with an error rather than panic)
- Add a DEBUG flag for SET-SKIP-CHECKSUM-VALIDATION so that the test don't need
to bother with faking a valid checksum
- Remove a pile of code in serverLogObjectDebugInfo which is actually unsafe to
run in the crash report (see comments in the code)
- fix a missing boundary check in lzf_decompress
test suite infra improvements:
- be able to run valgrind checks before the process terminates
- rotate log files when restarting servers
When loading an encoded payload we will at least do a shallow validation to
check that the size that's encoded in the payload matches the size of the
allocation.
This let's us later use this encoded size to make sure the various offsets
inside encoded payload don't reach outside the allocation, if they do, we'll
assert/panic, but at least we won't segfault or smear memory.
We can also do 'deep' validation which runs on all the records of the encoded
payload and validates that they don't contain invalid offsets. This lets us
detect corruptions early and reject a RESTORE command rather than accepting
it and asserting (crashing) later when accessing that payload via some command.
configuration:
- adding ACL flag skip-sanitize-payload
- adding config sanitize-dump-payload [yes/no/clients]
For now, we don't have a good way to ensure MIGRATE in cluster resharding isn't
being slowed down by these sanitation, so i'm setting the default value to `no`,
but later on it should be set to `clients` by default.
changes:
- changing rdbReportError not to `exit` in RESTORE command
- adding a new stat to be able to later check if cluster MIGRATE isn't being
slowed down by sanitation.
- add test suite coverage for redis-benchmark
- add --version (similar to what redis-cli has)
- fix bug sending more requests than intended when pipeline > 1.
- when done sending requests, avoid freeing client in the write handler, in theory before
responses are received (probably dead code since the read handler will call clientDone first)
Co-authored-by: Oran Agra <oran@redislabs.com>
We're already using bg_unlink in several places to delete the rdb file in the background,
and avoid paying the cost of the deletion from our main thread.
This commit uses bg_unlink to remove the temporary rdb file in the background too.
However, in case we delete that rdb file just before exiting, we don't actually wait for the
background thread or the main thread to delete it, and just let the OS clean up after us.
i.e. we open the file, unlink it and exit with the fd still open.
Furthermore, rdbRemoveTempFile can be called from a thread and was using snprintf which is
not async-signal-safe, we now use ll2string instead.
in some cases a command that returns an error possibly due to a timing
issue causes the tcl code to crash and thus prevents the rest of the
tests from running. this adds an option to make the test proceed despite
the crash.
maybe it should be the default mode some day.
- skip full units
- skip a single test (not just a list of tests)
- when skipping tag, skip spinning up servers, not just the tests
- skip tags when running against an external server too
- allow using multiple tags (split them)
Add Linux kernel OOM killer control option.
This adds the ability to control the Linux OOM killer oom_score_adj
parameter for all Redis processes, depending on the process role (i.e.
master, replica, background child).
A oom-score-adj global boolean flag control this feature. In addition,
specific values can be configured using oom-score-adj-values if
additional tuning is required.
in the majority of the cases (on this rarely used feature) we want to
stop and be able to connect to the shard with redis-cli.
since these are two different processes interracting with the tty we
need to stop both, and we'll have to hit enter twice, but it's not that
bad considering it is rarely used.
* Tests: fix and reintroduce redis-cli tests.
These tests have been broken and disabled for 10 years now!
* TLS: add remaining redis-cli support.
This adds support for the redis-cli --pipe, --rdb and --replica options
previously unsupported in --tls mode.
* Fix writeConn().
In order to support the use of multi-exec in pipeline, it is important that
MULTI and EXEC are never rejected and it is easy for the client to know if the
connection is still in multi state.
It was easy to make sure MULTI and DISCARD never fail (done by previous
commits) since these only change the client state and don't do any actual
change in the server, but EXEC is a different story.
Since in the past, it was possible for clients to handle some EXEC errors and
retry the EXEC, we now can't affort to return any error on EXEC other than
EXECABORT, which now carries with it the real reason for the abort too.
Other fixes in this commit:
- Some checks that where performed at the time of queuing need to be re-
validated when EXEC runs, for instance if the transaction contains writes
commands, it needs to be aborted. there was one check that was already done
in execCommand (-READONLY), but other checks where missing: -OOM, -MISCONF,
-NOREPLICAS, -MASTERDOWN
- When a command is rejected by processCommand it was rejected with addReply,
which was not recognized as an error in case the bad command came from the
master. this will enable to count or MONITOR these errors in the future.
- make it easier for tests to create additional (non deferred) clients.
- add tests for the fixes of this commit.
apparently when running tests in parallel (the default of --clients 16),
there's a chance for two tests to use the same port.
specifically, one test might shutdown a master and still have the
replica up, and then another test will re-use the port number of master
for another master, and then that replica will connect to the master of
the other test.
this can cause a master to count too many full syncs and fail a test if
we run the tests with --single integration/psync2 --loop --stop
see Probmem 2 in #7314
this test is time sensitive and it sometimes fail to pass below the
latency threshold, even on strong machines.
this test was the reson we're running just 2 parallel tests in the
github actions CI, revering this.
* Introduce a connection abstraction layer for all socket operations and
integrate it across the code base.
* Provide an optional TLS connections implementation based on OpenSSL.
* Pull a newer version of hiredis with TLS support.
* Tests, redis-cli updates for TLS support.
* allowing --single to be repeated
* adding --only so that only a specific test inside a unit can be run
* adding --skiptill useful to resume a test that crashed passed the problematic unit.
useful together with --clients 1
* adding --skipfile to use a file containing list of tests names to skip
* printing the names of the tests that are skiped by skipfile or denytags
* adding --config to add config file options from command line
* fail the test (exit code) in case of timeout.
* add --wait-server to allow attaching a debugger
* add --dont-clean to keep log files when tests are done
other fixes / improvements:
- LUA script memory isn't taken from zmalloc (taken from libc malloc)
so it can cause high fragmentation ratio to be displayed (which is false)
- there was a problem with "fragmentation" info being calculated from
RSS and used_memory sampled at different times (now sampling them together)
other details:
- adding a few more allocator info fields to INFO and MEMORY commands
- improve defrag test to measure defrag latency of big keys
- increasing the accuracy of the defrag test (by looking at real grag info)
this way we can use an even lower threshold and still avoid false positives
- keep the old (total) "fragmentation" field unchanged, but add new ones for spcific things
- add these the MEMORY DOCTOR command
- deduct LUA memory from the rss in case of non jemalloc allocator (one for which we don't "allocator active/used")
- reduce sampling rate of the rss and allocator info
Experimentally verified that it can trigger the issue reverting the fix.
At least on my system... Being the bug time/backlog dependant, it is
very hard to tell if this test will be able to trigger the problem
consistently, however even if it triggers the problem once in a while,
we'll see it in the CI environment at http://ci.redis.io.
Some inline test moved into server_is_up procedure.
Also find_available_port was moved into util since it is going
to be used for the Sentinel test as well.
The Redis test uses a server-clients model in order to parallelize the
execution of different tests. However in recent versions of osx not
setting the channel to a binary encoding caused issues even if AFAIK no
binary data is really sent via this channel. However now the channels
are deliberately set to a binary encoding and this solves the issue.
The exact issue was the test not terminating and giving the impression
of running forever, since test clients or servers were unable to
exchange the messages to continue.
The Redis implementation is tested against Tcl implementations of the
same operation. Both fuzzing and testing of specific aspects of the
commands behavior are performed.