Commit Graph

100 Commits

Author SHA1 Message Date
Allen Farris
0d18a1e85f
implement FAILOVER command (#8315)
Implement FAILOVER command, which coordinates failover
between the server and one of its replicas.
2021-01-28 13:18:05 -08:00
filipe oliveira
90b9f08e5d
Add errorstats info section, Add failed_calls and rejected_calls to commandstats (#8217)
This Commit pushes forward the observability on overall error statistics and command statistics within redis-server:

It extends INFO COMMANDSTATS to have
- failed_calls in - so we can keep track of errors that happen from the command itself, broken by command.
- rejected_calls - so we can keep track of errors that were triggered outside the commmand processing per se

Adds a new section to INFO, named ERRORSTATS that enables keeping track of the different errors that
occur within redis ( within processCommand and call ) based on the reply Error Prefix ( The first word
after the "-", up to the first space ).

This commit also fixes RM_ReplyWithError so that it can be correctly identified as an error reply.
2020-12-31 16:53:43 +02:00
Yossi Gottlieb
63c1303cfb
Modules: add defrag API support. (#8149)
Add a new set of defrag functions that take a defrag context and allow
defragmenting memory blocks and RedisModuleStrings.

Modules can register a defrag callback which will be invoked when the
defrag process handles globals.

Modules with custom data types can also register a datatype-specific
defrag callback which is invoked for keys that require defragmentation.
The callback and associated functions support both one-step and
multi-step options, depending on the complexity of the key as exposed by
the free_effort callback.
2020-12-13 09:56:01 +02:00
Yossi Gottlieb
00db1b5579
Fix failing macOS tests due to wc differences. (#8161) 2020-12-08 16:22:16 +02:00
Oran Agra
c31055db61 Sanitize dump payload: fuzz tester and fixes for segfaults and leaks it exposed
The test creates keys with various encodings, DUMP them, corrupt the payload
and RESTORES it.
It utilizes the recently added use-exit-on-panic config to distinguish between
 asserts and segfaults.
If the restore succeeds, it runs random commands on the key to attempt to
trigger a crash.

It runs in two modes, one with deep sanitation enabled and one without.
In the first one we don't expect any assertions or segfaults, in the second one
we expect assertions, but no segfaults.
We also check for leaks and invalid reads using valgrind, and if we find them
we print the commands that lead to that issue.

Changes in the code (other than the test):
- Replace a few NPD (null pointer deference) flows and division by zero with an
  assertion, so that it doesn't fail the test. (since we set the server to use
  `exit` rather than `abort` on assertion).
- Fix quite a lot of flows in rdb.c that could have lead to memory leaks in
  RESTORE command (since it now responds with an error rather than panic)
- Add a DEBUG flag for SET-SKIP-CHECKSUM-VALIDATION so that the test don't need
  to bother with faking a valid checksum
- Remove a pile of code in serverLogObjectDebugInfo which is actually unsafe to
  run in the crash report (see comments in the code)
- fix a missing boundary check in lzf_decompress

test suite infra improvements:
- be able to run valgrind checks before the process terminates
- rotate log files when restarting servers
2020-12-06 14:54:34 +02:00
Oran Agra
ca1c182567 Sanitize dump payload: ziplist, listpack, zipmap, intset, stream
When loading an encoded payload we will at least do a shallow validation to
check that the size that's encoded in the payload matches the size of the
allocation.
This let's us later use this encoded size to make sure the various offsets
inside encoded payload don't reach outside the allocation, if they do, we'll
assert/panic, but at least we won't segfault or smear memory.

We can also do 'deep' validation which runs on all the records of the encoded
payload and validates that they don't contain invalid offsets. This lets us
detect corruptions early and reject a RESTORE command rather than accepting
it and asserting (crashing) later when accessing that payload via some command.

configuration:
- adding ACL flag skip-sanitize-payload
- adding config sanitize-dump-payload [yes/no/clients]

For now, we don't have a good way to ensure MIGRATE in cluster resharding isn't
being slowed down by these sanitation, so i'm setting the default value to `no`,
but later on it should be set to `clients` by default.

changes:
- changing rdbReportError not to `exit` in RESTORE command
- adding a new stat to be able to later check if cluster MIGRATE isn't being
  slowed down by sanitation.
2020-12-06 14:54:34 +02:00
Oran Agra
4e2e5be201
Attempt to fix sporadic test failures due to wait_for_log_messages (#7955)
The tests sometimes fail to find a log message.
Recently i added a print that shows the log files that are searched
and it shows that the message was in deed there.
The only reason i can't think of for this seach to fail, is we we
happened to read an incomplete line, which didn't match our pattern and
then on the next iteration we would continue reading from the line after
it.

The fix is to always re-evaluation the previous line.
2020-10-26 11:55:24 +02:00
Oran Agra
c96ece9f5e
improve verbose logging on failed test. print log file lines (#7938) 2020-10-22 11:34:54 +03:00
Yossi Gottlieb
ef92f507dd
Fix tests failure on busybox systems. (#7916) 2020-10-18 14:50:29 +03:00
Felipe Machado
c3f9e01794
Adds new pop-push commands (LMOVE, BLMOVE) (#6929)
Adding [B]LMOVE <src> <dst> RIGHT|LEFT RIGHT|LEFT. deprecating [B]RPOPLPUSH.

Note that when receiving a BRPOPLPUSH we'll still propagate an RPOPLPUSH,
but on BLMOVE RIGHT LEFT we'll propagate an LMOVE

improvement to existing tests
- Replace "after 1000" with "wait_for_condition" when wait for
  clients to block/unblock.
- Add a pre-existing element to target list on basic tests so
  that we can check if the new element was added to the correct
  side of the list.
- check command stats on the replica to make sure the right
  command was replicated

Co-authored-by: Oran Agra <oran@redislabs.com>
2020-10-08 08:33:17 +03:00
bodong.ybd
f22fa9594d Tests: Some fixes for macOS
1) cur_test: when restart_server, "no such variable" error occurs
  ./runtest --single integration/rdb
  test {client freed during loading}
      SET ::cur_test
      restart_server
        kill_server
          test "Check for memory leaks (pid $pid)"
          SET ::cur_test
          UNSET ::cur_test
      UNSET ::cur_test // This global variable has been unset.

2) `ps --ppid` not available on macOS platform, can be replaced with
`pgrep -P pid`.
2020-09-08 14:27:53 +08:00
Oran Agra
573246f73c
if diskless repl child is killed, make sure to reap the pid (#7742)
Starting redis 6.0 and the changes we made to the diskless master to be
suitable for TLS, I made the master avoid reaping (wait3) the pid of the
child until we know all replicas are done reading their rdb.

I did that in order to avoid a state where the rdb_child_pid is -1 but
we don't yet want to start another fork (still busy serving that data to
replicas).

It turns out that the solution used so far was problematic in case the
fork child was being killed (e.g. by the kernel OOM killer), in that
case there's a chance that we currently disabled the read event on the
rdb pipe, since we're waiting for a replica to become writable again.
and in that scenario the master would have never realized the child
exited, and the replica will remain hung too.
Note that there's no mechanism to detect a hung replica while it's in
rdb transfer state.

The solution here is to add another pipe which is used by the parent to
tell the child it is safe to exit. this mean that when the child exits,
for whatever reason, it is safe to reap it.

Besides that, i'm re-introducing an adjustment to REPLCONF ACK which was
part of #6271 (Accelerate diskless master connections) but was dropped
when that PR was rebased after the TLS fork/pipe changes (5a47794).
Now that RdbPipeCleanup no longer calls checkChildrenDone, and the ACK
has chance to detect that the child exited, it should be the one to call
it so that we don't have to wait for cron (server.hz) to do that.
2020-09-06 16:43:57 +03:00
Oran Agra
2b998de460
Improve valgrind support for cluster tests (#7725)
- redirect valgrind reports to a dedicated file rather than console
- try to avoid killing instances with SIGKILL so that we get the memory
  leak report (killing with SIGTERM before resorting to SIGKILL)
- search for valgrind reports when done, print them and fail the tests
- add --dont-clean option to keep the logs on exit
- fix exit error code when crash is found (would have exited with 0)

changes that affect the normal redis test suite:
- refactor check_valgrind_errors into two functions one to search and
  one to report
- move the search half into util.tcl to serve the cluster tests too
- ignore "address range perms" valgrind warnings which seem non relevant.
2020-09-06 11:11:49 +03:00
Oran Agra
1b7ba44e79 test infra - wait_done_loading
reduce code duplication in aof.tcl.
move creation of clients into the test so that it can be skipped
2020-09-06 09:59:19 +03:00
Oran Agra
9d527d076b test infra - write test name to logfile 2020-09-06 09:59:19 +03:00
Oran Agra
9ef8d2f671
Run active defrag while blocked / loading (#7726)
During long running scripts or loading RDB/AOF, we may need to do some
defragging. Since processEventsWhileBlocked is called periodically at
unknown intervals, and many cron jobs either depend on run_with_period
(including active defrag), or rely on being called at server.hz rate
(i.e. active defrag knows ho much time to run by looking at server.hz),
the whileBlockedCron may have to run a loop triggering the cron jobs in it
(currently only active defrag) several times.

Other changes:
- Adding a test for defrag during aof loading.
- Changing key-load-delay config to take negative values for fractions
  of a microsecond sleep
2020-09-03 08:47:29 +03:00
Oran Agra
109b5ccdcd
Fix failing tests due to issues with wait_for_log_message (#7572)
- the test now waits for specific set of log messages rather than wait for
  timeout looking for just one message.
- we don't wanna sample the current length of the log after an action, due
  to a race, we need to start the search from the line number of the last
  message we where waiting for.
- when attempting to trigger a full sync, use multi-exec to avoid a race
  where the replica manages to re-connect before we completed the set of
  actions that should force a full sync.
- fix verify_log_message which was broken and unused
2020-07-28 11:15:29 +03:00
Remi Collet
3f2fbc4c61
Fix deprecated tail syntax in tests (#7543) 2020-07-21 09:07:54 +03:00
Oran Agra
8e76e13472
stabilize tests that look for log lines (#7367)
tests were sensitive to additional log lines appearing in the log
causing the search to come empty handed.

instead of just looking for the n last log lines, capture the log lines
before performing the action, and then search from that offset.
2020-07-10 08:28:22 +03:00
Oran Agra
1cf33a46d5 tests: find_available_port start search from next port
i.e. don't start the search from scratch hitting the used ones again.
this will also reduce the likelihood of collisions (if there are any
left) by increasing the time until we re-use a port we did use in the
past.
2020-05-27 16:12:35 +03:00
Oran Agra
e258a1c087 tests: each test client work on a distinct port range
apparently when running tests in parallel (the default of --clients 16),
there's a chance for two tests to use the same port.
specifically, one test might shutdown a master and still have the
replica up, and then another test will re-use the port number of master
for another master, and then that replica will connect to the master of
the other test.

this can cause a master to count too many full syncs and fail a test if
we run the tests with --single integration/psync2 --loop --stop

see Probmem 2 in #7314
2020-05-26 11:17:08 +03:00
Yossi Gottlieb
b087dd1db6 TLS: Connections refactoring and TLS support.
* Introduce a connection abstraction layer for all socket operations and
integrate it across the code base.
* Provide an optional TLS connections implementation based on OpenSSL.
* Pull a newer version of hiredis with TLS support.
* Tests, redis-cli updates for TLS support.
2019-10-07 21:06:13 +03:00
Oran Agra
c56b4ddc6f prevent diskless replica from terminating on short read
now that replica can read rdb directly from the socket, it should avoid exiting
on short read and instead try to re-sync.

this commit tries to have minimal effects on non-diskless rdb reading.
and includes a test that tries to trigger this scenario on various read cases.
2019-07-17 16:46:22 +02:00
Oran Agra
2de544cfcc diskless replication on slave side (don't store rdb to file), plus some other related fixes
The implementation of the diskless replication was currently diskless only on the master side.
The slave side was still storing the received rdb file to the disk before loading it back in and parsing it.

This commit adds two modes to load rdb directly from socket:
1) when-empty
2) using "swapdb"
the third mode of using diskless slave by flushdb is risky and currently not included.

other changes:
--------------
distinguish between aof configuration and state so that we can re-enable aof only when sync eventually
succeeds (and not when exiting from readSyncBulkPayload after a failed attempt)
also a CONFIG GET and INFO during rdb loading would have lied

When loading rdb from the network, don't kill the server on short read (that can be a network error)

Fix rdb check when performed on preamble AOF

tests:
run replication tests for diskless slave too
make replication test a bit more aggressive
Add test for diskless load swapdb
2019-07-08 15:37:48 +03:00
Oran Agra
d0850369c4 fix small test suite race conditions 2018-11-12 10:26:10 +02:00
antirez
3d7d20b7f3 Test: fix lshuffle by providing the "K" combinator. 2018-07-13 17:52:39 +02:00
antirez
967ad3643c Test: add lshuffle in the Tcl utility functions set. 2018-07-13 17:51:03 +02:00
antirez
175707e550 Test: csvdump now scans all DBs. 2015-08-05 12:27:15 +02:00
Mariano Pérez Rodríguez
5afe1e37c7 Stop tests from leaving a black background
Uses ANSI "default background" color code after closing tests
so any non-black terminals don't remain polluted.

Fixes #1649
Closes #1912
2014-08-25 10:14:03 +02:00
antirez
e01195e90d Test: AOF rewrite during write load. 2014-07-10 11:25:12 +02:00
antirez
54157bc49e Test: find_available_port: check that cluster port is free as well.
The function will only return ports that have also port+10000 free, so
that Redis Cluster instances can be executed at the returned port.
2014-06-30 12:08:24 +02:00
antirez
34c404e069 Test: colorstr moved to util.tcl. 2014-02-17 17:36:50 +01:00
antirez
a1dca2efab Test: code to test server availability refactored.
Some inline test moved into server_is_up procedure.
Also find_available_port was moved into util since it is going
to be used for the Sentinel test as well.
2014-02-17 16:44:57 +01:00
antirez
d1f2d0733c Test: randomInt() behavior commented. 2013-06-25 15:32:37 +02:00
antirez
c0de45924c New test: hash ziplist -> hashtable encoding conversion.
A new stress test was added to stress test the code converting a ziplist
into an hash table.

In this commit also randomValue helper function was modified to also
return negative values.
2012-06-11 15:19:46 +02:00
antirez
80e808b6d6 EVAL replication test: less false positives.
wait_for_condition is now used instead of the usual "after 1000" (that
is the way to sleep in Tcl). This should avoid to find the replica in
a state where it is loading the RDB in memory, returning -LOADING error.

This test used to fail when running the test over valgrind, due to the
added latencies.
2012-06-02 23:29:57 +02:00
antirez
bc70b8e5f4 Tests modified to account for INFO fields renaming.
Commit 33e1db36fa modified the name of a
few INFO fields. This commit changes the Redis test to account for this
changes.
2012-05-25 15:20:59 +02:00
antirez
2bcd18a2e9 Redis test: include bug report on crash.
Due to a change in the format of the bug report in case of crash of
failed assertion the test suite was no longer able to properly log it.
Instead just a protocol error was logged by the Redis TCL client that
provided no clue about the actual problem.

This commit resolves the issue by logging everything from the first line
of the log including the string REDIS BUG REPORT, till the end of the
file.
2012-05-22 13:13:24 +02:00
antirez
bf758397a1 more valgrind (and other archs) friendly testing of floating number related features. 2011-11-16 14:40:50 +01:00
Pieter Noordhuis
6f8a32d5c7 Be less verbose in testing; improve error handling 2010-12-10 16:13:21 +01:00
antirez
4d7e125519 minor test suite bug fixed 2010-11-04 10:48:49 +01:00
antirez
6146329f1f replication test with expires 2010-08-03 13:38:39 +02:00
antirez
a0573260b0 better random dataset creation function in test. master-slave replication test now is able to save the two datasets in CSV when an inconsistency is detected. 2010-07-28 14:08:46 +02:00
antirez
dd3f505ff5 Consistency test improved 2010-07-27 14:42:11 +02:00
antirez
b056ca39f2 improved random dataset creation in test: del, sunionstore, zunionstore 2010-07-06 18:30:38 +02:00
Pieter Noordhuis
53cbf66caf initial tests for AOF (and small changes to server.tcl to support these) 2010-05-19 14:54:20 +02:00
Pieter Noordhuis
fdfb02e7ff print warnings in redis log when a test raises an exception (very likely to be caused by something like a failed assertion) 2010-05-15 23:48:08 +02:00
Pieter Noordhuis
85ecc65edc initial rough integration test for replication 2010-05-14 20:50:58 +02:00
Pieter Noordhuis
9cf9e6f197 proc to retrieve values from INFO properties 2010-05-14 20:48:57 +02:00
antirez
ab72b4833d minor fixes to the new test suite, html doc updated 2010-05-14 18:48:33 +02:00