redict

mirror of https://codeberg.org/redict/redict.git synced 2025-01-23 08:38:27 -05:00

Author	SHA1	Message	Date
Oran Agra	c17e597d05	Accelerate diskless master connections, and general re-connections (#6271 ) Diskless master has some inherent latencies. 1) fork starts with delay from cron rather than immediately 2) replica is put online only after an ACK. but the ACK was sent only once a second. 3) but even if it would arrive immediately, it will not register in case cron didn't yet detect that the fork is done. Besides that, when a replica disconnects, it doesn't immediately attempts to re-connect, it waits for replication cron (one per second). in case it was already online, it may be important to try to re-connect as soon as possible, so that the backlog at the master doesn't vanish. In case it disconnected during rdb transfer, one can argue that it's not very important to re-connect immediately, but this is needed for the "diskless loading short read" test to be able to run 100 iterations in 5 seconds, rather than 3 (waiting for replication cron re-connection) changes in this commit: 1) sync command starts a fork immediately if no sync_delay is configured 2) replica sends REPLCONF ACK when done reading the rdb (rather than on 1s cron) 3) when a replica unexpectedly disconnets, it immediately tries to re-connect rather than waiting 1s 4) when when a child exits, if there is another replica waiting, we spawn a new one right away, instead of waiting for 1s replicationCron. 5) added a call to connectWithMaster from replicationSetMaster. which is called from the REPLICAOF command but also in 3 places in cluster.c, in all of these the connection attempt will now be immediate instead of delayed by 1 second. side note: we can add a call to rdbPipeReadHandler in replconfCommand when getting a REPLCONF ACK from the replica to solve a race where the replica got the entire rdb and EOF marker before we detected that the pipe was closed. in the test i did see this race happens in one about of some 300 runs, but i concluded that this race is unlikely in real life (where the replica is on another host and we're more likely to first detect the pipe was closed. the test runs 100 iterations in 3 seconds, so in some cases it'll take 4 seconds instead (waiting for another REPLCONF ACK). Removing unneeded startBgsaveForReplication from updateSlavesWaitingForBgsave Now that CheckChildrenDone is calling the new replicationStartPendingFork (extracted from serverCron) there's actually no need to call startBgsaveForReplication from updateSlavesWaitingForBgsave anymore, since as soon as updateSlavesWaitingForBgsave returns, CheckChildrenDone is calling replicationStartPendingFork that handles that anyway. The code in updateSlavesWaitingForBgsave had a bug in which it ignored repl-diskless-sync-delay, but removing that code shows that this bug was hiding another bug, which is that the max_idle should have used >= and not >, this one second delay has a big impact on my new test.	2020-08-06 16:53:06 +03:00
Oran Agra	824bd2ac11	fix new rdb test failing on timing issues (#7604 ) apparenlty on github actions sometimes 500ms is not enough	2020-08-04 08:53:50 +03:00
Oran Agra	109b5ccdcd	Fix failing tests due to issues with wait_for_log_message (#7572 ) - the test now waits for specific set of log messages rather than wait for timeout looking for just one message. - we don't wanna sample the current length of the log after an action, due to a race, we need to start the search from the line number of the last message we where waiting for. - when attempting to trigger a full sync, use multi-exec to avoid a race where the replica manages to re-connect before we completed the set of actions that should force a full sync. - fix verify_log_message which was broken and unused	2020-07-28 11:15:29 +03:00
Oran Agra	8a57969fd7	Stabilize bgsave test that sometimes fails with valgrind (#7559 ) on ci.redis.io the test fails a lot, reporting that bgsave didn't end. increaseing the timeout we wait for that bgsave to get aborted. in addition to that, i also verify that it indeed got aborted by checking that the save counter wasn't reset. add another test to verify that a successful bgsave indeed resets the change counter.	2020-07-23 13:06:24 +03:00
Yossi Gottlieb	f57e844b2e	Tests: drop TCL 8.6 dependency. (#7548 ) This re-implements the redis-cli --pipe test so it no longer depends on a close feature available only in TCL 8.6. Basically what this test does is run redis-cli --pipe, generates a bunch of commands and pipes them through redis-cli, and inspects the result in both Redis and the redis-cli output. To do that, we need to close stdin for redis-cli to indicate we're done so it can flush its buffers and exit. TCL has bi-directional channels can only offers a way to "one-way close" a channel with TCL 8.6. To work around that, we now generate the commands into a file and feed that file to redis-cli directly. As we're writing to an actual file, the number of commands is now reduced.	2020-07-21 14:17:14 +03:00
Oran Agra	254c962554	redis-cli tests, fix valgrind timing issue (#7519 ) this test when run with valgrind on github actions takes 160 seconds	2020-07-14 18:04:08 +03:00
Oran Agra	e5227aab89	fix recently added time sensitive tests failing with valgrind (#7512 ) interestingly the latency monitor test fails because valgrind is slow enough so that the time inside PEXPIREAT command from the moment of the first mstime() call to get the basetime until checkAlreadyExpired calls mstime() again is more than 1ms, and that test was too sensitive. using this opportunity to speed up the test (unrelated to the failure) the fix is just the longer time passed to PEXPIRE.	2020-07-13 16:40:03 +03:00
Yossi Gottlieb	d9f970d8d3	TLS: Add missing redis-cli options. (#7456 ) * Tests: fix and reintroduce redis-cli tests. These tests have been broken and disabled for 10 years now! * TLS: add remaining redis-cli support. This adds support for the redis-cli --pipe, --rdb and --replica options previously unsupported in --tls mode. * Fix writeConn().	2020-07-10 10:25:55 +03:00
Oran Agra	8e76e13472	stabilize tests that look for log lines (#7367 ) tests were sensitive to additional log lines appearing in the log causing the search to come empty handed. instead of just looking for the n last log lines, capture the log lines before performing the action, and then search from that offset.	2020-07-10 08:28:22 +03:00
Oran Agra	69ade87325	tests/valgrind: don't use debug restart (#7404 ) * tests/valgrind: don't use debug restart DEBUG REATART causes two issues: 1. it uses execve which replaces the original process and valgrind doesn't have a chance to check for errors, so leaks go unreported. 2. valgrind report invalid calls to close() which we're unable to resolve. So now the tests use restart_server mechanism in the tests, that terminates the old server and starts a new one, new PID, but same stdout, stderr. since the stderr can contain two or more valgrind report, it is not enough to just check for the absence of leaks, we also need to check for some known errors, we do both, and fail if we either find an error, or can't find a report saying there are no leaks. other changes: - when killing a server that was already terminated we check for leaks too. - adding DEBUG LEAK which was used to test it. - adding --trace-children to valgrind, although no longer needed. - since the stdout contains two or more runs, we need slightly different way of checking if the new process is up (explicitly looking for the new PID) - move the code that handles --wait-server to happen earlier (before watching the startup message in the log), and serve the restarted server too. * squashme - CR fixes	2020-07-10 08:26:52 +03:00
Oran Agra	c480af9007	fix pingoff test race	2020-05-31 15:51:52 +03:00
antirez	23f2b4d0a8	Test: take PSYNC2 test master timeout high during switch. This will likely avoid false positives due to trailing pings.	2020-05-28 10:47:30 +02:00
Oran Agra	2a8af8e675	adjust revived meaningful offset tests these tests create several edge cases that are otherwise uncovered (at least not consistently) by the test suite, so although they're no longer testing what they were meant to test, it's still a good idea to keep them in hope that they'll expose some issue in the future.	2020-05-28 09:10:51 +03:00
Oran Agra	90f3856fd5	revive meaningful offset tests	2020-05-28 08:21:24 +03:00
antirez	484cfc3d76	Another meaningful offset test removed.	2020-05-27 12:50:02 +02:00
antirez	32d0df0c1f	Remove the PSYNC2 meaningful offset test.	2020-05-27 12:47:34 +02:00
antirez	091fb64681	Test: PSYNC2 test can now show server logs.	2020-05-25 20:26:29 +02:00
Qu Chen	42f5da5d2d	Disconnect chained replicas when the replica performs PSYNC with the master always to avoid replication offset mismatch between master and chained replicas.	2020-05-21 18:42:10 -07:00
Oran Agra	75c11d7fec	fix valgrind test failure in replication test in `b4416280c` i added more keys to that test to make it run longer but in valgrind this now means the test times out, give valgrind more time.	2020-05-18 10:26:53 +03:00
antirez	0ca2f4f824	Merge branch 'unstable' of github.com:/antirez/redis into unstable	2020-05-17 18:24:48 +02:00
antirez	96bb0c9471	Improve the PSYNC2 test reliability.	2020-05-17 18:24:34 +02:00
Oran Agra	357aace895	add regression test for the race in #7205 with the original version of 6.0.0, this test detects an excessive full sync. with the fix in `1a7cd2c0e`, this test detects memory corruption, especially when using libc allocator with or without valgrind.	2020-05-17 18:26:02 +03:00
antirez	c38fd1f661	Merge branch 'free_clients_during_loading' into unstable	2020-05-14 11:28:08 +02:00
Oran Agra	b4416280cf	fix unstable replication test this test which has coverage for varoius flows of diskless master was failing randomly from time to time. the failure was: [err]: diskless all replicas drop during rdb pipe in tests/integration/replication.tcl log message of 'Diskless rdb transfer, last replica dropped, killing fork child' not found what seemed to have happened is that the master didn't detect that all replicas dropped by the time the replication ended, it thought that one replica is still connected. now the test takes a few seconds longer but it seems stable.	2020-05-12 08:59:09 +03:00
Oran Agra	905e28ee87	fix redis 6.0 not freeing closed connections during loading. This bug was introduced by a recent change in which readQueryFromClient is using freeClientAsync, and despite the fact that now freeClientsInAsyncFreeQueue is in beforeSleep, that's not enough since it's not called during loading in processEventsWhileBlocked. furthermore, afterSleep was called in that case but beforeSleep wasn't. This bug also caused slowness sine the level-triggered mode of epoll kept signaling these connections as readable causing us to keep doing connRead again and again for ll of these, which keep accumulating. now both before and after sleep are called, but not all of their actions are performed during loading, some are only reserved for the main loop. fixes issue #7215	2020-05-11 11:33:46 +03:00
Oran Agra	deee2c1ef2	add daily github actions with libc malloc and valgrind * fix memlry leaks with diskless replica short read. * fix a few timing issues with valgrind runs * fix issue with valgrind and watchdog schedule signal about the valgrind WD issue: the stack trace test in logging.tcl, has issues with valgrind: ==28808== Can't extend stack to 0x1ffeffdb38 during signal delivery for thread 1: ==28808== too small or bad protection modes it seems to be some valgrind bug with SA_ONSTACK. SA_ONSTACK seems unneeded since WD is not recursive (SA_NODEFER was removed), also, not sure if it's even valid without a call to sigaltstack()	2020-05-04 09:52:20 +03:00
Oran Agra	d31c0c5264	fix loading race in psync2 tests	2020-04-28 09:18:01 +03:00
Oran Agra	4447ddc8bb	Keep track of meaningful replication offset in replicas too Now both master and replicas keep track of the last replication offset that contains meaningful data (ignoring the tailing pings), and both trim that tail from the replication backlog, and the offset with which they try to use for psync. the implication is that if someone missed some pings, or even have excessive pings that the promoted replica has, it'll still be able to psync (avoid full sync). the downside (which was already committed) is that replicas running old code may fail to psync, since the promoted replica trims pings form it's backlog. This commit adds a test that reproduces several cases of promotions and demotions with stale and non-stale pings Background: The mearningful offset on the master was added recently to solve a problem were the master is left all alone, injecting PINGs into it's backlog when no one is listening and then gets demoted and tries to replicate from a replica that didn't have any of the PINGs (or at least not the last ones). however, consider this case: master A has two replicas (B and C) replicating directly from it. there's no traffic at all, and also no network issues, just many pings in the tail of the backlog. now B gets promoted, A becomes a replica of B, and C remains a replica of A. when A gets demoted, it trims the pings from its backlog, and successfully replicate from B. however, C is still aware of these PINGs, when it'll disconnect and re-connect to A, it'll ask for something that's not in the backlog anymore (since A trimmed the tail of it's backlog), and be forced to do a full sync (something it didn't have to do before the meaningful offset fix). Besides that, the psync2 test was always failing randomly here and there, it turns out the reason were PINGs. Investigating it shows the following scenario: cycle 1: redis #1 is master, and all the rest are direct replicas of #1 cycle 2: redis #2 is promoted to master, #1 is a replica of #2 and #3 is replica of #1 now we see that when #1 is demoted it prints: 17339:S 21 Apr 2020 11:16:38.523 * Using the meaningful offset 3929963 instead of 3929977 to exclude the final PINGs (14 bytes difference) 17339:S 21 Apr 2020 11:16:39.391 * Trying a partial resynchronization (request e2b3f8817735fdfe5fa4626766daa938b61419e5:3929964). 17339:S 21 Apr 2020 11:16:39.392 * Successful partial resynchronization with master. and when #3 connects to the demoted #2, #2 says: 17339:S 21 Apr 2020 11:16:40.084 * Partial resynchronization not accepted: Requested offset for secondary ID was 3929978, but I can reply up to 3929964 so the issue here is that the meaningful offset feature saved the day for the demoted master (since it needs to sync from a replica that didn't get the last ping), but it didn't help one of the other replicas which did get the last ping.	2020-04-27 15:52:23 +02:00
antirez	c4d7f30e25	PSYNC2: meaningful offset test.	2020-03-25 15:43:34 +01:00
Oran Agra	27641ee490	fix for flaky psync2 test *** [err]: PSYNC2: total sum of full synchronizations is exactly 4 in tests/integration/psync2.tcl Expected 5 == 4 (context: type eval line 6 cmd {assert {$sum == 4}} proc ::test) issue was that sometime the test got an unexpected full sync since it tried to switch to the replica before it was in sync with it's master.	2020-03-05 16:55:14 +02:00
zhaozhao.zz	58554396d6	incrbyfloat: fix issue #5256 ttl lost after propagate	2019-12-18 15:44:51 +08:00
Yossi Gottlieb	0db3b0a0ff	Merge remote-tracking branch 'upstream/unstable' into tls	2019-10-16 17:08:07 +03:00
Daniel Dai	98600c9a11	update typo	2019-10-09 14:15:31 -04:00
Yossi Gottlieb	61733ded14	TLS: Configuration options. Add configuration options for TLS protocol versions, ciphers/cipher suites selection, etc.	2019-10-07 21:07:27 +03:00
Oran Agra	6b6294807c	TLS: Implement support for write barrier.	2019-10-07 21:06:30 +03:00
Oran Agra	5a47794606	diskless replication rdb transfer uses pipe, and writes to sockets form the parent process. misc: - handle SSL_has_pending by iterating though these in beforeSleep, and setting timeout of 0 to aeProcessEvents - fix issue with epoll signaling EPOLLHUP and EPOLLERR only to the write handlers. (needed to detect the rdb pipe was closed) - add key-load-delay config for testing - trim connShutdown which is no longer needed - rioFdsetWrite -> rioFdWrite - simplified since there's no longer need to write to multiple FDs - don't detect rdb child exited (don't call wait3) until we detect the pipe is closed - Cleanup bad optimization from rio.c, add another one	2019-10-07 21:06:30 +03:00
Yossi Gottlieb	b087dd1db6	TLS: Connections refactoring and TLS support. * Introduce a connection abstraction layer for all socket operations and integrate it across the code base. * Provide an optional TLS connections implementation based on OpenSSL. * Pull a newer version of hiredis with TLS support. * Tests, redis-cli updates for TLS support.	2019-10-07 21:06:13 +03:00
Salvatore Sanfilippo	6129758558	Merge branch 'unstable' into modules_fork	2019-09-27 11:24:06 +02:00
Oran Agra	83e87bac76	Fix lastbgsave_status, when new child signal handler get intended kill And add a test for that.	2019-09-26 15:16:34 +03:00
Oran Agra	c56b4ddc6f	prevent diskless replica from terminating on short read now that replica can read rdb directly from the socket, it should avoid exiting on short read and instead try to re-sync. this commit tries to have minimal effects on non-diskless rdb reading. and includes a test that tries to trigger this scenario on various read cases.	2019-07-17 16:46:22 +02:00
Oran Agra	2de544cfcc	diskless replication on slave side (don't store rdb to file), plus some other related fixes The implementation of the diskless replication was currently diskless only on the master side. The slave side was still storing the received rdb file to the disk before loading it back in and parsing it. This commit adds two modes to load rdb directly from socket: 1) when-empty 2) using "swapdb" the third mode of using diskless slave by flushdb is risky and currently not included. other changes: -------------- distinguish between aof configuration and state so that we can re-enable aof only when sync eventually succeeds (and not when exiting from readSyncBulkPayload after a failed attempt) also a CONFIG GET and INFO during rdb loading would have lied When loading rdb from the network, don't kill the server on short read (that can be a network error) Fix rdb check when performed on preamble AOF tests: run replication tests for diskless slave too make replication test a bit more aggressive Add test for diskless load swapdb	2019-07-08 15:37:48 +03:00
Oran Agra	ba809f26d4	make replication tests more stable on slow machines solving few replication related tests race conditions which fail on slow machines bugfix in slave buffers test: since the test is executed twice, each time with a different commands count, the threshold for the delta can't be a constant.	2019-05-05 08:25:01 +03:00
antirez	009a929269	Remove debugging printf from replication.tcl test.	2018-12-12 11:55:30 +01:00
antirez	4cf8fdbbd3	Slave removal: remove slave from integration tests descriptions.	2018-09-11 15:32:28 +02:00
antirez	febe102bf6	Test: processing of master stream in slave -BUSY state. See #5297.	2018-08-31 16:45:02 +02:00
antirez	7a30be1237	Minor improvements to PR #5187 .	2018-07-31 17:30:12 +02:00
Jack Drogon	93238575f7	Fix typo	2018-07-03 18:19:46 +02:00
Oran Agra	de495ee7ab	minor fix in creating a stream NACK for rdb and defrag tests	2018-06-27 15:34:17 +03:00
Oran Agra	5616d4c603	add active defrag support for streams	2018-06-27 15:00:41 +03:00
antirez	e72190252e	Test RDB stream encoding saving/loading.	2018-06-19 16:29:15 +02:00

1 2 3 4

159 Commits