redict

mirror of https://codeberg.org/redict/redict.git synced 2025-01-22 08:08:53 -05:00

Author	SHA1	Message	Date
Binbin	9b6dded421	Fix empty rehashing list in swapdb mode (#12770 ) In swapdb mode, the temp db does not init the rehashing list, the change added in #12764 caused cluster ci to fail.	2023-11-16 11:18:25 +02:00
Binbin	4366bbaa61	Empty rehashing list in emptyDbStructure (#12764 ) This is currently harmless, since we have already cleared the dict before, it will reset the rehashidx to -1, and in incrementallyRehash we will call dictIsRehashing to check. It would be nice to empty the list to avoid meaningless attempts, and the code is also unified to reduce misunderstandings.	2023-11-15 07:55:34 +02:00
Binbin	fe36306340	Fix DB iterator not resetting pauserehash causing dict being unable to rehash (#12757 ) When using DB iterator, it will use dictInitSafeIterator to init a old safe dict iterator. When dbIteratorNext is used, it will jump to the next slot db dict when we are done a dict. During this process, we do not have any calls to dictResumeRehashing, which causes the dict's pauserehash to always be > 0. And at last, it will be returned directly in dictRehashMilliseconds, which causes us to have slot dict in a state where rehash cannot be completed. In the "expire scan should skip dictionaries with lot's of empty buckets" test, adding a `keys ` can reproduce the problem stably. `keys ` will call dbIteratorNext to trigger a traversal of all slot dicts. Added dbReleaseIterator and dbIteratorInitNextSafeIterator methods to call dictResetIterator. Issue was introduced in #11695.	2023-11-14 14:28:46 +02:00
Yossi Gottlieb	a9e73c00bc	Reduce FreeBSD daily scope. (#12758 ) The full test is very flaky running on a VM inside GitHub worker, so we have to settle for only building and running a small smoke test.	2023-11-13 17:22:09 +02:00
Roshan Khatri	88e83e517b	Add DEBUG_ASSERTIONS option to custom assert (#12667 ) This PR introduces a new macro, serverAssertWithInfoDebug, to do complex assertions only for debugging. The main intention is to allow running complex operations during tests without impacting runtime performance. This assertion is enabled when setting DEBUG_ASSERTIONS. The DEBUG_ASSERTIONS flag is set for the daily and CI variants of `test-sanitizer-address`.	2023-11-11 20:31:34 -08:00
Harkrishn Patro	9ca8490315	Increase timeout for expiry cluster tests (#12752 ) Test recently added fails on timeout in valgrind in GH actions. Locally with valgrind the test finishes within 1.5 sec(s). Couldn't find any issue due to lack of reproducibility. Increasing the timeout and adding an additional log to the test to understand how many keys were left at the end.	2023-11-11 12:01:04 +02:00
zhaozhao.zz	6258edebf0	reset bucket_count when empty db (#12750 ) Introduced in #12697 , should reset bucket_count when empty db, or the overhead memory usage of db can be miscalculated.	2023-11-10 15:52:57 +02:00
zhaozhao.zz	cf6ed3feeb	fix the wrong judgement for activerehashing in standalone mode (#12741 ) Introduced by #11695, the judgement should be dictIsRehashing.	2023-11-09 11:30:50 +02:00
Binbin	53294e537c	Fix genClusterDebugString minor sds leaks (#12739 ) This function now will only be called in printCrashReport, so this is just a cleanup.	2023-11-08 19:14:36 +02:00
Meir Shpilraien (Spielrein)	0ffb9d2ea9	Before evicted and before expired server events are not executed inside an execution unit. (#12733 ) Redis 7.2 (#9406) introduced a new modules event, `RedisModuleEvent_Key`. This new event allows the module to read the key data just before it is removed from the database (either deleted, expired, evicted, or overwritten). When the key is removed from the database, either by active expire or eviction. The new event was not called as part of an execution unit. This can cause an issue if the module registers a post notification job inside the event. This job will not be executed atomically with the expiration/eviction operation and will not replicated inside a Multi/Exec. Moreover, the post notification job will be executed right after the event where it is still not safe to perform any write operation, this will violate the promise that post notification job will be called atomically with the operation that triggered it and only when it is safe to write. This PR fixes the issue by wrapping each expiration/eviction of a key with an execution unit. This makes sure the entire operation will run atomically and all the post notification jobs will be executed at the end where it is safe to write. Tests were modified to verify the fix.	2023-11-08 09:28:22 +02:00
Yossi Gottlieb	6223355cf3	Use cross-platform-actions for FreeBSD support. (#12732 ) This change overcomes many stability issues experienced with the vmactions action. We need to limit VMs to 8GB for better stability, as the 13GB default seems to hang them occasionally. Shell code has been simplified since this action seem to use `bash -e` which will abort on non-zero exit codes anyway.	2023-11-06 18:07:14 +02:00
dingrui	a888503b4f	Remove unnecessary argument(tp) in gettimeofday() call for retrieving timezone (#12722 ) changes the `gettimeofday` caller, by removing an unused optional output argument. It would take 2 benefits: - simplify code, discard unnecessary arg. - possibly faster due to the implementation in kernel.	2023-11-06 15:10:09 +02:00
Chen Tianjie	282b82e9d2	Handle all CLUSTER_REDIR_ error code when verifying script. (#12707 ) Clarify the errors related to the cluster mode in the script, return the command that encountered an execution error along with the specific error message. --------- Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2023-11-06 17:48:58 +08:00
Wen Hui	28b6155ba5	Fix the bug that write redis sensitive command information to redis_cli historyfile (#11489 ) Currently, we do not write the following sensitive commands into the ~/.rediscli_history file: ACL SETUSER username [rule [rule ...]] AUTH [username] password HELLO [AUTH username password] MIGRATE host port <key \| ""> destination-db timeout [[AUTH password \| AUTH2 username password]] CONFIG SET masterauth master-password CONFIG SET masteruser username CONFIG SET requirepass foobared However, we still write the following sensitive commands into the ~/.rediscli_history file: ACL GETUSER username Sentinel CONFIG set sentinel-pass password Sentinel CONFIG set sentinel-user username Sentinel set mastername auth-pass password Sentinel set mastername auth-user username This change adds the commands of the second list to be skipped from being written to the history file.	2023-11-05 14:20:15 +02:00
Roshan Khatri	15a048d4f0	re-enable defrag tests in cluster mode. (#12710 ) Reverts the skipping defrag tests in cluster mode (done in #12672. instead it skips only some defrag tests that are relevant for cluster modes. The test now run well after investigating and making the changes in #12674 and #12694. Co-authored-by: Oran Agra <oran@redislabs.com>	2023-11-02 13:55:48 +02:00
dependabot[bot]	0ce74872c4	Bump actions/setup-node from 3 to 4 (#12708 ) Bumps [actions/setup-node](https://github.com/actions/setup-node) from 3 to 4. - [Release notes](https://github.com/actions/setup-node/releases) - [Commits](https://github.com/actions/setup-node/compare/v3...v4) --- updated-dependencies: - dependency-name: actions/setup-node dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-11-02 12:22:50 +02:00
Viktor Söderqvist	8878817d89	Optimize SCAN with MATCH when pattern implies cluster slot (#12536 ) Optimize the performance of SCAN commands when a match pattern can only contain keys from a single slot in cluster mode. This can happen when the pattern contains a hash tag before any wildcard matchers or when the key contains no matchers.	2023-11-01 00:06:49 -07:00
Chen Tianjie	e9f312e087	Change stat_client_qbuf_limit_disconnections to atomic. (#12711 ) In #12476 server.stat_client_qbuf_limit_disconnections was added. It is written in readQueryFromClient, which may be called by multiple threads when io-threads and io-threads-do-reads are turned on. Somehow we missed to make it an atomic variable.	2023-11-01 10:57:24 +08:00
Viktor Söderqvist	8d675950e6	Don't crash when adding a forgotten node to blacklist twice (#12702 ) Add a defensive checks to prevent double freeing a node from the cluster blacklist.	2023-10-31 07:20:06 -07:00
erpeng	4bbb2b0152	Optimize CPU cache efficiency on dict while it's being rehashed (#5692 ) when find a key ,if redis is rehashing, currently we should lookup both tables (ht[0] and ht[1]). if we use the key's index comparing to the rehashidx,if index < rehashidx,then we can conclude: 1. it is rehashing(rehashidx is -1 if it is not rehashing) 2. we can't find key in ht[0] so just continue to find key in ht[1] The possible performance gain here, is not the looping over the linked list (which is empty), but rather the lookup in the table (which could be a cache miss). --------- Co-authored-by: zhangshihua003 <zhangshihua003@ke.com> Co-authored-by: sundb <sundbcn@gmail.com> Co-authored-by: Oran Agra <oran@redislabs.com> Co-authored-by: judeng <abc3844@126.com>	2023-10-31 09:57:26 +02:00
Roshan Khatri	f7fa481156	Optimize finding the slot for a given key count in a fenwick tree (#12704 ) This PR optimizes the time complexity of findSlotByKeyIndex from O(log^2(N)) to O(log(N)) by using the tree structure of binary index tree to find a slot in one search of the index.	2023-10-27 17:15:19 -07:00
Harkrishn Patro	4145d628b4	Reduce dbBuckets operation time complexity from O(N) to O(1) (#12697 ) As part of #11695 independent dictionaries were introduced per slot. Time complexity to discover total no. of buckets across all dictionaries increased to O(N) with straightforward implementation of iterating over all dictionaries and adding dictBuckets of each. To optimize the time complexity, we could maintain a global counter at db level to keep track of the count of buckets and update it on the start and end of rehashing. --------- Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>	2023-10-27 22:05:40 +03:00
Roshan Khatri	7d68208a6e	Reset later item flag after defrag later is done (#12694 ) Fixing issues described in #12672, started after #11695 Related to #12674 Fixes the `defrag didn't stop' issue. In some cases of how the keys were stored in memory defrag_later_item_in_progress was not getting reset once we finish defragging the later items and we move to the next slot. This stopped the scan to happen in the later slots and did not get	2023-10-27 13:56:15 +03:00
Oran Agra	ba900f6cb8	Fix fd leak causing deleted files to remain open and eat disk space (#12693 ) This was introduced in v7.2 by #11248	2023-10-25 20:54:02 +03:00
Binbin	372ea21875	Update comment around propagateDeletion (#12687 ) Fix some outdated comments and add comment for moduleNotifyKeyspaceEvent we added in #11084 since it seems a bit implicit. --------- Co-authored-by: Oran Agra <oran@redislabs.com>	2023-10-24 13:10:03 +03:00
Harkrishn Patro	3fac869f02	Fix test, disable expiration until empty buckets are formed (#12689 ) Test failure on freebsd CI: ``` *** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl scan didn't handle slot skipping logic. ``` Observation: expiry of keys might happen before the empty buckets are formed and won't help with the expiry skip logic validation. Solution: Disable expiration until the empty buckets are formed.	2023-10-24 11:29:40 +03:00
Harkrishn Patro	26eb4ce397	Fix defrag test (#12674 ) Fixing issues started after #11695 when the defrag tests are being executed in cluster mode too. For some reason, it looks like the defragmentation is over too quickly, before the test is able to detect that it's running. so now instead of waiting to see that it's active, we wait to see that it did some work ``` [err]: Active defrag big list: cluster in tests/unit/memefficiency.tcl defrag not started. [err]: Active defrag big keys: cluster in tests/unit/memefficiency.tcl defrag didn't stop. ```	2023-10-22 11:56:45 +03:00
Harkrishn Patro	becd50d0da	Disable flaky defrag tests affecting daily run (#12672 ) Temporarily disabling few of the defrag tests in cluster mode to make the daily run stable: Active defrag eval scripts Active defrag big keys Active defrag big list Active defrag edge case	2023-10-19 21:12:58 +03:00
Harkrishn Patro	f3bf8485d8	Fix resize hash table dictionary iterator (#12660 ) Dictionary iterator logic in the `tryResizeHashTables` method is picking the next (incorrect) dictionary while the cursor is at a given slot. This could lead to some dictionary/slot getting skipped from resizing. Also stabilize the test. problem introduced recently in #11695	2023-10-19 13:58:32 +03:00
Oran Agra	03345ddc7f	Fix issue of listen before chmod on Unix sockets (CVE-2023-45145) (#12671 ) Before this commit, Unix socket setup performed chmod(2) on the socket file after calling listen(2). Depending on what umask is used, this could leave the file with the wrong permissions for a short period of time. As a result, another process could exploit this race condition and establish a connection that would otherwise not be possible. We now make sure the socket permissions are set up prior to calling listen(2). (cherry picked from commit 1119ecae6fd8796fa337df2212f09173ab6c7b0a) Co-authored-by: Yossi Gottlieb <yossigo@gmail.com>	2023-10-18 14:00:00 +03:00
sundb	3c734b8e9d	Add new compilation CI for macos-11 and macos-13 (#12666 ) As discussed in #12611 Add a build CI for macox 11 and 13 to avoid compatibility breakage introduced by future macos sdk versions.	2023-10-18 13:25:52 +03:00
meiravgri	d27c7413a9	remove heap allocations from signal handlers. (#12655 ) Using heap allocation during signal handlers is unsafe. This PR purpose is to replace all the heap allocations done within the signal handlers raised upon server crash and assertions. These were added in #12453. writeStacktraces(): allocates the stacktraces output array on the calling thread's stack and assigns the address to a global variable. It calls `ThreadsManager_runOnThreads()` that invokes `collect_stacktrace_data()` by each thread: each thread writes to a different location in the above array to allow sync writes. get_ready_to_signal_threads_tids(): instead of allocating the `tids` array, it receives it as a fixed size array parameter, allocated on on the stack of the calling function, and returns the number of valid threads. The array size is hard-coded to 50. `ThreadsManager_runOnThreads():` To avoid the outputs array allocation, the callback signature was changed. Now it should return void. This function return type has also changed to int - returns 1 if successful, and 0 otherwise. Other unsafe calls will be handled in following PRs	2023-10-16 17:21:49 +03:00
Vitaly	0270abda82	Replace cluster metadata with slot specific dictionaries (#11695 ) This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data. ## Important changes * Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms. * getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time. * Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree. * scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue. * Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot. * Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading. * DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well. ## Performance This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict. RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load. ## Interface changes * Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS` * Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored. * New RDB version to support the new op code for SLOT information. --------- Co-authored-by: Vitaly Arbuzov <arvit@amazon.com> Co-authored-by: Harkrishn Patro <harkrisp@amazon.com> Co-authored-by: Roshan Khatri <rvkhatri@amazon.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Oran Agra <oran@redislabs.com>	2023-10-14 23:58:26 -07:00
Oran Agra	f0c1c730d4	test suite: clean server pids after server crashed (#12639 ) when a server in the test suite crashes and is restarted by redstart_server, we didn't clean it's pid from the list. we can see that when the corrupt-dump-fuzzer hangs, it has a long list of servers to lean, but in fact they're all already dead.	2023-10-13 16:28:52 +03:00
Harkrishn Patro	b784c5375e	Unsubscribe all clients from replica for shard channel if the master ownership changes (#12577 ) Unsubscribe all clients from replica for shard channel if the master ownership changes	2023-10-12 20:48:27 -07:00
Ye Lin Aung	b705049a7a	Replace `emptyDb()` with new `emptyData()` (#12646 ) The function was renamed, but the comments were outdated.	2023-10-12 15:34:08 +03:00
zhaozhao.zz	77a65e82b2	support XREAD[GROUP] with BLOCK option in scripts (#12596 ) In #11568 we removed the NOSCRIPT flag from commands and keep the BLOCKING flag. Aiming to allow them in scripts and let them implicitly behave in the non-blocking way. In that sense, the old behavior was to allow LPOP and reject BLPOP, and the new behavior, is to allow BLPOP too, and fail it only in case it ends up blocking. So likewise, so far we allowed XREAD and rejected XREAD BLOCK, and we will now allow that too, and only reject it if it ends up blocking.	2023-10-12 10:54:50 +03:00
Binbin	e5ef161374	Fix crash when running rebalance command in a mixed cluster of 7.0 and 7.2 (#12604 ) In #10536, we introduced the assert, some older versions of servers (like 7.0) doesn't gossip shard_id, so we will not add the node to cluster->shards, and node->shard_id is filled in randomly and may not be found here. It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow like #12538. In this PR, we remove the assert and replace it with an unconditional removal.	2023-10-11 22:15:25 -07:00
Binbin	4de4fcf280	Fix redis-cli pubsub_mode and connect minor prompt / crash issue (#12571 ) When entering pubsub mode and using the redis-cli only connect command, we need to reset pubsub_mode because we switch to a different connection. This will affect the prompt when the connection is successful, and redis-cli will crash when the connect fails: ``` 127.0.0.1:6379> subscribe ch 1) "subscribe" 2) "ch" 3) (integer) 1 127.0.0.1:6379(subscribed mode)> connect 127.0.0.1 6380 127.0.0.1:6380(subscribed mode)> ping PONG 127.0.0.1:6380(subscribed mode)> connect a b Could not connect to Redis at a:0: Name or service not known Segmentation fault ```	2023-10-11 10:45:38 +03:00
Binbin	8d92f7f2b7	Support NO ONE block in REPLICAOF command json (#12633 ) The current commands.json doesn't mention the special NO ONE arguments. This change is also applied to SLAVEOF	2023-10-10 11:10:40 +03:00
Oran Agra	b810384c62	dump server longs on hang corrupt dump fuzzer test recently there are some incidents of hanged tests in the CI when we try to reproduce them, we get an assertion, not a hang. maybe the server logs will reveal some info.	2023-10-08 16:19:31 +03:00
Jachin	a2b0701d2c	Fix compile on macOS 13 (#12611 ) Use the __MAC_OS_X_VERSION_MIN_REQUIRED macro to detect the macOS system version instead of using MAC_OS_X_VERSION_10_6. From MacOSX14.0.sdk, the default definitions of MAC_OS_X_VERSION_xxx have been removed in usr/include/AvailabilityMacros.h. It includes AvailabilityVersions.h, where the following condition must be met: `#if (!defined(_POSIX_C_SOURCE) && !defined(_XOPEN_SOURCE)) \|\| defined(_DARWIN_C_SOURCE)` Only then will MAC_OS_X_VERSION_xxx be defined. However, in the project, _DARWIN_C_SOURCE is not defined, which leads to the loss of the definition for MAC_OS_X_VERSION_10_6.	2023-10-08 11:12:50 +03:00
Oran Agra	fe37e4fc87	Cleanup nested module keyspace notifications (#12630 ) Recently we added a way for the module to declare that it wishes to receive nested KSN, by setting ALLOW_NESTED_KEYSPACE_NOTIFICATIONS. but it looks like this flow has a bug, clearing the `active` member when it was previously set. however, since nesting is permitted, this bug has no implications, since regardless of the active member, the notification is permitted.	2023-10-05 13:50:17 +03:00
YaacovHazan	2cf50ddbad	Fix 'load corrupted rdb with no CRC' test (#12629 ) After the change in #12626 (`2e0f6724e`), the is_alive proc gets pid and not server config. This PR aligns it in 'load corrupted rdb with no CRC' test.	2023-10-03 11:09:25 +03:00
Madelyn Olson	31c3172d9b	Better standardize around assertions (#12539 ) We use the C standard assert() in various places in the codebase, which requires NDEBUG to be undefined. We introduced the redisassert.h file in order to allow low level files to access the assert that maps to serverPanic, but this was only applied tactically and is not available broadly. This PR removes all usage of the standard library asserts and replaces them with an assert that maps to serverPanic. It makes us immune to accidentally setting the NDEBUG flag preventing assertions. I also marked marked the server asserts as "likely" to not execute. I spot checked various points in the code, and it didn't change the code layout on my x86 mac, but it is more consistent with redisassert.h and seems more correct overall.	2023-10-02 18:58:44 -07:00
Madelyn Olson	9d31768cbb	Fix a couple of tabs that caused misindentation (#12541 ) Fixed some usages of tabs which caused weird indentation in the code. Tried to find all of the places so their was one PR. I ignored all of the usages of tabs which don't really affect readability.	2023-10-02 16:44:09 -07:00
meiravgri	4ba9e18ef0	fix crash in crash-report and other improvements (#12623 ) ## Crash fix ### Current behavior We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example. The threads mngr API guarantees that the output array length will be `tids_len`, however, some indices can be NULL, in case it fails to collect some of the threads' outputs. When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage, so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() ) ### fix Allocate the global output array with zcalloc. ### To reproduce the bug, you'll have to change the code: in threadsmngr:ThreadsManager_runOnThreads(): make sure the g_output_array allocation is initialized with garbage and not 0s (add `memset(g_output_array, 2, sizeof(void) tids_len);` below the allocation). Force one of the threads to write to the array: add a global var: `static redisAtomic size_t return_now = 0;` add to `invoke_callback()` before writing to the output array: ``` size_t i_return; atomicGetIncr(return_now, i_return, 1); if(i_return == 1) return; ``` compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert` The assertion triggers the the stacktrace collection. Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from the threads mngr, it can be safely triggered again. ## Added global variables r/w lock in ThreadsManager To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still invoking the signal handler, we use a r/w lock. For cleanups, we will acquire the write lock. The threads will acquire the read lock to enable them to write simultaneously. If we fail to acquire the read lock, it means cleanups are in progress and we return immediately. After acquiring the lock we can safely check that the global output array wasn't nullified and proceed to write to it. This way we ensure the threads are not modifying the global variables/ trying to write to the output array after they were zeroed/nullified/destroyed(the semaphore). ## other minor logging change 1. removed logging if the semaphore times out because the threads can still write to the output array after this check. Instead, we print the total number of printed stacktraces compared to the exacted number (len_tids). 2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct. 3. improve testing Co-authored-by: Oran Agra <oran@redislabs.com>	2023-10-02 20:02:02 +03:00
YaacovHazan	2e0f6724e0	Stabilization and improvements around aof tests (#12626 ) In some tests, the code manually searches for a log message, and it uses tail -1 with a delay of 1 second, which can miss the expected line. Also, because the aof tests use start_server_aof and not start_server, the test name doesn't log into the server log. To fix the above, I made the following changes: - Change the start_server_aof to wrap the start_server. This will add the created aof server to the servers list, and make srv() and wait_for_log_messages() available for the tests. - Introduce a new option for start_server. 'wait_ready' - an option to let the caller start the test code without waiting for the server to be ready. useful for tests on a server that is expected to exit on startup. - Create a new start_server_aof_ex. The new proc also accept options as argument and make use of the new 'short_life' option for tests that are expected to exit on startup because of some error in the aof file(s). Because of the above, I had to change many lines and replace every local srv variable (a server config) usage with the srv().	2023-10-02 08:20:53 +03:00
guybe7	c2a4b78491	WAITAOF: Update fsynced_reploff_pending even if there's nothing to fsync (#12622 ) The problem is that WAITAOF could have hang in case commands were propagated only to replicas. This can happen if a module uses RM_Call with the REDISMODULE_ARGV_NO_AOF flag. In that case, master_repl_offset would increase, but there would be nothing to fsync, so in the absence of other traffic, fsynced_reploff_pending would stay the static, and WAITAOF can hang. This commit updates fsynced_reploff_pending to the latest offset in flushAppendOnlyFile in case there's nothing to fsync. i.e. in case it's behind because of the above mentions case it'll be refreshed and release the WAITAOF. Other changes: Fix a race in wait.tcl (client getting blocked vs. the fsync thread)	2023-09-28 17:19:20 +03:00
guybe7	bfa3931a04	WAITAOF: Update fsynced_reploff_pending just before starting the initial AOFRW fork (#12620 ) If we set `fsynced_reploff_pending` in `startAppendOnly`, and the fork doesn't start immediately (e.g. there's another fork active at the time), any subsequent commands will increment `server.master_repl_offset`, but will not cause a fsync (given they were executed before the fork started, they just ended up in the RDB part of it) Therefore, any WAITAOF will wait on the new master_repl_offset, but it will time out because no fsync will be executed. Release notes: ``` WAITAOF could timeout in the absence of write traffic in case a new AOF is created and an AOFRW can't immediately start. This can happen by the appendonly config is changed at runtime, but also after FLUSHALL, and replica full sync. ```	2023-09-28 17:05:53 +03:00

... 5 6 7 8 9 ...

12203 Commits