redict/tests/integration/logging.tcl

127 lines
4.7 KiB
Tcl
Raw Normal View History

Improve test suite to handle external servers better. (#9033) This commit revives the improves the ability to run the test suite against external servers, instead of launching and managing `redis-server` processes as part of the test fixture. This capability existed in the past, using the `--host` and `--port` options. However, it was quite limited and mostly useful when running a specific tests. Attempting to run larger chunks of the test suite experienced many issues: * Many tests depend on being able to start and control `redis-server` themselves, and there's no clear distinction between external server compatible and other tests. * Cluster mode is not supported (resulting with `CROSSSLOT` errors). This PR cleans up many things and makes it possible to run the entire test suite against an external server. It also provides more fine grained controls to handle cases where the external server supports a subset of the Redis commands, limited number of databases, cluster mode, etc. The tests directory now contains a `README.md` file that describes how this works. This commit also includes additional cleanups and fixes: * Tests can now be tagged. * Tag-based selection is now unified across `start_server`, `tags` and `test`. * More information is provided about skipped or ignored tests. * Repeated patterns in tests have been extracted to common procedures, both at a global level and on a per-test file basis. * Cleaned up some cases where test setup was based on a previous test executing (a major anti-pattern that repeats itself in many places). * Cleaned up some cases where test teardown was not part of a test (in the future we should have dedicated teardown code that executes even when tests fail). * Fixed some tests that were flaky running on external servers.
2021-06-09 08:13:24 -04:00
tags {"external:skip"} {
2015-02-10 08:40:27 -05:00
set system_name [string tolower [exec uname -s]]
set backtrace_supported [system_backtrace_supported]
Fix async safety in signal handlers (#12658) see discussion from after https://github.com/redis/redis/pull/12453 was merged ---- This PR replaces signals that are not considered async-signal-safe (AS-safe) with safe calls. #### **1. serverLog() and serverLogFromHandler()** `serverLog` uses unsafe calls. It was decided that we will **avoid** `serverLog` calls by the signal handlers when: * The signal is not fatal, such as SIGALRM. In these cases, we prefer using `serverLogFromHandler` which is the safe version of `serverLog`. Note they have different prompts: `serverLog`: `62220:M 26 Oct 2023 14:39:04.526 # <msg>` `serverLogFromHandler`: `62220:signal-handler (1698331136) <msg>` * The code was added recently. Calls to `serverLog` by the signal handler have been there ever since Redis exists and it hasn't caused problems so far. To avoid regression, from now we should use `serverLogFromHandler` #### **2. `snprintf` `fgets` and `strtoul`(base = 16) --------> `_safe_snprintf`, `fgets_async_signal_safe`, `string_to_hex`** The safe version of `snprintf` was taken from [here](https://github.com/twitter/twemcache/blob/8cfc4ca5e76ed936bd3786c8cc43ed47e7778c08/src/mc_util.c#L754) #### **3. fopen(), fgets(), fclose() --------> open(), read(), close()** #### **4. opendir(), readdir(), closedir() --------> open(), syscall(SYS_getdents64), close()** #### **5. Threads_mngr sync mechanisms** * waiting for the thread to generate stack trace: semaphore --------> busy-wait * `globals_rw_lock` was removed: as we are not using malloc and the semaphore anymore we don't need to protect `ThreadsManager_cleanups`. #### **6. Stacktraces buffer** The initial problem was that we were not able to safely call malloc within the signal handler. To solve that we created a buffer on the stack of `writeStacktraces` and saved it in a global pointer, assuming that under normal circumstances, the function `writeStacktraces` would complete before any thread attempted to write to it. However, **if threads lag behind, they might access this global pointer after it no longer belongs to the `writeStacktraces` stack, potentially corrupting memory.** To address this, various solutions were discussed [here](https://github.com/redis/redis/pull/12658#discussion_r1390442896) Eventually, we decided to **create a pipe** at server startup that will remain valid as long as the process is alive. We chose this solution due to its minimal memory usage, and since `write()` and `read()` are atomic operations. It ensures that stack traces from different threads won't mix. **The stacktraces collection process is now as follows:** * Cleaning the pipe to eliminate writes of late threads from previous runs. * Each thread writes to the pipe its stacktrace * Waiting for all the threads to mark completion or until a timeout (2 sec) is reached * Reading from the pipe to print the stacktraces. #### **7. Changes that were considered and eventually were dropped** * replace watchdog timer with a POSIX timer: according to [settimer man](https://linux.die.net/man/2/setitimer) > POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending the use of the POSIX timers API ([timer_gettime](https://linux.die.net/man/2/timer_gettime)(2), [timer_settime](https://linux.die.net/man/2/timer_settime)(2), etc.) instead. However, although it is supposed to conform to POSIX std, POSIX timers API is not supported on Mac. You can take a look here at the Linux implementation: [here](https://github.com/redis/redis/commit/c7562ee13546e504977372fdf40d33c3f86775a5) To avoid messing up the code, and uncertainty regarding compatibility, it was decided to drop it for now. * avoid using sds (uses malloc) in logConfigDebugInfo It was considered to print config info instead of using sds, however apparently, `logConfigDebugInfo` does more than just print the sds, so it was decided this fix is out of this issue scope. #### **8. fix Signal mask check** The check `signum & sig_mask` intended to indicate whether the signal is blocked by the thread was incorrect. Actually, the bit position in the signal mask corresponds to the signal number. We fixed this by changing the condition to: `sig_mask & (1L << (sig_num - 1))` #### **9. Unrelated changes** both `fork.tcl `and `util.tcl` implemented a function called `count_log_message` expecting different parameters. This caused confusion when trying to run daily tests with additional test parameters to run a specific test. The `count_log_message` in `fork.tcl` was removed and the calls were replaced with calls to `count_log_message` located in `util.tcl` --------- Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com> Co-authored-by: Oran Agra <oran@redislabs.com>
2023-11-23 06:22:20 -05:00
set threads_mngr_supported 0 ;# Do we support printing stack trace from all threads, not just the one that got the signal?
if {$system_name eq {linux}} {
Print stack trace from all threads in crash report (#12453) In this PR we are adding the functionality to collect all the process's threads' backtraces. ## Changes made in this PR ### **introduce threads mngr API** The **threads mngr API** which has 2 abilities: * `ThreadsManager_init() `- register to SIGUSR2. called on the server start-up. * ` ThreadsManager_runOnThreads()` - receives a list of a pid_t and a callback, tells every thread in the list to invoke the callback, and returns the output collected by each invocation. **Elaborating atomicvar API** * `atomicIncrGet(var,newvalue_var,count) `-- Increment and get the atomic counter new value * `atomicFlagGetSet` -- Get and set the atomic counter value to 1 ### **Always set SIGALRM handler** SIGALRM handler prints the process's stacktrace to the log file. Up until now, it was set only if the `server.watchdog_period` > 0. This can be also useful if debugging is needed. However, in situations where the server can't get requests, (a deadlock, for example) we weren't able to change the signal handler. To make it available at run time we set SIGALRM handler on server startup. The signal handler name was changed to a more general `sigalrmSignalHandler`. ### **Print all the process' threads' stacktraces** `logStackTrace()` now calls `writeStacktraces()`, instead of logging the current thread stacktrace. `writeStacktraces()`: * On Linux systems we use the threads manager API to collect the backtraces of all the process' threads. To get the `tids` list (threads ids) we read the `/proc/<redis-server-pid>/tasks` file which includes a list of directories. Each directory name corresponds to one tid (including the main thread). For each thread, we also need to check if it can get the signal from the threads manager (meaning it is not blocking/ignoring that signal). We send the threads manager this tids list and `collect_stacktrace_data()` callback, which collects the thread's backtrace addresses, its name, and tid. * On other systems, the behavior remained as it was (writing only the current thread stacktrace to the log file). ## compatibility notes 1. **The threads mngr API is only supported in linux.** 2. glibc earlier than 2.3 We use `syscall(SYS_gettid)` and `syscall(SYS_tgkill...)` because their dedicated alternatives (`gettid()` and `tgkill`) were added in glibc 2.3. ## Output example Each thread backtrace will have the following format: `<tid> <thread_name> [additional_info]` * **tid**: as read from the `/proc/<redis-server-pid>/tasks` file * **thread_name**: the tread name as it is registered in the os/ * **additional_info**: Sometimes we want to add specific information about one of the threads. currently. it is only used to mark the thread that handles the backtraces collection by adding "*". In case of crash - this also indicates which thread caused the crash. The handling thread in won't necessarily appear first. ``` ------ STACK TRACE ------ EIP: /lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc] 67089 redis-server * linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb9437790] /lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc] redis-server *:6379(+0x75e0c)[0xaaaac2fe5e0c] redis-server *:6379(aeProcessEvents+0x18c)[0xaaaac2fe6c00] redis-server *:6379(aeMain+0x24)[0xaaaac2fe7038] redis-server *:6379(main+0xe0c)[0xaaaac3001afc] /lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xffffb91d73fc] /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffffb91d74cc] redis-server *:6379(_start+0x30)[0xaaaac2fe0370] 67093 bio_lazy_free /lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc] /lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc] redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8] /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8] /lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c] 67091 bio_close_file /lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc] /lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc] redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8] /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8] /lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c] 67092 bio_aof /lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc] /lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc] redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8] /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8] /lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c] 67089:signal-handler (1693824528) -------- ```
2023-09-24 02:47:23 -04:00
set threads_mngr_supported 1
}
Fix async safety in signal handlers (#12658) see discussion from after https://github.com/redis/redis/pull/12453 was merged ---- This PR replaces signals that are not considered async-signal-safe (AS-safe) with safe calls. #### **1. serverLog() and serverLogFromHandler()** `serverLog` uses unsafe calls. It was decided that we will **avoid** `serverLog` calls by the signal handlers when: * The signal is not fatal, such as SIGALRM. In these cases, we prefer using `serverLogFromHandler` which is the safe version of `serverLog`. Note they have different prompts: `serverLog`: `62220:M 26 Oct 2023 14:39:04.526 # <msg>` `serverLogFromHandler`: `62220:signal-handler (1698331136) <msg>` * The code was added recently. Calls to `serverLog` by the signal handler have been there ever since Redis exists and it hasn't caused problems so far. To avoid regression, from now we should use `serverLogFromHandler` #### **2. `snprintf` `fgets` and `strtoul`(base = 16) --------> `_safe_snprintf`, `fgets_async_signal_safe`, `string_to_hex`** The safe version of `snprintf` was taken from [here](https://github.com/twitter/twemcache/blob/8cfc4ca5e76ed936bd3786c8cc43ed47e7778c08/src/mc_util.c#L754) #### **3. fopen(), fgets(), fclose() --------> open(), read(), close()** #### **4. opendir(), readdir(), closedir() --------> open(), syscall(SYS_getdents64), close()** #### **5. Threads_mngr sync mechanisms** * waiting for the thread to generate stack trace: semaphore --------> busy-wait * `globals_rw_lock` was removed: as we are not using malloc and the semaphore anymore we don't need to protect `ThreadsManager_cleanups`. #### **6. Stacktraces buffer** The initial problem was that we were not able to safely call malloc within the signal handler. To solve that we created a buffer on the stack of `writeStacktraces` and saved it in a global pointer, assuming that under normal circumstances, the function `writeStacktraces` would complete before any thread attempted to write to it. However, **if threads lag behind, they might access this global pointer after it no longer belongs to the `writeStacktraces` stack, potentially corrupting memory.** To address this, various solutions were discussed [here](https://github.com/redis/redis/pull/12658#discussion_r1390442896) Eventually, we decided to **create a pipe** at server startup that will remain valid as long as the process is alive. We chose this solution due to its minimal memory usage, and since `write()` and `read()` are atomic operations. It ensures that stack traces from different threads won't mix. **The stacktraces collection process is now as follows:** * Cleaning the pipe to eliminate writes of late threads from previous runs. * Each thread writes to the pipe its stacktrace * Waiting for all the threads to mark completion or until a timeout (2 sec) is reached * Reading from the pipe to print the stacktraces. #### **7. Changes that were considered and eventually were dropped** * replace watchdog timer with a POSIX timer: according to [settimer man](https://linux.die.net/man/2/setitimer) > POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending the use of the POSIX timers API ([timer_gettime](https://linux.die.net/man/2/timer_gettime)(2), [timer_settime](https://linux.die.net/man/2/timer_settime)(2), etc.) instead. However, although it is supposed to conform to POSIX std, POSIX timers API is not supported on Mac. You can take a look here at the Linux implementation: [here](https://github.com/redis/redis/commit/c7562ee13546e504977372fdf40d33c3f86775a5) To avoid messing up the code, and uncertainty regarding compatibility, it was decided to drop it for now. * avoid using sds (uses malloc) in logConfigDebugInfo It was considered to print config info instead of using sds, however apparently, `logConfigDebugInfo` does more than just print the sds, so it was decided this fix is out of this issue scope. #### **8. fix Signal mask check** The check `signum & sig_mask` intended to indicate whether the signal is blocked by the thread was incorrect. Actually, the bit position in the signal mask corresponds to the signal number. We fixed this by changing the condition to: `sig_mask & (1L << (sig_num - 1))` #### **9. Unrelated changes** both `fork.tcl `and `util.tcl` implemented a function called `count_log_message` expecting different parameters. This caused confusion when trying to run daily tests with additional test parameters to run a specific test. The `count_log_message` in `fork.tcl` was removed and the calls were replaced with calls to `count_log_message` located in `util.tcl` --------- Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com> Co-authored-by: Oran Agra <oran@redislabs.com>
2023-11-23 06:22:20 -05:00
# look for the DEBUG command in the backtrace, used when we triggered
# a stack trace print while we know redis is running that command.
fix crash in crash-report and other improvements (#12623) ## Crash fix ### Current behavior We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example. The threads mngr API guarantees that the output array length will be `tids_len`, however, some indices can be NULL, in case it fails to collect some of the threads' outputs. When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage, so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() ) ### fix Allocate the global output array with zcalloc. ### To reproduce the bug, you'll have to change the code: **in threadsmngr:ThreadsManager_runOnThreads():** make sure the g_output_array allocation is initialized with garbage and not 0s (add `memset(g_output_array, 2, sizeof(void*) * tids_len);` below the allocation). Force one of the threads to write to the array: add a global var: `static redisAtomic size_t return_now = 0;` add to `invoke_callback()` before writing to the output array: ``` size_t i_return; atomicGetIncr(return_now, i_return, 1); if(i_return == 1) return; ``` compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert` The assertion triggers the the stacktrace collection. Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from the threads mngr, it can be safely triggered again. ## Added global variables r/w lock in ThreadsManager To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still invoking the signal handler, we use a r/w lock. For cleanups, we will acquire the write lock. The threads will acquire the read lock to enable them to write simultaneously. If we fail to acquire the read lock, it means cleanups are in progress and we return immediately. After acquiring the lock we can safely check that the global output array wasn't nullified and proceed to write to it. This way we ensure the threads are not modifying the global variables/ trying to write to the output array after they were zeroed/nullified/destroyed(the semaphore). ## other minor logging change 1. removed logging if the semaphore times out because the threads can still write to the output array after this check. Instead, we print the total number of printed stacktraces compared to the exacted number (len_tids). 2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct. 3. improve testing Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-02 13:02:02 -04:00
proc check_log_backtrace_for_debug {log_pattern} {
Fix async safety in signal handlers (#12658) see discussion from after https://github.com/redis/redis/pull/12453 was merged ---- This PR replaces signals that are not considered async-signal-safe (AS-safe) with safe calls. #### **1. serverLog() and serverLogFromHandler()** `serverLog` uses unsafe calls. It was decided that we will **avoid** `serverLog` calls by the signal handlers when: * The signal is not fatal, such as SIGALRM. In these cases, we prefer using `serverLogFromHandler` which is the safe version of `serverLog`. Note they have different prompts: `serverLog`: `62220:M 26 Oct 2023 14:39:04.526 # <msg>` `serverLogFromHandler`: `62220:signal-handler (1698331136) <msg>` * The code was added recently. Calls to `serverLog` by the signal handler have been there ever since Redis exists and it hasn't caused problems so far. To avoid regression, from now we should use `serverLogFromHandler` #### **2. `snprintf` `fgets` and `strtoul`(base = 16) --------> `_safe_snprintf`, `fgets_async_signal_safe`, `string_to_hex`** The safe version of `snprintf` was taken from [here](https://github.com/twitter/twemcache/blob/8cfc4ca5e76ed936bd3786c8cc43ed47e7778c08/src/mc_util.c#L754) #### **3. fopen(), fgets(), fclose() --------> open(), read(), close()** #### **4. opendir(), readdir(), closedir() --------> open(), syscall(SYS_getdents64), close()** #### **5. Threads_mngr sync mechanisms** * waiting for the thread to generate stack trace: semaphore --------> busy-wait * `globals_rw_lock` was removed: as we are not using malloc and the semaphore anymore we don't need to protect `ThreadsManager_cleanups`. #### **6. Stacktraces buffer** The initial problem was that we were not able to safely call malloc within the signal handler. To solve that we created a buffer on the stack of `writeStacktraces` and saved it in a global pointer, assuming that under normal circumstances, the function `writeStacktraces` would complete before any thread attempted to write to it. However, **if threads lag behind, they might access this global pointer after it no longer belongs to the `writeStacktraces` stack, potentially corrupting memory.** To address this, various solutions were discussed [here](https://github.com/redis/redis/pull/12658#discussion_r1390442896) Eventually, we decided to **create a pipe** at server startup that will remain valid as long as the process is alive. We chose this solution due to its minimal memory usage, and since `write()` and `read()` are atomic operations. It ensures that stack traces from different threads won't mix. **The stacktraces collection process is now as follows:** * Cleaning the pipe to eliminate writes of late threads from previous runs. * Each thread writes to the pipe its stacktrace * Waiting for all the threads to mark completion or until a timeout (2 sec) is reached * Reading from the pipe to print the stacktraces. #### **7. Changes that were considered and eventually were dropped** * replace watchdog timer with a POSIX timer: according to [settimer man](https://linux.die.net/man/2/setitimer) > POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending the use of the POSIX timers API ([timer_gettime](https://linux.die.net/man/2/timer_gettime)(2), [timer_settime](https://linux.die.net/man/2/timer_settime)(2), etc.) instead. However, although it is supposed to conform to POSIX std, POSIX timers API is not supported on Mac. You can take a look here at the Linux implementation: [here](https://github.com/redis/redis/commit/c7562ee13546e504977372fdf40d33c3f86775a5) To avoid messing up the code, and uncertainty regarding compatibility, it was decided to drop it for now. * avoid using sds (uses malloc) in logConfigDebugInfo It was considered to print config info instead of using sds, however apparently, `logConfigDebugInfo` does more than just print the sds, so it was decided this fix is out of this issue scope. #### **8. fix Signal mask check** The check `signum & sig_mask` intended to indicate whether the signal is blocked by the thread was incorrect. Actually, the bit position in the signal mask corresponds to the signal number. We fixed this by changing the condition to: `sig_mask & (1L << (sig_num - 1))` #### **9. Unrelated changes** both `fork.tcl `and `util.tcl` implemented a function called `count_log_message` expecting different parameters. This caused confusion when trying to run daily tests with additional test parameters to run a specific test. The `count_log_message` in `fork.tcl` was removed and the calls were replaced with calls to `count_log_message` located in `util.tcl` --------- Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com> Co-authored-by: Oran Agra <oran@redislabs.com>
2023-11-23 06:22:20 -05:00
# search for the final line in the stacktraces generation to make sure it was completed.
set pattern "* STACK TRACE DONE *"
set res [wait_for_log_messages 0 \"$pattern\" 0 100 100]
fix crash in crash-report and other improvements (#12623) ## Crash fix ### Current behavior We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example. The threads mngr API guarantees that the output array length will be `tids_len`, however, some indices can be NULL, in case it fails to collect some of the threads' outputs. When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage, so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() ) ### fix Allocate the global output array with zcalloc. ### To reproduce the bug, you'll have to change the code: **in threadsmngr:ThreadsManager_runOnThreads():** make sure the g_output_array allocation is initialized with garbage and not 0s (add `memset(g_output_array, 2, sizeof(void*) * tids_len);` below the allocation). Force one of the threads to write to the array: add a global var: `static redisAtomic size_t return_now = 0;` add to `invoke_callback()` before writing to the output array: ``` size_t i_return; atomicGetIncr(return_now, i_return, 1); if(i_return == 1) return; ``` compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert` The assertion triggers the the stacktrace collection. Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from the threads mngr, it can be safely triggered again. ## Added global variables r/w lock in ThreadsManager To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still invoking the signal handler, we use a r/w lock. For cleanups, we will acquire the write lock. The threads will acquire the read lock to enable them to write simultaneously. If we fail to acquire the read lock, it means cleanups are in progress and we return immediately. After acquiring the lock we can safely check that the global output array wasn't nullified and proceed to write to it. This way we ensure the threads are not modifying the global variables/ trying to write to the output array after they were zeroed/nullified/destroyed(the semaphore). ## other minor logging change 1. removed logging if the semaphore times out because the threads can still write to the output array after this check. Instead, we print the total number of printed stacktraces compared to the exacted number (len_tids). 2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct. 3. improve testing Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-02 13:02:02 -04:00
set res [wait_for_log_messages 0 \"$log_pattern\" 0 100 100]
if {$::verbose} { puts $res}
# If the stacktrace is printed more than once, it means redis crashed during crash report generation
Fix async safety in signal handlers (#12658) see discussion from after https://github.com/redis/redis/pull/12453 was merged ---- This PR replaces signals that are not considered async-signal-safe (AS-safe) with safe calls. #### **1. serverLog() and serverLogFromHandler()** `serverLog` uses unsafe calls. It was decided that we will **avoid** `serverLog` calls by the signal handlers when: * The signal is not fatal, such as SIGALRM. In these cases, we prefer using `serverLogFromHandler` which is the safe version of `serverLog`. Note they have different prompts: `serverLog`: `62220:M 26 Oct 2023 14:39:04.526 # <msg>` `serverLogFromHandler`: `62220:signal-handler (1698331136) <msg>` * The code was added recently. Calls to `serverLog` by the signal handler have been there ever since Redis exists and it hasn't caused problems so far. To avoid regression, from now we should use `serverLogFromHandler` #### **2. `snprintf` `fgets` and `strtoul`(base = 16) --------> `_safe_snprintf`, `fgets_async_signal_safe`, `string_to_hex`** The safe version of `snprintf` was taken from [here](https://github.com/twitter/twemcache/blob/8cfc4ca5e76ed936bd3786c8cc43ed47e7778c08/src/mc_util.c#L754) #### **3. fopen(), fgets(), fclose() --------> open(), read(), close()** #### **4. opendir(), readdir(), closedir() --------> open(), syscall(SYS_getdents64), close()** #### **5. Threads_mngr sync mechanisms** * waiting for the thread to generate stack trace: semaphore --------> busy-wait * `globals_rw_lock` was removed: as we are not using malloc and the semaphore anymore we don't need to protect `ThreadsManager_cleanups`. #### **6. Stacktraces buffer** The initial problem was that we were not able to safely call malloc within the signal handler. To solve that we created a buffer on the stack of `writeStacktraces` and saved it in a global pointer, assuming that under normal circumstances, the function `writeStacktraces` would complete before any thread attempted to write to it. However, **if threads lag behind, they might access this global pointer after it no longer belongs to the `writeStacktraces` stack, potentially corrupting memory.** To address this, various solutions were discussed [here](https://github.com/redis/redis/pull/12658#discussion_r1390442896) Eventually, we decided to **create a pipe** at server startup that will remain valid as long as the process is alive. We chose this solution due to its minimal memory usage, and since `write()` and `read()` are atomic operations. It ensures that stack traces from different threads won't mix. **The stacktraces collection process is now as follows:** * Cleaning the pipe to eliminate writes of late threads from previous runs. * Each thread writes to the pipe its stacktrace * Waiting for all the threads to mark completion or until a timeout (2 sec) is reached * Reading from the pipe to print the stacktraces. #### **7. Changes that were considered and eventually were dropped** * replace watchdog timer with a POSIX timer: according to [settimer man](https://linux.die.net/man/2/setitimer) > POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending the use of the POSIX timers API ([timer_gettime](https://linux.die.net/man/2/timer_gettime)(2), [timer_settime](https://linux.die.net/man/2/timer_settime)(2), etc.) instead. However, although it is supposed to conform to POSIX std, POSIX timers API is not supported on Mac. You can take a look here at the Linux implementation: [here](https://github.com/redis/redis/commit/c7562ee13546e504977372fdf40d33c3f86775a5) To avoid messing up the code, and uncertainty regarding compatibility, it was decided to drop it for now. * avoid using sds (uses malloc) in logConfigDebugInfo It was considered to print config info instead of using sds, however apparently, `logConfigDebugInfo` does more than just print the sds, so it was decided this fix is out of this issue scope. #### **8. fix Signal mask check** The check `signum & sig_mask` intended to indicate whether the signal is blocked by the thread was incorrect. Actually, the bit position in the signal mask corresponds to the signal number. We fixed this by changing the condition to: `sig_mask & (1L << (sig_num - 1))` #### **9. Unrelated changes** both `fork.tcl `and `util.tcl` implemented a function called `count_log_message` expecting different parameters. This caused confusion when trying to run daily tests with additional test parameters to run a specific test. The `count_log_message` in `fork.tcl` was removed and the calls were replaced with calls to `count_log_message` located in `util.tcl` --------- Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com> Co-authored-by: Oran Agra <oran@redislabs.com>
2023-11-23 06:22:20 -05:00
assert_equal [count_log_message 0 "STACK TRACE -"] 1
upvar threads_mngr_supported threads_mngr_supported
# the following checks are only done if we support printing stack trace from all threads
if {$threads_mngr_supported} {
assert_equal [count_log_message 0 "setupStacktracePipe failed"] 0
assert_equal [count_log_message 0 "failed to open /proc/"] 0
assert_equal [count_log_message 0 "failed to find SigBlk or/and SigIgn"] 0
# the following are skipped since valgrind is slow and a timeout can happen
if {!$::valgrind} {
assert_equal [count_log_message 0 "wait_threads(): waiting threads timed out"] 0
# make sure redis prints stack trace for all threads. we know 3 threads are idle in bio.c
assert_equal [count_log_message 0 "bioProcessBackgroundJobs"] 3
}
}
fix crash in crash-report and other improvements (#12623) ## Crash fix ### Current behavior We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example. The threads mngr API guarantees that the output array length will be `tids_len`, however, some indices can be NULL, in case it fails to collect some of the threads' outputs. When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage, so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() ) ### fix Allocate the global output array with zcalloc. ### To reproduce the bug, you'll have to change the code: **in threadsmngr:ThreadsManager_runOnThreads():** make sure the g_output_array allocation is initialized with garbage and not 0s (add `memset(g_output_array, 2, sizeof(void*) * tids_len);` below the allocation). Force one of the threads to write to the array: add a global var: `static redisAtomic size_t return_now = 0;` add to `invoke_callback()` before writing to the output array: ``` size_t i_return; atomicGetIncr(return_now, i_return, 1); if(i_return == 1) return; ``` compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert` The assertion triggers the the stacktrace collection. Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from the threads mngr, it can be safely triggered again. ## Added global variables r/w lock in ThreadsManager To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still invoking the signal handler, we use a r/w lock. For cleanups, we will acquire the write lock. The threads will acquire the read lock to enable them to write simultaneously. If we fail to acquire the read lock, it means cleanups are in progress and we return immediately. After acquiring the lock we can safely check that the global output array wasn't nullified and proceed to write to it. This way we ensure the threads are not modifying the global variables/ trying to write to the output array after they were zeroed/nullified/destroyed(the semaphore). ## other minor logging change 1. removed logging if the semaphore times out because the threads can still write to the output array after this check. Instead, we print the total number of printed stacktraces compared to the exacted number (len_tids). 2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct. 3. improve testing Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-02 13:02:02 -04:00
set pattern "*debugCommand*"
set res [wait_for_log_messages 0 \"$pattern\" 0 100 100]
if {$::verbose} { puts $res}
}
# used when backtrace_supported == 0
proc check_crash_log {log_pattern} {
set res [wait_for_log_messages 0 \"$log_pattern\" 0 50 100]
if {$::verbose} { puts $res }
}
# test the watchdog and the stack trace report from multiple threads
if {$backtrace_supported} {
set server_path [tmpdir server.log]
2015-02-10 08:40:27 -05:00
start_server [list overrides [list dir $server_path]] {
test "Server is able to generate a stack trace on selected systems" {
r config set watchdog-period 200
r debug sleep 1
Fix async safety in signal handlers (#12658) see discussion from after https://github.com/redis/redis/pull/12453 was merged ---- This PR replaces signals that are not considered async-signal-safe (AS-safe) with safe calls. #### **1. serverLog() and serverLogFromHandler()** `serverLog` uses unsafe calls. It was decided that we will **avoid** `serverLog` calls by the signal handlers when: * The signal is not fatal, such as SIGALRM. In these cases, we prefer using `serverLogFromHandler` which is the safe version of `serverLog`. Note they have different prompts: `serverLog`: `62220:M 26 Oct 2023 14:39:04.526 # <msg>` `serverLogFromHandler`: `62220:signal-handler (1698331136) <msg>` * The code was added recently. Calls to `serverLog` by the signal handler have been there ever since Redis exists and it hasn't caused problems so far. To avoid regression, from now we should use `serverLogFromHandler` #### **2. `snprintf` `fgets` and `strtoul`(base = 16) --------> `_safe_snprintf`, `fgets_async_signal_safe`, `string_to_hex`** The safe version of `snprintf` was taken from [here](https://github.com/twitter/twemcache/blob/8cfc4ca5e76ed936bd3786c8cc43ed47e7778c08/src/mc_util.c#L754) #### **3. fopen(), fgets(), fclose() --------> open(), read(), close()** #### **4. opendir(), readdir(), closedir() --------> open(), syscall(SYS_getdents64), close()** #### **5. Threads_mngr sync mechanisms** * waiting for the thread to generate stack trace: semaphore --------> busy-wait * `globals_rw_lock` was removed: as we are not using malloc and the semaphore anymore we don't need to protect `ThreadsManager_cleanups`. #### **6. Stacktraces buffer** The initial problem was that we were not able to safely call malloc within the signal handler. To solve that we created a buffer on the stack of `writeStacktraces` and saved it in a global pointer, assuming that under normal circumstances, the function `writeStacktraces` would complete before any thread attempted to write to it. However, **if threads lag behind, they might access this global pointer after it no longer belongs to the `writeStacktraces` stack, potentially corrupting memory.** To address this, various solutions were discussed [here](https://github.com/redis/redis/pull/12658#discussion_r1390442896) Eventually, we decided to **create a pipe** at server startup that will remain valid as long as the process is alive. We chose this solution due to its minimal memory usage, and since `write()` and `read()` are atomic operations. It ensures that stack traces from different threads won't mix. **The stacktraces collection process is now as follows:** * Cleaning the pipe to eliminate writes of late threads from previous runs. * Each thread writes to the pipe its stacktrace * Waiting for all the threads to mark completion or until a timeout (2 sec) is reached * Reading from the pipe to print the stacktraces. #### **7. Changes that were considered and eventually were dropped** * replace watchdog timer with a POSIX timer: according to [settimer man](https://linux.die.net/man/2/setitimer) > POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending the use of the POSIX timers API ([timer_gettime](https://linux.die.net/man/2/timer_gettime)(2), [timer_settime](https://linux.die.net/man/2/timer_settime)(2), etc.) instead. However, although it is supposed to conform to POSIX std, POSIX timers API is not supported on Mac. You can take a look here at the Linux implementation: [here](https://github.com/redis/redis/commit/c7562ee13546e504977372fdf40d33c3f86775a5) To avoid messing up the code, and uncertainty regarding compatibility, it was decided to drop it for now. * avoid using sds (uses malloc) in logConfigDebugInfo It was considered to print config info instead of using sds, however apparently, `logConfigDebugInfo` does more than just print the sds, so it was decided this fix is out of this issue scope. #### **8. fix Signal mask check** The check `signum & sig_mask` intended to indicate whether the signal is blocked by the thread was incorrect. Actually, the bit position in the signal mask corresponds to the signal number. We fixed this by changing the condition to: `sig_mask & (1L << (sig_num - 1))` #### **9. Unrelated changes** both `fork.tcl `and `util.tcl` implemented a function called `count_log_message` expecting different parameters. This caused confusion when trying to run daily tests with additional test parameters to run a specific test. The `count_log_message` in `fork.tcl` was removed and the calls were replaced with calls to `count_log_message` located in `util.tcl` --------- Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com> Co-authored-by: Oran Agra <oran@redislabs.com>
2023-11-23 06:22:20 -05:00
fix crash in crash-report and other improvements (#12623) ## Crash fix ### Current behavior We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example. The threads mngr API guarantees that the output array length will be `tids_len`, however, some indices can be NULL, in case it fails to collect some of the threads' outputs. When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage, so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() ) ### fix Allocate the global output array with zcalloc. ### To reproduce the bug, you'll have to change the code: **in threadsmngr:ThreadsManager_runOnThreads():** make sure the g_output_array allocation is initialized with garbage and not 0s (add `memset(g_output_array, 2, sizeof(void*) * tids_len);` below the allocation). Force one of the threads to write to the array: add a global var: `static redisAtomic size_t return_now = 0;` add to `invoke_callback()` before writing to the output array: ``` size_t i_return; atomicGetIncr(return_now, i_return, 1); if(i_return == 1) return; ``` compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert` The assertion triggers the the stacktrace collection. Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from the threads mngr, it can be safely triggered again. ## Added global variables r/w lock in ThreadsManager To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still invoking the signal handler, we use a r/w lock. For cleanups, we will acquire the write lock. The threads will acquire the read lock to enable them to write simultaneously. If we fail to acquire the read lock, it means cleanups are in progress and we return immediately. After acquiring the lock we can safely check that the global output array wasn't nullified and proceed to write to it. This way we ensure the threads are not modifying the global variables/ trying to write to the output array after they were zeroed/nullified/destroyed(the semaphore). ## other minor logging change 1. removed logging if the semaphore times out because the threads can still write to the output array after this check. Instead, we print the total number of printed stacktraces compared to the exacted number (len_tids). 2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct. 3. improve testing Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-02 13:02:02 -04:00
check_log_backtrace_for_debug "*WATCHDOG TIMER EXPIRED*"
# make sure redis is still alive
assert_equal "PONG" [r ping]
2015-02-10 08:40:27 -05:00
}
}
}
# Valgrind will complain that the process terminated by a signal, skip it.
if {!$::valgrind} {
if {$backtrace_supported} {
fix crash in crash-report and other improvements (#12623) ## Crash fix ### Current behavior We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example. The threads mngr API guarantees that the output array length will be `tids_len`, however, some indices can be NULL, in case it fails to collect some of the threads' outputs. When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage, so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() ) ### fix Allocate the global output array with zcalloc. ### To reproduce the bug, you'll have to change the code: **in threadsmngr:ThreadsManager_runOnThreads():** make sure the g_output_array allocation is initialized with garbage and not 0s (add `memset(g_output_array, 2, sizeof(void*) * tids_len);` below the allocation). Force one of the threads to write to the array: add a global var: `static redisAtomic size_t return_now = 0;` add to `invoke_callback()` before writing to the output array: ``` size_t i_return; atomicGetIncr(return_now, i_return, 1); if(i_return == 1) return; ``` compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert` The assertion triggers the the stacktrace collection. Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from the threads mngr, it can be safely triggered again. ## Added global variables r/w lock in ThreadsManager To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still invoking the signal handler, we use a r/w lock. For cleanups, we will acquire the write lock. The threads will acquire the read lock to enable them to write simultaneously. If we fail to acquire the read lock, it means cleanups are in progress and we return immediately. After acquiring the lock we can safely check that the global output array wasn't nullified and proceed to write to it. This way we ensure the threads are not modifying the global variables/ trying to write to the output array after they were zeroed/nullified/destroyed(the semaphore). ## other minor logging change 1. removed logging if the semaphore times out because the threads can still write to the output array after this check. Instead, we print the total number of printed stacktraces compared to the exacted number (len_tids). 2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct. 3. improve testing Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-02 13:02:02 -04:00
set check_cb check_log_backtrace_for_debug
} else {
set check_cb check_crash_log
}
fix crash in crash-report and other improvements (#12623) ## Crash fix ### Current behavior We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example. The threads mngr API guarantees that the output array length will be `tids_len`, however, some indices can be NULL, in case it fails to collect some of the threads' outputs. When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage, so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() ) ### fix Allocate the global output array with zcalloc. ### To reproduce the bug, you'll have to change the code: **in threadsmngr:ThreadsManager_runOnThreads():** make sure the g_output_array allocation is initialized with garbage and not 0s (add `memset(g_output_array, 2, sizeof(void*) * tids_len);` below the allocation). Force one of the threads to write to the array: add a global var: `static redisAtomic size_t return_now = 0;` add to `invoke_callback()` before writing to the output array: ``` size_t i_return; atomicGetIncr(return_now, i_return, 1); if(i_return == 1) return; ``` compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert` The assertion triggers the the stacktrace collection. Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from the threads mngr, it can be safely triggered again. ## Added global variables r/w lock in ThreadsManager To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still invoking the signal handler, we use a r/w lock. For cleanups, we will acquire the write lock. The threads will acquire the read lock to enable them to write simultaneously. If we fail to acquire the read lock, it means cleanups are in progress and we return immediately. After acquiring the lock we can safely check that the global output array wasn't nullified and proceed to write to it. This way we ensure the threads are not modifying the global variables/ trying to write to the output array after they were zeroed/nullified/destroyed(the semaphore). ## other minor logging change 1. removed logging if the semaphore times out because the threads can still write to the output array after this check. Instead, we print the total number of printed stacktraces compared to the exacted number (len_tids). 2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct. 3. improve testing Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-02 13:02:02 -04:00
# test being killed by a SIGABRT from outside
set server_path [tmpdir server1.log]
start_server [list overrides [list dir $server_path crash-memcheck-enabled no]] {
test "Crash report generated on SIGABRT" {
set pid [s process_id]
fix crash in crash-report and other improvements (#12623) ## Crash fix ### Current behavior We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example. The threads mngr API guarantees that the output array length will be `tids_len`, however, some indices can be NULL, in case it fails to collect some of the threads' outputs. When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage, so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() ) ### fix Allocate the global output array with zcalloc. ### To reproduce the bug, you'll have to change the code: **in threadsmngr:ThreadsManager_runOnThreads():** make sure the g_output_array allocation is initialized with garbage and not 0s (add `memset(g_output_array, 2, sizeof(void*) * tids_len);` below the allocation). Force one of the threads to write to the array: add a global var: `static redisAtomic size_t return_now = 0;` add to `invoke_callback()` before writing to the output array: ``` size_t i_return; atomicGetIncr(return_now, i_return, 1); if(i_return == 1) return; ``` compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert` The assertion triggers the the stacktrace collection. Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from the threads mngr, it can be safely triggered again. ## Added global variables r/w lock in ThreadsManager To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still invoking the signal handler, we use a r/w lock. For cleanups, we will acquire the write lock. The threads will acquire the read lock to enable them to write simultaneously. If we fail to acquire the read lock, it means cleanups are in progress and we return immediately. After acquiring the lock we can safely check that the global output array wasn't nullified and proceed to write to it. This way we ensure the threads are not modifying the global variables/ trying to write to the output array after they were zeroed/nullified/destroyed(the semaphore). ## other minor logging change 1. removed logging if the semaphore times out because the threads can still write to the output array after this check. Instead, we print the total number of printed stacktraces compared to the exacted number (len_tids). 2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct. 3. improve testing Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-02 13:02:02 -04:00
r deferred 1
r debug sleep 10 ;# so that we see the function in the stack trace
r flush
after 100 ;# wait for redis to get into the sleep
exec kill -SIGABRT $pid
fix crash in crash-report and other improvements (#12623) ## Crash fix ### Current behavior We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example. The threads mngr API guarantees that the output array length will be `tids_len`, however, some indices can be NULL, in case it fails to collect some of the threads' outputs. When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage, so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() ) ### fix Allocate the global output array with zcalloc. ### To reproduce the bug, you'll have to change the code: **in threadsmngr:ThreadsManager_runOnThreads():** make sure the g_output_array allocation is initialized with garbage and not 0s (add `memset(g_output_array, 2, sizeof(void*) * tids_len);` below the allocation). Force one of the threads to write to the array: add a global var: `static redisAtomic size_t return_now = 0;` add to `invoke_callback()` before writing to the output array: ``` size_t i_return; atomicGetIncr(return_now, i_return, 1); if(i_return == 1) return; ``` compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert` The assertion triggers the the stacktrace collection. Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from the threads mngr, it can be safely triggered again. ## Added global variables r/w lock in ThreadsManager To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still invoking the signal handler, we use a r/w lock. For cleanups, we will acquire the write lock. The threads will acquire the read lock to enable them to write simultaneously. If we fail to acquire the read lock, it means cleanups are in progress and we return immediately. After acquiring the lock we can safely check that the global output array wasn't nullified and proceed to write to it. This way we ensure the threads are not modifying the global variables/ trying to write to the output array after they were zeroed/nullified/destroyed(the semaphore). ## other minor logging change 1. removed logging if the semaphore times out because the threads can still write to the output array after this check. Instead, we print the total number of printed stacktraces compared to the exacted number (len_tids). 2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct. 3. improve testing Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-02 13:02:02 -04:00
$check_cb "*crashed by signal*"
}
}
fix crash in crash-report and other improvements (#12623) ## Crash fix ### Current behavior We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example. The threads mngr API guarantees that the output array length will be `tids_len`, however, some indices can be NULL, in case it fails to collect some of the threads' outputs. When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage, so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() ) ### fix Allocate the global output array with zcalloc. ### To reproduce the bug, you'll have to change the code: **in threadsmngr:ThreadsManager_runOnThreads():** make sure the g_output_array allocation is initialized with garbage and not 0s (add `memset(g_output_array, 2, sizeof(void*) * tids_len);` below the allocation). Force one of the threads to write to the array: add a global var: `static redisAtomic size_t return_now = 0;` add to `invoke_callback()` before writing to the output array: ``` size_t i_return; atomicGetIncr(return_now, i_return, 1); if(i_return == 1) return; ``` compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert` The assertion triggers the the stacktrace collection. Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from the threads mngr, it can be safely triggered again. ## Added global variables r/w lock in ThreadsManager To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still invoking the signal handler, we use a r/w lock. For cleanups, we will acquire the write lock. The threads will acquire the read lock to enable them to write simultaneously. If we fail to acquire the read lock, it means cleanups are in progress and we return immediately. After acquiring the lock we can safely check that the global output array wasn't nullified and proceed to write to it. This way we ensure the threads are not modifying the global variables/ trying to write to the output array after they were zeroed/nullified/destroyed(the semaphore). ## other minor logging change 1. removed logging if the semaphore times out because the threads can still write to the output array after this check. Instead, we print the total number of printed stacktraces compared to the exacted number (len_tids). 2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct. 3. improve testing Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-02 13:02:02 -04:00
# test DEBUG SEGFAULT
set server_path [tmpdir server2.log]
start_server [list overrides [list dir $server_path crash-memcheck-enabled no]] {
test "Crash report generated on DEBUG SEGFAULT" {
catch {r debug segfault}
fix crash in crash-report and other improvements (#12623) ## Crash fix ### Current behavior We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example. The threads mngr API guarantees that the output array length will be `tids_len`, however, some indices can be NULL, in case it fails to collect some of the threads' outputs. When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage, so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() ) ### fix Allocate the global output array with zcalloc. ### To reproduce the bug, you'll have to change the code: **in threadsmngr:ThreadsManager_runOnThreads():** make sure the g_output_array allocation is initialized with garbage and not 0s (add `memset(g_output_array, 2, sizeof(void*) * tids_len);` below the allocation). Force one of the threads to write to the array: add a global var: `static redisAtomic size_t return_now = 0;` add to `invoke_callback()` before writing to the output array: ``` size_t i_return; atomicGetIncr(return_now, i_return, 1); if(i_return == 1) return; ``` compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert` The assertion triggers the the stacktrace collection. Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from the threads mngr, it can be safely triggered again. ## Added global variables r/w lock in ThreadsManager To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still invoking the signal handler, we use a r/w lock. For cleanups, we will acquire the write lock. The threads will acquire the read lock to enable them to write simultaneously. If we fail to acquire the read lock, it means cleanups are in progress and we return immediately. After acquiring the lock we can safely check that the global output array wasn't nullified and proceed to write to it. This way we ensure the threads are not modifying the global variables/ trying to write to the output array after they were zeroed/nullified/destroyed(the semaphore). ## other minor logging change 1. removed logging if the semaphore times out because the threads can still write to the output array after this check. Instead, we print the total number of printed stacktraces compared to the exacted number (len_tids). 2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct. 3. improve testing Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-02 13:02:02 -04:00
$check_cb "*crashed by signal*"
}
}
# test DEBUG SIGALRM being non-fatal
set server_path [tmpdir server3.log]
start_server [list overrides [list dir $server_path]] {
test "Stacktraces generated on SIGALRM" {
set pid [s process_id]
r deferred 1
r debug sleep 10 ;# so that we see the function in the stack trace
r flush
after 100 ;# wait for redis to get into the sleep
exec kill -SIGALRM $pid
$check_cb "*Received SIGALRM*"
r read
r deferred 0
# make sure redis is still alive
assert_equal "PONG" [r ping]
}
}
}
# test DEBUG ASSERT
if {$backtrace_supported} {
set server_path [tmpdir server4.log]
# Use exit() instead of abort() upon assertion so Valgrind tests won't fail.
start_server [list overrides [list dir $server_path use-exit-on-panic yes crash-memcheck-enabled no]] {
test "Generate stacktrace on assertion" {
catch {r debug assert}
check_log_backtrace_for_debug "*ASSERTION FAILED*"
}
}
}
Improve test suite to handle external servers better. (#9033) This commit revives the improves the ability to run the test suite against external servers, instead of launching and managing `redis-server` processes as part of the test fixture. This capability existed in the past, using the `--host` and `--port` options. However, it was quite limited and mostly useful when running a specific tests. Attempting to run larger chunks of the test suite experienced many issues: * Many tests depend on being able to start and control `redis-server` themselves, and there's no clear distinction between external server compatible and other tests. * Cluster mode is not supported (resulting with `CROSSSLOT` errors). This PR cleans up many things and makes it possible to run the entire test suite against an external server. It also provides more fine grained controls to handle cases where the external server supports a subset of the Redis commands, limited number of databases, cluster mode, etc. The tests directory now contains a `README.md` file that describes how this works. This commit also includes additional cleanups and fixes: * Tests can now be tagged. * Tag-based selection is now unified across `start_server`, `tags` and `test`. * More information is provided about skipped or ignored tests. * Repeated patterns in tests have been extracted to common procedures, both at a global level and on a per-test file basis. * Cleaned up some cases where test setup was based on a previous test executing (a major anti-pattern that repeats itself in many places). * Cleaned up some cases where test teardown was not part of a test (in the future we should have dedicated teardown code that executes even when tests fail). * Fixed some tests that were flaky running on external servers.
2021-06-09 08:13:24 -04:00
}