redict/tests/integration/logging.tcl
meiravgri 4ba9e18ef0
fix crash in crash-report and other improvements (#12623)
## Crash fix
### Current behavior
We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example.

The threads mngr API guarantees that the output array length will be `tids_len`, however, some
indices can be NULL, in case it fails to collect some of the threads' outputs.

When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL
entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage,
so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() )

### fix
Allocate the global output array with zcalloc.

### To reproduce the bug, you'll have to change the code:
**in threadsmngr:ThreadsManager_runOnThreads():**
make sure the g_output_array allocation is initialized with garbage and not 0s 
(add `memset(g_output_array, 2, sizeof(void*) * tids_len);` below the allocation).

Force one of the threads to write to the array:
add a global var: `static redisAtomic size_t return_now = 0;` 
add to `invoke_callback()` before writing to the output array:
```
    size_t i_return;
    atomicGetIncr(return_now, i_return, 1);
    if(i_return == 1) return;
```
compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert`
The assertion triggers the the stacktrace collection. 
Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from
the threads mngr, it can be safely triggered again.

## Added global variables r/w lock in ThreadsManager
To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still
invoking the signal handler, we use a r/w lock.
For cleanups, we will acquire the write lock.
The threads will acquire the read lock to enable them to write simultaneously.
If we fail to acquire the read lock, it means cleanups are in progress and we return immediately.
After acquiring the lock we can safely check that the global output array wasn't nullified and proceed
to write to it.
This way we ensure the threads are not modifying the global variables/ trying to write to the output
array after they were zeroed/nullified/destroyed(the semaphore).

## other minor logging change
1. removed logging if the semaphore times out because the threads can still write to the output array
  after this check. Instead, we print the total number of printed stacktraces compared to the exacted
  number (len_tids).
2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct.
3. improve testing

Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-02 20:02:02 +03:00

124 lines
4.4 KiB
Tcl

tags {"external:skip"} {
set system_name [string tolower [exec uname -s]]
set backtrace_supported 0
set threads_mngr_supported 0
# We only support darwin or Linux with glibc
if {$system_name eq {darwin}} {
set backtrace_supported 1
} elseif {$system_name eq {linux}} {
set threads_mngr_supported 1
# Avoid the test on libmusl, which does not support backtrace
# and on static binaries (ldd exit code 1) where we can't detect libmusl
catch {
set ldd [exec ldd src/redis-server]
if {![string match {*libc.*musl*} $ldd]} {
set backtrace_supported 1
}
}
}
proc check_log_backtrace_for_debug {log_pattern} {
set res [wait_for_log_messages 0 \"$log_pattern\" 0 100 100]
if {$::verbose} { puts $res}
# If the stacktrace is printed more than once, it means redis crashed during crash report generation
assert_equal [count_log_message 0 "STACK TRACE"] 1
set pattern "*debugCommand*"
set res [wait_for_log_messages 0 \"$pattern\" 0 100 100]
if {$::verbose} { puts $res}
}
# used when backtrace_supported == 0
proc check_crash_log {log_pattern} {
set res [wait_for_log_messages 0 \"$log_pattern\" 0 50 100]
if {$::verbose} { puts $res }
}
# test the watchdog and the stack trace report from multiple threads
if {$backtrace_supported} {
set server_path [tmpdir server.log]
start_server [list overrides [list dir $server_path]] {
test "Server is able to generate a stack trace on selected systems" {
r config set watchdog-period 200
r debug sleep 1
if {$threads_mngr_supported} {
assert_equal [count_log_message 0 "failed to open /proc/"] 0
assert_equal [count_log_message 0 "failed to find SigBlk or/and SigIgn"] 0
assert_equal [count_log_message 0 "threads_mngr: waiting for threads' output was interrupted by signal"] 0
assert_equal [count_log_message 0 "threads_mngr: waiting for threads' output timed out"] 0
assert_equal [count_log_message 0 "bioProcessBackgroundJobs"] 3
}
check_log_backtrace_for_debug "*WATCHDOG TIMER EXPIRED*"
# make sure redis is still alive
assert_equal "PONG" [r ping]
}
}
}
# Valgrind will complain that the process terminated by a signal, skip it.
if {!$::valgrind} {
if {$backtrace_supported} {
set check_cb check_log_backtrace_for_debug
} else {
set check_cb check_crash_log
}
# test being killed by a SIGABRT from outside
set server_path [tmpdir server1.log]
start_server [list overrides [list dir $server_path crash-memcheck-enabled no]] {
test "Crash report generated on SIGABRT" {
set pid [s process_id]
r deferred 1
r debug sleep 10 ;# so that we see the function in the stack trace
r flush
after 100 ;# wait for redis to get into the sleep
exec kill -SIGABRT $pid
$check_cb "*crashed by signal*"
}
}
# test DEBUG SEGFAULT
set server_path [tmpdir server2.log]
start_server [list overrides [list dir $server_path crash-memcheck-enabled no]] {
test "Crash report generated on DEBUG SEGFAULT" {
catch {r debug segfault}
$check_cb "*crashed by signal*"
}
}
# test DEBUG SIGALRM being non-fatal
set server_path [tmpdir server3.log]
start_server [list overrides [list dir $server_path]] {
test "Stacktraces generated on SIGALRM" {
set pid [s process_id]
r deferred 1
r debug sleep 10 ;# so that we see the function in the stack trace
r flush
after 100 ;# wait for redis to get into the sleep
exec kill -SIGALRM $pid
$check_cb "*Received SIGALRM*"
r read
r deferred 0
# make sure redis is still alive
assert_equal "PONG" [r ping]
}
}
}
# test DEBUG ASSERT
if {$backtrace_supported} {
set server_path [tmpdir server4.log]
# Use exit() instead of abort() upon assertion so Valgrind tests won't fail.
start_server [list overrides [list dir $server_path use-exit-on-panic yes crash-memcheck-enabled no]] {
test "Generate stacktrace on assertion" {
catch {r debug assert}
check_log_backtrace_for_debug "*ASSERTION FAILED*"
}
}
}
}