Commit Graph

12 Commits

Author SHA1 Message Date
meiravgri
4ba9e18ef0
fix crash in crash-report and other improvements (#12623)
## Crash fix
### Current behavior
We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example.

The threads mngr API guarantees that the output array length will be `tids_len`, however, some
indices can be NULL, in case it fails to collect some of the threads' outputs.

When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL
entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage,
so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() )

### fix
Allocate the global output array with zcalloc.

### To reproduce the bug, you'll have to change the code:
**in threadsmngr:ThreadsManager_runOnThreads():**
make sure the g_output_array allocation is initialized with garbage and not 0s 
(add `memset(g_output_array, 2, sizeof(void*) * tids_len);` below the allocation).

Force one of the threads to write to the array:
add a global var: `static redisAtomic size_t return_now = 0;` 
add to `invoke_callback()` before writing to the output array:
```
    size_t i_return;
    atomicGetIncr(return_now, i_return, 1);
    if(i_return == 1) return;
```
compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert`
The assertion triggers the the stacktrace collection. 
Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from
the threads mngr, it can be safely triggered again.

## Added global variables r/w lock in ThreadsManager
To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still
invoking the signal handler, we use a r/w lock.
For cleanups, we will acquire the write lock.
The threads will acquire the read lock to enable them to write simultaneously.
If we fail to acquire the read lock, it means cleanups are in progress and we return immediately.
After acquiring the lock we can safely check that the global output array wasn't nullified and proceed
to write to it.
This way we ensure the threads are not modifying the global variables/ trying to write to the output
array after they were zeroed/nullified/destroyed(the semaphore).

## other minor logging change
1. removed logging if the semaphore times out because the threads can still write to the output array
  after this check. Instead, we print the total number of printed stacktraces compared to the exacted
  number (len_tids).
2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct.
3. improve testing

Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-02 20:02:02 +03:00
meiravgri
cc2be63997
Print stack trace from all threads in crash report (#12453)
In this PR we are adding the functionality to collect all the process's threads' backtraces.

## Changes made in this PR

### **introduce threads mngr API**
The **threads mngr API** which has 2 abilities:
* `ThreadsManager_init() `- register to SIGUSR2. called on the server start-up.
* ` ThreadsManager_runOnThreads()` - receives a list of a pid_t and a callback, tells every
  thread in the list to invoke the callback, and returns the output collected by each invocation.
**Elaborating atomicvar API**
* `atomicIncrGet(var,newvalue_var,count) `-- Increment and get the atomic counter new value
* `atomicFlagGetSet` -- Get and set the atomic counter value to 1

### **Always set SIGALRM handler**
SIGALRM handler prints the process's stacktrace to the log file. Up until now, it was set only if the
`server.watchdog_period` > 0. This can be also useful if debugging is needed. However, in situations
where the server can't get requests, (a deadlock, for example) we weren't able to change the signal handler.
To make it available at run time we set SIGALRM handler on server startup. The signal handler name was
changed to a more general `sigalrmSignalHandler`.

### **Print all the process' threads' stacktraces**

`logStackTrace()` now calls `writeStacktraces()`, instead of logging the current thread stacktrace.
`writeStacktraces()`:
* On Linux systems we use the threads manager API to collect the backtraces of all the process' threads.
  To get the `tids` list (threads ids) we read the `/proc/<redis-server-pid>/tasks` file which includes a list of directories.
  Each directory name corresponds to one tid (including the main thread). For each thread, we also need to check if it
  can get the signal from the threads manager (meaning it is not blocking/ignoring that signal). We send the threads
  manager this tids list and `collect_stacktrace_data()` callback, which collects the thread's backtrace addresses,
  its name, and tid.
* On other systems, the behavior remained as it was (writing only the current thread stacktrace to the log file).

## compatibility notes
1. **The threads mngr API is only supported in linux.** 
2. glibc earlier than 2.3 We use `syscall(SYS_gettid)` and `syscall(SYS_tgkill...)` because their dedicated
  alternatives (`gettid()` and `tgkill`) were added in glibc 2.3.

## Output example

Each thread backtrace will have the following format:
`<tid> <thread_name> [additional_info]`
* **tid**: as read from the `/proc/<redis-server-pid>/tasks` file
* **thread_name**: the tread name as it is registered in the os/
* **additional_info**: Sometimes we want to add specific information about one of the threads. currently.
  it is only used to mark the thread that handles the backtraces collection by adding "*".
  In case of crash - this also indicates which thread caused the crash. The handling thread in won't
  necessarily appear first.

```
------ STACK TRACE ------
EIP:
/lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc]

67089 redis-server *
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb9437790]
/lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc]
redis-server *:6379(+0x75e0c)[0xaaaac2fe5e0c]
redis-server *:6379(aeProcessEvents+0x18c)[0xaaaac2fe6c00]
redis-server *:6379(aeMain+0x24)[0xaaaac2fe7038]
redis-server *:6379(main+0xe0c)[0xaaaac3001afc]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xffffb91d73fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffffb91d74cc]
redis-server *:6379(_start+0x30)[0xaaaac2fe0370]

67093 bio_lazy_free
/lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc]
/lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc]
redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8]
/lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8]
/lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c]

67091 bio_close_file
/lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc]
/lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc]
redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8]
/lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8]
/lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c]

67092 bio_aof
/lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc]
/lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc]
redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8]
/lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8]
/lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c]
67089:signal-handler (1693824528) --------
```
2023-09-24 09:47:23 +03:00
Oran Agra
b56bb72033
Avoid ASAN test errors on crash report tests (#11605)
Clang Address Sanitizer tests started reporting unknown-crash on these
tests due to the memcheck, disable the memcheck to avoid that noise.
2022-12-11 18:20:42 +02:00
Christian Krieg
032619b82b
Fixing test to consider statically linked binaries (#10835)
The test calls `ldd` on `redis-server` in order to find out whether the binary
was linked against `libmusl`; However, `ldd` returns a value different from `0`
when statically linking the binaries agains libc-musl, because `redis-server` is
not a dynamic executable (as given by the exception thrown by the failing test),
and `make test` terminates with an error::

   $ ldd src/redis-server
       not a dynamic executable
   $ echo $?
   1

This commit fixes the test by ignoring such failures.

Co-authored-by: Yossi Gottlieb <yossigo@gmail.com>
2022-06-09 12:59:33 +03:00
Petr Vaněk
f22bfe86b6
Update musl libc detection pattern (#10826)
This change fixes failing `integration/logging.tcl` test in Gentoo with
musl libc, where `ldd` returns
```
libc.so => /lib/ld-musl-x86_64.so.1 (0x7f9d5f171000)
```
unlike Alpine's
```
libc.musl-x86_64.so.1 => /lib/ld-musl-x86_64.so.1 (0x7f82cfa16000)
```
The solution is to extend matching pattern introduced in #8532.
2022-06-07 18:47:01 +03:00
Binbin
b947049f85
Fix timing issue in logging.tcl with FreeBSD (#9910)
A test failure was reported in Daily CI.
`Crash report generated on SIGABRT` with FreeBSD.

```
*** [err]: Crash report generated on SIGABRT in tests/integration/logging.tcl
Expected [string match *crashed by signal* ### Starting...(logs) in tests/integration/logging.tcl]
```

It look like `tail -1000` was executed too early, before it
printed out all the crash logs. We can give it a few more
chances by using `wait_for_log_messages`.

Other changes:
1. In `Server is able to generate a stack trace on selected systems`,
use `wait_for_log_messages`to reduce the lines of code. And if it
fails, there are more detailed logs that can be printed.

2. In `Crash report generated on DEBUG SEGFAULT`, we also use
`wait_for_log_messages` to avoid possible timing issues.
2021-12-07 12:02:58 +02:00
Ozan Tezcan
b91d8b289b
Add sanitizer support and clean up sanitizer findings (#9601)
- Added sanitizer support. `address`, `undefined` and `thread` sanitizers are available.  
- To build Redis with desired sanitizer : `make SANITIZER=undefined`
- There were some sanitizer findings, cleaned up codebase
- Added tests with address and undefined behavior sanitizers to daily CI.
- Added tests with address sanitizer to the per-PR CI (smoke out mem leaks sooner).

Basically, there are three types of issues : 

**1- Unaligned load/store** : Most probably, this issue may cause a crash on a platform that
does not support unaligned access. Redis does unaligned access only on supported platforms.

**2- Signed integer overflow.** Although, signed overflow issue can be problematic time to time
and change how compiler generates code, current findings mostly about signed shift or simple
addition overflow. For most platforms Redis can be compiled for, this wouldn't cause any issue
as far as I can tell (checked generated code on godbolt.org).

 **3 -Minor leak** (redis-cli), **use-after-free**(just before calling exit());

UB means nothing guaranteed and risky to reason about program behavior but I don't think any
of the fixes here worth backporting. As sanitizers are now part of the CI, preventing new issues
will be the real benefit.
2021-11-11 13:51:33 +02:00
Yossi Gottlieb
8a86bca5ed
Improve test suite to handle external servers better. (#9033)
This commit revives the improves the ability to run the test suite against
external servers, instead of launching and managing `redis-server` processes as
part of the test fixture.

This capability existed in the past, using the `--host` and `--port` options.
However, it was quite limited and mostly useful when running a specific tests.
Attempting to run larger chunks of the test suite experienced many issues:

* Many tests depend on being able to start and control `redis-server` themselves,
and there's no clear distinction between external server compatible and other
tests.
* Cluster mode is not supported (resulting with `CROSSSLOT` errors).

This PR cleans up many things and makes it possible to run the entire test suite
against an external server. It also provides more fine grained controls to
handle cases where the external server supports a subset of the Redis commands,
limited number of databases, cluster mode, etc.

The tests directory now contains a `README.md` file that describes how this
works.

This commit also includes additional cleanups and fixes:

* Tests can now be tagged.
* Tag-based selection is now unified across `start_server`, `tags` and `test`.
* More information is provided about skipped or ignored tests.
* Repeated patterns in tests have been extracted to common procedures, both at a
  global level and on a per-test file basis.
* Cleaned up some cases where test setup was based on a previous test executing
  (a major anti-pattern that repeats itself in many places).
* Cleaned up some cases where test teardown was not part of a test (in the
  future we should have dedicated teardown code that executes even when tests
  fail).
* Fixed some tests that were flaky running on external servers.
2021-06-09 15:13:24 +03:00
Yossi Gottlieb
95ea74549c
Fix failed tests on Linux Alpine and add a CI job. (#8532)
* Remove linux/version.h dependency.

This introduces unnecessary dependencies, and generally not a good idea
as the platform we build on may be different than the platform we run
on.

To determine if sync_file_range exists we can simply rely on header file
hints.

* Fix setproctitle() on libmusl.

The previous ifdef checks were a bit too strict for no apparent
reason.

* Fix tests failure on Linux with no backtrace.

* Add alpine daily CI job.
2021-02-23 12:57:45 +02:00
Meir Shpilraien (Spielrein)
762be79f0b
Disable new SIGABRT test on valgrind (#8013)
The crash reports cause false-positive warnings when run with valgrind.
2020-11-04 13:13:55 +02:00
Meir Shpilraien (Spielrein)
f210e197f3
Added crash report on SIGABRT (#8004)
The reason that we want to get a full crash report on SIGABRT
is that the jmalloc, when detecting a corruption, calls abort().
This will cause the Redis to exist silently without any report
and without any way to analyze what happened.
2020-11-03 14:59:21 +02:00
antirez
e1fce55237 Added regression test for issue #2371. 2015-02-10 14:40:27 +01:00