redict/tests/unit
meiravgri 2e854bccc6
Fix async safety in signal handlers (#12658)
see discussion from after https://github.com/redis/redis/pull/12453 was
merged
----
This PR replaces signals that are not considered async-signal-safe
(AS-safe) with safe calls.

#### **1. serverLog() and serverLogFromHandler()**
`serverLog` uses unsafe calls. It was decided that we will **avoid**
`serverLog` calls by the signal handlers when:
* The signal is not fatal, such as SIGALRM. In these cases, we prefer
using `serverLogFromHandler` which is the safe version of `serverLog`.
Note they have different prompts:
`serverLog`: `62220:M 26 Oct 2023 14:39:04.526 # <msg>`
`serverLogFromHandler`: `62220:signal-handler (1698331136) <msg>`
* The code was added recently. Calls to `serverLog` by the signal
handler have been there ever since Redis exists and it hasn't caused
problems so far. To avoid regression, from now we should use
`serverLogFromHandler`

#### **2. `snprintf` `fgets` and `strtoul`(base = 16) -------->
`_safe_snprintf`, `fgets_async_signal_safe`, `string_to_hex`**
The safe version of `snprintf` was taken from
[here](8cfc4ca5e7/src/mc_util.c (L754))

#### **3. fopen(), fgets(), fclose() --------> open(), read(), close()**

#### **4. opendir(), readdir(), closedir() --------> open(),
syscall(SYS_getdents64), close()**

#### **5. Threads_mngr sync mechanisms**
* waiting for the thread to generate stack trace: semaphore -------->
busy-wait
* `globals_rw_lock` was removed: as we are not using malloc and the
semaphore anymore we don't need to protect `ThreadsManager_cleanups`.

#### **6. Stacktraces buffer**
The initial problem was that we were not able to safely call malloc
within the signal handler.
To solve that we created a buffer on the stack of `writeStacktraces` and
saved it in a global pointer, assuming that under normal circumstances,
the function `writeStacktraces` would complete before any thread
attempted to write to it. However, **if threads lag behind, they might
access this global pointer after it no longer belongs to the
`writeStacktraces` stack, potentially corrupting memory.**
To address this, various solutions were discussed
[here](https://github.com/redis/redis/pull/12658#discussion_r1390442896)
Eventually, we decided to **create a pipe** at server startup that will
remain valid as long as the process is alive.
We chose this solution due to its minimal memory usage, and since
`write()` and `read()` are atomic operations. It ensures that stack
traces from different threads won't mix.

**The stacktraces collection process is now as  follows:**
* Cleaning the pipe to eliminate writes of late threads from previous
runs.
* Each thread writes to the pipe its stacktrace
* Waiting for all the threads to mark completion or until a timeout (2
sec) is reached
* Reading from the pipe to print the stacktraces.

#### **7. Changes that were considered and eventually were dropped**
* replace watchdog timer with a POSIX timer: 
according to [settimer man](https://linux.die.net/man/2/setitimer)

> POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending
the use of the POSIX timers API
([timer_gettime](https://linux.die.net/man/2/timer_gettime)(2),
[timer_settime](https://linux.die.net/man/2/timer_settime)(2), etc.)
instead.

However, although it is supposed to conform to POSIX std, POSIX timers
API is not supported on Mac.
You can take a look here at the Linux implementation:

[here](c7562ee135)
To avoid messing up the code, and uncertainty regarding compatibility,
it was decided to drop it for now.

* avoid using sds (uses malloc) in logConfigDebugInfo
It was considered to print config info instead of using sds, however
apparently, `logConfigDebugInfo` does more than just print the sds, so
it was decided this fix is out of this issue scope.

#### **8. fix Signal mask check**
The check `signum & sig_mask` intended to indicate whether the signal is
blocked by the thread was incorrect. Actually, the bit position in the
signal mask corresponds to the signal number. We fixed this by changing
the condition to: `sig_mask & (1L << (sig_num - 1))`

#### **9. Unrelated changes**
both `fork.tcl `and `util.tcl` implemented a function called
`count_log_message` expecting different parameters. This caused
confusion when trying to run daily tests with additional test parameters
to run a specific test.
The `count_log_message` in `fork.tcl` was removed and the calls were
replaced with calls to `count_log_message` located in `util.tcl`

---------

Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2023-11-23 13:22:20 +02:00
..
cluster Replace cluster metadata with slot specific dictionaries (#11695) 2023-10-14 23:58:26 -07:00
moduleapi Fix async safety in signal handlers (#12658) 2023-11-23 13:22:20 +02:00
type Bump codespell from 2.2.4 to 2.2.5 (#12557) 2023-09-08 16:10:17 +03:00
acl-v2.tcl fix false success and a memory leak for ACL selector with bad parenthesis combination (#12452) 2023-08-02 10:46:06 +03:00
acl.tcl Fixed a bug where sequential matching ACL rules weren't compressed (#12472) 2023-08-10 09:58:53 +03:00
aofrw.tcl Attempt to solve MacOS CI issues in GH Actions (#12013) 2023-04-12 09:19:21 +03:00
auth.tcl Fix race condition in tests/unit/auth.tcl (#12444) 2023-08-01 18:03:33 +03:00
bitfield.tcl Add BITFIELD_RO basic tests for non-repl use cases (#12187) 2023-05-18 12:16:46 +03:00
bitops.tcl BITCOUNT and BITPOS with non-existing key and illegal arguments should return error, not 0 (#11734) 2023-08-21 19:48:30 +03:00
client-eviction.tcl Attempt to solve MacOS CI issues in GH Actions (#12013) 2023-04-12 09:19:21 +03:00
dump.tcl Fix SPOP/RESTORE propagation when doing lazy free (#12320) 2023-06-16 08:14:11 -07:00
expire.tcl Fix DB iterator not resetting pauserehash causing dict being unable to rehash (#12757) 2023-11-14 14:28:46 +02:00
functions.tcl Tests: Add missing key declaration in scripts (#11134) 2022-08-16 22:04:22 +03:00
geo.tcl adding geo command edge cases tests (#12274) 2023-06-20 12:50:03 +03:00
hyperloglog.tcl Hyperloglog avoid allocate more than 'server.hll_sparse_max_bytes' bytes of memory for sparse representation (#11438) 2022-11-28 17:35:31 +02:00
info-command.tcl Make INFO command variadic (#6891) 2022-02-08 13:14:42 +02:00
info.tcl Use cross-platform-actions for FreeBSD support. (#12732) 2023-11-06 18:07:14 +02:00
introspection-2.tcl Fix possible crash in command getkeys (#12380) 2023-07-03 12:45:18 +03:00
introspection.tcl Added tests for Client commands (#10276) 2023-08-20 19:17:51 +03:00
keyspace.tcl Adding missing SWAPDB related test cases. (#12769) 2023-11-19 12:44:48 +02:00
latency-monitor.tcl Add printing for LATENCY related tests (#12514) 2023-08-27 11:42:55 +03:00
lazyfree.tcl attempt to fix tracking test issue with external tests due to lazy free (#9722) 2021-11-02 16:42:53 +02:00
limits.tcl Improve test suite to handle external servers better. (#9033) 2021-06-09 15:13:24 +03:00
maxmemory.tcl postpone the initialization of oject's lru&lfu until it is added to the db as a value object (#11626) 2023-05-24 09:40:11 +03:00
memefficiency.tcl re-enable defrag tests in cluster mode. (#12710) 2023-11-02 13:55:48 +02:00
multi.tcl multi.tcl: reset readraw at the end of the test (#12123) 2023-05-04 11:58:31 +03:00
networking.tcl Add reply_schema to command json files (internal for now) (#10273) 2023-03-11 10:14:16 +02:00
obuf-limits.tcl Add reply_schema to command json files (internal for now) (#10273) 2023-03-11 10:14:16 +02:00
oom-score-adj.tcl Return 0 when config set out-of-range oom-score-adj-values (#10601) 2022-04-19 11:31:15 +03:00
other.tcl Support TLS service when "tls-cluster" is not enabled and persist both plain and TLS port in nodes.conf (#12233) 2023-06-26 07:43:38 -07:00
pause.tcl Bump codespell to 2.2.4, fix typos and outupdated comments (#11911) 2023-03-16 08:50:32 +02:00
printver.tcl Print version info before running the test 2011-05-20 11:44:54 +02:00
protocol.tcl Add reply_schema to command json files (internal for now) (#10273) 2023-03-11 10:14:16 +02:00
pubsub.tcl Fix broken protocol when PUBLISH emits local push inside MULTI (#12326) 2023-06-20 20:41:41 +03:00
pubsubshard.tcl Fix the bug that CLIENT REPLY OFF|SKIP cannot receive push notifications (#11875) 2023-03-12 17:50:44 +02:00
querybuf.tcl Pause cron to prevent premature shrinking in querybuf test (#12126) 2023-05-04 13:02:08 +03:00
quit.tcl flushSlavesOutputBuffers should not write to replicas scheduled to drop (#12242) 2023-06-12 14:05:34 +03:00
replybufsize.tcl Introduce debug command to disable reply buffer resizing (#10360) 2022-03-01 14:40:29 +02:00
scan.tcl Optimize SCAN with MATCH when pattern implies cluster slot (#12536) 2023-11-01 00:06:49 -07:00
scripting.tcl support XREAD[GROUP] with BLOCK option in scripts (#12596) 2023-10-12 10:54:50 +03:00
shutdown.tcl Tests: Do not save an RDB by default and add a SIGTERM default AOFRW test (#12064) 2023-04-18 16:14:26 +03:00
slowlog.tcl minor optimization for slowlog get (#12103) 2023-04-25 10:17:21 +03:00
sort.tcl Update sort_ro reply_schema to mention the null reply (#12534) 2023-08-31 06:36:35 +03:00
tls.tcl Add support for reading encrypted keyfiles. (#8644) 2021-03-22 13:27:46 +02:00
tracking.tcl Bump codespell from 2.2.4 to 2.2.5 (#12557) 2023-09-08 16:10:17 +03:00
violations.tcl Run large-memory tests as solo. (#10626) 2022-04-24 17:29:35 +03:00
wait.tcl WAITAOF: Update fsynced_reploff_pending even if there's nothing to fsync (#12622) 2023-09-28 17:19:20 +03:00