Fix#12785 and other race condition issues.
See the following isolated comments.
The following report was obtained using SANITIZER thread.
```sh
make SANITIZER=thread
./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate
```
1. Fixed thread-safe issue in RM_UnblockClient()
Related discussion:
https://github.com/redis/redis/pull/12817#issuecomment-1831181220
* When blocking a client in a module using `RM_BlockClientOnKeys()` or
`RM_BlockClientOnKeysWithFlags()`
with a timeout_callback, calling RM_UnblockClient() in module threads
can lead to race conditions
in `updateStatsOnUnblock()`.
- Introduced:
Version: 6.2
PR: #7491
- Touch:
`server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`,
and `server.latency_events`
- Harm Level: High
Potentially corrupts the memory data of `cmd->latency_histogram`,
`server.slowlog`, and `server.latency_events`
- Solution:
Differentiate whether the call to moduleBlockedClientTimedOut() comes
from the module or the main thread.
Since we can't know if RM_UnblockClient() comes from module threads, we
always assume it does and
let `updateStatsOnUnblock()` asynchronously update the unblock status.
* When error reply is called in timeout_callback(), ctx is not
thread-safe, eventually lead to race conditions in `afterErrorReply`.
- Introduced:
Version: 6.2
PR: #8217
- Touch
`server.stat_total_error_replies`, `server.errors`,
- Harm Level: High
Potentially corrupts the memory data of `server.errors`
- Solution:
Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`,
and asynchronously reply errors to the client.
2. Made RM_Reply*() family API thread-safe
Related discussion:
https://github.com/redis/redis/pull/12817#discussion_r1408707239
Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch
server.current_client
- Introduced:
Version: 7.2.0
PR: #12326
- Harm Level: None
Since the module fake client won't have the `CLIENT_PUSHING` flag, even
if we touch server.current_client,
we can still exit after `c->flags & CLIENT_PUSHING`.
- Solution
Checking `c->flags & CLIENT_PUSHING` earlier.
3. Made freeClient() thread-safe
Fix#12785
- Introduced:
Version: 4.0
Commit:
3fcf959e60
- Harm Level: Moderate
* Trigger assertion
It happens when the module thread calls freeClient while the io-thread
is in progress,
which just triggers an assertion, and doesn't make any race condiaions.
* Touch `server.current_client`, `server.stat_clients_type_memory`, and
`clientMemUsageBucket->clients`.
It happens between the main thread and the module threads, may cause
data corruption.
1. Error reset `server.current_client` to NULL, but theoretically this
won't happen,
because the module has already reset `server.current_client` to old
value before entering freeClient.
2. corrupts `clientMemUsageBucket->clients` in
updateClientMemUsageAndBucket().
3. Causes server.stat_clients_type_memory memory statistics to be
inaccurate.
- Solution:
* No longer counts memory usage on fake clients, to avoid updating
`server.stat_clients_type_memory` in freeClient.
* No longer resetting `server.current_client` in unlinkClient, because
the fake client won't be evicted or disconnected in the mid of the
process.
* Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is
not a fake client.
4. Fixed free client args without GIL
Related discussion:
https://github.com/redis/redis/pull/12817#discussion_r1408706695
When freeing retained strings in the module thread (refcount decr), or
using them in some way (refcount incr), we should do so while holding
the GIL,
otherwise, they might be simultaneously freed while the main thread is
processing the unblock client state.
- Introduced:
Version: 6.2.0
PR: #8141
- Harm Level: Low
Trigger assertion or double free or memory leak.
- Solution:
Documenting that module API users need to ensure any access to these
retained strings is done with the GIL locked
5. Fix adding fake client to server.clients_pending_write
It will incorrectly log the memory usage for the fake client.
Related discussion:
https://github.com/redis/redis/pull/12817#issuecomment-1851899163
- Introduced:
Version: 4.0
Commit:
9b01b64430
- Harm Level: None
Only result in NOP
- Solution:
* Don't add fake client into server.clients_pending_write
* Add c->conn assertion for updateClientMemUsageAndBucket() and
updateClientMemoryUsage() to avoid same
issue in the future.
So now it will be the responsibility of the caller of both of them to
avoid passing in fake client.
6. Fix calling RM_BlockedClientMeasureTimeStart() and
RM_BlockedClientMeasureTimeEnd() without GIL
- Introduced:
Version: 6.2
PR: #7491
- Harm Level: Low
Causes inaccuracies in command latency histogram and slow logs, but does
not corrupt memory.
- Solution:
Module API users, if know that non-thread-safe APIs will be used in
multi-threading, need to take responsibility for protecting them with
their own locks instead of the GIL, as using the GIL is too expensive.
### Other issue
1. RM_Yield is not thread-safe, fixed via #12905.
### Summarize
1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and
`RM_Yield`, potentially preventing memory corruption, data disorder, or
assertion.
2. Updated docs and module test to clarify module API users'
responsibility for locking non-thread-safe APIs in multi-threading, such
as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(),
RM_RetainString(), and RM_HoldString().
### About backpot to 7.2
1. The implement of (1) is not too satisfying, would like to get more
eyes.
2. (2), (3) can be safely for backport
3. (4), (6) just modifying the module tests and updating the
documentation, no need for a backpot.
4. (5) is harmless, no need for a backpot.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Before #9497, before redis-server was shut down, we did not manually shut down all the clients,
which would have prevented valgrind from detecting a memory leak in the client's argc.
fixes test issue introduced in #9167
1. invalid reads due to accessing non-retained string (passed as unblock context).
2. leaking module blocked client context, see #6922 for info.
Modules that use background threads with thread safe contexts are likely
to use RM_BlockClient() without a timeout function, because they do not
set up a timeout.
Before this commit, `CLIENT UNBLOCK` would result with a crash as the
`NULL` timeout callback is called. Beyond just crashing, this is also
logically wrong as it may throw the module into an unexpected client
state.
This commits makes `CLIENT UNBLOCK` on such clients behave the same as
any other client that is not in a blocked state and therefore cannot be
unblocked.
- removes time sensitive checks from block on background tests during leak checks.
- fix uninitialized variable on RedisModuleBlockedClient() when calling
RM_BlockedClientMeasureTimeEnd() without RM_BlockedClientMeasureTimeStart()
This commit enables tracking time of the background tasks and on replies,
opening the door for properly tracking commands that rely on blocking / background
work via the slowlog, latency history, and commandstats.
Some notes:
- The time spent blocked waiting for key changes, or blocked on synchronous
replication is not accounted for.
- **This commit does not affect latency tracking of commands that are non-blocking
or do not have background work.** ( meaning that it all stays the same with exception to
`BZPOPMIN`,`BZPOPMAX`,`BRPOP`,`BLPOP`, etc... and module's commands that rely
on background threads ).
- Specifically for latency history command we've added a new event class named
`command-unblocking` that will enable latency monitoring on commands that spawn
background threads to do the work.
- For blocking commands we're now considering the total time of a command as the
time spent on call() + the time spent on replying when unblocked.
- For Modules commands that rely on background threads we're now considering the
total time of a command as the time spent on call (main thread) + the time spent on
the background thread ( if marked within `RedisModule_MeasureTimeStart()` and
`RedisModule_MeasureTimeEnd()` ) + the time spent on replying (main thread)
To test for this feature we've added a `unit/moduleapi/blockonbackground` test that relies on
a module that blocks the client and sleeps on the background for a given time.
- check blocked command that uses RedisModule_MeasureTimeStart() is tracking background time
- check blocked command that uses RedisModule_MeasureTimeStart() is tracking background time even in timeout
- check blocked command with multiple calls RedisModule_MeasureTimeStart() is tracking the total background time
- check blocked command without calling RedisModule_MeasureTimeStart() is not reporting background time