fix#10439. see https://github.com/redis/redis/pull/9872
When executing SHUTDOWN we pause the client so we can un-pause it
if the shutdown fails.
this could happen during the timeout, if the shutdown is aborted, but could
also happen from withing the initial `call()` to shutdown, if the rdb save fails.
in that case when we return to `call()`, we'll crash if `c->cmd` has been set to NULL.
The call stack is:
```
unblockClient(c)
replyToClientsBlockedOnShutdown()
cancelShutdown()
finishShutdown()
prepareForShutdown()
shutdownCommand()
```
what's special about SHUTDOWN in that respect is that it can be paused,
and then un-paused before the original `call()` returns.
tests where added for both failed shutdown, and a followup successful one.
To avoid data loss, this commit adds a grace period for lagging replicas to
catch up the replication offset.
Done:
* Wait for replicas when shutdown is triggered by SIGTERM and SIGINT.
* Wait for replicas when shutdown is triggered by the SHUTDOWN command. A new
blocked client type BLOCKED_SHUTDOWN is introduced, allowing multiple clients
to call SHUTDOWN in parallel.
Note that they don't expect a response unless an error happens and shutdown is aborted.
* Log warning for each replica lagging behind when finishing shutdown.
* CLIENT_PAUSE_WRITE while waiting for replicas.
* Configurable grace period 'shutdown-timeout' in seconds (default 10).
* New flags for the SHUTDOWN command:
- NOW disables the grace period for lagging replicas.
- FORCE ignores errors writing the RDB or AOF files which would normally
prevent a shutdown.
- ABORT cancels ongoing shutdown. Can't be combined with other flags.
* New field in the output of the INFO command: 'shutdown_in_milliseconds'. The
value is the remaining maximum time to wait for lagging replicas before
finishing the shutdown. This field is present in the Server section **only**
during shutdown.
Not directly related:
* When shutting down, if there is an AOF saving child, it is killed **even** if AOF
is disabled. This can happen if BGREWRITEAOF is used when AOF is off.
* Client pause now has end time and type (WRITE or ALL) per purpose. The
different pause purposes are *CLIENT PAUSE command*, *failover* and
*shutdown*. If clients are unpaused for one purpose, it doesn't affect client
pause for other purposes. For example, the CLIENT UNPAUSE command doesn't
affect client pause initiated by the failover or shutdown procedures. A completed
failover or a failed shutdown doesn't unpause clients paused by the CLIENT
PAUSE command.
Notes:
* DEBUG RESTART doesn't wait for replicas.
* We already have a warning logged when a replica disconnects. This means that
if any replica connection is lost during the shutdown, it is either logged as
disconnected or as lagging at the time of exit.
Co-authored-by: Oran Agra <oran@redislabs.com>
This commit revives the improves the ability to run the test suite against
external servers, instead of launching and managing `redis-server` processes as
part of the test fixture.
This capability existed in the past, using the `--host` and `--port` options.
However, it was quite limited and mostly useful when running a specific tests.
Attempting to run larger chunks of the test suite experienced many issues:
* Many tests depend on being able to start and control `redis-server` themselves,
and there's no clear distinction between external server compatible and other
tests.
* Cluster mode is not supported (resulting with `CROSSSLOT` errors).
This PR cleans up many things and makes it possible to run the entire test suite
against an external server. It also provides more fine grained controls to
handle cases where the external server supports a subset of the Redis commands,
limited number of databases, cluster mode, etc.
The tests directory now contains a `README.md` file that describes how this
works.
This commit also includes additional cleanups and fixes:
* Tests can now be tagged.
* Tag-based selection is now unified across `start_server`, `tags` and `test`.
* More information is provided about skipped or ignored tests.
* Repeated patterns in tests have been extracted to common procedures, both at a
global level and on a per-test file basis.
* Cleaned up some cases where test setup was based on a previous test executing
(a major anti-pattern that repeats itself in many places).
* Cleaned up some cases where test teardown was not part of a test (in the
future we should have dedicated teardown code that executes even when tests
fail).
* Fixed some tests that were flaky running on external servers.
We're already using bg_unlink in several places to delete the rdb file in the background,
and avoid paying the cost of the deletion from our main thread.
This commit uses bg_unlink to remove the temporary rdb file in the background too.
However, in case we delete that rdb file just before exiting, we don't actually wait for the
background thread or the main thread to delete it, and just let the OS clean up after us.
i.e. we open the file, unlink it and exit with the fd still open.
Furthermore, rdbRemoveTempFile can be called from a thread and was using snprintf which is
not async-signal-safe, we now use ll2string instead.