redict/tests/unit/shutdown.tcl
Viktor Söderqvist 45a155bd0f
Wait for replicas when shutting down (#9872)
To avoid data loss, this commit adds a grace period for lagging replicas to
catch up the replication offset.

Done:

* Wait for replicas when shutdown is triggered by SIGTERM and SIGINT.

* Wait for replicas when shutdown is triggered by the SHUTDOWN command. A new
  blocked client type BLOCKED_SHUTDOWN is introduced, allowing multiple clients
  to call SHUTDOWN in parallel.
  Note that they don't expect a response unless an error happens and shutdown is aborted.

* Log warning for each replica lagging behind when finishing shutdown.

* CLIENT_PAUSE_WRITE while waiting for replicas.

* Configurable grace period 'shutdown-timeout' in seconds (default 10).

* New flags for the SHUTDOWN command:

    - NOW disables the grace period for lagging replicas.

    - FORCE ignores errors writing the RDB or AOF files which would normally
      prevent a shutdown.

    - ABORT cancels ongoing shutdown. Can't be combined with other flags.

* New field in the output of the INFO command: 'shutdown_in_milliseconds'. The
  value is the remaining maximum time to wait for lagging replicas before
  finishing the shutdown. This field is present in the Server section **only**
  during shutdown.

Not directly related:

* When shutting down, if there is an AOF saving child, it is killed **even** if AOF
  is disabled. This can happen if BGREWRITEAOF is used when AOF is off.

* Client pause now has end time and type (WRITE or ALL) per purpose. The
  different pause purposes are *CLIENT PAUSE command*, *failover* and
  *shutdown*. If clients are unpaused for one purpose, it doesn't affect client
  pause for other purposes. For example, the CLIENT UNPAUSE command doesn't
  affect client pause initiated by the failover or shutdown procedures. A completed
  failover or a failed shutdown doesn't unpause clients paused by the CLIENT
  PAUSE command.

Notes:

* DEBUG RESTART doesn't wait for replicas.

* We already have a warning logged when a replica disconnects. This means that
  if any replica connection is lost during the shutdown, it is either logged as
  disconnected or as lagging at the time of exit.

Co-authored-by: Oran Agra <oran@redislabs.com>
2022-01-02 09:50:15 +02:00

69 lines
2.2 KiB
Tcl

start_server {tags {"shutdown external:skip"}} {
test {Temp rdb will be deleted if we use bg_unlink when shutdown} {
for {set i 0} {$i < 20} {incr i} {
r set $i $i
}
# It will cost 2s(20 * 100ms) to dump rdb
r config set rdb-key-save-delay 100000
# Child is dumping rdb
r bgsave
after 100
set dir [lindex [r config get dir] 1]
set child_pid [get_child_pid 0]
set temp_rdb [file join [lindex [r config get dir] 1] temp-${child_pid}.rdb]
# Temp rdb must be existed
assert {[file exists $temp_rdb]}
catch {r shutdown nosave}
# Make sure the server was killed
catch {set rd [redis_deferring_client]} e
assert_match {*connection refused*} $e
# Temp rdb file must be deleted
assert {![file exists $temp_rdb]}
}
}
start_server {tags {"shutdown external:skip"}} {
test {SHUTDOWN ABORT can cancel SIGTERM} {
r debug pause-cron 1
set pid [s process_id]
exec kill -SIGTERM $pid
after 10; # Give signal handler some time to run
r shutdown abort
verify_log_message 0 "*Shutdown manually aborted*" 0
r debug pause-cron 0
r ping
} {PONG}
test {Temp rdb will be deleted in signal handle} {
for {set i 0} {$i < 20} {incr i} {
r set $i $i
}
# It will cost 2s (20 * 100ms) to dump rdb
r config set rdb-key-save-delay 100000
set pid [s process_id]
set temp_rdb [file join [lindex [r config get dir] 1] temp-${pid}.rdb]
# trigger a shutdown which will save an rdb
exec kill -SIGINT $pid
# Wait for creation of temp rdb
wait_for_condition 50 10 {
[file exists $temp_rdb]
} else {
fail "Can't trigger rdb save on shutdown"
}
# Insist on immediate shutdown, temp rdb file must be deleted
exec kill -SIGINT $pid
# wait for the rdb file to be deleted
wait_for_condition 50 10 {
![file exists $temp_rdb]
} else {
fail "Can't trigger rdb save on shutdown"
}
}
}