Sentinel: check Slave INFO state more often when disconnected.

During the initial handshake with the master a slave will report to have a very high disconnection time from its master (since technically it was disconnected since forever, so the current UNIX time in seconds is reported). However when the slave is connected again the Sentinel may re-scan the INFO output again only after 10 seconds, which is a long time. During this time Sentinels will consider this instance unable to failover, so a useless delay is introduced. Actaully this hardly happened in the practice because when a slave's master is down, the INFO period for slaves changes to 1 second. However when a manual failover is attempted immediately after adding slaves (like in the case of the Sentinel unit test), this problem may happen. This commit changes the INFO period to 1 second even in the case the slave's master is not down, but the slave reported to be disconnected from the master (by publishing, last time we checked, a master disconnection time field in INFO). This change is required as a result of an unrelated change in the replication code that adds a small delay in the master-slave first synchronization.
2025-01-22 16:18:28 -05:00 · 2016-07-22 10:51:25 +02:00 · 2016-07-22 10:51:25 +02:00 · 3e9ce38b0a
commit 3e9ce38b0a
parent 0a628e5102
2 changed files with 10 additions and 3 deletions
--- a/src/sentinel.c
+++ b/src/sentinel.c
@ -2579,9 +2579,15 @@ void sentinelSendPeriodicCommands(sentinelRedisInstance *ri) {
    /* If this is a slave of a master in O_DOWN condition we start sending
     * it INFO every second, instead of the usual SENTINEL_INFO_PERIOD
     * period. In this state we want to closely monitor slaves in case they
-     * are turned into masters by another Sentinel, or by the sysadmin. */
+     * are turned into masters by another Sentinel, or by the sysadmin.
+     *
+     * Similarly we monitor the INFO output more often if the slave reports
+     * to be disconnected from the master, so that we can have a fresh
+     * disconnection time figure. */
    if ((ri->flags & SRI_SLAVE) &&
-        (ri->master->flags & (SRI_O_DOWN|SRI_FAILOVER_IN_PROGRESS))) {
+        ((ri->master->flags & (SRI_O_DOWN|SRI_FAILOVER_IN_PROGRESS)) ||
+         (ri->master_link_down_time != 0)))
+    {
        info_period = 1000;
    } else {
        info_period = SENTINEL_INFO_PERIOD;
--- a/tests/sentinel/tests/05-manual.tcl
+++ b/tests/sentinel/tests/05-manual.tcl
@ -6,7 +6,8 @@ test "Manual failover works" {
    set old_port [RI $master_id tcp_port]
    set addr [S 0 SENTINEL GET-MASTER-ADDR-BY-NAME mymaster]
    assert {[lindex $addr 1] == $old_port}
-    S 0 SENTINEL FAILOVER mymaster
+    catch {S 0 SENTINEL FAILOVER mymaster} reply
+    assert {$reply eq "OK"}
    foreach_sentinel_id id {
        wait_for_condition 1000 50 {
            [lindex [S $id SENTINEL GET-MASTER-ADDR-BY-NAME mymaster] 1] != $old_port