Commit Graph

236 Commits

Author SHA1 Message Date
antirez
4c0f8c4e5a Sentinel: parse new INFO replication output correctly.
Sentinel was not able to detect slaves when connected to a very recent
version of Redis master since a previos non-backward compatible change
to INFO broken the parsing of the slaves ip:port INFO output.

This fixes issue #1164
2013-06-20 10:23:23 +02:00
antirez
e5ef85c444 Sentinel: changes to tilt mode.
Tilt mode was too aggressive (not processing INFO output), this
resulted in a few problems:

1) Redirections were not followed when in tilt mode. This opened a
   window to misinform clients about the current master when a Sentinel
   was in tilt mode and a fail over happened during the time it was not
   able to update the state.

2) It was possible for a Sentinel exiting tilt mode to detect a false
   fail over start, if a slave rebooted with a wrong configuration
   about at the same time. This used to happen since in tilt mode we
   lose the information that the runid changed (reboot).

   Now instead the Sentinel in tilt mode will still remove the instance
   from the list of slaves if it changes state AND runid at the same
   time.

Both are edge conditions but the changes should overall improve the
reliability of Sentinel.
2013-04-30 15:08:29 +02:00
antirez
ef05a78e7e Sentinel: more sensible delay in master demote after tilt. 2013-04-30 15:08:22 +02:00
antirez
48ede0d84d Sentinel: only demote old master into slave under certain conditions.
We used to always turn a master into a slave if the DEMOTE flag was set,
as this was a resurrecting master instance.

However the following race condition is possible for a Sentinel that
got partitioned or internal issues (tilt mode), and was not able to
refresh the state in the meantime:

1) Sentinel X is running, master is instance "A".
3) "A" fails, sentinels will promote slave "B" as master.
2) Sentinel X goes down because of a network partition.
4) "A" returns available, Sentinels will demote it as a slave.
5) "B" fails, other Sentinels will promote slave "A" as master.
6) At this point Sentinel X comes back.

When "X" comes back he thinks that:

"B" is the master.
"A" is the slave to demote.

We want to avoid that Sentinel "X" will demote "A" into a slave.
We also want that Sentinel "X" will detect that the conditions changed
and will reconfigure itself to monitor the right master.

There are two main ways for the Sentinel to reconfigure itself after
this event:

1) If "B" is reachable and already configured as a slave by other
sentinels, "X" will perform a redirection to "A".
2) If there are not the conditions to demote "A", the fact that "A"
reports to be a master will trigger a failover detection in "X", that
will end into a reconfiguraiton to monitor "A".

However if the Sentinel was not reachable, its state may not be updated,
so in case it titled, or was partiitoned from the master instance of the
slave to demote, the new implementation waits some time (enough to
guarantee we can detect the new INFO, and new DOWN conditions).

If after some time still there are not the right condiitons to demote
the instance, the DEMOTE flag is cleared.
2013-04-26 17:02:13 +02:00
antirez
1965e22aa1 Sentinel: always redirect on master->slave transition.
Sentinel redirected to the master if the instance changed runid or it
was the first time we got INFO, and a role change was detected from
master to slave.

While this is a good idea in case of slave->master, since otherwise we
could detect a failover without good reasons just after a reboot with a
slave with a wrong configuration, in the case of master->slave
transition is much better to always perform the redirection for the
following reasons:

1) A Sentinel may go down for some time. When it is back online there is
no other way to understand there was a failover.
2) Pointing clients to a slave seems to be always the wrong thing to do.
3) There is no good rationale about handling things differently once an
instance is rebooted (runid change) in that case.
2013-04-24 11:30:17 +02:00
antirez
8e222c888f Sentinel: turn old master into a slave when it comes back. 2013-04-19 16:47:24 +02:00
antirez
089cbe643f Sentinel: advertise the promoted slave address only after successful setup. 2013-01-31 17:19:21 +01:00
guiquanz
9d09ce3981 Fixed many typos. 2013-01-19 10:59:44 +01:00
antirez
4365e5b2d3 BSD license added to every C source and header file. 2012-11-08 18:31:32 +01:00
antirez
db100c4671 Sentinel: Support for AUTH. 2012-09-26 18:59:54 +02:00
antirez
9bd0e097aa Sentinel: reply -IDONTKNOW to get-master-addr-by-name on lack of info.
If we don't have any clue about a master since it never replied to INFO
so far, reply with an -IDONTKNOW error to SENTINEL
get-master-addr-by-name requests.
2012-09-04 16:06:53 +02:00
antirez
8bdde086ac Sentinel: more easy master redirection if master is a slave.
Before this commit Sentienl used to redirect master ip/addr if the
current instance reported to be a slave only if this was the first INFO
output received, and the role was found to be slave.

Now instead also if we find that the runid is different, and the
reported role is slave, we also redirect to the reported master ip/addr.

This unifies the behavior of Sentinel in the case of a reboot (where it
will see the first INFO output with the wrong role and will perform the
redirection), with the behavior of Sentinel in the case of a change in
what it sees in the INFO output of the master.
2012-09-04 15:52:04 +02:00
antirez
6276434ad2 Sentinel: do not crash against slaves not publishing the runid.
Older versions of Redis (before 2.4.17) don't publish the runid field in
INFO. This commit makes Sentinel able to handle that without crashing.
2012-08-30 18:01:52 +02:00
antirez
58186b9dcf Sentinel: INFO command implementation. 2012-08-29 12:44:24 +02:00
antirez
3ec701e059 Sentinel: Sentinel-side support for slave priority.
The slave priority that is now published by Redis in INFO output is
now used by Sentinel in order to select the slave with minimum priority
for promotion, and in order to consider slaves with priority set to 0 as
not able to play the role of master (they will never be promoted by
Sentinel).

The "slave-priority" field is now one of the fileds that Sentinel
publishes when describing an instance via the SENTINEL commands such as
"SENTINEL slaves mastername".
2012-08-28 17:45:01 +02:00
antirez
c14e0ecafd Sentinel: suppress harmless warning by initializing 'table' to NULL.
Note that the assertion guarantees that one of the if branches setting
table is always entered.
2012-08-28 12:56:05 +02:00
antirez
850789ce73 Sentinel: send SCRIPT KILL on -BUSY reply and SDOWN instance.
From the point of view of Redis an instance replying -BUSY is down,
since it is effectively not able to reply to user requests. However
a looping script is a recoverable condition in Redis if the script still
did not performed any write to the dataset. In that case performing a
fail over is not optimal, so Sentinel now tries to restore the normal server
condition killing the script with a SCRIPT KILL command.

If the script already performed some write before entering an infinite
(or long enough to timeout) loop, SCRIPT KILL will not work and the
fail over will be triggered anyway.
2012-08-24 12:29:54 +02:00
antirez
01477753e6 Sentinel: fixed a crash on script execution.
The call to sentinelScheduleScriptExecution() lacked the final NULL
argument to signal the end of arguments. This resulted into a crash.
2012-08-24 12:10:24 +02:00
antirez
cada7f9671 Sentinel: SENTINEL FAILOVER command implemented.
This command can be used in order to force a Sentinel instance to start
a failover for the specified master, as leader, forcing the failover
even if the master is up.

The commit also adds some minor refactoring and other improvements to
functions already implemented that make them able to work when the
master is not in SDOWN condition. For instance slave selection
assumed that we ask INFO every second to every slave, this is true
only when the master is in SDOWN condition, so slave selection did not
worked when the master was not in SDOWN condition.
2012-08-03 12:41:27 +02:00
antirez
6275004ca6 Sentinel: client reconfiguration script execution.
This commit adds support to optionally execute a script when one of the
following events happen:

* The failover starts (with a slave already promoted).
* The failover ends.
* The failover is aborted.

The script is called with enough parameters (documented in the example
sentinel.conf file) to provide information about the old and new ip:port
pair of the master, the role of the sentinel (leader or observer) and
the name of the master.

The goal of the script is to inform clients of the configuration change
in a way specific to the environment Sentinel is running, that can't be
implemented in a genereal way inside Sentinel itself.
2012-08-02 18:40:30 +02:00
antirez
fd92b366b0 Sentinel: when leader in wait-start, sense another leader as race.
When we are in wait start, if another leader (or any other external
entity) turns a slave into a master, abort the failover, and detect it
as an observer.

Note that the wait-start state is mainly there for this reason but the
abort was yet not implemented.

This adds a new sentinel event -failover-abort-race.
2012-07-31 17:11:26 +02:00
antirez
91c15ed1b5 Sentinel: sentinelRefreshInstanceInfo() comments improved a bit. 2012-07-31 16:18:15 +02:00
antirez
75084e057d Sentinel: abort failover when in wait-start if master is back.
When we are a Leader Sentinel in wait-start state, starting with this
commit the failover is aborted if the master returns online.

This improves the way we handle a notable case of net split, that is the
split between Sentinels and Redis servers, that will be a very common
case of split becase Sentinels will often be installed in the client's
network and servers can be in a differnt arm of the network.

When Sentinels and Redis servers are isolated the master is in ODOWN
condition since the Sentinels can agree about this state, however the
failover does not start since there are no good slaves to promote (in
this specific case all the slaves are unreachable).

However when the split is resolved, Sentinels may sense the slave back
a moment before they sense the master is back, so the failover may start
without a good reason (since the master is actually working too).

Now this condition is reversible, so the failover will be aborted
immediately after if the master is detected to be working again, that
is, not in SDOWN nor in ODOWN condition.
2012-07-31 10:19:34 +02:00
antirez
7f5bdba434 Merge remote-tracking branch 'origin/unstable' into unstable 2012-07-28 20:55:17 +02:00
antirez
3f194a9d25 Sentinel: scripts execution engine improved.
We no longer use a vanilla fork+execve but take a queue of jobs of
scripts to execute, with retry on error, timeouts, and so forth.

Currently this is used only for notifications but soon the ability to
also call clients reconfiguration scripts will be added.
2012-07-28 20:54:27 +02:00
Jan-Erik Rediger
c6c19c8372 Include sys/wait.h to avoid compiler warning
gcc warned about an implicit declaration of function 'wait3'. 
Including this header fixes this.
2012-07-28 12:33:01 +03:00
antirez
ce7b838fb9 Sentinel: don't start a failover as leader if there is no good slave. 2012-07-26 12:09:40 +02:00
antirez
baace5fc42 Sentinel: ability to execute notification scripts. 2012-07-25 16:33:37 +02:00
antirez
672102c2ce Sentinel: abort failover if no good slave is available.
The previous behavior of the state machine was to wait some time and
retry the slave selection, but this is not robust enough against drastic
changes in the conditions of the monitored instances.

What we do now when the slave selection fails is to abort the failover
and return back monitoring the master. If the ODOWN condition is still
present a new failover will be triggered and so forth.

This commit also refactors the code we use to abort a failover.
2012-07-25 11:32:19 +02:00
antirez
9e5bef38e6 Sentinel: reset pending_commands in a more generic way. 2012-07-24 18:57:26 +02:00
antirez
a23a5b6c7d Prevent a spurious +sdown event on switch.
When we reset the master we should start with clean timestamps for ping
replies otherwise we'll detect a spurious +sdown event, because on
+master-switch event the previous master instance was probably in +sdown
condition. Since we updated the address we should count time from
scratch again.

Also this commit makes sure to explicitly reset the count of pending
commands, now we can do this because of the new way the hiredis link
is closed.
2012-07-24 18:46:04 +02:00
antirez
d918e6f127 Sentinel: debugging message removed. 2012-07-24 18:20:05 +02:00
antirez
75fb6e5b8a Sentinel: changes to connection handling and redirection.
We disconnect the Redis instances hiredis link in a more robust way now.
Also we change the way we perform the redirection for the +switch-master
event, that is not just an instance reset with an address change.

Using the same system we now implement the +redirect-to-master event
that is triggered by an instance that is configured to be master but
found to be a slave at the first INFO reply. In that case we monitor the
master instead, logging the incident as an event.
2012-07-24 18:15:44 +02:00
antirez
2179c26916 Sentinel: check that instance still exists in reply callbacks.
We can't be sure the instance object still exists when the reply
callback is called.
2012-07-24 16:37:57 +02:00
antirez
d876d6feac Sentinel: more robust failover detection as observer.
Sentinel observers detect failover checking if a slave attached to the
monitored master turns into its replication state from slave to master.
However while this change may in theory only happen after a SLAVEOF NO
ONE command, in practie it is very easy to reboot a slave instance with
a wrong configuration that turns it into a master, especially if it was
a past master before a successfull failover.

This commit changes the detection policy so that if an instance goes
from slave to master, but at the same time the runid has changed, we
sense a reboot, and in that case we don't detect a failover at all.

This commit also introduces the "reboot" sentinel event, that is logged
at "warning" level (so this will trigger an admin notification).

The commit also fixes a problem in the disconnect handler that assumed
that the instance object always existed, that is not the case. Now we
no longer assume that redisAsyncFree() will call the disconnection
handler before returning.
2012-07-24 12:42:40 +02:00
antirez
6b5daa2df2 First implementation of Redis Sentinel.
This commit implements the first, beta quality implementation of Redis
Sentinel, a distributed monitoring system for Redis with notification
and automatic failover capabilities.

More info at http://redis.io/topics/sentinel
2012-07-23 13:14:44 +02:00