Commit Graph

186 Commits

Author SHA1 Message Date
antirez
47750998a6 Sentinel: more aggressive failover start desynchronization.
Sentinel needs to avoid split brain conditions due to multiple sentinels
trying to get voted at the exact same time.

So far some desynchronization was provided by fluctuating server.hz,
that is the frequency of the timer function call. However the
desynchonization provided in this way was not enough when using many
Sentinel instances, especially when a large quorum value is used in
order to force a greater degree of agreement (more than N/2+1).

It was verified that it was likely to trigger a split brain
condition, forcing the system to try again after a timeout.
Usually the system will succeed after a few retries, but this is not
optimal.

This commit desynchronizes instances in a more effective way to make it
likely that the first attempt will be successful.
2014-03-04 17:09:36 +01:00
antirez
b15411df98 Sentinel: log quorum with +monitor event. 2014-02-24 17:10:20 +01:00
antirez
6b373edb77 Sentinel: generate +monitor events at startup. 2014-02-24 16:33:55 +01:00
antirez
3b7a757468 Sentinel: log +monitor and +set events.
Now that we have a runtime configuration system, it is very important to
be able to log how the Sentinel configuration changes over time because
of API calls.
2014-02-24 16:33:43 +01:00
antirez
25cebf7285 Sentinel: added missing exit(1) after checking for config file. 2014-02-24 16:22:52 +01:00
antirez
b1c1386374 Sentinel: IDONTKNOW error removed.
This error was conceived for the older version of Sentinel that worked
via master redirection and that was not able to get configuration
updates from other Sentinels via the Pub/Sub channel of masters or
slaves.

This reply does not make sense today, every Sentinel should reply with
the best information it has currently. The error will make even more
sense in the future since the plan is to allow Sentinels to update the
configuration of other Sentinels via gossip with a direct chat without
the prerequisite that they have at least a monitored instance in common.
2014-02-22 17:34:46 +01:00
antirez
7d7b3810e7 Sentinel: report instances role switch events.
This is useful mostly for debugging of issues.
2014-02-20 12:13:52 +01:00
antirez
7cec9e48ce Sentinel: SENTINEL_SLAVE_RECONF_RETRY_PERIOD -> RECONF_TIMEOUT
Rename define to match the new meaning.
2014-02-18 10:27:38 +01:00
antirez
18b8bad53c Sentinel: fix slave promotion timeout.
If we can't reconfigure a slave in time during failover, go forward as
anyway the slave will be fixed by Sentinels in the future, once they
detect it is misconfigured.

Otherwise a failover in progress may never terminate if for some reason
the slave is uncapable to sync with the master while at the same time
it is not disconnected.
2014-02-18 08:50:57 +01:00
antirez
e1b77b61f3 Sentinel: better specify startup errors due to config file.
Now it logs the file name if it is not accessible. Also there is a
different error for the missing config file case, and for the non
writable file case.
2014-02-17 16:44:49 +01:00
antirez
2d6eb68993 Sentinel: allow SHUTDOWN command in Sentinel mode. 2014-02-07 11:22:24 +01:00
antirez
3ff1bb4b2e Sentinel: check arity for SENTINEL MASTER command.
This fixes issue #1530.
2014-01-31 10:13:38 +01:00
antirez
d5763dceaf SENTINEL SET master quorum implemented. 2014-01-14 09:23:26 +01:00
antirez
fe86f890b0 SENTINEL SET: error on bad option name + flush config on error. 2014-01-13 11:55:57 +01:00
antirez
f822516e43 SENTINEL SET implemented.
The new command allows to change master-specific configurations
at runtime. All the settable parameters can be retrivied via the
SENTINEL MASTER command, so there is no equivalent "GET" command.
2014-01-13 11:53:29 +01:00
antirez
3cdcaff069 Sentinel: fix wrong arity error message. 2014-01-13 11:05:13 +01:00
antirez
964f6b17e9 Sentinel: SENTINEL REMOVE command added.
The command totally removes a monitored master.
2014-01-10 15:39:36 +01:00
antirez
cf2835519e Sentinel: releaseSentinelRedisInstance() top comment fixed.
The claim about unlinking the instance from the connected hash tables
was the opposite of the reality. Also the current actual behavior is
safer in most cases, so it is better to manually unlink when needed.
2014-01-10 15:33:42 +01:00
antirez
9d0f46c6f5 Sentinel: flush config on disk when new master is added. 2014-01-10 15:22:06 +01:00
antirez
39f9f449b0 Sentinel: SENTINEL MONITOR command implemented.
It allows to add new masters to monitor at runtime.
2014-01-10 15:18:24 +01:00
antirez
c42e4bd0b6 Sentinel: added SENTINEL MASTER <name> command.
With SENTINEL MASTERS it was already possible to list all the configured
masters, but not a specific one.
2014-01-10 14:41:52 +01:00
antirez
2bb9cd464e Add all the configurable fields to addReplySentinelRedisInstance().
Note: the auth password with the master is voluntarily not exposed.
2014-01-10 14:31:41 +01:00
antirez
5a7d04ee7b Trip comment to 80 cols in SentinelCommand(). 2014-01-10 14:13:04 +01:00
antirez
5320148883 Sentinel: dead code removed. 2013-12-13 11:01:13 +01:00
antirez
2eb781b35b dict.c: added optional callback to dictEmpty().
Redis hash table implementation has many non-blocking features like
incremental rehashing, however while deleting a large hash table there
was no way to have a callback called to do some incremental work.

This commit adds this support, as an optiona callback argument to
dictEmpty() that is currently called at a fixed interval (one time every
65k deletions).
2013-12-10 18:46:24 +01:00
antirez
c590549e40 Sentinel: fix reported role info sampling.
The way the role change was recoded was not sane and too much
convoluted, causing the role information to be not always updated.

This commit fixes issue #1445.
2013-12-06 12:46:56 +01:00
antirez
2b414a4b5f Sentinel: fix reported role fields when master is reset.
When there is a master address switch, the reported role must be set to
master so that we have a chance to re-sample the INFO output to check if
the new address is reporting the right role.

Otherwise if the role was wrong, it will be sensed as wrong even after
the address switch, and for enough time according to the role change
time, for Sentinel consider the master SDOWN.

This fixes isue #1446, that describes the effects of this bug in
practice.
2013-12-06 11:37:46 +01:00
antirez
11e81a1e9a Fixed grammar: before H the article is a, not an. 2013-12-05 16:35:32 +01:00
antirez
f80cf7363a Sentinel: don't write HZ when flushing config.
See issue #1419.
2013-12-02 15:56:10 +01:00
antirez
dffebbc904 Sentinel: better time desynchronization.
Sentinels are now desynchronized in a better way changing the time
handler frequency between 10 and 20 HZ. This way on average a
desynchronization of 25 milliesconds is produced that should be larger
enough compared to network latency, avoiding most split-brain condition
during the vote.

Now that the clocks are desynchronized, to have larger random delays when
performing operations can be easily achieved in the following way.
Take as example the function that starts the failover, that is
called with a frequency between 10 and 20 HZ and will start the
failover every time there are the conditions. By just adding as an
additional condition something like rand()%4 == 0, we can amplify the
desynchronization between Sentinel instances easily.

See issue #1419.
2013-12-02 12:29:42 +01:00
antirez
0addf8aff1 Sentinel: log vote received from other Sentinels. 2013-11-28 15:23:46 +01:00
huangz1990
86a540a66e fix a bug in sentinel.c about pub/sub link 2013-11-26 19:55:51 +08:00
antirez
6f4fd55762 Sentinel: fixes inverted strcmp() test preventing config updates.
The result of this one-char bug was pretty serious, if the new master
had the same port of the previous master, but just a different IP
address, non-leader Sentinels would not be able to recognize the
configuration change.

This commit fixes issue #1394.

Many thanks to @shanemadden that reported the bug and helped
investigating it.
2013-11-25 10:59:53 +01:00
antirez
8d547ebd56 Sentinel: fix type specifier for Hello msg generation.
This fixes issue #1395.
2013-11-25 10:24:34 +01:00
antirez
cc6053681f Sentinel: different comments updated to new implementation. 2013-11-21 16:22:59 +01:00
antirez
685e79998c Sentinel: cleanup around SENTINEL_INFO_VALIDITY_TIME. 2013-11-21 16:05:41 +01:00
antirez
489d889726 Sentinel: removed mem leak and useless code. 2013-11-21 15:43:55 +01:00
antirez
f55ad3038f Sentinel: manual failover works again. 2013-11-21 12:39:47 +01:00
antirez
297de1ab26 Sentinel: test for writable config file.
This commit introduces a funciton called when Sentinel is ready for
normal operations to avoid putting Sentinel specific stuff in redis.c.
2013-11-21 12:28:15 +01:00
antirez
d920177f8d Sentinel: check for disconnected links in sentinelSendHello().
Does not fix any bug as the test is performed by the caller, but better
to have the check.
2013-11-21 11:35:50 +01:00
antirez
8810167d13 Sentinel: Hello message sending code refactored. 2013-11-21 11:31:06 +01:00
antirez
0101c2bcfe Sentinel: select slave with best (greater) replication offset. 2013-11-20 16:05:36 +01:00
antirez
a6ebd910d8 Sentinel: take the replication offset in slaves state. 2013-11-20 15:53:21 +01:00
antirez
37a51a2568 Sentinel: distinguish between is-master-down-by-addr requests.
Some are just to know if the master is down, and in this case the runid
in the request is set to "*", others are actually in order to seek for a
vote and get elected. In the latter case the runid is set to the runid
of the instance seeking for the vote.
2013-11-19 16:50:04 +01:00
antirez
b22d1beea0 Sentinel: various fixes to leader election implementation. 2013-11-19 16:20:42 +01:00
antirez
1f9728cb20 Sentinel: failover script execution fixed. 2013-11-19 12:34:46 +01:00
antirez
90635488ce Sentinel: no longer used defines removed. 2013-11-19 11:24:36 +01:00
antirez
0a35f65301 Sentinel: when writing config on disk, remember sentinels runid. 2013-11-19 11:11:43 +01:00
antirez
5450833d02 Sentinel: arity of known-sentinel/slave is 4 not 3. 2013-11-19 11:03:47 +01:00
antirez
b8a94463b7 Sentinel: rewriteConfigSentinelOption() sub-iterators var typo fixed. 2013-11-19 10:59:50 +01:00
antirez
16237d78c8 Sentinel: call sentinelFlushConfig() to persist state when needed.
Also the sentinel configuration rewriting was modified in order to
account for failover in progress, where we need to provide the promoted
slave address as master address, and the old master address as one of
the slaves address.
2013-11-19 10:55:43 +01:00
antirez
e257ab2bfe Sentinel: sentinelFlushConfig() to CONFIG REWRITE + fsync. 2013-11-19 10:13:04 +01:00
antirez
5998769c28 Sentinel: CONFIG REWRITE support for Sentinel config. 2013-11-19 09:48:12 +01:00
antirez
47df12d5d9 Sentinel: can-failover option removed, many comments fixed. 2013-11-19 09:28:47 +01:00
antirez
232cdb95ab Sentinel: added config options useful to take state on config rewrite.
We'll use CONFIG REWRITE (internally) in order to store the new
configuration of a Sentinel after the internal state changes. In order
to do so, we need configuration options (that usually the user will not
touch at all) about config epoch of the master, Sentinels and Slaves
known for this master, and so forth.
2013-11-18 16:03:03 +01:00
antirez
3a374b0511 Sentinel: failover abort function simplified. 2013-11-18 11:43:35 +01:00
antirez
e0750acf11 Sentinel: slaves reconfig delay modified.
The time Sentinel waits since the slave is detected to be configured to
the wrong master, before reconfiguring it, is now the failover_timeout
time as this makes more sense in order to give the Sentinel performing
the failover enoung time to reconfigure the slaves slowly (if required
by the configuration).

Also we now PUBLISH more frequently the new configuraiton as this allows
to switch the reapprearing master back to slave faster.
2013-11-18 11:37:24 +01:00
antirez
83316f515c Sentinel: failover restart time is now multiple of failover timeout.
Also defaulf failover timeout changed to 3 minutes as the failover is a
fairly fast procedure most of the times, unless there are a very big
number of slaves and the user picked to configure them sequentially (in
that case the user should change the failover timeout accordingly).
2013-11-18 11:30:08 +01:00
antirez
3a56013acb Sentinel: state machine and timeouts simplified. 2013-11-18 11:12:58 +01:00
antirez
4be53b1c5d Sentinel: election timeout define. 2013-11-18 10:08:06 +01:00
antirez
69d826a354 Sentinel: fix address of master in Hello messages.
Once we switched configuration during a failover, we should advertise
the new address.

This was a serious race condition as the Sentinel performing the
failover for a moment advertised the old address with the new
configuration epoch: once trasmitted to the other Sentinels the broken
configuration would remain there forever, until the next failover
(because a greater configuration epoch is required to overwrite an older
one).
2013-11-14 10:25:55 +01:00
antirez
e4c65e72c6 Sentinel: master address selection in get-master-address refactored. 2013-11-14 10:23:54 +01:00
antirez
c0d7229364 Sentinel: fix conditional to only affect slaves with wrong master. 2013-11-14 10:23:05 +01:00
antirez
dfbd9c5aeb Sentinel: simplify and refactor slave reconfig code. 2013-11-14 00:36:43 +01:00
antirez
64ad6648a8 Sentinel: reconfigure slaves to right master. 2013-11-14 00:29:38 +01:00
antirez
3e27d678da Sentinel: remember last time slave changed master. 2013-11-14 00:20:15 +01:00
antirez
8297745fa6 Sentinel: redirect-to-master is not ok with new algorithm.
Now Sentinel believe the current configuration is always the winner and
should be applied by Sentinels instead of trying to adapt our view of
the cluster based on what we observe.

So the only way to modify what a Sentinel believe to be the truth is to
win an election and advertise the new configuration via Pub / Sub with a
greater configuration epoch.
2013-11-13 17:03:48 +01:00
antirez
76a88f56e5 Sentinel: safer slave reconfig, master reported role should match. 2013-11-13 17:02:09 +01:00
antirez
ddaad9fe2d Sentinel: role reporting fixed and added in SENTINEL output. 2013-11-13 16:39:57 +01:00
antirez
a0afa66f4b Sentinel: being a master and reporting as slave is considered SDOWN. 2013-11-13 16:28:56 +01:00
antirez
17718fdcba Sentinel: make sure role_reported is always updated. 2013-11-13 16:21:58 +01:00
antirez
46a053d34b Sentinel: track role change time. Wait before reconfigurations. 2013-11-13 16:18:23 +01:00
antirez
9e40c46f5e Sentinel: fix no-down check in master->slave conversion code. 2013-11-13 13:43:59 +01:00
antirez
ae35b7e240 Sentinel: readd slaves back after a master reset. 2013-11-13 13:01:11 +01:00
antirez
6bd4f6bffe Sentinel: sentinelResetMaster() new flag to avoid removing set of sentinels.
This commit also removes some dead code and cleanup generic flags.
2013-11-13 10:30:45 +01:00
antirez
1569af1f23 Sentinel: receive Pub/Sub messages from slaves. 2013-11-12 23:07:33 +01:00
antirez
dfa5f8b777 Sentinel: change event name when converting master to slave. 2013-11-12 23:00:17 +01:00
antirez
24158d1488 Sentinel: added config-epoch to SENTINEL masters output. 2013-11-12 17:22:04 +01:00
antirez
d2bc6dc39a Sentinel: new failover algo, desync slaves and update config epoch. 2013-11-12 17:07:31 +01:00
antirez
4a128b949d Sentinel: when starting failover seek for votes ASAP. 2013-11-12 16:38:02 +01:00
antirez
e6b9d5e97e Sentinel: +new-epoch events. 2013-11-12 13:35:25 +01:00
antirez
54c447be52 Sentinel: wait some time between failover attempts. 2013-11-12 13:30:31 +01:00
antirez
ab4b2ec88f Sentinel: allow to vote for myself. 2013-11-12 11:32:40 +01:00
antirez
b6b65b29c0 Sentinel: fix PUBLISH to masters and slaves. 2013-11-12 11:12:48 +01:00
antirez
90ab62fd5e Sentinel: epoch introduced in leader vote. 2013-11-12 11:09:35 +01:00
antirez
8c1bf9a2bd Sentinel: leadership handling changes WIP.
Changes to leadership handling.

Now the leader gets selected by every Sentinel, for a specified epoch,
when the SENTINEL is-master-down-by-addr is sent.

This command now includes the runid and the currentEpoch of the instance
seeking for a vote. The Sentinel only votes a single time in a given
epoch.

Still a work in progress, does not even compile at this stage.
2013-11-11 18:30:14 +01:00
antirez
0bac36d0a1 Sentinel: handle Hello messages received via slaves correctly.
Even when messages are received via the slave, we should perform
operations (like adding a new Sentinel) in the context of the master.
2013-11-11 17:12:27 +01:00
antirez
9e1b27d49e Sentinel: remove code not useful in the new design. 2013-11-11 12:06:11 +01:00
antirez
b93b0adc89 Sentinel: epoch introduced.
Sentinel state now includes the idea of current epoch and config epoch.
In the Hello message, that is now published both on masters and slaves,
a Sentinel no longer just advertises itself but also broadcasts its
current view of the configuration: the master name / ip / port and its
current epoch.

Sentinels receiving such information switch to the new master if the
configuration epoch received is newer and the ip / port of the master
are indeed different compared to the previos ones.
2013-11-11 11:05:58 +01:00
antirez
80da056c29 Sentinel: sentinelSendSlaveOf() was missing a var and the prototype. 2013-11-06 11:23:53 +01:00
antirez
23800d9e49 Sentinel: increment pending_commands counter in two more places.
AUTH and SCRIPT KILL were sent without incrementing the pending commands
counter. Clearly this needs some kind of wrapper doing it for the caller
in order to be less bug prone.
2013-11-06 11:21:44 +01:00
antirez
671c1dfb56 Sentinel: always send CONFIG REWRITE when changing instance role.
This change makes Sentinel less fragile about a number of failure modes.

This commit also fixes a different bug as a side effect, SLAVEOF command
was sent multiple times without incrementing the pending commands count.
2013-11-06 11:13:27 +01:00
antirez
fb9b76fe14 Cluster: slave node now uses the new protocol to get elected. 2013-09-26 11:13:17 +02:00
antirez
6ea8e0949c sdsrange() does not need to return a value.
Actaully the string is modified in-place and a reallocation is never
needed, so there is no need to return the new sds string pointer as
return value of the function, that is now just "void".
2013-07-24 11:21:39 +02:00
antirez
73ae8558c1 Sentinel: embed IPv6 address into [] when naming slave/sentinel instance. 2013-07-11 16:38:40 +02:00
antirez
3fc7f324d2 Sentinel: use comma as separator to publish hello messages.
We use comma to play well with IPv6 addresses, but the implementation is
still able to parse the old messages separated by colons.
2013-07-11 16:37:47 +02:00
antirez
5c5ebb0b9a Sentinel: make sure published addr/id buffer is large enough.
With ipv6 support we need more space, so we account for the IP address
max size plus what we need for the Run ID, port, flags.
2013-07-10 14:44:38 +02:00
antirez
631d656a94 All IP string repr buffers are now REDIS_IP_STR_LEN bytes. 2013-07-09 11:32:52 +02:00
Geoff Garside
e04fdf26fe Add IPv6 support to sentinel.c.
This has been done by exposing the anetSockName() function anet.c
to be used when the sentinel is publishing its existence to the masters.

This implementation is very unintelligent as it will likely break if used
with IPv6 as the nested colons will break any parsing of the PUBLISH string
by the master.
2013-07-08 16:08:36 +02:00
Geoff Garside
2345cee335 Update calls to anetResolve to include buffer size 2013-07-08 15:57:22 +02:00