Commit Graph

186 Commits

Author SHA1 Message Date
Salvatore Sanfilippo
61fb441c8c Merge pull request #2386 from inkel/sentinel-add-client-command
Support CLIENT commands in Redis Sentinel
2015-03-13 18:23:36 +01:00
Salvatore Sanfilippo
e00cb78f67 Merge pull request #2054 from mattsta/fix-set-sentinel-quorum
Sentinel: Add initial quorum bounds check
2015-02-25 10:09:40 +01:00
Salvatore Sanfilippo
46bd13b806 Merge pull request #1966 from mattsta/fix-sentinel-info
Sentinel: Improve INFO command behavior
2015-02-24 17:20:09 +01:00
Leandro López (inkel)
d5e01519e5 Support CLIENT commands in Redis Sentinel
When trying to debug sentinel connections or max connections errors it
would be very useful to have the ability to see the list of connected
clients to a running sentinel. At the same time it would be very helpful
to be able to name each sentinel connection or kill offending clients.

This commits adds the already defined CLIENT commands back to Redis
Sentinel.
2015-02-02 18:16:18 -03:00
Matt Stancliff
d956d809ac Fix three simple clang analyzer warnings 2014-12-23 09:31:04 -05:00
Matt Stancliff
ad41a7c404 Add addReplyBulkSds() function
Refactor a common pattern into one function so we don't
end up with copy/paste programming.
2014-12-23 09:31:02 -05:00
Matt Stancliff
32bba43ac7 Add 'age' value to SENTINEL INFO-CACHE 2014-12-22 21:17:04 -05:00
antirez
bbf0736c4e sdsformatip() removed.
Specialized single-use function. Not the best match for sds.c btw.
Also genClientPeerId() is no longer static: we need symbols.
2014-12-11 18:29:04 +01:00
antirez
ce269ad3c5 AnetFormatIP(): renamed, commented, now sticks to IP:port format.
A few code style changes + consistent format: not nice for humans but
better for parsers.
2014-12-11 18:20:30 +01:00
Matt Stancliff
391fc9b633 Sentinel: Improve INFO command behavior
Improvements:
  - Return empty string if asking for non-existing section (INFO foo)
  - Fix potential memory leak (caused by sdsempty() then returned if >2 args)
  - Clean up argument parsing
  - Allow "all" as valid section (same as "default" or zero args currently)
  - Move strcasecmp to end of evaluation chain in conditionals

Also, since we're C99, I moved some variable declarations to be closer
to where they are actually used (saves us from needing to free an empty info
if detect argument errors up front).

Closes #1915
Closes #1966
2014-12-11 10:49:16 -05:00
Matt Stancliff
491881e13b Cleanup all IP formatting code
Instead of manually checking for strchr(n,':') everywhere,
we can use our new centralized IP formatting functions.
2014-12-11 10:12:18 -05:00
antirez
d8158771b5 Sentinel: INFO-CACHE comments reworked a bit.
Changed in order to make them more review friendly, based on the
experience of reviewing the code myself.
2014-12-10 11:15:13 +01:00
antirez
c83a917286 Sentinel: INFO-CACHE GCC minior code cleanup.
I guess the initial goal of the initialization was to suppress GCC
warning, but if we have to initialize, we can do it with the base-case
value instead of NULL which is never retained.
2014-12-10 11:12:26 +01:00
antirez
0422321617 Sentinel: removed useless flag var from INFO-CACHE. 2014-12-10 11:05:37 +01:00
antirez
7576a27d58 Sentinel: INFO-CACHE reply format command shortened. 2014-12-10 11:04:24 +01:00
Matt Stancliff
f8c73e38b5 Add SENTINEL INFO-CACHE [masters...]
Sentinel queries the INFO from every master and from every replica of
every master.

We can cache the INFO results in Sentinel so Sentinel can be a single
place to quickly get all INFO output for an entire Sentinel monitoring
group.

This commit gives us SENTINEL INFO-CACHE in two forms:
  - SENTINEL INFO-CACHE — returns all masters and all replicas
  - SENTINEL INFO-CACHE master0 master1 ... masterN — vararg specify masters

Results are returned as a multibulk reply with two top-level entries
for each master.  The first entry for each master is the name of the master.
The second entry is a nested multibulk reply with the contents of INFO,
first for the master, then an additional entry for each of the
replicas.
2014-11-20 16:56:30 -05:00
Matt Stancliff
6739ef4447 Sentinel: Add initial quorum bounds check
Fixes #2054
2014-11-20 16:30:17 -05:00
Matt Stancliff
12d0195b30 Clean up text throughout project
- Remove trailing newlines from redis.conf
  - Fix comment misspelling
  - Clarifies zipEncodeLength usage and a C API mention (#1243, #1242)
  - Fix cluster typos (inspired by @papanikge #1507)
  - Fix rewite -> rewrite in a few places (inspired by #682)

Closes #1243, #1242, #1507
2014-09-29 06:49:07 -04:00
antirez
f5efa9bbad Sentinel sentinelGetLeader() top comment improved. 2014-09-11 19:27:45 +02:00
antirez
f4be6f16f2 Sentinel: fix computation of total number of votes.
The code to check the number of voters was never updated to follow the new
Sentinel specification, so the number of voters was computed using only
the set of Sentinels that provided a vote.

This means that there is a changing majority on partitions, even if
usually the issue is not triggered because of the configured quorum
check (what was broken was the other implicit check that requires anyway
half of the known sentinels to agree in order to start a failover).
2014-09-11 18:53:31 +02:00
antirez
0a6cbabb26 Sentinel: don't set announce-ip if is empty. 2014-09-04 11:45:58 +02:00
antirez
cd576a1aab Sentinel: announce ip/port changes + rewrite.
The original implementation was modified in order to allow to
selectively announce a different IP or port, and to rewrite the two
options in the config file after a rewrite.
2014-09-04 11:23:31 +02:00
Dara Kong
3d939266be sentinel: Decouple bind address from address sent to other sentinels
There are instances such as EC2 where the bind address is private
(behind a NAT) and cannot be accessible from WAN.

https://groups.google.com/d/msg/redis-db/PVVvjO4nMd0/P3oWC036v3cJ
2014-09-04 10:54:21 +02:00
Matt Stancliff
67e414c7b8 Sentinel: Abort Hello quicker if not connected
We can save a little work by aborting when we enter the function
if we're disconnected.
2014-09-01 16:34:06 +02:00
Matt Stancliff
7e63dd23f3 Rename two 'buf' vars to 'ip' for better clarity
Clearly ip[32] is wrong, but it's less clear that buf[32] was wrong
without further reading.
2014-08-25 10:16:20 +02:00
Eiichi Sato
c38884ceac Sentinel: fix bufsize to support IPv6 address
Closes #1914
2014-08-25 10:15:43 +02:00
antirez
edca2b14d2 Remove warnings and improve integer sign correctness. 2014-08-13 11:44:38 +02:00
antirez
e3bae84606 Sentinel implementation of ROLE. 2014-06-23 12:07:41 +02:00
Matt Stancliff
5cd83ef539 Sentinel: bind source address
Some deployments need traffic sent from a specific address.  This
change uses the same policy as Cluster where the first listed bindaddr
becomes the source address for outgoing Sentinel communication.

Fixes #1667
2014-06-23 11:44:35 +02:00
antirez
41f12ac988 Sentinel: send hello messages ASAP after config change.
Eventual configuration convergence is guaranteed by our periodic hello
messages to all the instances, however when there are important notices
to share, better make a phone call. With this commit we force an hello
message to other Sentinal and Redis instances within the next 100
milliseconds of a config update, which is practically better than
waiting a few seconds.
2014-06-19 15:17:06 +02:00
antirez
94bc467328 Sentinel: handle SRI_PROMOTED flag correctly.
Lack of check of the SRI_PROMOTED flag caused Sentienl to act with the
promoted slave turned into a master during failover like if it was a
normal instance.

Normally this problem was not apparent because during real failovers the
old master is down so the bugged code path was not entered, however with
manual failovers via the SENTINEL FAILOVER command, the problem was
easily triggered.

This commit prevents promoted slaves from getting reconfigured, moreover
we now explicitly check that during a failover the slave turning into a
master is the one we selected for promotion and not a different one.
2014-06-19 10:28:27 +02:00
antirez
2c17591224 Sentinel: send SLAVEOF with MULTI, CLIENT KILL, CONFIG REWRITE.
This implements the new Sentinel-Client protocol for the Sentinel part:
now instances are reconfigured using a transaction that ensures that the
config is rewritten in the target instance, and that clients lose the
connection with the instance, in order to be forced to: ask Sentinel,
reconnect to the instance, and verify the instance role with the new
ROLE command.
2014-06-17 11:03:21 +02:00
antirez
8a588ac14d More trailing spaces in sentinel.c removed. 2014-05-28 15:46:05 +02:00
antirez
01e3f9ba1d Remove trailing spaces from sentinel.c. 2014-05-20 14:22:42 +02:00
antirez
2102778606 Sentinel: log when a failover will be attempted again.
When a Sentinel performs a failover (successful or not), or when a
Sentinel votes for a different Sentinel trying to start a failover, it
sets a min delay before it will try to get elected for a failover.

While not strictly needed, because if multiple Sentinels will try
to failover the same master at the same time, only one configuration
will eventually win, this serialization is practically very useful.
Normal failovers are cleaner: one Sentinel starts to failover, the
others update their config when the Sentinel performing the failover
is able to get the selected slave to move from the role of slave to the
one of master.

However currently this timeout was implicit, so users could see
Sentinels not reacting, after a failed failover, for some time, without
giving any feedback in the logs to the poor sysadmin waiting for clues.

This commit makes Sentinels more verbose about the delay: when a master
is down and a failover attempt is not performed because the delay has
still not elaped, something like that will be logged:

    Next failover delay: I will not start a failover
    before Thu May  8 16:48:59 2014
2014-05-08 16:38:53 +02:00
antirez
931beae9b0 Sentinel: generate +config-update-from event when a new config is received.
This event makes clear, before the switch-master event is generated,
that a Sentinel received a configuration update from another Sentinel.
2014-05-08 15:59:34 +02:00
antirez
35667d75c3 Fixed undefined variable value with certain code paths.
In sentinelFlushConfig() fd could be undefined when the following if
statement was true:

        if (rewrite_status == -1) goto werr;

This could cause random file descriptors to get closed.
2014-03-24 21:07:44 +01:00
Matt Stancliff
4290455145 Sentinel: Notify user when config can't be saved 2014-03-24 13:54:14 -04:00
Salvatore Sanfilippo
906c4d77c0 Merge pull request #1617 from mattsta/remove-unused-warning
Cluster: remove variable causing warning
2014-03-24 18:33:22 +01:00
Matt Stancliff
67ed5f00aa Cluster: remove variable causing warning
GCC-4.9 warned about this, but clang didn't.

This commit fixes warning:
sentinel.c: In function 'sentinelReceiveHelloMessages':
sentinel.c:2156:43: warning: variable 'master' set but not used [-Wunused-but-set-variable]
     sentinelRedisInstance *ri = c->data, *master;
2014-03-18 15:35:09 -04:00
antirez
b9e90a70fa Sentinel: sentinelRefreshInstanceInfo() minor refactoring.
Test sentinel.tilt condition on top and return if it is true.
This allows to remove the check for the tilt condition in the remaining
code paths of the function.
2014-03-18 15:35:47 +01:00
antirez
218cc5fc39 Sentinel: propagate down-after-ms changes to slaves and sentinels. 2014-03-18 14:37:44 +01:00
antirez
bb6d850160 Sentinel: down-after-milliseconds is not master-specific.
addReplySentinelRedisInstance() modified so that this field is displayed
for all the kind of instances: Sentinels, Masters, Slaves.
2014-03-18 11:21:17 +01:00
antirez
ae0b7680b3 Sentinel failure detection implementation improved.
Failure detection in Sentinel is ping-pong based. It used to work by
remembering the last time a valid PONG reply was received, and checking
if the reception time was too old compared to the current current time.

PINGs were sent at a fixed interval of 1 second.

This works in a decent way, but does not scale well when we want to set
very small values of "down-after-milliseconds" (this is the node
timeout basically).

This commit reiplements the failure detection making a number of
changes. Some changes are inspired to Redis Cluster failure detection
code:

* A new last_ping_time field is added in representation of instances.
  If non zero, we have an active ping that was sent at the specified
  time. When a valid reply to ping is received, the field is zeroed
  again.
* last_ping_time is not reset when we reconnect the link or send a new
  ping, so from our point of view it represents the time we started
  waiting for the instance to reply to our pings without receiving a
  reply.
* last_ping_time is now used in order to check if the instance is
  timed out. This means that we can have a node timeout of 100
  milliseconds and yet the system will work well since the new check is
  not bound to the period used to send pings.
* Pings are now sent every second, or often if the value of
  down-after-milliseconds is less than one second. With a lower limit of
  10 HZ ping frequency.
* Link reconnection code was improved. This is used in order to try to
  reconnect the link when we are at 50% of the node timeout without a
  valid reply received yet. However the old code triggered unnecessary
  reconnections when the node timeout was very small. Now that should be
  ok.

The new code passes the tests but more testing is needed and more unit
tests stressing the failure detector, so currently this is merged only
in the unstable branch.
2014-03-17 18:33:45 +01:00
antirez
3a2ff55617 Sentinel: use CLIENT SETNAME when connecting to Redis.
This makes debugging / monitoring of Sentinels simpler since you can
identify sentinels in CLIENT LIST output of Redis instances.
2014-03-15 14:59:23 +01:00
Matt Stancliff
584052ee6b Fix segfault from accessing array out of bounds
argc == 2; argv[2] == crash
2014-03-14 17:38:05 -04:00
antirez
ed813863f0 Sentinel: be safe under crash-recovery assumptions.
Sentinel's main safety argument is that there are no two configurations
for the same master with the same version (configuration epoch).

For this to be true Sentinels require to be authorized by a majority.
Additionally Sentinels require to do two important things:

* Never vote again for the same epoch.
* Never exchange an old vote for a fresh one.

The first prerequisite, in a crash-recovery system model, requires to
persist the master->leader_epoch on durable storage before to reply to
messages. This was not the case.

We also make sure to persist the current epoch in order to never reply
to stale votes requests from other Sentinels, after a recovery.

The configuration is persisted by making use of fsync(), this is
considered in the context of this code a good enough guarantee that
after a restart our durable state is restored, however this may not
always be the case depending on the kind of hardware and operating
system used.
2014-03-14 14:58:44 +01:00
antirez
365094028b Sentinel: fake PUBLISH command to receive HELLO messages.
Now the way HELLO messages are received is unified.
Now it is no longer needed for Sentinels to converge to the higher
configuration for a master to be able to chat via some Redis instance,
the are able to directly exchanges configurations.

Note that this commit does not include the (trivial) change needed to
send HELLO messages to Sentinel instances as well, since for an error I
committed the change in the previous commit that refactored hello
messages processing into a separated function.
2014-03-14 11:07:42 +01:00
antirez
9dfe426fc8 Sentinel: HELLO processing refactored into sentinelProcessHelloMessage(). 2014-03-14 11:07:42 +01:00
Jan-Erik Rediger
5f5118bdad Small typo fixed 2014-03-05 00:41:02 +01:00