When we are preparing an handshake with the slave we can't touch the
connection buffer as it'll be used to accumulate differences between
the sent RDB file and what arrives next from clients.
So in short we can't use addReply() family functions.
However we just use write(2) because we know that the socket buffer is
empty, since a prerequisite for SYNC to work is that the static buffer
and the output list are empty, and in general it is not expected that a
client SYNCs after doing some heavy I/O with the master.
However a short write connection is explicitly handled to avoid
fragility (we simply close the connection and the slave will retry).
SELECT was still transmitted to slaves using the inline protocol, that
is conceived mostly for humans to type into telnet sessions, and is
notably not understood by redis-cli --slave.
Now the new protocol is used instead.
A Redis master sends PING commands to slaves from time to time: doing
this ensures that even if absence of writes, the master->slave channel
remains active and the slave can feel the master presence, instead of
closing the connection for timeout.
This commit changes the way PINGs are sent to slaves in order to use the
standard interface used to replicate all the other commands, that is,
the function replicationFeedSlaves().
With this change the stream of commands sent to every slave is exactly
the same regardless of their exact state (Transferring RDB for first
synchronization or slave already online). With the previous
implementation the PING was only sent to online slaves, with the result
that the output stream from master to slaves was not identical for all
the slaves: this is a problem if we want to implement partial resyncs in
the future using a global replication stream offset.
TL;DR: this commit should not change the behaviour in practical terms,
but is just something in preparation for partial resynchronization
support.
Before this commit every Redis slave had its own selected database ID
state. This was not actually useful as the emitted stream of commands
is identical for all the slaves.
Now the the currently selected database is a global state that is set to
-1 when a new slave is attached, in order to force the SELECT command to
be re-emitted for all the slaves.
This change is useful in order to implement replication partial
resynchronization in the future, as makes sure that the stream of
commands received by slaves, including SELECT commands, are exactly the
same for every slave connected, at any time.
In this way we could have a global offset that can identify a specific
piece of the master -> slaves stream of commands.
Further details from @antirez:
It was reported by @StopForumSpam on Twitter that the Redis replication
link was strangely using multiple TCP packets for multiple commands.
This wastes a lot of bandwidth and is due to the TCP_NODELAY option we
enable on the socket after accepting a new connection.
However the master -> slave channel is a one-way channel since Redis
replication is asynchronous, so there is no point in trying to reduce
the latency, we should aim to reduce the bandwidth. For this reason this
commit introduces the ability to disable the nagle algorithm on the
socket after a successful SYNC.
This feature is off by default because the delay can be up to 40
milliseconds with normally configured Linux kernels.
Issue #828 shows how Redis was not correctly undoing a non-blocking
connection attempt with the previous master when the master was set to a
new address using the SLAVEOF command.
This was also a result of lack of refactoring, so now there is a
function to cancel the non blocking handshake with the master.
The new function is now used when SLAVEOF NO ONE is called or when
SLAVEOF is used to set the master to a different address.
REDIS_HZ is the frequency our serverCron() function is called with.
A more frequent call to this function results into less latency when the
server is trying to handle very expansive background operations like
mass expires of a lot of keys at the same time.
Redis 2.4 used to have an HZ of 10. This was good enough with almost
every setup, but the incremental key expiration algorithm was working a
bit better under *extreme* pressure when HZ was set to 100 for Redis
2.6.
However for most users a latency spike of 30 milliseconds when million
of keys are expiring at the same time is acceptable, on the other hand a
default HZ of 100 in Redis 2.6 was causing idle instances to use some
CPU time compared to Redis 2.4. The CPU usage was in the order of 0.3%
for an idle instance, however this is a shame as more energy is consumed
by the server, if not important resources.
This commit introduces HZ as a runtime parameter, that can be queried by
INFO or CONFIG GET, and can be modified with CONFIG SET. At the same
time the default frequency is set back to 10.
In this way we default to a sane value of 10, but allows users to
easily switch to values up to 500 for near real-time applications if
needed and if they are willing to pay this small CPU usage penalty.
The new message now contains an hint about modifying the repl-timeout
configuration directive if the problem persists.
This should normally not be needed, because while the master generates
the RDB file it makes sure to send newlines to the replication channel
to prevent timeouts. However there are times when masters running on
very slow systems can completely stop for seconds during the RDB saving
process. In such a case enlarging the timeout value can fix the problem.
See issue #695 for an example of this problem in an EC2 deployment.
During the first synchronization step of the replication process, a Redis
slave connects with the master in a non blocking way. However once the
connection is established the replication continues sending the REPLCONF
command, and sometimes the AUTH command if needed. Those commands are
send in a partially blocking way (blocking with timeout in the order of
seconds).
Because it is common for a blocked master to accept connections even if
it is actually not able to reply to the slave requests, it was easy for
a slave to block if the master had serious issues, but was still able to
accept connections in the listening socket.
For this reason we now send an asynchronous PING request just after the
non blocking connection ended in a successful way, and wait for the
reply before to continue with the replication process. It is very
unlikely that a master replying to PING can't reply to the other
commands.
This solution was proposed by Didier Spezia (Thanks!) so that we don't
need to turn all the replication process into a non blocking affair, but
still the probability of a slave blocked is minimal even in the event of
a failing master.
Also we now use getsockopt(SO_ERROR) in order to check errors ASAP
in the event handler, instead of waiting for actual I/O to return an
error.
This commit fixes issue #632.
This fixes issue #539.
Basically if there is enough free memory the OS may buffer the RDB file
that the slave transfers on disk from the master. The file may
actually be flused on disk at once by the operating system when it gets
closed by Redis, causing the close system call to block for a long time.
This patch is a modified version of one provided by yoav-steinberg of
@garantiadata (the original version was posted in the issue #539
comments), and tries to flush the OS buffers incrementally (every 8 MB
of loaded data).
REDIS_REPL_PING_SLAVE_PERIOD controls how often the master should
transmit a heartbeat (PING) to its slaves. This period, which defaults
to 10, is measured in seconds.
Redis 2.4 masters used to ping their slaves every ten seconds, just like
it says on the tin.
The Redis 2.6 masters I have been experimenting with, on the other hand,
ping their slaves *every second*. (master_last_io_seconds_ago never
approaches 10.) I think the ping period was inadvertently slashed to
one-tenth of its nominal value around the time REDIS_HZ was introduced.
This commit reintroduces correct ping schedule behaviour.
The REPLCONF command is an internal command (not designed to be directly
used by normal clients) that allows a slave to set some replication
related state in the master before issuing SYNC to start the
replication.
The initial motivation for this command, and the only reason currently
it is used by the implementation, is to let the slave instance
communicate its listening port to the slave, so that the master can
show all the slaves with their listening ports in the "replication"
section of the INFO output.
This allows clients to auto discover and query all the slaves attached
into a master.
Currently only a single option of the REPLCONF command is supported, and
it is called "listening-port", so the slave now starts the replication
process with something like the following chat:
REPLCONF listening-prot 6380
SYNC
Note that this works even if the master is an older version of Redis and
does not understand REPLCONF, because the slave ignores the REPLCONF
error.
In the future REPLCONF can be used for partial replication and other
replication related features where there is the need to exchange
information between master and slave.
NOTE: This commit also fixes a bug: the INFO outout already carried
information about slaves, but the port was broken, and was obtained
with getpeername(2), so it was actually just the ephemeral port used
by the slave to connect to the master as a client.
The user @jokea noticed that the following line of code into
replication.c made little sense:
addReplySds(slave,sdsempty());
Investigating a bit I found that this was introduced by commit 6208b3a7
three years ago in the early stages of Redis. The code apparently is not
useful at all, so I'm removing it.
This change will not be backported into 2.4 so that in the rare case
this should introduce a bug, we'll have a chance to detect it into the
development branch. However following the code path it seems like the
code is not useful at all, so the risk is truly small.
networking related stuff moved into networking.c
moved more code
more work on layout of source code
SDS instantaneuos memory saving. By Pieter and Salvatore at VMware ;)
cleanly compiling again after the first split, now splitting it in more C files
moving more things around... work in progress
split replication code
splitting more
Sets split
Hash split
replication split
even more splitting
more splitting
minor change