This prevents the kernel from putting too much stuff in the output
buffers, doing too heavy I/O all at once. So the goal of this commit is
to split the disk pressure due to the AOF rewrite process into smaller
spikes.
Please see issue #1019 for more information.
We used to copy this value into the server.cluster structure, however this
was not necessary.
The reason why we don't directly use server.cluster->node_timeout is
that things that can be configured via redis.conf need to be directly
available in the server structure as server.cluster is allocated later
only if needed in order to reduce the memory footprint of non-cluster
instances.
When a BGSAVE fails, Redis used to flood itself trying to BGSAVE at
every next cron call, that is either 10 or 100 times per second
depending on configuration and server version.
This commit does not allow a new automatic BGSAVE attempt to be
performed before a few seconds delay (currently 5).
This avoids both the auto-flood problem and filling the disk with
logs at a serious rate.
The five seconds limit, considering a log entry of 200 bytes, will use
less than 4 MB of disk space per day that is reasonable, the sysadmin
should notice before of catastrofic events especially since by default
Redis will stop serving write queries after the first failed BGSAVE.
This fixes issue #849
A slave node set this flag for itself when, after receiving authorization
from the majority of nodes, it turns itself into a master.
At the same time now this flag is tested by nodes receiving a PING
message before reconfiguring after a failover event. This makes the
system more robust: even if currently there is no way to manually turn
a slave into a master it is possible that we'll have such a feature in
the future, or that simply because of misconfiguration a node joins the
cluster as master while others believe it's a slave. This alone is now
no longer enough to trigger reconfiguration as other nodes will check
for the PROMOTED flag.
The PROMOTED flag is cleared every time the node is turned back into a
replica of some other node.
Sender flags were not propagated for the sender, but only for nodes in
the gossip section. This is odd and in the next commits we'll need to
get updated flags for the sender node, so this commit adds a new field
in the cluster messages header.
The message header is the same size as we reused some free space that
was marked as 'unused' because of alignment concerns.
Redis Cluster can cope with a minority of nodes not informed about the
failure of a master in time for some reason (netsplit or node not
functioning properly, blocked, ...) however to wait a few seconds before
to start the failover will make most "normal" failovers simpler as the
FAIL message will propagate before the slave election happens.
This is the first step to lower the CPU usage when many databases are
configured. The other is to also process a limited number of DBs per
call in the active expire cycle.
A new server.orig_commands table was added to the server structure, this
contains a copy of the commant table unaffected by rename-command
statements in redis.conf.
A new API lookupCommandOrOriginal() was added that checks both tables,
new first, old later, so that rewriteClientCommandVector() and friends
can lookup commands with their new or original name in order to fix the
client->cmd pointer when the argument vector is renamed.
This fixes the segfault of issue #986, but does not fix a wider range of
problems resulting from renaming commands that actually operate on data
and are registered into the AOF file or propagated to slaves... That is
command renaming should be handled with care.
This is the unix time at which we set the FAIL flag for the node.
It is only valid if FAIL is set.
The idea is to use it in order to make the cluster more robust, for
instance in order to revert a FAIL state if it is long-standing but
still slots are assigned to this node, that is, no one is going to fix
these slots apparently.
Before a relatively slow popcount() operation was needed every time we
needed to get the number of slots served by a given cluster node.
Now we just need to check an integer that is taken in sync with the
bitmap.
This commit allows Redis to set a process name that includes the binding
address and the port number in order to make operations simpler.
Redis children processes doing AOF rewrites or RDB saving change the
name into redis-aof-rewrite and redis-rdb-bgsave respectively.
This in general makes harder to kill the wrong process because of an
error and makes simpler to identify saving children.
This feature was suggested by Arnaud GRANAL in the Redis Google Group,
Arnaud also pointed me to the setproctitle.c implementation includeed in
this commit.
This feature should work on all the Linux, OSX, and all the three major
BSD systems.
A §Redis Cluster node used to mark a node as failing when itself
detected a failure for that node, and a single acknowledge was received
about the possible failure state.
The new API will be used in order to possible to require that N other
nodes have a PFAIL or FAIL state for a given node for a node to set it
as failing.
This makes us able to avoid allocating the cluster state structure if
cluster is not enabled, but still we can handle the configuration
directive that sets the cluster config filename.
A Redis master sends PING commands to slaves from time to time: doing
this ensures that even if absence of writes, the master->slave channel
remains active and the slave can feel the master presence, instead of
closing the connection for timeout.
This commit changes the way PINGs are sent to slaves in order to use the
standard interface used to replicate all the other commands, that is,
the function replicationFeedSlaves().
With this change the stream of commands sent to every slave is exactly
the same regardless of their exact state (Transferring RDB for first
synchronization or slave already online). With the previous
implementation the PING was only sent to online slaves, with the result
that the output stream from master to slaves was not identical for all
the slaves: this is a problem if we want to implement partial resyncs in
the future using a global replication stream offset.
TL;DR: this commit should not change the behaviour in practical terms,
but is just something in preparation for partial resynchronization
support.
Before this commit every Redis slave had its own selected database ID
state. This was not actually useful as the emitted stream of commands
is identical for all the slaves.
Now the the currently selected database is a global state that is set to
-1 when a new slave is attached, in order to force the SELECT command to
be re-emitted for all the slaves.
This change is useful in order to implement replication partial
resynchronization in the future, as makes sure that the stream of
commands received by slaves, including SELECT commands, are exactly the
same for every slave connected, at any time.
In this way we could have a global offset that can identify a specific
piece of the master -> slaves stream of commands.
Further details from @antirez:
It was reported by @StopForumSpam on Twitter that the Redis replication
link was strangely using multiple TCP packets for multiple commands.
This wastes a lot of bandwidth and is due to the TCP_NODELAY option we
enable on the socket after accepting a new connection.
However the master -> slave channel is a one-way channel since Redis
replication is asynchronous, so there is no point in trying to reduce
the latency, we should aim to reduce the bandwidth. For this reason this
commit introduces the ability to disable the nagle algorithm on the
socket after a successful SYNC.
This feature is off by default because the delay can be up to 40
milliseconds with normally configured Linux kernels.
When keyspace events are enabled, the overhead is not sever but
noticeable, so this commit introduces the ability to select subclasses
of events in order to avoid to generate events the user is not
interested in.
The events can be selected using redis.conf or CONFIG SET / GET.
decrRefCount used to get its argument as a void* pointer in order to be
used as destructor where a 'void free_object(void*)' prototype is
expected. However this made simpler to introduce bugs by freeing the
wrong pointer. This commit fixes the argument type and introduces a new
wrapper called decrRefCountVoid() that can be used when the void*
argument is needed.
Sometimes it is much simpler to debug complex Redis installations if it
is possible to assign clients a name that is displayed in the CLIENT
LIST output.
This is the case, for example, for "leaked" connections. The ability to
provide a name to the client makes it quite trivial to understand what
is the part of the code implementing the client not releasing the
resources appropriately.
Behavior:
CLIENT SETNAME: set a name for the client, or remove the current
name if an empty name is set.
CLIENT GETNAME: get the current name, or a nil.
CLIENT LIST: now displays the client name if any.
Thanks to Mark Gravell for pushing this idea forward.
REDIS_HZ is the frequency our serverCron() function is called with.
A more frequent call to this function results into less latency when the
server is trying to handle very expansive background operations like
mass expires of a lot of keys at the same time.
Redis 2.4 used to have an HZ of 10. This was good enough with almost
every setup, but the incremental key expiration algorithm was working a
bit better under *extreme* pressure when HZ was set to 100 for Redis
2.6.
However for most users a latency spike of 30 milliseconds when million
of keys are expiring at the same time is acceptable, on the other hand a
default HZ of 100 in Redis 2.6 was causing idle instances to use some
CPU time compared to Redis 2.4. The CPU usage was in the order of 0.3%
for an idle instance, however this is a shame as more energy is consumed
by the server, if not important resources.
This commit introduces HZ as a runtime parameter, that can be queried by
INFO or CONFIG GET, and can be modified with CONFIG SET. At the same
time the default frequency is set back to 10.
In this way we default to a sane value of 10, but allows users to
easily switch to values up to 500 for near real-time applications if
needed and if they are willing to pay this small CPU usage penalty.
To store the keys we block for during a blocking pop operation, in the
case the client is blocked for more data to arrive, we used a simple
linear array of redis objects, in the blockingState structure:
robj **keys;
int count;
However in order to fix issue #801 we also use a dictionary in order to
avoid to end in the blocked clients queue for the same key multiple
times with the same client.
The dictionary was only temporary, just to avoid duplicates, but since
we create / destroy it there is no point in doing this duplicated work,
so this commit simply use a dictionary as the main structure to store
the keys we are blocked for. So instead of the previous fields we now
just have:
dict *keys;
This simplifies the code and reduces the work done by the server during
a blocking POP operation.