SUSv3 says that:
The useconds argument shall be less than one million. If the value of
useconds is 0, then the call has no effect.
and actually NetBSD's implementation rejects such a value with EINVAL.
use nanosleep which has no such a limitation instead.
NetBSD-current's libc has a function named popcount.
hiding these extensions using feature macros is not possible because
redis uses other extensions covered by the same feature macro.
eg. inet_aton
Also the logfile option was modified to always have an explicit value
and to log to stdout when an empty string is used as log file.
Previously there was special handling of the string "stdout" that set
the logfile to NULL, this always required some special handling.
This reverts commit 2c75f2cf1a.
After further analysis, it is very unlikely that we'll raise the
string size limit to > 512MB, and at the same time such big strings
will be used in 32 bit systems.
Better to revert to size_t so that 32 bit processors will not be
forced to use a 64 bit counter in normal operations, that is currently
completely useless.
When the PONG delay is half the cluster node timeout, the link gets
disconnected (and later automatically reconnected) in order to ensure
that it's not just a dead connection issue.
However this operation is only performed if the link is old enough, in
order to avoid to disconnect the same link again and again (and among
the other problems, never receive the PONG because of that).
Note: when the link is reconnected, the 'ping_sent' field is not updated
even if a new ping is sent using the new connection, so we can still
reliably detect a node ping timeout.
This is just to make the code exactly like the above instance used for
requirepass. No actual change nor the original code violated the Redis
coding style.
There was a race condition in the AOF rewrite code that, with bad enough
timing, could cause a volatile key just about to expire to be turned
into a non-volatile key. The bug was never reported to cause actualy
issues, but was found analytically by an user in the Redis mailing list:
https://groups.google.com/forum/?fromgroups=#!topic/redis-db/Kvh2FAGK4Uk
This commit fixes issue #1079.
Tilt mode was too aggressive (not processing INFO output), this
resulted in a few problems:
1) Redirections were not followed when in tilt mode. This opened a
window to misinform clients about the current master when a Sentinel
was in tilt mode and a fail over happened during the time it was not
able to update the state.
2) It was possible for a Sentinel exiting tilt mode to detect a false
fail over start, if a slave rebooted with a wrong configuration
about at the same time. This used to happen since in tilt mode we
lose the information that the runid changed (reboot).
Now instead the Sentinel in tilt mode will still remove the instance
from the list of slaves if it changes state AND runid at the same
time.
Both are edge conditions but the changes should overall improve the
reliability of Sentinel.
We used to always turn a master into a slave if the DEMOTE flag was set,
as this was a resurrecting master instance.
However the following race condition is possible for a Sentinel that
got partitioned or internal issues (tilt mode), and was not able to
refresh the state in the meantime:
1) Sentinel X is running, master is instance "A".
3) "A" fails, sentinels will promote slave "B" as master.
2) Sentinel X goes down because of a network partition.
4) "A" returns available, Sentinels will demote it as a slave.
5) "B" fails, other Sentinels will promote slave "A" as master.
6) At this point Sentinel X comes back.
When "X" comes back he thinks that:
"B" is the master.
"A" is the slave to demote.
We want to avoid that Sentinel "X" will demote "A" into a slave.
We also want that Sentinel "X" will detect that the conditions changed
and will reconfigure itself to monitor the right master.
There are two main ways for the Sentinel to reconfigure itself after
this event:
1) If "B" is reachable and already configured as a slave by other
sentinels, "X" will perform a redirection to "A".
2) If there are not the conditions to demote "A", the fact that "A"
reports to be a master will trigger a failover detection in "X", that
will end into a reconfiguraiton to monitor "A".
However if the Sentinel was not reachable, its state may not be updated,
so in case it titled, or was partiitoned from the master instance of the
slave to demote, the new implementation waits some time (enough to
guarantee we can detect the new INFO, and new DOWN conditions).
If after some time still there are not the right condiitons to demote
the instance, the DEMOTE flag is cleared.
Sentinel redirected to the master if the instance changed runid or it
was the first time we got INFO, and a role change was detected from
master to slave.
While this is a good idea in case of slave->master, since otherwise we
could detect a failover without good reasons just after a reboot with a
slave with a wrong configuration, in the case of master->slave
transition is much better to always perform the redirection for the
following reasons:
1) A Sentinel may go down for some time. When it is back online there is
no other way to understand there was a failover.
2) Pointing clients to a slave seems to be always the wrong thing to do.
3) There is no good rationale about handling things differently once an
instance is rebooted (runid change) in that case.
This prevents the kernel from putting too much stuff in the output
buffers, doing too heavy I/O all at once. So the goal of this commit is
to split the disk pressure due to the AOF rewrite process into smaller
spikes.
Please see issue #1019 for more information.
Previously redis-cli never tried to raise an error when an unrecognized
switch was encountered, as everything after the initial options is to be
transmitted to the server.
However this is too liberal, as there are no commands starting with "-".
So the new behavior is to produce an error if there is an unrecognized
switch starting with "-". This should not break past redis-cli usages
but should prevent broken options to be silently discarded.
As far the first token not starting with "-" is encountered, all the
rest is considered to be part of the command, so you cna still use
strings starting with "-" as values, like in:
redis-cli --port 6380 set foo --my-value
We used to copy this value into the server.cluster structure, however this
was not necessary.
The reason why we don't directly use server.cluster->node_timeout is
that things that can be configured via redis.conf need to be directly
available in the server structure as server.cluster is allocated later
only if needed in order to reduce the memory footprint of non-cluster
instances.
In commit d728ec6 it was introduced the concept of sending a ping to
every node not receiving a ping since node_timeout/2 seconds.
However the code was located in a place that was not executed because of
a previous conditional causing the loop to re-iterate.
This caused false positives in nodes availability detection.
The current code is still not perfect as a node may be detected to be in
PFAIL state even if it does not reply for just node_timeout/2 seconds
that is not correct. There is a plan to improve this code ASAP.
When a BGSAVE fails, Redis used to flood itself trying to BGSAVE at
every next cron call, that is either 10 or 100 times per second
depending on configuration and server version.
This commit does not allow a new automatic BGSAVE attempt to be
performed before a few seconds delay (currently 5).
This avoids both the auto-flood problem and filling the disk with
logs at a serious rate.
The five seconds limit, considering a log entry of 200 bytes, will use
less than 4 MB of disk space per day that is reasonable, the sysadmin
should notice before of catastrofic events especially since by default
Redis will stop serving write queries after the first failed BGSAVE.
This fixes issue #849
This commit fixes two corner cases for the TTL command.
1) When the key was already logically expired (expire time older
than current time) the command returned -1 instead of -2.
2) When the key was existing and the expire was found to be exactly 0
(the key was just about to expire), the command reported -1 (that is, no
expire) instead of a TTL of zero (that is, about to expire).
MULTI/EXEC is now propagated to the AOF / Slaves only once we encounter
the first command that is not a read-only one inside the transaction.
The old behavior was to always propagate an empty MULTI/EXEC block when
the transaction was composed just of read only commands, or even
completely empty. This created two problems:
1) It's a bandwidth waste in the replication link and a space waste
inside the AOF file.
2) We used to always increment server.dirty to force the propagation of
the EXEC command, resulting into triggering RDB saves more often
than needed.
Note: even read-only commands may also trigger writes that will be
propagated, when we access a key that is found expired and Redis will
synthesize a DEL operation. However there is no need for this to stay
inside the transaction itself, but only to be ordered.
So for instance something like:
MULTI
GET foo
SET key zap
EXEC
May be propagated into:
DEL foo
MULTI
SET key zap
EXEC
While the DEL is outside the transaction, the commands are delivered in
the right order and it is not possible for other commands to be inserted
between DEL and MULTI.
Redis-tools is a connection of tools no longer mantained that was
intented as a way to economically make sense of Redis in the pre-vmware
sponsorship era. However there was a nice redis-stat utility, this
commit imports one of the functionalities of this tool here in redis-cli
as it seems to be pretty useful.
Usage: redis-cli --stat
The output is similar to vmstat in the format, but with Redis specific
stuff of course.
From the point of view of the monitored instance, only INFO is used in
order to grab data.
This is needed in order to colorize it as next step.
We use conventions in output messages such as
>>> This is an action
*** This is a warning
[ERR] This is an error
[OK] That's fine
And so forth, so that a color will be associated checking the first
three chars.
When a master turns into a slave after a failover event, make sure to
clear the assigned slots before setting up the replication, as a slave
should never claim slots in an explicit way, but just take over the
master slots when replacing its master.
A slave node set this flag for itself when, after receiving authorization
from the majority of nodes, it turns itself into a master.
At the same time now this flag is tested by nodes receiving a PING
message before reconfiguring after a failover event. This makes the
system more robust: even if currently there is no way to manually turn
a slave into a master it is possible that we'll have such a feature in
the future, or that simply because of misconfiguration a node joins the
cluster as master while others believe it's a slave. This alone is now
no longer enough to trigger reconfiguration as other nodes will check
for the PROMOTED flag.
The PROMOTED flag is cleared every time the node is turned back into a
replica of some other node.
Sender flags were not propagated for the sender, but only for nodes in
the gossip section. This is odd and in the next commits we'll need to
get updated flags for the sender node, so this commit adds a new field
in the cluster messages header.
The message header is the same size as we reused some free space that
was marked as 'unused' because of alignment concerns.
So when the failing master node is back in touch with the cluster,
instead of remaining unused it is converted into a replica of the
new master, ready to perform the fail over if the new master node
will fail at some point.
Note that as a side effect clients with stale configuration are now
not an issue as well, as the node converted into a slave will not
accept queries but will redirect clients accordingly.
The code handling a master that turns into a slave or the contrary was
improved in order to avoid repeating the same operations. Also
the readability and conceptual simplicity was improved.
Redis Cluster can cope with a minority of nodes not informed about the
failure of a master in time for some reason (netsplit or node not
functioning properly, blocked, ...) however to wait a few seconds before
to start the failover will make most "normal" failovers simpler as the
FAIL message will propagate before the slave election happens.
server.repl_down_since used to be initialized to the current time at
startup. This is wrong since the replication never started. Clients
testing this filed to check if data is uptodate should never believe
data is recent if we never ever connected to our master.
This fixes cases where the RDB file does exist but can't be accessed for
any reason. For instance, when the Redis process doesn't have enough
permissions on the file.
activeExpireCycle() tries to test just a few DBs per iteration so that
it scales if there are many configured DBs in the Redis instance.
However this commit makes it a bit smarter when one a few of those DBs
are under expiration pressure and there are many many keys to expire.
What we do is to remember if in the last iteration had to return because
we ran out of time. In that case the next iteration we'll test all the
configured DBs so that we are sure we'll test again the DB under
pressure.
Before of this commit after some mass-expire in a given DB the function
tested just a few of the next DBs, possibly empty, a few per iteration,
so it took a long time for the function to reach again the DB under
pressure. This resulted in a lot of memory being used by already expired
keys and never accessed by clients.
This small number of DBs is set to 16 so actually in the default
configuraiton Redis should behave exactly like in the past.
However the difference is that when the user configures a very large
number of DBs we don't do an O(N) operation, consuming a non trivial
amount of CPU per serverCron() iteration.
This is the first step to lower the CPU usage when many databases are
configured. The other is to also process a limited number of DBs per
call in the active expire cycle.
A new server.orig_commands table was added to the server structure, this
contains a copy of the commant table unaffected by rename-command
statements in redis.conf.
A new API lookupCommandOrOriginal() was added that checks both tables,
new first, old later, so that rewriteClientCommandVector() and friends
can lookup commands with their new or original name in order to fix the
client->cmd pointer when the argument vector is renamed.
This fixes the segfault of issue #986, but does not fix a wider range of
problems resulting from renaming commands that actually operate on data
and are registered into the AOF file or propagated to slaves... That is
command renaming should be handled with care.
Usually this does not happens since we trim for " \t\r\n", but if there
are other chars that return true with isspace(), we may end with an
empty argv. Better to handle the condition in an explicit way.
This makes programs not checking the return value for NULL much safer
since with this change:
1) It is still possible to iterate the zero-length result without
crashes.
2) sdssplitargs_free will work against NULL and 0 count.
An empty input string also resulted into the function returning NULL
making it harder for the caller to distinguish between error and empty
string without checking the original input string length.
If we have a master in FAIL state that's reachable again, and apparently
no one is going to serve its slots, clear the FAIL flag and let the
cluster continue with its operations again.
This is the unix time at which we set the FAIL flag for the node.
It is only valid if FAIL is set.
The idea is to use it in order to make the cluster more robust, for
instance in order to revert a FAIL state if it is long-standing but
still slots are assigned to this node, that is, no one is going to fix
these slots apparently.
Usually we try to send just 1 ping every second, however when we detect
we are going to have unreliable failure detection because we can't ping
some node in time, send an additional ping.
This should only happen with very large clusters or when the the node
timeout is set to a very low value.
This should improve things in two ways:
1) Prevent timeouts caused by the execution of long commands.
2) Improve detection of real connection errors.
This is mostly effective only on Linux because of the bogus default
keepalive settings. In Linux we have OS-specific calls to set the
keepalive interval to reasonable values.
As stated in the comment this is usually due to a resharding in progress
so the client should be still redirected to the old node that will
handle the redirection elsewhere.
Before a relatively slow popcount() operation was needed every time we
needed to get the number of slots served by a given cluster node.
Now we just need to check an integer that is taken in sync with the
bitmap.
This cased a segfault in some Linux system and was GCC-specific.
Commit modified by @antirez:
1) Stripped away the part to set the proc title via config for now.
2) Handle initialization of setproctitle only when the replacement
is used.
3) Don't require GCC now that the attribute constructor is no
longer used.
This commit allows Redis to set a process name that includes the binding
address and the port number in order to make operations simpler.
Redis children processes doing AOF rewrites or RDB saving change the
name into redis-aof-rewrite and redis-rdb-bgsave respectively.
This in general makes harder to kill the wrong process because of an
error and makes simpler to identify saving children.
This feature was suggested by Arnaud GRANAL in the Redis Google Group,
Arnaud also pointed me to the setproctitle.c implementation includeed in
this commit.
This feature should work on all the Linux, OSX, and all the three major
BSD systems.
This is not very important as anyway when the function counting the
number of reports is called the cleanup is performed. However with this
change if only part of the nodes that reported the failure will report
the node is back ok, we'll cleanup the older entries ASAP. In complex
split net split scenarios, and when we are dealing with clusters having
nodes in the order of ~ 1000, this can save some CPU.
Not sure why I set a limit to 1 million keys, there is no reason for
this artificial limit, and anyway this is s a stupid limit because it is
already high enough to create latency issues. So let's the users shoot
on their feet because maybe they just actually know what they are doing.
A §Redis Cluster node used to mark a node as failing when itself
detected a failure for that node, and a single acknowledge was received
about the possible failure state.
The new API will be used in order to possible to require that N other
nodes have a PFAIL or FAIL state for a given node for a node to set it
as failing.