This commit makes the fast collection cycle time configurable, at
the same time it does not allow to run a new fast collection cycle
for the same amount of time as the max duration of the fast
collection cycle.
The main idea here is that when we are no longer to expire keys at the
rate the are created, we can't block more in the normal expire cycle as
this would result in too big latency spikes.
For this reason the commit introduces a "fast" expire cycle that does
not run for more than 1 millisecond but is called in the beforeSleep()
hook of the event loop, so much more often, and with a frequency bound
to the frequency of executed commnads.
The fast expire cycle is only called when the standard expiration
algorithm runs out of time, that is, consumed more than
REDIS_EXPIRELOOKUPS_TIME_PERC of CPU in a given cycle without being able
to take the number of already expired keys that are yet not collected
to a number smaller than 25% of the number of keys.
You can test this commit with different loads, but a simple way is to
use the following:
Extreme load with pipelining:
redis-benchmark -r 100000000 -n 100000000 \
-P 32 set ele:rand:000000000000 foo ex 2
Remove the -P32 in order to avoid the pipelining for a more real-world
load.
In another terminal tab you can monitor the Redis behavior with:
redis-cli -i 0.1 -r -1 info keyspace
and
redis-cli --latency-history
Note: this commit will make Redis printing a lot of debug messages, it
is not a good idea to use it in production.
Thanks to @run and @badboy for spotting this.
Triva: clang was not able to provide me a warning about that when
compiling.
This closes#1024 and #1207, committing the change myself as the pull
requests no longer apply cleanly after other changes to the same
function.
Actaully the string is modified in-place and a reallocation is never
needed, so there is no need to return the new sds string pointer as
return value of the function, that is now just "void".
Now that EMBSTR encoding exists we calculate the amount of memory used
by the SDS part of a Redis String object in two different ways:
1) For raw string object, the size of the allocation is considered.
2) For embstr objects, the length of the string itself is used.
The new function takes care of this logic.
This function missed proper handling of reply_bytes when gluing to the
previous object was used. The issue was introduced with the EMBSTR new
string object encoding.
This fixes issue #1208.
Previously two string encodings were used for string objects:
1) REDIS_ENCODING_RAW: a string object with obj->ptr pointing to an sds
stirng.
2) REDIS_ENCODING_INT: a string object where the obj->ptr void pointer
is casted to a long.
This commit introduces a experimental new encoding called
REDIS_ENCODING_EMBSTR that implements an object represented by an sds
string that is not modifiable but allocated in the same memory chunk as
the robj structure itself.
The chunk looks like the following:
+--------------+-----------+------------+--------+----+
| robj data... | robj->ptr | sds header | string | \0 |
+--------------+-----+-----+------------+--------+----+
| ^
+-----------------------+
The robj->ptr points to the contiguous sds string data, so the object
can be manipulated with the same functions used to manipulate plan
string objects, however we need just on malloc and one free in order to
allocate or release this kind of objects. Moreover it has better cache
locality.
This new allocation strategy should benefit both the memory usage and
the performances. A performance gain between 60 and 70% was observed
during micro-benchmarks, however there is more work to do to evaluate
the performance impact and the memory usage behavior.
There are systems that when printing +/- infinte with printf-family
functions will not use the usual "inf" "-inf", but different strings.
Handle that explicitly.
Fixes issue #930.
This fixes issue #1194, that contains many details.
However in short, it was possible for ZADD to not accept as score values
that was however possible to obtain with multiple calls to ZINCRBY, like
in the following example:
redis 127.0.0.1:6379> zadd k 2.5e-308 m
(integer) 1
redis 127.0.0.1:6379> zincrby k -2.4e-308 m
"9.9999999999999694e-310"
redis 127.0.0.1:6379> zscore k m
"9.9999999999999694e-310"
redis 127.0.0.1:6379> zadd k 9.9999999999999694e-310 m1
(error) ERR value is not a valid float
The problem was due to strtod() returning ERANGE in the following case
specified by POSIX:
"If the correct value would cause an underflow, a value whose magnitude
is no greater than the smallest normalized positive number in the return
type shall be returned and errno set to [ERANGE].".
Now instead the returned value is accepted even when ERANGE is returned
as long as the return value of the function is not negative or positive
HUGE_VAL or zero.
Note that we only do it when STORE is not used, otherwise we want an
absolutely locale independent and binary safe sorting in order to ensure
AOF / replication consistency.
This is probably an unexpected behavior violating the least surprise
rule, but there is currently no other simple / good alternative.
compareStringObject was not always giving the same result when comparing
two exact strings, but encoded as integers or as sds strings, since it
switched to strcmp() when at least one of the strings were not sds
encoded.
For instance the two strings "123" and "123\x00456", where the first
string was integer encoded, would result into the old implementation of
compareStringObject() to return 0 as if the strings were equal, while
instead the second string is "greater" than the first in a binary
comparison.
The same compasion, but with "123" encoded as sds string, would instead
return a value < 0, as it is correct. It is not impossible that the
above caused some obscure bug, since the comparison was not always
deterministic, and compareStringObject() is used in the implementation
of skiplists, hash tables, and so forth.
At the same time, collateStringObject() was introduced by this commit, so
that can be used by SORT command to return sorted strings usign
collation instead of binary comparison. See next commit.
The function returns an unique identifier for the client, as ip:port for
IPv4 and IPv6 clients, or as path:0 for Unix socket clients.
See the top comment in the function for more info.
This has been done by exposing the anetSockName() function anet.c
to be used when the sentinel is publishing its existence to the masters.
This implementation is very unintelligent as it will likely break if used
with IPv6 as the nested colons will break any parsing of the PUBLISH string
by the master.
Any places which I feel might want to be updated to work differently
with IPv6 have been marked with a comment starting "IPV6:".
Currently the only comments address places where an IP address is
combined with a port using the standard : separated form. These may want
to be changed when printing IPv6 addresses to wrap the address in []
such as
[2001:db8::c0:ffee]:6379
instead of
2001:db8::c0:ffee:6379
as the latter format is a technically valid IPv6 address and it is hard
to distinguish the IPv6 address component from the port unless you know
the port is supposed to be there.
In two places buffers have been created with a size of 128 bytes which
could be reduced to INET6_ADDRSTRLEN to still hold a full IP address.
These places have been marked as they are presently big enough to handle
the needs of storing a printable IPv6 address.
Changes the sockaddr_in to a sockaddr_storage. Attempts to convert the
IP address into an AF_INET or AF_INET6 before returning an "Invalid IP
address" error. Handles converting the sockaddr from either AF_INET or
AF_INET6 back into a string for storage in the clusterNode ip field.
Change the sockaddr_in to sockaddr_storage which is capable of storing
both AF_INET and AF_INET6 sockets. Uses the sockaddr_storage ss_family
to correctly return the printable IP address and port.
Function makes the assumption that the buffer is of at least
REDIS_CLUSTER_IPLEN bytes in size.
Change the sockaddr_in to sockaddr_storage which is capable of storing
both AF_INET and AF_INET6 sockets. Uses the sockaddr_storage ss_family
to correctly return the printable IP address and port.
Change the sockaddr_in to sockaddr_storage which is capable of storing
both AF_INET and AF_INET6 sockets. Uses the sockaddr_storage ss_family
to correctly return the printable IP address and port.
Change the getaddrinfo(3) hints family from AF_INET to AF_UNSPEC to
allow resolution of IPv6 addresses as well as IPv4 addresses. The
function will return the IP address of whichever address family is
preferenced by the operating system. Most current operating systems
will preference AF_INET6 over AF_INET.
Unfortunately without attempting to establish a connection to the
remote address we can't know if the host is capable of using the
returned IP address. It might be desirable to have anetResolve
accept an additional argument specifying the AF_INET/AF_INET6 address
the caller would like to receive. Currently though it does not appear
as though the anetResolve function is ever used within Redis.
Add REDIS_CLUSTER_IPLEN macro to define the size of the clusterNode ip
character array. Additionally use this macro in inet_ntop(3) calls where
the size of the array was being defined manually.
The REDIS_CLUSTER_IPLEN is defined as INET_ADDRSTRLEN which defines the
correct size of a buffer to store an IPv4 address in. The
INET_ADDRSTRLEN macro itself is defined in the <netinet/in.h> header
file and should be portable across the majority of systems.
Using sizeof with an array will only return expected results if the
array is created in the scope of the function where sizeof is used. This
commit changes the inet_ntop calls so that they use the fixed buffer
value as defined in redis.h which is 16.
Replace inet_ntoa(3) calls with the more future proof inet_ntop(3)
function which is capable of handling additional address families.
API Change: anetTcpAccept() & anetPeerToString() additional argument
additional argument required to specify the length of the character
buffer the IP address is written to in order to comply with
inet_ntop(3) function semantics.
Change anetTcpServer() function to use getaddrinfo(3) to perform
address resolution, socket creation and binding. Resolved addresses
are limited to those reachable by the AF_INET address family.
Change anetTcpGenericConnect() function to use getaddrinfo(3) to
perform address resolution, socket creation and connection. Resolved
addresses are limited to those reachable by the AF_INET family.
Change anetResolve() function to use getaddrinfo(3) to resolve hostnames.
Resolved hostnames are limited to those reachable by the AF_INET address
family.
API Change: anetResolve requires additional argument.
additional argument required to specify the length of the character
buffer the IP address is written to in order to comply with
inet_ntop(3) function semantics. inet_ntop(3) replaces inet_ntoa(3)
as it has been designed to be compatible with more address families.
Bind() can't be called multiple times against the same socket, multiple
sockets are required to bind multiple interfaces, silly me.
This reverts commit bd234d62bb.
When in --pipe mode, after all the data transfer to the server is
complete, now redis-cli waits at max the specified amount of
seconds (30 by default, use 0 to wait forever) without receiving any
reply at all from the server. After this time limit the operation is
aborted with an error.
That's related to issue #681.
If the protocol read from stdin happened to contain grabage (invalid
random chars), in the previous implementation it was possible to end
with something like:
dksfjdksjflskfjl*2\r\n$4\r\nECHO....
That is invalid as the *2 should start into a new line. Now we prefix
the ECHO with a CRLF that has no effects on the server but prevents this
issues most of the times.
Of course if the offending wrong sequence is something like:
$3248772349\r\n
No one is going to save us as Redis will wait for data in the context of
a big argument, so this fix does not cover all the cases.
This partially fixes issue #681.
Clients using SYNC to replicate are older implementations, such as
redis-cli --slave, and are not designed to acknowledge the master with
REPLCONF ACK commands, so we don't have any feedback and should not
disconnect them on timeout.
This code is only responsible to take an LRU-evicted fixed length cache
of SHA1 that we are sure all the slaves received.
In this commit only the implementation is provided, but the Redis core
does not use it to actually send EVALSHA to slaves when possible.
The old REDIS_CMD_FORCE_REPLICATION flag was removed from the
implementation of Redis, now there is a new API to force specific
executions of a command to be propagated to AOF / Replication link:
void forceCommandPropagation(int flags);
The new API is also compatible with Lua scripting, so a script that will
execute commands that are forced to be propagated, will also be
propagated itself accordingly even if no change to data is operated.
As a side effect, this new design fixes the issue with scripts not able
to propagate PUBLISH to slaves (issue #873).
Currently it implements three subcommands:
PUBSUB CHANNELS [<pattern>] List channels with non-zero subscribers.
PUBSUB NUMSUB [channel_1 ...] List number of subscribers for channels.
PUBSUB NUMPAT Return number of subscribed patterns.
Sentinel was not able to detect slaves when connected to a very recent
version of Redis master since a previos non-backward compatible change
to INFO broken the parsing of the slaves ip:port INFO output.
This fixes issue #1164
When the semantics changed from logfile = NULL to logfile = "" to log
into standard output, no proper change was made to logStackTrace() to
make it able to work with the new setup.
This commit fixes the issue.
This feature allows the user to specify the minimum number of
connected replicas having a lag less or equal than the specified
amount of seconds for writes to be accepted.
There is a new 'lag' information in the list of slaves, in the
"replication" section of the INFO output.
Also the format was changed in a backward incompatible way in order to
make it more easy to parse if new fields are added in the future, as the
new format is comma separated but has named fields (no longer positional
fields).
Now masters, using the time at which the last REPLCONF ACK was received,
are able to explicitly disconnect slaves that are no longer responding.
Previously the only chance was to see a very long output buffer, that
was highly suboptimal.
ACKs can be also used as a base for synchronous replication. However in
that case they'll be explicitly requested by the master when the client
sends a request that needs to be replicated synchronously.
This special command is used by the slave to inform the master the
amount of replication stream it currently consumed.
it does not return anything so that we not need to consume additional
bandwidth needed by the master to reply something.
The master can do a number of things knowing the amount of stream
processed, such as understanding the "lag" in bytes of the slave, verify
if a given command was already processed by the slave, and so forth.
When master send commands, there is no need for the slave to reply.
Redis used to queue the reply in the output buffer and discard the reply
later, this is a waste of work and it is not clear why it was this way
(I sincerely don't remember).
This commit changes it in order to don't queue the reply at all.
All tests passing.
SUSv3 says that:
The useconds argument shall be less than one million. If the value of
useconds is 0, then the call has no effect.
and actually NetBSD's implementation rejects such a value with EINVAL.
use nanosleep which has no such a limitation instead.
NetBSD-current's libc has a function named popcount.
hiding these extensions using feature macros is not possible because
redis uses other extensions covered by the same feature macro.
eg. inet_aton
Also the logfile option was modified to always have an explicit value
and to log to stdout when an empty string is used as log file.
Previously there was special handling of the string "stdout" that set
the logfile to NULL, this always required some special handling.
This reverts commit 2c75f2cf1a.
After further analysis, it is very unlikely that we'll raise the
string size limit to > 512MB, and at the same time such big strings
will be used in 32 bit systems.
Better to revert to size_t so that 32 bit processors will not be
forced to use a 64 bit counter in normal operations, that is currently
completely useless.
When the PONG delay is half the cluster node timeout, the link gets
disconnected (and later automatically reconnected) in order to ensure
that it's not just a dead connection issue.
However this operation is only performed if the link is old enough, in
order to avoid to disconnect the same link again and again (and among
the other problems, never receive the PONG because of that).
Note: when the link is reconnected, the 'ping_sent' field is not updated
even if a new ping is sent using the new connection, so we can still
reliably detect a node ping timeout.
This is just to make the code exactly like the above instance used for
requirepass. No actual change nor the original code violated the Redis
coding style.
There was a race condition in the AOF rewrite code that, with bad enough
timing, could cause a volatile key just about to expire to be turned
into a non-volatile key. The bug was never reported to cause actualy
issues, but was found analytically by an user in the Redis mailing list:
https://groups.google.com/forum/?fromgroups=#!topic/redis-db/Kvh2FAGK4Uk
This commit fixes issue #1079.
Tilt mode was too aggressive (not processing INFO output), this
resulted in a few problems:
1) Redirections were not followed when in tilt mode. This opened a
window to misinform clients about the current master when a Sentinel
was in tilt mode and a fail over happened during the time it was not
able to update the state.
2) It was possible for a Sentinel exiting tilt mode to detect a false
fail over start, if a slave rebooted with a wrong configuration
about at the same time. This used to happen since in tilt mode we
lose the information that the runid changed (reboot).
Now instead the Sentinel in tilt mode will still remove the instance
from the list of slaves if it changes state AND runid at the same
time.
Both are edge conditions but the changes should overall improve the
reliability of Sentinel.
We used to always turn a master into a slave if the DEMOTE flag was set,
as this was a resurrecting master instance.
However the following race condition is possible for a Sentinel that
got partitioned or internal issues (tilt mode), and was not able to
refresh the state in the meantime:
1) Sentinel X is running, master is instance "A".
3) "A" fails, sentinels will promote slave "B" as master.
2) Sentinel X goes down because of a network partition.
4) "A" returns available, Sentinels will demote it as a slave.
5) "B" fails, other Sentinels will promote slave "A" as master.
6) At this point Sentinel X comes back.
When "X" comes back he thinks that:
"B" is the master.
"A" is the slave to demote.
We want to avoid that Sentinel "X" will demote "A" into a slave.
We also want that Sentinel "X" will detect that the conditions changed
and will reconfigure itself to monitor the right master.
There are two main ways for the Sentinel to reconfigure itself after
this event:
1) If "B" is reachable and already configured as a slave by other
sentinels, "X" will perform a redirection to "A".
2) If there are not the conditions to demote "A", the fact that "A"
reports to be a master will trigger a failover detection in "X", that
will end into a reconfiguraiton to monitor "A".
However if the Sentinel was not reachable, its state may not be updated,
so in case it titled, or was partiitoned from the master instance of the
slave to demote, the new implementation waits some time (enough to
guarantee we can detect the new INFO, and new DOWN conditions).
If after some time still there are not the right condiitons to demote
the instance, the DEMOTE flag is cleared.
Sentinel redirected to the master if the instance changed runid or it
was the first time we got INFO, and a role change was detected from
master to slave.
While this is a good idea in case of slave->master, since otherwise we
could detect a failover without good reasons just after a reboot with a
slave with a wrong configuration, in the case of master->slave
transition is much better to always perform the redirection for the
following reasons:
1) A Sentinel may go down for some time. When it is back online there is
no other way to understand there was a failover.
2) Pointing clients to a slave seems to be always the wrong thing to do.
3) There is no good rationale about handling things differently once an
instance is rebooted (runid change) in that case.
This prevents the kernel from putting too much stuff in the output
buffers, doing too heavy I/O all at once. So the goal of this commit is
to split the disk pressure due to the AOF rewrite process into smaller
spikes.
Please see issue #1019 for more information.
Previously redis-cli never tried to raise an error when an unrecognized
switch was encountered, as everything after the initial options is to be
transmitted to the server.
However this is too liberal, as there are no commands starting with "-".
So the new behavior is to produce an error if there is an unrecognized
switch starting with "-". This should not break past redis-cli usages
but should prevent broken options to be silently discarded.
As far the first token not starting with "-" is encountered, all the
rest is considered to be part of the command, so you cna still use
strings starting with "-" as values, like in:
redis-cli --port 6380 set foo --my-value
We used to copy this value into the server.cluster structure, however this
was not necessary.
The reason why we don't directly use server.cluster->node_timeout is
that things that can be configured via redis.conf need to be directly
available in the server structure as server.cluster is allocated later
only if needed in order to reduce the memory footprint of non-cluster
instances.
In commit d728ec6 it was introduced the concept of sending a ping to
every node not receiving a ping since node_timeout/2 seconds.
However the code was located in a place that was not executed because of
a previous conditional causing the loop to re-iterate.
This caused false positives in nodes availability detection.
The current code is still not perfect as a node may be detected to be in
PFAIL state even if it does not reply for just node_timeout/2 seconds
that is not correct. There is a plan to improve this code ASAP.
When a BGSAVE fails, Redis used to flood itself trying to BGSAVE at
every next cron call, that is either 10 or 100 times per second
depending on configuration and server version.
This commit does not allow a new automatic BGSAVE attempt to be
performed before a few seconds delay (currently 5).
This avoids both the auto-flood problem and filling the disk with
logs at a serious rate.
The five seconds limit, considering a log entry of 200 bytes, will use
less than 4 MB of disk space per day that is reasonable, the sysadmin
should notice before of catastrofic events especially since by default
Redis will stop serving write queries after the first failed BGSAVE.
This fixes issue #849
This commit fixes two corner cases for the TTL command.
1) When the key was already logically expired (expire time older
than current time) the command returned -1 instead of -2.
2) When the key was existing and the expire was found to be exactly 0
(the key was just about to expire), the command reported -1 (that is, no
expire) instead of a TTL of zero (that is, about to expire).
MULTI/EXEC is now propagated to the AOF / Slaves only once we encounter
the first command that is not a read-only one inside the transaction.
The old behavior was to always propagate an empty MULTI/EXEC block when
the transaction was composed just of read only commands, or even
completely empty. This created two problems:
1) It's a bandwidth waste in the replication link and a space waste
inside the AOF file.
2) We used to always increment server.dirty to force the propagation of
the EXEC command, resulting into triggering RDB saves more often
than needed.
Note: even read-only commands may also trigger writes that will be
propagated, when we access a key that is found expired and Redis will
synthesize a DEL operation. However there is no need for this to stay
inside the transaction itself, but only to be ordered.
So for instance something like:
MULTI
GET foo
SET key zap
EXEC
May be propagated into:
DEL foo
MULTI
SET key zap
EXEC
While the DEL is outside the transaction, the commands are delivered in
the right order and it is not possible for other commands to be inserted
between DEL and MULTI.