Report the actual port used for the listening attempt instead of
server.port.
Originally, Redis would just listen on server.port.
But, with clustering, Redis uses a Cluster Port too,
so we can't say server.port is always where we are listening.
If you tried to launch Redis with a too-high port number (any
port where Port+10000 > 65535), Redis would refuse to start, but
only print an error saying it can't connect to the Redis port.
This patch fixes much confusions.
The code tried to obtain the configuration file absolute path after
processing the configuration file. However if config file was a relative
path and a "dir" statement was processed reading the config, the absolute
path obtained was wrong.
With this fix the absolute path is obtained before processing the
configuration while the server is still in the original directory where
it was executed.
server.unixtime and server.mstime are cached less precise timestamps
that we use every time we don't need an accurate time representation and
a syscall would be too slow for the number of calls we require.
Such an example is the initialization and update process of the last
interaction time with the client, that is used for timeouts.
However rdbLoad() can take some time to load the DB, but at the same
time it did not updated the time during DB loading. This resulted in the
bug described in issue #1535, where in the replication process the slave
loads the DB, creates the redisClient representation of its master, but
the timestamp is so old that the master, under certain conditions, is
sensed as already "timed out".
Thanks to @yoav-steinberg and Redis Labs Inc for the bug report and
analysis.
A system similar to the RDB write error handling is used, in which when
we can't write to the AOF file, writes are no longer accepted until we
are able to write again.
For fsync == always we still abort on errors since there is currently no
easy way to avoid replying with success to the user otherwise, and this
would violate the contract with the user of only acknowledging data
already secured on disk.
The API is one of the bulding blocks of CLUSTER FAILOVER command that
executes a manual failover in Redis Cluster. However exposed as a
command that the user can call directly, it makes much simpler to
upgrade a standalone Redis instance using a slave in a safer way.
The commands works like that:
CLIENT PAUSE <milliesconds>
All the clients that are not slaves and not in MONITOR state are paused
for the specified number of milliesconds. This means that slaves are
normally served in the meantime.
At the end of the specified amount of time all the clients are unblocked
and will continue operations normally. This command has no effects on
the population of the slow log, since clients are not blocked in the
middle of operations but only when there is to process new data.
Note that while the clients are unblocked, still new commands are
accepted and queued in the client buffer, so clients will likely not
block while writing to the server while the pause is active.
In high RPS environments, the default listen backlog is not sufficient, so
giving users the power to configure it is the right approach, especially
since it requires only minor modifications to the code.
A client can enter a special cluster read-only mode using the READONLY
command: if the client read from a slave instance after this command,
for slots that are actually served by the instance's master, the queries
will be processed without redirection, allowing clients to read from
slaves (but without any kind fo read-after-write guarantee).
The READWRITE command can be used in order to exit the readonly state.
When a slave was disconnected from its master the replication offset was
reported as -1. Now it is reported as the replication offset of the
previous master, so that failover can be performed using this value in
order to try to select a slave with more processed data from a set of
slaves of the old master.
During the refactoring of blocking operations, commit
82b672f633, a bug was introduced where
a milliseconds time is compared to a seconds time, so all the clients
always appear to timeout if timeout is set to non-zero value.
Thanks to Jonathan Leibiusky for finding the bug and helping verifying
the cause and fix.
Some are just to know if the master is down, and in this case the runid
in the request is set to "*", others are actually in order to seek for a
vote and get elected. In the latter case the runid is set to the runid
of the instance seeking for the vote.
All the internal state of cluster involving time is now using mstime_t
and mstime() in order to use milliseconds resolution.
Also the clusterCron() function is called with a 10 hz frequency instead
of 1 hz.
The cluster node_timeout must be also configured in milliseconds by the
user in redis.conf.
Example:
db0:keys=221913,expires=221913,avg_ttl=655
The algorithm uses a running average with only two samples (current and
previous). Keys found to be expired are considered at TTL zero even if
the actual TTL can be negative.
The TTL is reported in milliseconds.
We don't want to repeat a fast cycle too soon, the previous code was
broken, we need to wait two times the period *since* the start of the
previous cycle in order to avoid there is an even space between cycles:
.-> start .-> second start
| |
+-------------+-------------+--------------+
| first cycle | pause | second cycle |
+-------------+-------------+--------------+
The second and first start must be PERIOD*2 useconds apart hence the *2
in the new code.
This commit makes the fast collection cycle time configurable, at
the same time it does not allow to run a new fast collection cycle
for the same amount of time as the max duration of the fast
collection cycle.
The main idea here is that when we are no longer to expire keys at the
rate the are created, we can't block more in the normal expire cycle as
this would result in too big latency spikes.
For this reason the commit introduces a "fast" expire cycle that does
not run for more than 1 millisecond but is called in the beforeSleep()
hook of the event loop, so much more often, and with a frequency bound
to the frequency of executed commnads.
The fast expire cycle is only called when the standard expiration
algorithm runs out of time, that is, consumed more than
REDIS_EXPIRELOOKUPS_TIME_PERC of CPU in a given cycle without being able
to take the number of already expired keys that are yet not collected
to a number smaller than 25% of the number of keys.
You can test this commit with different loads, but a simple way is to
use the following:
Extreme load with pipelining:
redis-benchmark -r 100000000 -n 100000000 \
-P 32 set ele:rand:000000000000 foo ex 2
Remove the -P32 in order to avoid the pipelining for a more real-world
load.
In another terminal tab you can monitor the Redis behavior with:
redis-cli -i 0.1 -r -1 info keyspace
and
redis-cli --latency-history
Note: this commit will make Redis printing a lot of debug messages, it
is not a good idea to use it in production.
Previously two string encodings were used for string objects:
1) REDIS_ENCODING_RAW: a string object with obj->ptr pointing to an sds
stirng.
2) REDIS_ENCODING_INT: a string object where the obj->ptr void pointer
is casted to a long.
This commit introduces a experimental new encoding called
REDIS_ENCODING_EMBSTR that implements an object represented by an sds
string that is not modifiable but allocated in the same memory chunk as
the robj structure itself.
The chunk looks like the following:
+--------------+-----------+------------+--------+----+
| robj data... | robj->ptr | sds header | string | \0 |
+--------------+-----+-----+------------+--------+----+
| ^
+-----------------------+
The robj->ptr points to the contiguous sds string data, so the object
can be manipulated with the same functions used to manipulate plan
string objects, however we need just on malloc and one free in order to
allocate or release this kind of objects. Moreover it has better cache
locality.
This new allocation strategy should benefit both the memory usage and
the performances. A performance gain between 60 and 70% was observed
during micro-benchmarks, however there is more work to do to evaluate
the performance impact and the memory usage behavior.
This code is only responsible to take an LRU-evicted fixed length cache
of SHA1 that we are sure all the slaves received.
In this commit only the implementation is provided, but the Redis core
does not use it to actually send EVALSHA to slaves when possible.
The old REDIS_CMD_FORCE_REPLICATION flag was removed from the
implementation of Redis, now there is a new API to force specific
executions of a command to be propagated to AOF / Replication link:
void forceCommandPropagation(int flags);
The new API is also compatible with Lua scripting, so a script that will
execute commands that are forced to be propagated, will also be
propagated itself accordingly even if no change to data is operated.
As a side effect, this new design fixes the issue with scripts not able
to propagate PUBLISH to slaves (issue #873).
Currently it implements three subcommands:
PUBSUB CHANNELS [<pattern>] List channels with non-zero subscribers.
PUBSUB NUMSUB [channel_1 ...] List number of subscribers for channels.
PUBSUB NUMPAT Return number of subscribed patterns.
This feature allows the user to specify the minimum number of
connected replicas having a lag less or equal than the specified
amount of seconds for writes to be accepted.
There is a new 'lag' information in the list of slaves, in the
"replication" section of the INFO output.
Also the format was changed in a backward incompatible way in order to
make it more easy to parse if new fields are added in the future, as the
new format is comma separated but has named fields (no longer positional
fields).
Also the logfile option was modified to always have an explicit value
and to log to stdout when an empty string is used as log file.
Previously there was special handling of the string "stdout" that set
the logfile to NULL, this always required some special handling.
When a BGSAVE fails, Redis used to flood itself trying to BGSAVE at
every next cron call, that is either 10 or 100 times per second
depending on configuration and server version.
This commit does not allow a new automatic BGSAVE attempt to be
performed before a few seconds delay (currently 5).
This avoids both the auto-flood problem and filling the disk with
logs at a serious rate.
The five seconds limit, considering a log entry of 200 bytes, will use
less than 4 MB of disk space per day that is reasonable, the sysadmin
should notice before of catastrofic events especially since by default
Redis will stop serving write queries after the first failed BGSAVE.
This fixes issue #849
server.repl_down_since used to be initialized to the current time at
startup. This is wrong since the replication never started. Clients
testing this filed to check if data is uptodate should never believe
data is recent if we never ever connected to our master.
This fixes cases where the RDB file does exist but can't be accessed for
any reason. For instance, when the Redis process doesn't have enough
permissions on the file.
activeExpireCycle() tries to test just a few DBs per iteration so that
it scales if there are many configured DBs in the Redis instance.
However this commit makes it a bit smarter when one a few of those DBs
are under expiration pressure and there are many many keys to expire.
What we do is to remember if in the last iteration had to return because
we ran out of time. In that case the next iteration we'll test all the
configured DBs so that we are sure we'll test again the DB under
pressure.
Before of this commit after some mass-expire in a given DB the function
tested just a few of the next DBs, possibly empty, a few per iteration,
so it took a long time for the function to reach again the DB under
pressure. This resulted in a lot of memory being used by already expired
keys and never accessed by clients.
This small number of DBs is set to 16 so actually in the default
configuraiton Redis should behave exactly like in the past.
However the difference is that when the user configures a very large
number of DBs we don't do an O(N) operation, consuming a non trivial
amount of CPU per serverCron() iteration.
This is the first step to lower the CPU usage when many databases are
configured. The other is to also process a limited number of DBs per
call in the active expire cycle.
A new server.orig_commands table was added to the server structure, this
contains a copy of the commant table unaffected by rename-command
statements in redis.conf.
A new API lookupCommandOrOriginal() was added that checks both tables,
new first, old later, so that rewriteClientCommandVector() and friends
can lookup commands with their new or original name in order to fix the
client->cmd pointer when the argument vector is renamed.
This fixes the segfault of issue #986, but does not fix a wider range of
problems resulting from renaming commands that actually operate on data
and are registered into the AOF file or propagated to slaves... That is
command renaming should be handled with care.
This cased a segfault in some Linux system and was GCC-specific.
Commit modified by @antirez:
1) Stripped away the part to set the proc title via config for now.
2) Handle initialization of setproctitle only when the replacement
is used.
3) Don't require GCC now that the attribute constructor is no
longer used.
This commit allows Redis to set a process name that includes the binding
address and the port number in order to make operations simpler.
Redis children processes doing AOF rewrites or RDB saving change the
name into redis-aof-rewrite and redis-rdb-bgsave respectively.
This in general makes harder to kill the wrong process because of an
error and makes simpler to identify saving children.
This feature was suggested by Arnaud GRANAL in the Redis Google Group,
Arnaud also pointed me to the setproctitle.c implementation includeed in
this commit.
This feature should work on all the Linux, OSX, and all the three major
BSD systems.
SELECT was still transmitted to slaves using the inline protocol, that
is conceived mostly for humans to type into telnet sessions, and is
notably not understood by redis-cli --slave.
Now the new protocol is used instead.
Before this commit every Redis slave had its own selected database ID
state. This was not actually useful as the emitted stream of commands
is identical for all the slaves.
Now the the currently selected database is a global state that is set to
-1 when a new slave is attached, in order to force the SELECT command to
be re-emitted for all the slaves.
This change is useful in order to implement replication partial
resynchronization in the future, as makes sure that the stream of
commands received by slaves, including SELECT commands, are exactly the
same for every slave connected, at any time.
In this way we could have a global offset that can identify a specific
piece of the master -> slaves stream of commands.
Further details from @antirez:
It was reported by @StopForumSpam on Twitter that the Redis replication
link was strangely using multiple TCP packets for multiple commands.
This wastes a lot of bandwidth and is due to the TCP_NODELAY option we
enable on the socket after accepting a new connection.
However the master -> slave channel is a one-way channel since Redis
replication is asynchronous, so there is no point in trying to reduce
the latency, we should aim to reduce the bandwidth. For this reason this
commit introduces the ability to disable the nagle algorithm on the
socket after a successful SYNC.
This feature is off by default because the delay can be up to 40
milliseconds with normally configured Linux kernels.
When keyspace events are enabled, the overhead is not sever but
noticeable, so this commit introduces the ability to select subclasses
of events in order to avoid to generate events the user is not
interested in.
The events can be selected using redis.conf or CONFIG SET / GET.
This commit fixes issue #875 that was caused by the following events:
1) There is an active child doing BGSAVE.
2) flushall is called (or any other condition that makes Redis killing
the saving child process).
3) An error is sensed by Redis as the child exited with an error (killed
by a singal), that stops accepting write commands until a BGSAVE happens
to be executed with success.
Whitelisting SIGUSR1 and making sure Redis always uses this signal in
order to kill its own children fixes the issue.
When a SIGTERM is received Redis schedules a shutdown. However if it
fails to perform the shutdown it must be clear the shutdown_asap flag
otehrwise it will try again and again possibly making the server
unusable.
The Redis Slow Log always used to log the slow commands executed inside
a MULTI/EXEC block. However also EXEC was logged at the end, which is
perfectly useless.
Now EXEC is no longer logged and a test was added to test this behavior.
This fixes issue #759.
REDIS_HZ is the frequency our serverCron() function is called with.
A more frequent call to this function results into less latency when the
server is trying to handle very expansive background operations like
mass expires of a lot of keys at the same time.
Redis 2.4 used to have an HZ of 10. This was good enough with almost
every setup, but the incremental key expiration algorithm was working a
bit better under *extreme* pressure when HZ was set to 100 for Redis
2.6.
However for most users a latency spike of 30 milliseconds when million
of keys are expiring at the same time is acceptable, on the other hand a
default HZ of 100 in Redis 2.6 was causing idle instances to use some
CPU time compared to Redis 2.4. The CPU usage was in the order of 0.3%
for an idle instance, however this is a shame as more energy is consumed
by the server, if not important resources.
This commit introduces HZ as a runtime parameter, that can be queried by
INFO or CONFIG GET, and can be modified with CONFIG SET. At the same
time the default frequency is set back to 10.
In this way we default to a sane value of 10, but allows users to
easily switch to values up to 500 for near real-time applications if
needed and if they are willing to pay this small CPU usage penalty.
The idea is to be able to identify a build in a unique way, so for
instance after a bug report we can recognize that the build is the one
of a popular Linux distribution and perform the debugging in the same
environment.
EVALSHA used to crash if the SHA1 was not lowercase (Issue #783).
Fixed using a case insensitive dictionary type for the sha -> script
map used for replication of scripts.
After the transcation starts with a MULIT, the previous behavior was to
return an error on problems such as maxmemory limit reached. But still
to execute the transaction with the subset of queued commands on EXEC.
While it is true that the client was able to check for errors
distinguish QUEUED by an error reply, MULTI/EXEC in most client
implementations uses pipelining for speed, so all the commands and EXEC
are sent without caring about replies.
With this change:
1) EXEC fails if at least one command was not queued because of an
error. The EXECABORT error is used.
2) A generic error is always reported on EXEC.
3) The client DISCARDs the MULTI state after a failed EXEC, otherwise
pipelining multiple transactions would be basically impossible:
After a failed EXEC the next transaction would be simply queued as
the tail of the previous transaction.
By caching TCP connections used by MIGRATE to chat with other Redis
instances a 5x performance improvement was measured with
redis-benchmark against small keys.
This can dramatically speedup cluster resharding and other processes
where an high load of MIGRATE commands are used.
With COPY now MIGRATE does not remove the key from the source instance.
With REPLACE it uses RESTORE REPLACE on the target host so that even if
the key already eixsts in the target instance it will be overwritten.
The options can be used together.
The REPLACE option deletes an existing key with the same name (if any)
and materializes the new one. The default behavior without RESTORE is to
return an error if a key already exists.
So instead to reply with a generic error like:
-ERR ... wrong kind of value ...
now it replies with:
-WRONGTYPE ... wrong kind of value ...
This makes this particular error easy to check without resorting to
(fragile) pattern matching of the error string (however the error string
used to be consistent already).
Client libraries should return a specific exeption type for this error.
Most of the commit is about fixing unit tests.
After the wait3() syscall we used to do something like that:
if (pid == server.rdb_child_pid) {
backgroundSaveDoneHandler(exitcode,bysignal);
} else {
....
}
So the AOF rewrite was handled in the else branch without actually
checking if the pid really matches. This commit makes the check explicit
and logs at WARNING level if the pid returned by wait3() does not match
neither the RDB or AOF rewrite child.
In some system, notably osx, the 3.5 GB limit was too far and not able
to prevent a crash for out of memory. The 3 GB limit works better and it
is still a lot of memory within a 4 GB theorical limit so it's not going
to bore anyone :-)
This fixes issue #711
Before of this commit it used to be like this:
MULTI
EXEC
... actual commands of the transaction ...
Because after all that is the natural order of things. Transaction
commands are queued and executed *only after* EXEC is called.
However this makes debugging with MONITOR a mess, so the code was
modified to provide a coherent output.
What happens is that MULTI is rendered in the MONITOR output as far as
possible, instead EXEC is propagated only after the transaction is
executed, or even in the case it fails because of WATCH, so in this case
you'll simply see:
MULTI
EXEC
An empty transaction.
If the server is password protected we need to accept AUTH when there is
a server busy (-BUSY) condition, otherwise it will be impossible to send
SHUTDOWN NOSAVE or SCRIPT KILL.
This fixes issue #708.
This commit warns the user with a log at "warning" level if:
1) After the server startup the maxmemory limit was found to be < 1MB.
2) After a CONFIG SET command modifying the maxmemory setting the limit
is set to a value that is smaller than the currently used memory.
The behaviour of the Redis server is unmodified, and this wil not make
the CONFIG SET command or a wrong configuration in redis.conf less
likely to create problems, but at least this will make aware most users
about a possbile error they committed without resorting to external
help.
However no warning is issued if, as a result of loading the AOF or RDB
file, we are very near the maxmemory setting, or key eviction will be
needed in order to go under the specified maxmemory setting. The reason
is that in servers configured as a cache with an aggressive
maxmemory-policy most of the times restarting the server will cause this
condition to happen if persistence is not switched off.
This fixes issue #429.
SRANDMEMBER called with just the key argument can just return a single
random element from a Redis Set. However many users need to return
multiple unique elements from a Set, this is not a trivial problem to
handle in the client side, and for truly good performance a C
implementation was required.
After many requests for this feature it was finally implemented.
The problem implementing this command is the strategy to follow when
the number of elements the user asks for is near to the number of
elements that are already inside the set. In this case asking random
elements to the dictionary API, and trying to add it to a temporary set,
may result into an extremely poor performance, as most add operations
will be wasted on duplicated elements.
For this reason this implementation uses a different strategy in this
case: the Set is copied, and random elements are returned to reach the
specified count.
The code actually uses 4 different algorithms optimized for the
different cases.
If the count is negative, the command changes behavior and allows for
duplicated elements in the returned subset.
Redis provides support for blocking operations such as BLPOP or BRPOP.
This operations are identical to normal LPOP and RPOP operations as long
as there are elements in the target list, but if the list is empty they
block waiting for new data to arrive to the list.
All the clients blocked waiting for th same list are served in a FIFO
way, so the first that blocked is the first to be served when there is
more data pushed by another client into the list.
The previous implementation of blocking operations was conceived to
serve clients in the context of push operations. For for instance:
1) There is a client "A" blocked on list "foo".
2) The client "B" performs `LPUSH foo somevalue`.
3) The client "A" is served in the context of the "B" LPUSH,
synchronously.
Processing things in a synchronous way was useful as if "A" pushes a
value that is served by "B", from the point of view of the database is a
NOP (no operation) thing, that is, nothing is replicated, nothing is
written in the AOF file, and so forth.
However later we implemented two things:
1) Variadic LPUSH that could add multiple values to a list in the
context of a single call.
2) BRPOPLPUSH that was a version of BRPOP that also provided a "PUSH"
side effect when receiving data.
This forced us to make the synchronous implementation more complex. If
client "B" is waiting for data, and "A" pushes three elemnents in a
single call, we needed to propagate an LPUSH with a missing argument
in the AOF and replication link. We also needed to make sure to
replicate the LPUSH side of BRPOPLPUSH, but only if in turn did not
happened to serve another blocking client into another list ;)
This were complex but with a few of mutually recursive functions
everything worked as expected... until one day we introduced scripting
in Redis.
Scripting + synchronous blocking operations = Issue #614.
Basically you can't "rewrite" a script to have just a partial effect on
the replicas and AOF file if the script happened to serve a few blocked
clients.
The solution to all this problems, implemented by this commit, is to
change the way we serve blocked clients. Instead of serving the blocked
clients synchronously, in the context of the command performing the PUSH
operation, it is now an asynchronous and iterative process:
1) If a key that has clients blocked waiting for data is the subject of
a list push operation, We simply mark keys as "ready" and put it into a
queue.
2) Every command pushing stuff on lists, as a variadic LPUSH, a script,
or whatever it is, is replicated verbatim without any rewriting.
3) Every time a Redis command, a MULTI/EXEC block, or a script,
completed its execution, we run the list of keys ready to serve blocked
clients (as more data arrived), and process this list serving the
blocked clients.
4) As a result of "3" maybe more keys are ready again for other clients
(as a result of BRPOPLPUSH we may have push operations), so we iterate
back to step "3" if it's needed.
The new code has a much simpler semantics, and a simpler to understand
implementation, with the disadvantage of not being able to "optmize out"
a PUSH+BPOP as a No OP.
This commit will be tested with care before the final merge, more tests
will be added likely.
SORT is able to return (faster than when ordering) unordered output if
the "BY" clause is used with a constant value. However we try to play
well with scripting requirements of determinism providing always sorted
outputs when SORT (and other similar commands) are called by Lua
scripts.
However we used the general mechanism in place in scripting in order to
reorder SORT output, that is, if the command has the "S" flag set, the
Lua scripting engine will take an additional step when converting a
multi bulk reply to Lua value, calling a Lua sorting function.
This is suboptimal as we can do it faster inside SORT itself.
This is also broken as issue #545 shows us: basically when SORT is used
with a constant BY, and additionally also GET is used, the Lua scripting
engine was trying to order the output as a flat array, while it was
actually a list of key-value pairs.
What we do know is to recognized if the caller of SORT is the Lua client
(since we can check this using the REDIS_LUA_CLIENT flag). If so, and if
a "don't sort" condition is triggered by the BY option with a constant
string, we force the lexicographical sorting.
This commit fixes this bug and improves the performance, and at the same
time simplifies the implementation. This does not mean I'm smart today,
it means I was stupid when I committed the original implementation ;)
A Redis slave can now be configured with a priority, that is an integer
number that is shown in INFO output and can be get and set using the
redis.conf file or the CONFIG GET/SET command.
This field is used by Sentinel during slave election. A slave with lower
priority is preferred. A slave with priority zero is never elected (and
is considered to be impossible to elect even if it is the only slave
available).
A next commit will add support in the Sentinel side as well.
This fixes issue #539.
Basically if there is enough free memory the OS may buffer the RDB file
that the slave transfers on disk from the master. The file may
actually be flused on disk at once by the operating system when it gets
closed by Redis, causing the close system call to block for a long time.
This patch is a modified version of one provided by yoav-steinberg of
@garantiadata (the original version was posted in the issue #539
comments), and tries to flush the OS buffers incrementally (every 8 MB
of loaded data).
The previous implementation of zmalloc.c was not able to handle out of
memory in an application-specific way. It just logged an error on
standard error, and aborted.
The result was that in the case of an actual out of memory in Redis
where malloc returned NULL (In Linux this actually happens under
specific overcommit policy settings and/or with no or little swap
configured) the error was not properly logged in the Redis log.
This commit fixes this problem, fixing issue #509.
Now the out of memory is properly reported in the Redis log and a stack
trace is generated.
The approach used is to provide a configurable out of memory handler
to zmalloc (otherwise the default one logging the event on the
standard output is used).
This commit implements the first, beta quality implementation of Redis
Sentinel, a distributed monitoring system for Redis with notification
and automatic failover capabilities.
More info at http://redis.io/topics/sentinel
Redis loading data from disk, and a Redis slave disconnected from its
master with serve-stale-data disabled, are two conditions where
commands are normally refused by Redis, returning an error.
However there is no reason to disable Pub/Sub commands as well, given
that this layer does not interact with the dataset. To allow Pub/Sub in
as many contexts as possible is especially interesting now that Redis
Sentinel uses Pub/Sub of a Redis master as a communication channel
between Sentinels.
This commit allows Pub/Sub to be used in the above two contexts where
it was previously denied.
Behaves like rdb_last_bgsave_status -- even down to reporting 'ok' when
no rewrite has been done yet. (You might want to check that
aof_last_rewrite_time_sec is not -1.)
The REPLCONF command is an internal command (not designed to be directly
used by normal clients) that allows a slave to set some replication
related state in the master before issuing SYNC to start the
replication.
The initial motivation for this command, and the only reason currently
it is used by the implementation, is to let the slave instance
communicate its listening port to the slave, so that the master can
show all the slaves with their listening ports in the "replication"
section of the INFO output.
This allows clients to auto discover and query all the slaves attached
into a master.
Currently only a single option of the REPLCONF command is supported, and
it is called "listening-port", so the slave now starts the replication
process with something like the following chat:
REPLCONF listening-prot 6380
SYNC
Note that this works even if the master is an older version of Redis and
does not understand REPLCONF, because the slave ignores the REPLCONF
error.
In the future REPLCONF can be used for partial replication and other
replication related features where there is the need to exchange
information between master and slave.
NOTE: This commit also fixes a bug: the INFO outout already carried
information about slaves, but the port was broken, and was obtained
with getpeername(2), so it was actually just the ephemeral port used
by the slave to connect to the master as a client.
The way we compared the authentication password using strcmp() allowed
an attacker to gain information about the password using a well known
class of attacks called "timing attacks".
The bug appears to be practically not exploitable in most modern systems
running Redis since even using multiple bytes of differences in the
input at a time instead of one the difference in running time in in the
order of 10 nanoseconds, making it hard to exploit even on LAN. However
attacks always get better so we are providing a fix ASAP.
The new implementation uses two fixed length buffers and a constant time
comparison function, with the goal of:
1) Completely avoid leaking information about the content of the
password, since the comparison is always performed between 512
characters and without conditionals.
2) Partially avoid leaking information about the length of the
password.
About "2" we still have a stage in the code where the real password and
the user provided password are copied in the static buffers, we also run
two strlen() operations against the two inputs, so the running time
of the comparison is a fixed amount plus a time proportional to
LENGTH(A)+LENGTH(B). This means that the absolute time of the operation
performed is still related to the length of the password in some way,
but there is no way to change the input in order to get a difference in
the execution time in the comparison that is not just proportional to
the string provided by the user (because the password length is fixed).
Thus in practical terms the user should try to discover LENGTH(PASSWORD)
looking at the whole execution time of the AUTH command and trying to
guess a proportionality between the whole execution time and the
password length: this appears to be mostly unfeasible in the real world.
Also protecting from this attack is not very useful in the case of Redis
as a brute force attack is anyway feasible if the password is too short,
while with a long password makes it not an issue that the attacker knows
the length.
The 'persistence' section of INFO output now contains additional four
fields related to RDB and AOF persistence:
rdb_last_bgsave_time_sec Duration of latest BGSAVE in sec.
rdb_current_bgsave_time_sec Duration of current BGSAVE in sec.
aof_last_rewrite_time_sec Duration of latest AOF rewrite in sec.
aof_current_rewrite_time_sec Duration of current AOF rewrite in sec.
The 'current' fields are set to -1 if a BGSAVE / AOF rewrite is not in
progress. The 'last' fileds are set to -1 if no previous BGSAVE / AOF
rewrites were performed.
Additionally a few fields in the persistence section were renamed for
consistency:
changes_since_last_save -> rdb_changes_since_last_save
bgsave_in_progress -> rdb_bgsave_in_progress
last_save_time -> rdb_last_save_time
last_bgsave_status -> rdb_last_bgsave_status
bgrewriteaof_in_progress -> aof_rewrite_in_progress
bgrewriteaof_scheduled -> aof_rewrite_scheduled
After the renaming, fields in the persistence section start with rdb_ or
aof_ prefix depending on the persistence method they describe.
The field 'loading' and related fields are not prefixed because they are
unique for both the persistence methods.
The motivation for this new commands is to be search in the usage of
Redis for real time statistics. See the article "Fast real time metrics
using Redis".
http://blog.getspool.com/2011/11/29/fast-easy-realtime-metrics-using-redis-bitmaps/
In general Redis strings when used as bitmaps using the SETBIT/GETBIT
command provide a very space-efficient and fast way to store statistics.
For instance in a web application with users, every user can be
associated with a key that shows every day in which the user visited the
web service. This information can be really valuable to extract user
behaviour information.
With Redis bitmaps doing this is very simple just saying that a given
day is 0 (the data the service was put online) and all the next days are
1, 2, 3, and so forth. So with SETBIT it is possible to set the bit
corresponding to the current day every time the user visits the site.
It is possible to take the count of the bit sets on the run, this is
extremely easy using a Lua script. However a fast bit count native
operation can be useful, especially if it can operate on ranges, or when
the string is small like in the case of days (even if you consider many
years it is still extremely little data).
For this reason BITOP was introduced. The command counts the number of
bits set to 1 in a string, with optional range:
BITCOUNT key [start end]
The start/end parameters are similar to GETRANGE. If omitted the whole
string is tested.
Population counting is more useful when bit-level operations like AND,
OR and XOR are avaialble. For instance I can test multiple users to see
the number of days three users visited the site at the same time. To do
this we can take the AND of all the bitmaps, and then count the set bits.
For this reason the BITOP command was introduced:
BITOP [AND|OR|XOR|NOT] dest_key src_key1 src_key2 src_key3 ... src_keyN
In the special case of NOT (that inverts the bits) only one source key
can be passed.
The judicious use of BITCOUNT and BITOP combined can lead to interesting
use cases with very space efficient representation of data.
The implementation provided is still not tested and optimized for speed,
next commits will introduce unit tests. Later the implementation will be
profiled to see if it is possible to gain an important amount of speed
without making the code much more complex.
The INFO output, persistence section, already contained the field
describing the size of the current AOF buffer to flush on disk. However
the other AOF buffer, used to accumulate changes during an AOF rewrite,
was not mentioned in the INFO output.
This commit introduces a new field called aof_rewrite_buffer_length with
the length of the rewrite buffer.
During the AOF rewrite process, the parent process needs to accumulate
the new writes in an in-memory buffer: when the child will terminate the
AOF rewriting process this buffer (that ist the difference between the
dataset when the rewrite was started, and the current dataset) is
flushed to the new AOF file.
We used to implement this buffer using an sds.c string, but sds.c has a
2GB limit. Sometimes the dataset can be big enough, the amount of writes
so high, and the rewrite process slow enough that we overflow the 2GB
limit, causing a crash, documented on github by issue #504.
In order to prevent this from happening, this commit introduces a new
system to accumulate writes, implemented by a linked list of blocks of
10 MB each, so that we also avoid paying the reallocation cost.
Note that theoretically modern operating systems may implement realloc()
simply as a remaping of the old pages, thus with very good performances,
see for instance the mremap() syscall on Linux. However this is not
always true, and jemalloc by default avoids doing this because there are
issues with the current implementation of mremap().
For this reason we are using a linked list of blocks instead of a single
block that gets reallocated again and again.
The changes in this commit lacks testing, that will be performed before
merging into the unstable branch. This fix will not enter 2.4 because it
is too invasive. However 2.4 will log a warning when the AOF rewrite
buffer is near to the 2GB limit.
activeExpireCycle() can consume no more than a few milliseconds per
iteration. This commit improves the precision of the check for the time
elapsed in two ways:
1) We check every 16 iterations instead of the main loop instead of 256.
2) We reset iterations at the start of the function and not every time
we switch to the next database, so the check is correctly performed
every 16 iterations.
A previous commit introduced REDIS_HZ define that changes the frequency
of calls to the serverCron() Redis function. This commit improves
different related things:
1) Software watchdog: now the minimal period can be set according to
REDIS_HZ. The minimal period is two times the timer period, that is:
(1000/REDIS_HZ)*2 milliseconds
2) The incremental rehashing is now performed in the expires dictionary
as well.
3) The activeExpireCycle() function was improved in different ways:
- Now it checks if it already used too much time using microseconds
instead of milliseconds for better precision.
- The time limit is now calculated correctly, in the previous version
the division was performed before of the multiplication resulting in
a timelimit of 0 if HZ was big enough.
- Databases with less than 1% of buckets fill in the hash table are
skipped, because getting random keys is too expensive in this
condition.
4) tryResizeHashTables() is now called at every timer call, we need to
match the number of calls we do to the expired keys colleciton cycle.
5) REDIS_HZ was raised to 100.
Redis uses a function called serverCron() that is very similar to the
timer interrupt of an operating system. This function is used to handle
a number of asynchronous things, like active expired keys collection,
clients timeouts, update of statistics, things related to the cluster
and replication, triggering of BGSAVE and AOF rewrite process, and so
forth.
In the past the timer was called 1 time per second. At some point it was
raised to 10 times per second, but it still was fixed and could not be
changed even at compile time, because different functions called from
serverCron() assumed a given fixed frequency.
This commmit makes the frequency configurable, so that it is simpler to
pick a good tradeoff between overhead of this function (that is usually
very small) and the responsiveness of Redis during a few critical
circumstances where a lot of work is done inside the timer.
An example of such a critical condition is mass-expire of a lot of keys
in the same second. Up to a given percentage of CPU time is used to
perform expired keys collection per expire cylce. Now changing the
REDIS_HZ macro it is possible to do less work but more times per second
in order to block the server for less time.
If this patch will work well in our tests it will enter Redis 2.6-final.
If a large amonut of keys are all expiring about at the same time, the
"active" expired keys collection cycle used to block as far as the
percentage of already expired keys was >= 25% of the total population of
keys with an expire set.
This could block the server even for many seconds in order to reclaim
memory ASAP. The new algorithm uses at max a small amount of
milliseconds per cycle, even if this means reclaiming the memory less
promptly it also means a more responsive server.
We used to reply -ERR ... message ..., now the reply is
instead -MASTERDOWN ... message ... so that it can be distinguished
easily by the other error conditions.
This commit reverts most of c575766202, in
order to use back main stack for signal handling.
The main reason is that otherwise it is completely pointless that we do
a lot of efforts to print the stack trace on crash, and the content of
the stack and registers as well. Using an alternate stack broken this
feature completely.