During the AOF rewrite process, the parent process needs to accumulate
the new writes in an in-memory buffer: when the child will terminate the
AOF rewriting process this buffer (that ist the difference between the
dataset when the rewrite was started, and the current dataset) is
flushed to the new AOF file.
We used to implement this buffer using an sds.c string, but sds.c has a
2GB limit. Sometimes the dataset can be big enough, the amount of writes
so high, and the rewrite process slow enough that we overflow the 2GB
limit, causing a crash, documented on github by issue #504.
In order to prevent this from happening, this commit introduces a new
system to accumulate writes, implemented by a linked list of blocks of
10 MB each, so that we also avoid paying the reallocation cost.
Note that theoretically modern operating systems may implement realloc()
simply as a remaping of the old pages, thus with very good performances,
see for instance the mremap() syscall on Linux. However this is not
always true, and jemalloc by default avoids doing this because there are
issues with the current implementation of mremap().
For this reason we are using a linked list of blocks instead of a single
block that gets reallocated again and again.
The changes in this commit lacks testing, that will be performed before
merging into the unstable branch. This fix will not enter 2.4 because it
is too invasive. However 2.4 will log a warning when the AOF rewrite
buffer is near to the 2GB limit.
A previous commit introduced REDIS_HZ define that changes the frequency
of calls to the serverCron() Redis function. This commit improves
different related things:
1) Software watchdog: now the minimal period can be set according to
REDIS_HZ. The minimal period is two times the timer period, that is:
(1000/REDIS_HZ)*2 milliseconds
2) The incremental rehashing is now performed in the expires dictionary
as well.
3) The activeExpireCycle() function was improved in different ways:
- Now it checks if it already used too much time using microseconds
instead of milliseconds for better precision.
- The time limit is now calculated correctly, in the previous version
the division was performed before of the multiplication resulting in
a timelimit of 0 if HZ was big enough.
- Databases with less than 1% of buckets fill in the hash table are
skipped, because getting random keys is too expensive in this
condition.
4) tryResizeHashTables() is now called at every timer call, we need to
match the number of calls we do to the expired keys colleciton cycle.
5) REDIS_HZ was raised to 100.
Redis uses a function called serverCron() that is very similar to the
timer interrupt of an operating system. This function is used to handle
a number of asynchronous things, like active expired keys collection,
clients timeouts, update of statistics, things related to the cluster
and replication, triggering of BGSAVE and AOF rewrite process, and so
forth.
In the past the timer was called 1 time per second. At some point it was
raised to 10 times per second, but it still was fixed and could not be
changed even at compile time, because different functions called from
serverCron() assumed a given fixed frequency.
This commmit makes the frequency configurable, so that it is simpler to
pick a good tradeoff between overhead of this function (that is usually
very small) and the responsiveness of Redis during a few critical
circumstances where a lot of work is done inside the timer.
An example of such a critical condition is mass-expire of a lot of keys
in the same second. Up to a given percentage of CPU time is used to
perform expired keys collection per expire cylce. Now changing the
REDIS_HZ macro it is possible to do less work but more times per second
in order to block the server for less time.
If this patch will work well in our tests it will enter Redis 2.6-final.
If a large amonut of keys are all expiring about at the same time, the
"active" expired keys collection cycle used to block as far as the
percentage of already expired keys was >= 25% of the total population of
keys with an expire set.
This could block the server even for many seconds in order to reclaim
memory ASAP. The new algorithm uses at max a small amount of
milliseconds per cycle, even if this means reclaiming the memory less
promptly it also means a more responsive server.
We used to reply -ERR ... message ..., now the reply is
instead -MASTERDOWN ... message ... so that it can be distinguished
easily by the other error conditions.
Two limits are added:
1) Up to SLOWLOG_ENTRY_MAX_ARGV arguments are logged.
2) Up to SLOWLOG_ENTRY_MAX_STRING bytes per argument are logged.
3) slowlog-max-len is set to 128 by default (was 1024).
The number of remaining arguments / bytes is logged in the entry
so that the user can understand better the nature of the logged command.
This new field counts all the times Redis is configured with AOF enabled and
fsync policy 'everysec', but the previous fsync performed by the
background thread was not able to complete within two seconds, forcing
Redis to perform a write against the AOF file while the fsync is still
in progress (likely a blocking operation).
This commit introduces support for read only slaves via redis.conf and CONFIG GET/SET commands. Also various semantical fixes are implemented here:
1) MULTI/EXEC with only read commands now work where the server is into a state where writes (or commands increasing memory usage) are not allowed. Before this patch everything inside a transaction would fail in this conditions.
2) Scripts just calling read-only commands will work against read only
slaves, when the server is out of memory, or when persistence is into an
error condition. Before the patch EVAL always failed in this condition.
The Run ID is a field that identifies a single execution of the Redis
server. It can be useful for many purposes as it makes easy to detect if
the instance we are talking about is the same, or if it is a different
one or was rebooted. An application of run_id will be in the partial
synchronization of replication, where a slave may request a partial sync
from a given offset only if it is talking with the same master. Another
application is in failover and monitoring scripts.
Redis now refuses accepting write queries if RDB persistence is
configured, but RDB snapshots can't be generated for some reason.
The status of the latest background save operation is now exposed
in the INFO output as well. This fixes issue #90.
The new code uses a more generic data structure to describe redis operations.
The new design allows for multiple alsoPropagate() calls within the scope of a
single command, that is useful in different contexts. For instance there
when there are multiple clients doing BRPOPLPUSH against the same list,
and a variadic LPUSH is performed against this list, the blocked clients
will both be served, and we should correctly replicate multiple LPUSH
commands after the replication of the current command.
1) sendReplyToClient() now no longer stops transferring data to a single
client in the case we are out of memory (maxmemory-wise).
2) in processCommand() the idea of we being out of memory is no longer
the naive zmalloc_used_memory() > server.maxmemory. To say if we can
accept or not write queries is up to the return value of
freeMemoryIfNeeded(), that has full control about that.
3) freeMemoryIfNeeded() now does its math without considering output
buffers size. But at the same time it can't let the output buffers to
put us too much outside the max memory limit, so at the same time it
makes sure there is enough effort into delivering the output buffers to
the slaves, calling the write handler directly.
This three changes are the result of many tests, I found (partially
empirically) that is the best way to address the problem, but maybe
we'll find better solutions in the future.