This change attempts to switch to an hash function which mitigates
the effects of the HashDoS attack (denial of service attack trying
to force data structures to worst case behavior) while at the same time
providing Redis with an hash function that does not expect the input
data to be word aligned, a condition no longer true now that sds.c
strings have a varialbe length header.
Note that it is possible sometimes that even using an hash function
for which collisions cannot be generated without knowing the seed,
special implementation details or the exposure of the seed in an
indirect way (for example the ability to add elements to a Set and
check the return in which Redis returns them with SMEMBERS) may
make the attacker's life simpler in the process of trying to guess
the correct seed, however the next step would be to switch to a
log(N) data structure when too many items in a single bucket are
detected: this seems like an overkill in the case of Redis.
SPEED REGRESION TESTS:
In order to verify that switching from MurmurHash to SipHash had
no impact on speed, a set of benchmarks involving fast insertion
of 5 million of keys were performed.
The result shows Redis with SipHash in high pipelining conditions
to be about 4% slower compared to using the previous hash function.
However this could partially be related to the fact that the current
implementation does not attempt to hash whole words at a time but
reads single bytes, in order to have an output which is endian-netural
and at the same time working on systems where unaligned memory accesses
are a problem.
Further X86 specific optimizations should be tested, the function
may easily get at the same level of MurMurHash2 if a few optimizations
are performed.
This is of great interest because allows us to print debugging
informations that could be of useful when debugging, like in the
following example:
serverPanic("Unexpected encoding for object %d, %d",
obj->type, obj->encoding);
The commit improves ziplistRepr() and adds a new debugging subcommand so
that we can trigger the dump directly from the Redis API.
This command capability was used while investigating issue #3684.
The gist of the changes is that now, partial resynchronizations between
slaves and masters (without the need of a full resync with RDB transfer
and so forth), work in a number of cases when it was impossible
in the past. For instance:
1. When a slave is promoted to mastrer, the slaves of the old master can
partially resynchronize with the new master.
2. Chained slalves (slaves of slaves) can be moved to replicate to other
slaves or the master itsef, without requiring a full resync.
3. The master itself, after being turned into a slave, is able to
partially resynchronize with the new master, when it joins replication
again.
In order to obtain this, the following main changes were operated:
* Slaves also take a replication backlog, not just masters.
* Same stream replication for all the slaves and sub slaves. The
replication stream is identical from the top level master to its slaves
and is also the same from the slaves to their sub-slaves and so forth.
This means that if a slave is later promoted to master, it has the
same replication backlong, and can partially resynchronize with its
slaves (that were previously slaves of the old master).
* A given replication history is no longer identified by the `runid` of
a Redis node. There is instead a `replication ID` which changes every
time the instance has a new history no longer coherent with the past
one. So, for example, slaves publish the same replication history of
their master, however when they are turned into masters, they publish
a new replication ID, but still remember the old ID, so that they are
able to partially resynchronize with slaves of the old master (up to a
given offset).
* The replication protocol was slightly modified so that a new extended
+CONTINUE reply from the master is able to inform the slave of a
replication ID change.
* REPLCONF CAPA is used in order to notify masters that a slave is able
to understand the new +CONTINUE reply.
* The RDB file was extended with an auxiliary field that is able to
select a given DB after loading in the slave, so that the slave can
continue receiving the replication stream from the point it was
disconnected without requiring the master to insert "SELECT" statements.
This is useful in order to guarantee the "same stream" property, because
the slave must be able to accumulate an identical backlog.
* Slave pings to sub-slaves are now sent in a special form, when the
top-level master is disconnected, in order to don't interfer with the
replication stream. We just use out of band "\n" bytes as in other parts
of the Redis protocol.
An old design document is available here:
https://gist.github.com/antirez/ae068f95c0d084891305
However the implementation is not identical to the description because
during the work to implement it, different changes were needed in order
to make things working well.
The old test, designed to do a transformation on the bits that was
invertible, in order to avoid touching the original memory content, was
not effective as it was redis-server --test-memory. The former often
reported OK while the latter was able to spot the error.
So the test was substituted with one that may perform better, however
the new one must backup the memory tested, so it tests memory in small
pieces. This limits the effectiveness because of the CPU caches. However
some attempt is made in order to trash the CPU cache between the fill
and the check stages, but not for the addressing test unfortunately.
We'll see if this test will be able to find errors where the old failed.
There are some cases of printing unsigned integer with %d conversion
specificator and vice versa (signed integer with %u specificator).
Patch by Sergey Polovko. Backported to Redis from Disque.
Currently this feature is only accessible via DEBUG for testing, since
otherwise depending on the instance configuration a given script works
or is broken, which is against the Redis philosophy.
The command reports information about the hash table internal state
representing the specified database ID.
This can be used in order to investigate rehashings, memory usage issues
and for other debugging purposes.
It's possible large objects could be larger than 'int', so let's
upgrade all size counters to ssize_t.
This also fixes rdbSaveObject serialized bytes calculation.
Since entire serializations of data structures can be large,
so we don't want to limit their calculated size to a 32 bit signed max.
This commit increases object size calculation and
cascades the change back up to serializedlength printing.
Before:
127.0.0.1:6379> debug object hihihi
... encoding:quicklist serializedlength:-2147483559 ...
After:
127.0.0.1:6379> debug object hihihi
... encoding:quicklist serializedlength:2147483737 ...
Adds: ql_compressed (boolean, 1 if compression enabled for list, 0
otherwise)
Adds: ql_uncompressed_size (actual uncompressed size of all quicklistNodes)
Adds: ql_ziplist_max (quicklist max ziplist fill factor)
Compression ratio of the list is then ql_uncompressed_size / serializedlength
We report ql_uncompressed_size for all quicklists because serializedlength
is a _compressed_ representation anyway.
Sample output from a large list:
127.0.0.1:6379> llen abc
(integer) 38370061
127.0.0.1:6379> debug object abc
Value at:0x7ff97b51d140 refcount:1 encoding:quicklist serializedlength:19878335 lru:9718164 lru_seconds_idle:5 ql_nodes:21945 ql_avg_node:1748.46 ql_ziplist_max:-2 ql_compressed:0 ql_uncompressed_size:1643187761
(1.36s)
The 1.36s result time is because rdbSavedObjectLen() is serializing the
object, not because of any new stats reporting.
If we run DEBUG OBJECT on a compressed list, DEBUG OBJECT takes almost *zero*
time because rdbSavedObjectLen() reuses already-compressed ziplists:
127.0.0.1:6379> debug object abc
Value at:0x7fe5c5800040 refcount:1 encoding:quicklist serializedlength:19878335 lru:9718109 lru_seconds_idle:5 ql_nodes:21945 ql_avg_node:1748.46 ql_ziplist_max:-2 ql_compressed:1 ql_uncompressed_size:1643187761
Let user set how many nodes to *not* compress.
We can specify a compression "depth" of how many nodes
to leave uncompressed on each end of the quicklist.
Depth 0 = disable compression.
Depth 1 = only leave head/tail uncompressed.
- (read as: "skip 1 node on each end of the list before compressing")
Depth 2 = leave head, head->next, tail->prev, tail uncompressed.
- ("skip 2 nodes on each end of the list before compressing")
Depth 3 = Depth 2 + head->next->next + tail->prev->prev
- ("skip 3 nodes...")
etc.
This also:
- updates RDB storage to use native quicklist compression (if node is
already compressed) instead of uncompressing, generating the RDB string,
then re-compressing the quicklist node.
- internalizes the "fill" parameter for the quicklist so we don't
need to pass it to _every_ function. Now it's just a property of
the list.
- allows a runtime-configurable compression option, so we can
expose a compresion parameter in the configuration file if people
want to trade slight request-per-second performance for up to 90%+
memory savings in some situations.
- updates the quicklist tests to do multiple passes: 200k+ tests now.
Added field 'ql_nodes' and 'ql_avg_per_node'.
ql_nodes is the number of quicklist nodes in the quicklist.
ql_avg_node is the average fill level in each quicklist node. (LLEN / QL_NODES)
Sample output:
127.0.0.1:6379> DEBUG object b
Value at:0x7fa42bf2fed0 refcount:1 encoding:quicklist serializedlength:18489 lru:8983768 lru_seconds_idle:3 ql_nodes:430 ql_avg_per_node:511.73
127.0.0.1:6379> llen b
(integer) 220044
Slaves key expire is orchestrated by the master. Sometimes the master
will send the synthesized DEL to expire keys on the slave with a non
trivial delay (when the key is not accessed, only the incremental expiry
algorithm will expire it in background).
During that time, a key is logically expired, but slaves still return
the key if you GET (or whatever) it. This is a bad behavior.
However we can't simply trust the slave view of the key, since we need
the master to be able to send write commands to update the slave data
set, and DELs should only happen when the key is expired in the master
in order to ensure consistency.
However 99.99% of the issues with this behavior is when a client which
is not a master sends a read only command. In this case we are safe and
can consider the key as non existing.
This commit does a few changes in order to make this sane:
1. lookupKeyRead() is modified in order to return NULL if the above
conditions are met.
2. Calls to lookupKeyRead() in commands actually writing to the data set
are repliaced with calls to lookupKeyWrite().
There are redundand checks, so for example, if in "2" something was
overlooked, we should be still safe, since anyway, when the master
writes the behavior is to don't care about what expireIfneeded()
returns.
This commit is related to #1768, #1770, #2131.
Because of (not so) recent Redis changes, now the LRU internally
reported unit is milliseconds, not seconds, but the DEBUG OBJECT output
was still claiming seconds while providing milliseconds.
However OBJECT IDLETIME was working as expected, which is the correct
API to use.
Both upstart and systemd provide a way for daemons to
be supervised, as well as a mechanism for them to
signal their readyness status.
This patch provides compatibility with this functionality while
not interfering with other methods.
With this, it will be possible to use `expect stop` with upstart
and `Type=notify` with systemd.
A more detailed explanation of the mechanism can be found here:
http://spootnik.org/entries/2014/11/09_pid-tracking-in-modern-init-systems.html
The old DEBUG POPULATE form for automatic creation of test keys is:
DEBUG POPULATE <count>
Now an additional form is available:
DEBUG POPULATE <count> <prefix>
When prefix is not specified, it defaults to "key", so the keys are
named incrementally from key:0 to key:<count-1>. Otherwise the specified
prefix is used instead of "key".
The command is useful in order to populate different Redis instances
with key names guaranteed to don't collide. There are other debugging
uses, for example it is possible to add additional N keys using a count
of N and a random prefix at every call.
If we are in the signal handler, we don't want to handle
the signal again. In extreme cases, this can cause a stack overflow
and segfault Redis.
Fixes#1771
This commit adds peer ID caching in the client structure plus an API
change and the use of sdsMakeRoomFor() in order to improve the
reallocation pattern to generate the CLIENT LIST output.
Both the changes account for a very significant speedup.
The new "error" subcommand of the DEBUG command can reply with an user
selected error, specified as its sole argument:
DEBUG ERROR "LOADING please wait..."
The error is generated just prefixing the command argument with a "-"
character, and replacing newlines with spaces (since error replies can't
include newlines).
The goal of the command is to help in Client libraries unit tests by
making simple to simulate a command call triggering a given error.
getKeysFromCommand() is designed to be called with the command arguments
passing the basic arity checks described in the command table.
DEBUG CMDKEYS must provide the same guarantees for calling
getKeysFromCommand() to be safe.
Examples:
redis 127.0.0.1:6379> debug cmdkeys set foo bar
1) "foo"
redis 127.0.0.1:6379> debug cmdkeys mget a b c
1) "a"
2) "b"
3) "c"
redis 127.0.0.1:6379> debug cmdkeys zunionstore foo 2 a b
1) "a"
2) "b"
3) "foo"
redis 127.0.0.1:6379> debug cmdkeys ping
(empty list or set)
Redis hash table implementation has many non-blocking features like
incremental rehashing, however while deleting a large hash table there
was no way to have a callback called to do some incremental work.
This commit adds this support, as an optiona callback argument to
dictEmpty() that is currently called at a fixed interval (one time every
65k deletions).
Previously two string encodings were used for string objects:
1) REDIS_ENCODING_RAW: a string object with obj->ptr pointing to an sds
stirng.
2) REDIS_ENCODING_INT: a string object where the obj->ptr void pointer
is casted to a long.
This commit introduces a experimental new encoding called
REDIS_ENCODING_EMBSTR that implements an object represented by an sds
string that is not modifiable but allocated in the same memory chunk as
the robj structure itself.
The chunk looks like the following:
+--------------+-----------+------------+--------+----+
| robj data... | robj->ptr | sds header | string | \0 |
+--------------+-----+-----+------------+--------+----+
| ^
+-----------------------+
The robj->ptr points to the contiguous sds string data, so the object
can be manipulated with the same functions used to manipulate plan
string objects, however we need just on malloc and one free in order to
allocate or release this kind of objects. Moreover it has better cache
locality.
This new allocation strategy should benefit both the memory usage and
the performances. A performance gain between 60 and 70% was observed
during micro-benchmarks, however there is more work to do to evaluate
the performance impact and the memory usage behavior.
When the semantics changed from logfile = NULL to logfile = "" to log
into standard output, no proper change was made to logStackTrace() to
make it able to work with the new setup.
This commit fixes the issue.
SUSv3 says that:
The useconds argument shall be less than one million. If the value of
useconds is 0, then the call has no effect.
and actually NetBSD's implementation rejects such a value with EINVAL.
use nanosleep which has no such a limitation instead.
REDIS_HZ is the frequency our serverCron() function is called with.
A more frequent call to this function results into less latency when the
server is trying to handle very expansive background operations like
mass expires of a lot of keys at the same time.
Redis 2.4 used to have an HZ of 10. This was good enough with almost
every setup, but the incremental key expiration algorithm was working a
bit better under *extreme* pressure when HZ was set to 100 for Redis
2.6.
However for most users a latency spike of 30 milliseconds when million
of keys are expiring at the same time is acceptable, on the other hand a
default HZ of 100 in Redis 2.6 was causing idle instances to use some
CPU time compared to Redis 2.4. The CPU usage was in the order of 0.3%
for an idle instance, however this is a shame as more energy is consumed
by the server, if not important resources.
This commit introduces HZ as a runtime parameter, that can be queried by
INFO or CONFIG GET, and can be modified with CONFIG SET. At the same
time the default frequency is set back to 10.
In this way we default to a sane value of 10, but allows users to
easily switch to values up to 500 for near real-time applications if
needed and if they are willing to pay this small CPU usage penalty.
The idea is to be able to identify a build in a unique way, so for
instance after a bug report we can recognize that the build is the one
of a popular Linux distribution and perform the debugging in the same
environment.
1) We no longer test location by location, otherwise the CPU write cache
completely makes our business useless.
2) We still need a memory test that operates in steps from the first to
the last location in order to never hit the cache, but that is still
able to retain the memory content.
This was tested using a Linux box containing a bad memory module with a
zingle bit error (always zero).
So the final solution does has an error propagation step that is:
1) Invert bits at every location.
2) Swap adiacent locations.
3) Swap adiacent locations again.
4) Invert bits at every location.
5) Swap adiacent locations.
6) Swap adiacent locations again.
Before and after these steps, and after step 4, a CRC64 checksum is computed.
If the three CRC64 checksums don't match, a memory error was detected.
We use this new bio.c feature in order to stop our I/O threads if there
is a memory test to do on crash. In this case we don't want anything
else than the main thread to run, otherwise the other threads may mess
with the heap and the memory test will report a false positive.
The previous implementation of zmalloc.c was not able to handle out of
memory in an application-specific way. It just logged an error on
standard error, and aborted.
The result was that in the case of an actual out of memory in Redis
where malloc returned NULL (In Linux this actually happens under
specific overcommit policy settings and/or with no or little swap
configured) the error was not properly logged in the Redis log.
This commit fixes this problem, fixing issue #509.
Now the out of memory is properly reported in the Redis log and a stack
trace is generated.
The approach used is to provide a configurable out of memory handler
to zmalloc (otherwise the default one logging the event on the
standard output is used).
The ziplist -> hashtable conversion code is triggered every time an hash
value must be promoted to a full hash table because the number or size of
elements reached the threshold.
If a problem in the ziplist causes the same field to be present
multiple times, the assertion of successful addition of the element
inside the hash table will fail, crashing server with a failed
assertion, but providing little information about the problem.
This code adds a new logging function to perform the hex dump of binary
data, and makes sure that the ziplist -> hashtable conversion code uses
this new logging facility to dump the content of the ziplist when the
assertion fails.
This change was originally made in order to investigate issue #547.
A previous commit introduced REDIS_HZ define that changes the frequency
of calls to the serverCron() Redis function. This commit improves
different related things:
1) Software watchdog: now the minimal period can be set according to
REDIS_HZ. The minimal period is two times the timer period, that is:
(1000/REDIS_HZ)*2 milliseconds
2) The incremental rehashing is now performed in the expires dictionary
as well.
3) The activeExpireCycle() function was improved in different ways:
- Now it checks if it already used too much time using microseconds
instead of milliseconds for better precision.
- The time limit is now calculated correctly, in the previous version
the division was performed before of the multiplication resulting in
a timelimit of 0 if HZ was big enough.
- Databases with less than 1% of buckets fill in the hash table are
skipped, because getting random keys is too expensive in this
condition.
4) tryResizeHashTables() is now called at every timer call, we need to
match the number of calls we do to the expired keys colleciton cycle.
5) REDIS_HZ was raised to 100.
This fixes compilation on FreeBSD (and possibly other systems) by
not using ucontext_t at all if HAVE_BACKTRACE is not defined.
Also the ifdefs to get the registers are modified to explicitly test for the
operating system in the first level, and the arch in the second level
of nesting.
Now the rdbSave* functions return the number of bytes written (or
required to write) in serializing a Redis object, writing to /dev/null
and using ftell (which doesn't work on FreeBSD) isn't needed anymore.
networking related stuff moved into networking.c
moved more code
more work on layout of source code
SDS instantaneuos memory saving. By Pieter and Salvatore at VMware ;)
cleanly compiling again after the first split, now splitting it in more C files
moving more things around... work in progress
split replication code
splitting more
Sets split
Hash split
replication split
even more splitting
more splitting
minor change