If a thread unblocks a client blocked in a module command, by using the
RedisMdoule_UnblockClient() API, the event loop may not be awaken until
the next timeout of the multiplexing API or the next unrelated I/O
operation on other clients. We actually want the client to be served
ASAP, so a mechanism is needed in order for the unblocking API to inform
Redis that there is a client to serve ASAP.
This commit fixes the issue using the old trick of the pipe: when a
client needs to be unblocked, a byte is written in a pipe. When we run
the list of clients blocked in modules, we consume all the bytes
written in the pipe. Writes and reads are performed inside the context
of the mutex, so no race is possible in which we consume the bytes that
are actually related to an awake request for a client that should still
be put into the list of clients to unblock.
It was verified that after the fix the server handles the blocked
clients with the expected short delay.
Thanks to @dvirsky for understanding there was such a problem and
reporting it.
Testing with Solaris C compiler (SunOS 5.11 11.2 sun4v sparc sun4v)
there were issues compiling due to atomicvar.h and running the
tests also failed because of "tail" usage not conform with Solaris
tail implementation. This commit fixes both the issues.
For performance reasons we use a reduced rounds variant of
SipHash. This should still provide enough protection and the
effects in the hash table distribution are non existing.
If some real world attack on SipHash 1-2 will be found we can
trivially switch to something more secure. Anyway it is a
big step forward from Murmurhash, for which it is trivial to
generate *seed independent* colliding keys... The speed
penatly introduced by SipHash 2-4, around 4%, was a too big
price to pay compared to the effectiveness of the HashDoS
attack against SipHash 1-2, and considering so far in the
Redis history, no such an incident ever happened even while
using trivially to collide hash functions.
1. Refactor memory overhead computation into a function.
2. Every 10 keys evicted, check if memory usage already reached
the target value directly, since we otherwise don't count all
the memory reclaimed by the background thread right now.
This change attempts to switch to an hash function which mitigates
the effects of the HashDoS attack (denial of service attack trying
to force data structures to worst case behavior) while at the same time
providing Redis with an hash function that does not expect the input
data to be word aligned, a condition no longer true now that sds.c
strings have a varialbe length header.
Note that it is possible sometimes that even using an hash function
for which collisions cannot be generated without knowing the seed,
special implementation details or the exposure of the seed in an
indirect way (for example the ability to add elements to a Set and
check the return in which Redis returns them with SMEMBERS) may
make the attacker's life simpler in the process of trying to guess
the correct seed, however the next step would be to switch to a
log(N) data structure when too many items in a single bucket are
detected: this seems like an overkill in the case of Redis.
SPEED REGRESION TESTS:
In order to verify that switching from MurmurHash to SipHash had
no impact on speed, a set of benchmarks involving fast insertion
of 5 million of keys were performed.
The result shows Redis with SipHash in high pipelining conditions
to be about 4% slower compared to using the previous hash function.
However this could partially be related to the fact that the current
implementation does not attempt to hash whole words at a time but
reads single bytes, in order to have an output which is endian-netural
and at the same time working on systems where unaligned memory accesses
are a problem.
Further X86 specific optimizations should be tested, the function
may easily get at the same level of MurMurHash2 if a few optimizations
are performed.
GCC will produce certain unaligned multi load-store instructions
that will be trapped by the Linux kernel since ARM v6 cannot
handle them with unaligned addresses. Better to use the slower
but safer implementation instead of generating the exception which
should be anyway very slow.
I'm not sure how much test Jemalloc gets on ARM, moreover
compiling Redis with Jemalloc support in not very powerful
devices, like most ARMs people will build Redis on, is extremely
slow. It is possible to enable Jemalloc build anyway if needed
by using "make MALLOC=jemalloc".
However note that in architectures supporting 64 bit unaligned
accesses memcpy(...,...,8) is likely translated to a simple
word memory movement anyway.
After investigating issue #3796, it was discovered that MIGRATE
could call migrateCloseSocket() after the original MIGRATE c->argv
was already rewritten as a DEL operation. As a result the host/port
passed to migrateCloseSocket() could be anything, often a NULL pointer
that gets deferenced crashing the server.
Now the socket is closed at an earlier time when there is a socket
error in a later stage where no retry will be performed, before we
rewrite the argument vector. Moreover a check was added so that later,
in the socket_err label, there is no further attempt at closing the
socket if the argument was rewritten.
This fix should resolve the bug reported in #3796.
Ziplists had a bug that was discovered while investigating a different
issue, resulting in a corrupted ziplist representation, and a likely
segmentation foult and/or data corruption of the last element of the
ziplist, once the ziplist is accessed again.
The bug happens when a specific set of insertions / deletions is
performed so that an entry is encoded to have a "prevlen" field (the
length of the previous entry) of 5 bytes but with a count that could be
encoded in a "prevlen" field of a since byte. This could happen when the
"cascading update" process called by ziplistInsert()/ziplistDelete() in
certain contitious forces the prevlen to be bigger than necessary in
order to avoid too much data moving around.
Once such an entry is generated, inserting a very small entry
immediately before it will result in a resizing of the ziplist for a
count smaller than the current ziplist length (which is a violation,
inserting code expects the ziplist to get bigger actually). So an FF
byte is inserted in a misplaced position. Moreover a realloc() is
performed with a count smaller than the ziplist current length so the
final bytes could be trashed as well.
SECURITY IMPLICATIONS:
Currently it looks like an attacker can only crash a Redis server by
providing specifically choosen commands. However a FF byte is written
and there are other memory operations that depend on a wrong count, so
even if it is not immediately apparent how to mount an attack in order
to execute code remotely, it is not impossible at all that this could be
done. Attacks always get better... and we did not spent enough time in
order to think how to exploit this issue, but security researchers
or malicious attackers could.
This header file is for libs, like ziplist.c, that we want to leave
almost separted from the core. The panic() calls will be easy to delete
in order to use such files outside, but the debugging info we gain are
very valuable compared to simple assertions where it is not possible to
print debugging info.
This is of great interest because allows us to print debugging
informations that could be of useful when debugging, like in the
following example:
serverPanic("Unexpected encoding for object %d, %d",
obj->type, obj->encoding);
Don't go over 80 cols. Start with captial letter, capital letter afer
point, end comment with a point and so forth. No actual code behavior
touched at all.
There were two cases outlined in issue #3512 and PR #3551 where
the Geo API returned unexpected results: empty strings where NULL
replies were expected, or a single null reply where an array was
expected. This violates the Redis principle that Redis replies for
existing keys or elements should be indistinguishable.
This is technically an API breakage so will be merged only into 4.0 and
specified in the changelog in the list of breaking compatibilities, even
if it is not very likely that actual code will be affected, hopefully,
since with the past behavior basically there was to acconut for *both*
the possibilities, and the new behavior is always one of the two, but
in a consistent way.
You can still force the logo in the normal logs.
For motivations, check issue #3112. For me the reason is that actually
the logo is nice to have in interactive sessions, but inside the logs
kinda loses its usefulness, but for the ability of users to recognize
restarts easily: for this reason the new startup sequence shows a one
liner ASCII "wave" so that there is still a bit of visual clue.
Startup logging was modified in order to log events in more obvious
ways, and to log more events. Also certain important informations are
now more easy to parse/grep since they are printed in field=value style.
The option --always-show-logo in redis.conf was added, defaulting to no.
This commit also contains other changes in order to conform the code to
the Redis core style, specifically 80 chars max per line, smart
conditionals in the same line:
if (that) do_this();
The new algorithm provides the same speed with a smaller error for
cardinalities in the range 0-100k. Before switching, the new and old
algorithm behavior was studied in details in the context of
issue #3677. You can find a few graphs and motivations there.
Otherwise for small cardinalities the algorithm will output something
like, for example, 4.99 for a candinality of 5, that will be converted
to 4 producing a huge error.
The commit improves ziplistRepr() and adds a new debugging subcommand so
that we can trigger the dump directly from the Redis API.
This command capability was used while investigating issue #3684.
After the fix for #3673 the ttl var is always initialized inside the
loop itself, so the early initialization is not needed.
Variables declaration also moved to a more local scope.
BACKGROUND AND USE CASEj
Redis slaves are normally write only, however the supprot a "writable"
mode which is very handy when scaling reads on slaves, that actually
need write operations in order to access data. For instance imagine
having slaves replicating certain Sets keys from the master. When
accessing the data on the slave, we want to peform intersections between
such Sets values. However we don't want to intersect each time: to cache
the intersection for some time often is a good idea.
To do so, it is possible to setup a slave as a writable slave, and
perform the intersection on the slave side, perhaps setting a TTL on the
resulting key so that it will expire after some time.
THE BUG
Problem: in order to have a consistent replication, expiring of keys in
Redis replication is up to the master, that synthesize DEL operations to
send in the replication stream. However slaves logically expire keys
by hiding them from read attempts from clients so that if the master did
not promptly sent a DEL, the client still see logically expired keys
as non existing.
Because slaves don't actively expire keys by actually evicting them but
just masking from the POV of read operations, if a key is created in a
writable slave, and an expire is set, the key will be leaked forever:
1. No DEL will be received from the master, which does not know about
such a key at all.
2. No eviction will be performed by the slave, since it needs to disable
eviction because it's up to masters, otherwise consistency of data is
lost.
THE FIX
In order to fix the problem, the slave should be able to tag keys that
were created in the slave side and have an expire set in some way.
My solution involved using an unique additional dictionary created by
the writable slave only if needed. The dictionary is obviously keyed by
the key name that we need to track: all the keys that are set with an
expire directly by a client writing to the slave are tracked.
The value in the dictionary is a bitmap of all the DBs where such a key
name need to be tracked, so that we can use a single dictionary to track
keys in all the DBs used by the slave (actually this limits the solution
to the first 64 DBs, but the default with Redis is to use 16 DBs).
This solution allows to pay both a small complexity and CPU penalty,
which is zero when the feature is not used, actually. The slave-side
eviction is encapsulated in code which is not coupled with the rest of
the Redis core, if not for the hook to track the keys.
TODO
I'm doing the first smoke tests to see if the feature works as expected:
so far so good. Unit tests should be added before merging into the
4.0 branch.
Before, if a previous key had a TTL set but the current one didn't, the
TTL was reused and thus resulted in wrong expirations set.
This behaviour was experienced, when `MigrateDefaultPipeline` in
redis-trib was set to >1
Fixes#3655
A bug was reported in the context in issue #3631. The root cause of the
bug was that certain neighbor boxes were zeroed after the "inside the
bounding box or not" check, simply because the bounding box computation
function was wrong.
A few debugging infos where enhanced and moved in other parts of the
code. A check to avoid steps=0 was added, but is unrelated to this
issue and I did not verified it was an actual bug in practice.