The old version could not handle the fact that "STREAMS" is a valid key
name for streams. Now we really try to parse the command like the
command implementation would do.
Related to #5028 and 4857.
The loop allocated a buffer for the right number of keys positions, then
overflowed it going past the limit.
Related to #4857 and cause of the memory violation seen in #5028.
This fixes a crash with Redis Cluster when OBJECT is mis-used, because
getKeysUsingCommandTable() will call serverPanic() detecting we are
accessing an invalid argument in the case "OBJECT foo" is called.
This bug was introduced when OBJECT HELP was introduced, because the key
argument is set fixed at index 2 in the command table, however now
OBJECT may be called with an insufficient number of arguments to extract
the key.
The "Right Thing" would be to have a specific function to extract keys
from the OBJECT command, however this is kinda of an overkill, so I
preferred to make getKeysUsingCommandTable() more robust and just return
no keys when it's not possible to honor the command table, because new
commands are often added and also there are a number with an HELP
subcommand violating the normal form, and crashing for this trivial
reason or having many command-specific key extraction functions is not
great.
With lists we need to signal only on key creation, but streams can
provide data to clients listening at every new item added.
To make this slightly more efficient we now track different classes of
blocked clients to avoid signaling keys when there is nobody listening.
A typical case is when the stream is used as a time series DB and
accessed only by range with XRANGE.
Firstly, use access time to replace the decreas time of LFU.
For function LFUDecrAndReturn,
it should only try to get decremented counter,
not update LFU fields, we will update it in an explicit way.
And we will times halve the counter according to the times of
elapsed time than server.lfu_decay_time.
Everytime a key is accessed, we should update the LFU
including update access time, and increment the counter after
call function LFUDecrAndReturn.
If a key is overwritten, the LFU should be also updated.
Then we can use `OBJECT freq` command to get a key's frequence,
and LFUDecrAndReturn should be called in `OBJECT freq` command
in case of the key has not been accessed for a long time,
because we update the access time only when the key is read or
overwritten.
This commit is a reinforcement of commit c1c99e9.
1. Replication information can be stored when the RDB file is
generated by a mater using server.slaveseldb when server.repl_backlog
is not NULL, or set repl_stream_db be -1. That's safe, because
NULL server.repl_backlog will trigger full synchronization,
then master will send SELECT command to replicaiton stream.
2. Only do rdbSave* when rsiptr is not NULL,
if we do rdbSave* without rdbSaveInfo, slave will miss repl-stream-db.
3. Save the replication informations also in the case of
SAVE command, FLUSHALL command and DEBUG reload.
BACKGROUND AND USE CASEj
Redis slaves are normally write only, however the supprot a "writable"
mode which is very handy when scaling reads on slaves, that actually
need write operations in order to access data. For instance imagine
having slaves replicating certain Sets keys from the master. When
accessing the data on the slave, we want to peform intersections between
such Sets values. However we don't want to intersect each time: to cache
the intersection for some time often is a good idea.
To do so, it is possible to setup a slave as a writable slave, and
perform the intersection on the slave side, perhaps setting a TTL on the
resulting key so that it will expire after some time.
THE BUG
Problem: in order to have a consistent replication, expiring of keys in
Redis replication is up to the master, that synthesize DEL operations to
send in the replication stream. However slaves logically expire keys
by hiding them from read attempts from clients so that if the master did
not promptly sent a DEL, the client still see logically expired keys
as non existing.
Because slaves don't actively expire keys by actually evicting them but
just masking from the POV of read operations, if a key is created in a
writable slave, and an expire is set, the key will be leaked forever:
1. No DEL will be received from the master, which does not know about
such a key at all.
2. No eviction will be performed by the slave, since it needs to disable
eviction because it's up to masters, otherwise consistency of data is
lost.
THE FIX
In order to fix the problem, the slave should be able to tag keys that
were created in the slave side and have an expire set in some way.
My solution involved using an unique additional dictionary created by
the writable slave only if needed. The dictionary is obviously keyed by
the key name that we need to track: all the keys that are set with an
expire directly by a client writing to the slave are tracked.
The value in the dictionary is a bitmap of all the DBs where such a key
name need to be tracked, so that we can use a single dictionary to track
keys in all the DBs used by the slave (actually this limits the solution
to the first 64 DBs, but the default with Redis is to use 16 DBs).
This solution allows to pay both a small complexity and CPU penalty,
which is zero when the feature is not used, actually. The slave-side
eviction is encapsulated in code which is not coupled with the rest of
the Redis core, if not for the hook to track the keys.
TODO
I'm doing the first smoke tests to see if the feature works as expected:
so far so good. Unit tests should be added before merging into the
4.0 branch.
The gist of the changes is that now, partial resynchronizations between
slaves and masters (without the need of a full resync with RDB transfer
and so forth), work in a number of cases when it was impossible
in the past. For instance:
1. When a slave is promoted to mastrer, the slaves of the old master can
partially resynchronize with the new master.
2. Chained slalves (slaves of slaves) can be moved to replicate to other
slaves or the master itsef, without requiring a full resync.
3. The master itself, after being turned into a slave, is able to
partially resynchronize with the new master, when it joins replication
again.
In order to obtain this, the following main changes were operated:
* Slaves also take a replication backlog, not just masters.
* Same stream replication for all the slaves and sub slaves. The
replication stream is identical from the top level master to its slaves
and is also the same from the slaves to their sub-slaves and so forth.
This means that if a slave is later promoted to master, it has the
same replication backlong, and can partially resynchronize with its
slaves (that were previously slaves of the old master).
* A given replication history is no longer identified by the `runid` of
a Redis node. There is instead a `replication ID` which changes every
time the instance has a new history no longer coherent with the past
one. So, for example, slaves publish the same replication history of
their master, however when they are turned into masters, they publish
a new replication ID, but still remember the old ID, so that they are
able to partially resynchronize with slaves of the old master (up to a
given offset).
* The replication protocol was slightly modified so that a new extended
+CONTINUE reply from the master is able to inform the slave of a
replication ID change.
* REPLCONF CAPA is used in order to notify masters that a slave is able
to understand the new +CONTINUE reply.
* The RDB file was extended with an auxiliary field that is able to
select a given DB after loading in the slave, so that the slave can
continue receiving the replication stream from the point it was
disconnected without requiring the master to insert "SELECT" statements.
This is useful in order to guarantee the "same stream" property, because
the slave must be able to accumulate an identical backlog.
* Slave pings to sub-slaves are now sent in a special form, when the
top-level master is disconnected, in order to don't interfer with the
replication stream. We just use out of band "\n" bytes as in other parts
of the Redis protocol.
An old design document is available here:
https://gist.github.com/antirez/ae068f95c0d084891305
However the implementation is not identical to the description because
during the work to implement it, different changes were needed in order
to make things working well.
This new command swaps two Redis databases, so that immediately all the
clients connected to a given DB will see the data of the other DB, and
the other way around. Example:
SWAPDB 0 1
This will swap DB 0 with DB 1. All the clients connected with DB 0 will
immediately see the new data, exactly like all the clients connected
with DB 1 will see the data that was formerly of DB 0.
MOTIVATION AND HISTORY
---
The command was recently demanded by Pedro Melo, but was suggested in
the past multiple times, and always refused by me.
The reason why it was asked: Imagine you have clients operating in DB 0.
At the same time, you create a new version of the dataset in DB 1.
When the new version of the dataset is available, you immediately want
to swap the two views, so that the clients will transparently use the
new version of the data. At the same time you'll likely destroy the
DB 1 dataset (that contains the old data) and start to build a new
version, to repeat the process.
This is an interesting pattern, but the reason why I always opposed to
implement this, was that FLUSHDB was a blocking command in Redis before
Redis 4.0 improvements. Now we have FLUSHDB ASYNC that releases the
old data in O(1) from the point of view of the client, to reclaim memory
incrementally in a different thread.
At this point, the pattern can really be supported without latency
spikes, so I'm providing this implementation for the users to comment.
In case a very compelling argument will be made against this new command
it may be removed.
BEHAVIOR WITH BLOCKING OPERATIONS
---
If a client is blocking for a list in a given DB, after the swap it will
still be blocked in the same DB ID, since this is the most logical thing
to do: if I was blocked for a list push to list "foo", even after the
swap I want still a LPUSH to reach the key "foo" in the same DB in order
to unblock.
However an interesting thing happens when a client is, for instance,
blocked waiting for new elements in list "foo" of DB 0. Then the DB
0 and 1 are swapped with SWAPDB. However the DB 1 happened to have
a list called "foo" containing elements. When this happens, this
implementation can correctly unblock the client.
It is possible that there are subtle corner cases that are not covered
in the implementation, but since the command is self-contained from the
POV of the implementation and the Redis core, it cannot cause anything
bad if not used.
Tests and documentation are yet to be provided.
It is possible to get better results by using the pool like in the LRU
case. Also from tests during the morning I believe the current
implementation has issues in the frequency decay function that should
decrease the counter at periodic intervals.
MOVE was not able to move the TTL: when a key was moved into a different
database number, it became persistent like if PERSIST was used.
In some incredible way (I guess almost nobody uses Redis MOVE) this bug
remained unnoticed inside Redis internals for many years.
Finally Andy Grunwald discovered it and opened an issue.
This commit fixes the bug and adds a regression test.
Close#2765.
The new return value is the number of keys existing, among the ones
specified in the command line, counting the same key multiple times if
given multiple times (and if it exists).
See PR #2667.
Slaves key expire is orchestrated by the master. Sometimes the master
will send the synthesized DEL to expire keys on the slave with a non
trivial delay (when the key is not accessed, only the incremental expiry
algorithm will expire it in background).
During that time, a key is logically expired, but slaves still return
the key if you GET (or whatever) it. This is a bad behavior.
However we can't simply trust the slave view of the key, since we need
the master to be able to send write commands to update the slave data
set, and DELs should only happen when the key is expired in the master
in order to ensure consistency.
However 99.99% of the issues with this behavior is when a client which
is not a master sends a read only command. In this case we are safe and
can consider the key as non existing.
This commit does a few changes in order to make this sane:
1. lookupKeyRead() is modified in order to return NULL if the above
conditions are met.
2. Calls to lookupKeyRead() in commands actually writing to the data set
are repliaced with calls to lookupKeyWrite().
There are redundand checks, so for example, if in "2" something was
overlooked, we should be still safe, since anyway, when the master
writes the behavior is to don't care about what expireIfneeded()
returns.
This commit is related to #1768, #1770, #2131.
Previously, "MOVE key somestring" would move the key to
DB 0 which is just unexpected and wrong.
String as DB == error.
Test added too.
Modified by @antirez in order to use the getLongLongFromObject() API
instead of strtol().
Fixes#1428
We only want to use the last STORE key, but we have to record
we actually found a STORE key so we can increment the final return
key count.
Test added to prevent further regression.
Closes#1883, #1645, #1647
Behrad Zari discovered [1] and Josiah reported [2]: if you block
and wait for a list to exist, but the list creates from
a non-push command, the blocked client never gets notified.
This commit adds notification of blocked clients into
the DB layer and away from individual commands.
Lists can be created by [LR]PUSH, SORT..STORE, RENAME, MOVE,
and RESTORE. Previously, blocked client notifications were
only triggered by [LR]PUSH. Your client would never get
notified if a list were created by SORT..STORE or RENAME or
a RESTORE, etc.
Blocked client notification now happens in one unified place:
- dbAdd() triggers notification when adding a list to the DB
Two new tests are added that fail prior to this commit.
All test pass.
Fixes#1668
[1]: https://groups.google.com/forum/#!topic/redis-db/k4oWfMkN1NU
[2]: #1668
The previous code handling a lost slot (by another master with an higher
configuration for the slot) was defensive, considering it an error and
putting the cluster in an odd state requiring redis-cli fix.
This was changed, because actually this only happens either in a
legitimate way, with failovers, or when the admin messed with the config
in order to reconfigure the cluster. So the new code instead will try to
make sure that the keys stored match the new slots map, by removing all
the keys in the slots we lost ownership from.
The function that deletes the keys from the lost slots is called only
if the node does not lose all its slots (resulting in a reconfiguration
as a slave of the node that got ownership). This is an optimization
since the replication code will anyway flush all the instance data in
a faster way.
All the Redis functions that need to modify the string value of a key in
a destructive way (APPEND, SETBIT, SETRANGE, ...) require to make the
object unshared (if refcount > 1) and encoded in raw format (if encoding
is not already REDIS_ENCODING_RAW).
This was cut & pasted many times in multiple places of the code. This
commit puts the small logic needed into a function called
dbUnshareStringValue().
For testing purposes it is handy to have a very high resolution of the
LRU clock, so that it is possible to experiment with scripts running in
just a few seconds how the eviction algorithms works.
This commit allows Redis to use the cached LRU clock, or a value
computed on demand, depending on the resolution. So normally we have the
good performance of a precomputed value, and a clock that wraps in many
days using the normal resolution, but if needed, changing a define will
switch behavior to an high resolution LRU clock.
Previously we used zunionInterGetKeys(), however after this function was
fixed to account for the destination key (not needed when the API was
designed for "diskstore") the two set of commands can no longer be served
by an unique keys-extraction function.
This API originated from the "diskstore" experiment, not for Redis
Cluster itself, so there were legacy/useless things trying to
differentiate between keys that are going to be overwritten and keys
that need to be fetched from disk (preloaded).
All useless with Cluster, so removed with the result of code
simplification.
The previous implementation wasn't taking into account
the storage key in position 1 being a requirement (it
was only counting the source keys in positions 3 to N).
Fixesantirez/redis#1581
Keys expiring in the middle of the execution of Lua scripts are to
create inconsistencies in masters and / or AOF files. See the following
example:
if redis.call("exists",KEYS[1]) == 1
then
redis.call("incr","mycounter")
end
if redis.call("exists",KEYS[1]) == 1
then
return redis.call("incr","mycounter")
end
The script executes two times the same *if key exists then incrementcounter*
logic. However the two executions will work differently in the master and
the slaves, provided some unlucky timing happens.
In the master the first time the key may still exist, while the second time
the key may no longer exist. This will result in the key incremented just one
time. However as a side effect the master will generate a synthetic
`DEL` command in the replication channel in order to force the slaves to
expire the key (given that key expiration is master-driven).
When the same script will run in the slave, the key will no longer be
there, so the script will not increment the key.
The key idea used to implement the expire-at-first-lookup semantics was
provided by Marc Gravell.
Redis hash table implementation has many non-blocking features like
incremental rehashing, however while deleting a large hash table there
was no way to have a callback called to do some incremental work.
This commit adds this support, as an optiona callback argument to
dictEmpty() that is currently called at a fixed interval (one time every
65k deletions).
This change makes Sentinel less fragile about a number of failure modes.
This commit also fixes a different bug as a side effect, SLAVEOF command
was sent multiple times without incrementing the pending commands count.
The previous implementation of SCAN parsed the cursor in the generic
function implementing SCAN, SSCAN, HSCAN and ZSCAN.
The actual higher-level command implementation only checked for empty
keys and return ASAP in that case. The result was that inverting the
arguments of, for instance, SSCAN for example and write:
SSCAN 0 key
Instead of
SSCAN key 0
Resulted into no error, since 0 is a non-existing key name very likely.
Just the iterator returned no elements at all.
In order to fix this issue the code was refactored to extract the
function to parse the cursor and return the error. Every higher level
command implementation now parses the cursor and later checks if the key
exist or not.
The previous implementation assumed that the first call always happens
with cursor set to 0, this may not be the case, and we want to return 0
anyway otherwise the (broken) client code will loop forever.
The new implementation is capable of iterating the keyspace but also
sets, hashes, and sorted sets, and can be used to implement SSCAN, ZSCAN
and HSCAN.
This commit fixes two corner cases for the TTL command.
1) When the key was already logically expired (expire time older
than current time) the command returned -1 instead of -2.
2) When the key was existing and the expire was found to be exactly 0
(the key was just about to expire), the command reported -1 (that is, no
expire) instead of a TTL of zero (that is, about to expire).
When keyspace events are enabled, the overhead is not sever but
noticeable, so this commit introduces the ability to select subclasses
of events in order to avoid to generate events the user is not
interested in.
The events can be selected using redis.conf or CONFIG SET / GET.
This commit fixes issue #875 that was caused by the following events:
1) There is an active child doing BGSAVE.
2) flushall is called (or any other condition that makes Redis killing
the saving child process).
3) An error is sensed by Redis as the child exited with an error (killed
by a singal), that stops accepting write commands until a BGSAVE happens
to be executed with success.
Whitelisting SIGUSR1 and making sure Redis always uses this signal in
order to kill its own children fixes the issue.
The previous behavior was to return -1 if:
1) Existing key but without an expire set.
2) Non existing key.
Now the second case is handled in a different, and TTL will return -2
if the key does not exist at all.
PTTL follows the same behavior as well.
Unfortunately we had still the lame atoi() without any error checking in
place, so "SELECT foo" would work as "SELECT 0". This was not an huge
problem per se but some people expected that DB can be strings and not
just numbers, and without errors you get the feeling that they can be
numbers, but not the behavior.
Now getLongFromObjectOrReply() is used as almost everybody else across
the code, generating an error if the number is not an integer or
overflows the long type.
Thanks to @mipearson for reporting that on Twitter.
Every matched key in a KEYS call is checked for expiration. When the key
is set to expire, the call to `getExpire` will assert that the key also
exists in the main dictionary. This in turn causes a rehashing step to
be executed. Rehashing a dictionary when there is an iterator active may
result in the iterator emitting duplicate entries, or not emitting some
entries at all. By using a safe iterator, the rehash step is omitted.