BACKGROUND AND USE CASEj
Redis slaves are normally write only, however the supprot a "writable"
mode which is very handy when scaling reads on slaves, that actually
need write operations in order to access data. For instance imagine
having slaves replicating certain Sets keys from the master. When
accessing the data on the slave, we want to peform intersections between
such Sets values. However we don't want to intersect each time: to cache
the intersection for some time often is a good idea.
To do so, it is possible to setup a slave as a writable slave, and
perform the intersection on the slave side, perhaps setting a TTL on the
resulting key so that it will expire after some time.
THE BUG
Problem: in order to have a consistent replication, expiring of keys in
Redis replication is up to the master, that synthesize DEL operations to
send in the replication stream. However slaves logically expire keys
by hiding them from read attempts from clients so that if the master did
not promptly sent a DEL, the client still see logically expired keys
as non existing.
Because slaves don't actively expire keys by actually evicting them but
just masking from the POV of read operations, if a key is created in a
writable slave, and an expire is set, the key will be leaked forever:
1. No DEL will be received from the master, which does not know about
such a key at all.
2. No eviction will be performed by the slave, since it needs to disable
eviction because it's up to masters, otherwise consistency of data is
lost.
THE FIX
In order to fix the problem, the slave should be able to tag keys that
were created in the slave side and have an expire set in some way.
My solution involved using an unique additional dictionary created by
the writable slave only if needed. The dictionary is obviously keyed by
the key name that we need to track: all the keys that are set with an
expire directly by a client writing to the slave are tracked.
The value in the dictionary is a bitmap of all the DBs where such a key
name need to be tracked, so that we can use a single dictionary to track
keys in all the DBs used by the slave (actually this limits the solution
to the first 64 DBs, but the default with Redis is to use 16 DBs).
This solution allows to pay both a small complexity and CPU penalty,
which is zero when the feature is not used, actually. The slave-side
eviction is encapsulated in code which is not coupled with the rest of
the Redis core, if not for the hook to track the keys.
TODO
I'm doing the first smoke tests to see if the feature works as expected:
so far so good. Unit tests should be added before merging into the
4.0 branch.
The gist of the changes is that now, partial resynchronizations between
slaves and masters (without the need of a full resync with RDB transfer
and so forth), work in a number of cases when it was impossible
in the past. For instance:
1. When a slave is promoted to mastrer, the slaves of the old master can
partially resynchronize with the new master.
2. Chained slalves (slaves of slaves) can be moved to replicate to other
slaves or the master itsef, without requiring a full resync.
3. The master itself, after being turned into a slave, is able to
partially resynchronize with the new master, when it joins replication
again.
In order to obtain this, the following main changes were operated:
* Slaves also take a replication backlog, not just masters.
* Same stream replication for all the slaves and sub slaves. The
replication stream is identical from the top level master to its slaves
and is also the same from the slaves to their sub-slaves and so forth.
This means that if a slave is later promoted to master, it has the
same replication backlong, and can partially resynchronize with its
slaves (that were previously slaves of the old master).
* A given replication history is no longer identified by the `runid` of
a Redis node. There is instead a `replication ID` which changes every
time the instance has a new history no longer coherent with the past
one. So, for example, slaves publish the same replication history of
their master, however when they are turned into masters, they publish
a new replication ID, but still remember the old ID, so that they are
able to partially resynchronize with slaves of the old master (up to a
given offset).
* The replication protocol was slightly modified so that a new extended
+CONTINUE reply from the master is able to inform the slave of a
replication ID change.
* REPLCONF CAPA is used in order to notify masters that a slave is able
to understand the new +CONTINUE reply.
* The RDB file was extended with an auxiliary field that is able to
select a given DB after loading in the slave, so that the slave can
continue receiving the replication stream from the point it was
disconnected without requiring the master to insert "SELECT" statements.
This is useful in order to guarantee the "same stream" property, because
the slave must be able to accumulate an identical backlog.
* Slave pings to sub-slaves are now sent in a special form, when the
top-level master is disconnected, in order to don't interfer with the
replication stream. We just use out of band "\n" bytes as in other parts
of the Redis protocol.
An old design document is available here:
https://gist.github.com/antirez/ae068f95c0d084891305
However the implementation is not identical to the description because
during the work to implement it, different changes were needed in order
to make things working well.
This new command swaps two Redis databases, so that immediately all the
clients connected to a given DB will see the data of the other DB, and
the other way around. Example:
SWAPDB 0 1
This will swap DB 0 with DB 1. All the clients connected with DB 0 will
immediately see the new data, exactly like all the clients connected
with DB 1 will see the data that was formerly of DB 0.
MOTIVATION AND HISTORY
---
The command was recently demanded by Pedro Melo, but was suggested in
the past multiple times, and always refused by me.
The reason why it was asked: Imagine you have clients operating in DB 0.
At the same time, you create a new version of the dataset in DB 1.
When the new version of the dataset is available, you immediately want
to swap the two views, so that the clients will transparently use the
new version of the data. At the same time you'll likely destroy the
DB 1 dataset (that contains the old data) and start to build a new
version, to repeat the process.
This is an interesting pattern, but the reason why I always opposed to
implement this, was that FLUSHDB was a blocking command in Redis before
Redis 4.0 improvements. Now we have FLUSHDB ASYNC that releases the
old data in O(1) from the point of view of the client, to reclaim memory
incrementally in a different thread.
At this point, the pattern can really be supported without latency
spikes, so I'm providing this implementation for the users to comment.
In case a very compelling argument will be made against this new command
it may be removed.
BEHAVIOR WITH BLOCKING OPERATIONS
---
If a client is blocking for a list in a given DB, after the swap it will
still be blocked in the same DB ID, since this is the most logical thing
to do: if I was blocked for a list push to list "foo", even after the
swap I want still a LPUSH to reach the key "foo" in the same DB in order
to unblock.
However an interesting thing happens when a client is, for instance,
blocked waiting for new elements in list "foo" of DB 0. Then the DB
0 and 1 are swapped with SWAPDB. However the DB 1 happened to have
a list called "foo" containing elements. When this happens, this
implementation can correctly unblock the client.
It is possible that there are subtle corner cases that are not covered
in the implementation, but since the command is self-contained from the
POV of the implementation and the Redis core, it cannot cause anything
bad if not used.
Tests and documentation are yet to be provided.
It is possible to get better results by using the pool like in the LRU
case. Also from tests during the morning I believe the current
implementation has issues in the frequency decay function that should
decrease the counter at periodic intervals.
MOVE was not able to move the TTL: when a key was moved into a different
database number, it became persistent like if PERSIST was used.
In some incredible way (I guess almost nobody uses Redis MOVE) this bug
remained unnoticed inside Redis internals for many years.
Finally Andy Grunwald discovered it and opened an issue.
This commit fixes the bug and adds a regression test.
Close#2765.
The new return value is the number of keys existing, among the ones
specified in the command line, counting the same key multiple times if
given multiple times (and if it exists).
See PR #2667.
Slaves key expire is orchestrated by the master. Sometimes the master
will send the synthesized DEL to expire keys on the slave with a non
trivial delay (when the key is not accessed, only the incremental expiry
algorithm will expire it in background).
During that time, a key is logically expired, but slaves still return
the key if you GET (or whatever) it. This is a bad behavior.
However we can't simply trust the slave view of the key, since we need
the master to be able to send write commands to update the slave data
set, and DELs should only happen when the key is expired in the master
in order to ensure consistency.
However 99.99% of the issues with this behavior is when a client which
is not a master sends a read only command. In this case we are safe and
can consider the key as non existing.
This commit does a few changes in order to make this sane:
1. lookupKeyRead() is modified in order to return NULL if the above
conditions are met.
2. Calls to lookupKeyRead() in commands actually writing to the data set
are repliaced with calls to lookupKeyWrite().
There are redundand checks, so for example, if in "2" something was
overlooked, we should be still safe, since anyway, when the master
writes the behavior is to don't care about what expireIfneeded()
returns.
This commit is related to #1768, #1770, #2131.
Previously, "MOVE key somestring" would move the key to
DB 0 which is just unexpected and wrong.
String as DB == error.
Test added too.
Modified by @antirez in order to use the getLongLongFromObject() API
instead of strtol().
Fixes#1428
We only want to use the last STORE key, but we have to record
we actually found a STORE key so we can increment the final return
key count.
Test added to prevent further regression.
Closes#1883, #1645, #1647
Behrad Zari discovered [1] and Josiah reported [2]: if you block
and wait for a list to exist, but the list creates from
a non-push command, the blocked client never gets notified.
This commit adds notification of blocked clients into
the DB layer and away from individual commands.
Lists can be created by [LR]PUSH, SORT..STORE, RENAME, MOVE,
and RESTORE. Previously, blocked client notifications were
only triggered by [LR]PUSH. Your client would never get
notified if a list were created by SORT..STORE or RENAME or
a RESTORE, etc.
Blocked client notification now happens in one unified place:
- dbAdd() triggers notification when adding a list to the DB
Two new tests are added that fail prior to this commit.
All test pass.
Fixes#1668
[1]: https://groups.google.com/forum/#!topic/redis-db/k4oWfMkN1NU
[2]: #1668
The previous code handling a lost slot (by another master with an higher
configuration for the slot) was defensive, considering it an error and
putting the cluster in an odd state requiring redis-cli fix.
This was changed, because actually this only happens either in a
legitimate way, with failovers, or when the admin messed with the config
in order to reconfigure the cluster. So the new code instead will try to
make sure that the keys stored match the new slots map, by removing all
the keys in the slots we lost ownership from.
The function that deletes the keys from the lost slots is called only
if the node does not lose all its slots (resulting in a reconfiguration
as a slave of the node that got ownership). This is an optimization
since the replication code will anyway flush all the instance data in
a faster way.
All the Redis functions that need to modify the string value of a key in
a destructive way (APPEND, SETBIT, SETRANGE, ...) require to make the
object unshared (if refcount > 1) and encoded in raw format (if encoding
is not already REDIS_ENCODING_RAW).
This was cut & pasted many times in multiple places of the code. This
commit puts the small logic needed into a function called
dbUnshareStringValue().
For testing purposes it is handy to have a very high resolution of the
LRU clock, so that it is possible to experiment with scripts running in
just a few seconds how the eviction algorithms works.
This commit allows Redis to use the cached LRU clock, or a value
computed on demand, depending on the resolution. So normally we have the
good performance of a precomputed value, and a clock that wraps in many
days using the normal resolution, but if needed, changing a define will
switch behavior to an high resolution LRU clock.