The gist of the changes is that now, partial resynchronizations between
slaves and masters (without the need of a full resync with RDB transfer
and so forth), work in a number of cases when it was impossible
in the past. For instance:
1. When a slave is promoted to mastrer, the slaves of the old master can
partially resynchronize with the new master.
2. Chained slalves (slaves of slaves) can be moved to replicate to other
slaves or the master itsef, without requiring a full resync.
3. The master itself, after being turned into a slave, is able to
partially resynchronize with the new master, when it joins replication
again.
In order to obtain this, the following main changes were operated:
* Slaves also take a replication backlog, not just masters.
* Same stream replication for all the slaves and sub slaves. The
replication stream is identical from the top level master to its slaves
and is also the same from the slaves to their sub-slaves and so forth.
This means that if a slave is later promoted to master, it has the
same replication backlong, and can partially resynchronize with its
slaves (that were previously slaves of the old master).
* A given replication history is no longer identified by the `runid` of
a Redis node. There is instead a `replication ID` which changes every
time the instance has a new history no longer coherent with the past
one. So, for example, slaves publish the same replication history of
their master, however when they are turned into masters, they publish
a new replication ID, but still remember the old ID, so that they are
able to partially resynchronize with slaves of the old master (up to a
given offset).
* The replication protocol was slightly modified so that a new extended
+CONTINUE reply from the master is able to inform the slave of a
replication ID change.
* REPLCONF CAPA is used in order to notify masters that a slave is able
to understand the new +CONTINUE reply.
* The RDB file was extended with an auxiliary field that is able to
select a given DB after loading in the slave, so that the slave can
continue receiving the replication stream from the point it was
disconnected without requiring the master to insert "SELECT" statements.
This is useful in order to guarantee the "same stream" property, because
the slave must be able to accumulate an identical backlog.
* Slave pings to sub-slaves are now sent in a special form, when the
top-level master is disconnected, in order to don't interfer with the
replication stream. We just use out of band "\n" bytes as in other parts
of the Redis protocol.
An old design document is available here:
https://gist.github.com/antirez/ae068f95c0d084891305
However the implementation is not identical to the description because
during the work to implement it, different changes were needed in order
to make things working well.
It's possible large objects could be larger than 'int', so let's
upgrade all size counters to ssize_t.
This also fixes rdbSaveObject serialized bytes calculation.
Since entire serializations of data structures can be large,
so we don't want to limit their calculated size to a 32 bit signed max.
This commit increases object size calculation and
cascades the change back up to serializedlength printing.
Before:
127.0.0.1:6379> debug object hihihi
... encoding:quicklist serializedlength:-2147483559 ...
After:
127.0.0.1:6379> debug object hihihi
... encoding:quicklist serializedlength:2147483737 ...
This commit introduces a new RDB data type called 'aux'. It is used in
order to insert inside an RDB file key-value pairs that may serve
different needs, without breaking backward compatibility when new
informations are embedded inside an RDB file. The contract between Redis
versions is to ignore unknown aux fields when encountered.
Aux fields can be used in order to:
1. Augment the RDB file with info like version of Redis that created the
RDB file, creation time, used memory while the RDB was created, and so
forth.
2. Add state about Redis inside the RDB file that we need to reload
later: replication offset, previos master run ID, in order to improve
failovers safety and allow partial resynchronization after a slave
restart.
3. Anything that we may want to add to RDB files without breaking the
ability of past versions of Redis to load the file.
The new opcode is an hint about the size of the dataset (keys and number
of expires) we are going to load for a given Redis database inside the
RDB file. Since hash tables are resized accordingly ASAP, useless
rehashing is avoided, speeding up load times significantly, in the order
of ~ 20% or more for larger data sets.
Related issue: #1719
Turns out it's a huge improvement during save/reload/migrate/restore
because, with compression enabled, we're compressing 4k or 8k
chunks of data consisting of multiple elements in one ziplist
instead of compressing series of smaller individual elements.
(additional commit notes by antirez@gmail.com):
The rdbIsObjectType() macro was not updated when the new RDB object type
of ziplist encoded hashes was added.
As a result RESTORE, that uses rdbLoadObjectType(), failed when a
ziplist encoded hash was loaded.
This does not affected normal RDB loading because in that case we use
the lower-level function rdbLoadType().
The commit also adds a regression test.