Replication backlog and replicas use one global shared replication buffer (#9166)
## Background
For redis master, one replica uses one copy of replication buffer, that is a big waste of memory,
more replicas more waste, and allocate/free memory for every reply list also cost much.
If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with
replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big,
master may be OOM when there are many replicas that separately keep much memory.
Because replication buffers of different replica client are the same, one simple idea is that
all replicas only use one replication buffer, that will effectively save memory.
Since replication backlog content is the same as replicas' output buffer, now we
can discard replication backlog memory and use global shared replication buffer
to implement replication backlog mechanism.
## Implementation
I create one global "replication buffer" which contains content of replication stream.
The structure of "replication buffer" is similar to the reply list that exists in every client.
But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields.
```c
/* Replication buffer blocks is the list of replBufBlock.
*
* +--------------+ +--------------+ +--------------+
* | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 |
* +--------------+ +--------------+ +--------------+
* | / \
* | / \
* | / \
* Repl Backlog Replia_A Replia_B
*
* Each replica or replication backlog increments only the refcount of the
* 'ref_repl_buf_node' which it points to. So when replica walks to the next
* node, it should first increase the next node's refcount, and when we trim
* the replication buffer nodes, we remove node always from the head node which
* refcount is 0. If the refcount of the head node is not 0, we must stop
* trimming and never iterate the next node. */
/* Similar with 'clientReplyBlock', it is used for shared buffers between
* all replica clients and replication backlog. */
typedef struct replBufBlock {
int refcount; /* Number of replicas or repl backlog using. */
long long id; /* The unique incremental number. */
long long repl_offset; /* Start replication offset of the block. */
size_t size, used;
char buf[];
} replBufBlock;
```
So now when we feed replication stream into replication backlog and all replicas, we only need
to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of
replication backlog and replicas to references of the global replication buffer blocks. And we also
need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim
replication backlog if exceeding `repl-backlog-size`.
When sending reply to replicas, we also need to iterate replication buffer blocks and send its
content, when totally sending one block for replica, we decrease current node count and
increase the next current node count, and then free the block which reference is 0 from the
head of replication buffer blocks.
Since now we use linked list to manage replication backlog, it may cost much time for iterating
all linked list nodes to find corresponding replication buffer node. So we create a rax tree to
store some nodes for index, but to avoid rax tree occupying too much memory, i record
one per 64 nodes for index.
Currently, to make partial resynchronization as possible as much, we always let replication
backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting
if slow replicas that reference vast replication buffer blocks, and this method doesn't increase
memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced
replication buffer blocks when we need to trim backlog for exceeding backlog size setting,
we trim backlog incrementally (free 64 blocks per call now), and make it faster in
`beforeSleep` (free 640 blocks).
### Other changes
- `mem_total_replication_buffers`: we add this field in INFO command, it means the total
memory of replication buffers used.
- `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory
is not 0, but it still may be 0, since replication backlog and replicas share one global replication
buffer, only if replication buffer memory is more than the repl backlog setting size, we consider
the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption
of repl backlog.
- Key eviction
Since all replicas and replication backlog share global replication buffer, we think only the
part of exceeding backlog size the extra separate consumption of replicas.
Because we trim backlog incrementally in the background, backlog size may exceeds our
setting if slow replicas that reference vast replication buffer blocks disconnect.
To avoid massive eviction loop, we don't count the delayed freed replication backlog into
used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory.
- `client-output-buffer-limit` check for replica clients
It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size
config (partial sync will succeed and then replica will get disconnected). Such a configuration is
ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption
implications since the replica client will share the backlog buffers memory.
- Drop replication backlog after loading data if needed
We always create replication backlog if server is a master, we need it because we put DELs in
it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb,
it is not possible to support partial resynchronization, to avoid extra memory of replication backlog,
we drop it.
- Multi IO threads
Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled,
to guarantee data accessing thread safe, we must let main thread handle sending the output buffer
to all replicas. But before, other IO threads could handle sending output buffer of all replicas.
## Other optimizations
This solution resolve some other problem:
- When replicas disconnect with master since of out of output buffer limit, releasing the output
buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now,
it doesn't cause freezing.
- This implementation may mitigate reply list copy cost time(also freezes server) when one replication
has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy
reference info, it is very light.
- If we set replication backlog size big, it also may cost much time to copy replication backlog into
replica's output buffer. But this commit eliminates this problem.
- Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 14:24:31 +08:00
|
|
|
# This test group aims to test that all replicas share one global replication buffer,
|
|
|
|
# two replicas don't make replication buffer size double, and when there is no replica,
|
|
|
|
# replica buffer will shrink.
|
|
|
|
start_server {tags {"repl external:skip"}} {
|
|
|
|
start_server {} {
|
|
|
|
start_server {} {
|
|
|
|
start_server {} {
|
|
|
|
set replica1 [srv -3 client]
|
|
|
|
set replica2 [srv -2 client]
|
|
|
|
set replica3 [srv -1 client]
|
|
|
|
|
|
|
|
set master [srv 0 client]
|
|
|
|
set master_host [srv 0 host]
|
|
|
|
set master_port [srv 0 port]
|
|
|
|
|
|
|
|
$master config set save ""
|
|
|
|
$master config set repl-backlog-size 16384
|
|
|
|
$master config set client-output-buffer-limit "replica 0 0 0"
|
|
|
|
|
|
|
|
# Make sure replica3 is synchronized with master
|
|
|
|
$replica3 replicaof $master_host $master_port
|
|
|
|
wait_for_sync $replica3
|
|
|
|
|
|
|
|
# Generating RDB will take some 100 seconds
|
|
|
|
$master config set rdb-key-save-delay 1000000
|
|
|
|
populate 100 "" 16
|
|
|
|
|
|
|
|
# Make sure replica1 and replica2 are waiting bgsave
|
|
|
|
$replica1 replicaof $master_host $master_port
|
|
|
|
$replica2 replicaof $master_host $master_port
|
|
|
|
wait_for_condition 50 100 {
|
|
|
|
([s rdb_bgsave_in_progress] == 1) &&
|
|
|
|
[lindex [$replica1 role] 3] eq {sync} &&
|
|
|
|
[lindex [$replica2 role] 3] eq {sync}
|
|
|
|
} else {
|
|
|
|
fail "fail to sync with replicas"
|
|
|
|
}
|
|
|
|
|
|
|
|
test {All replicas share one global replication buffer} {
|
|
|
|
set before_used [s used_memory]
|
|
|
|
populate 1024 "" 1024 ; # Write extra 1M data
|
|
|
|
# New data uses 1M memory, but all replicas use only one
|
|
|
|
# replication buffer, so all replicas output memory is not
|
|
|
|
# more than double of replication buffer.
|
|
|
|
set repl_buf_mem [s mem_total_replication_buffers]
|
|
|
|
set extra_mem [expr {[s used_memory]-$before_used-1024*1024}]
|
|
|
|
assert {$extra_mem < 2*$repl_buf_mem}
|
|
|
|
|
|
|
|
# Kill replica1, replication_buffer will not become smaller
|
|
|
|
catch {$replica1 shutdown nosave}
|
|
|
|
wait_for_condition 50 100 {
|
|
|
|
[s connected_slaves] eq {2}
|
|
|
|
} else {
|
|
|
|
fail "replica doesn't disconnect with master"
|
|
|
|
}
|
|
|
|
assert_equal $repl_buf_mem [s mem_total_replication_buffers]
|
|
|
|
}
|
|
|
|
|
|
|
|
test {Replication buffer will become smaller when no replica uses} {
|
|
|
|
# Make sure replica3 catch up with the master
|
|
|
|
wait_for_ofs_sync $master $replica3
|
|
|
|
|
|
|
|
set repl_buf_mem [s mem_total_replication_buffers]
|
|
|
|
# Kill replica2, replication_buffer will become smaller
|
|
|
|
catch {$replica2 shutdown nosave}
|
|
|
|
wait_for_condition 50 100 {
|
|
|
|
[s connected_slaves] eq {1}
|
|
|
|
} else {
|
|
|
|
fail "replica2 doesn't disconnect with master"
|
|
|
|
}
|
|
|
|
assert {[expr $repl_buf_mem - 1024*1024] > [s mem_total_replication_buffers]}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
# This test group aims to test replication backlog size can outgrow the backlog
|
|
|
|
# limit config if there is a slow replica which keep massive replication buffers,
|
|
|
|
# and replicas could use this replication buffer (beyond backlog config) for
|
|
|
|
# partial re-synchronization. Of course, replication backlog memory also can
|
|
|
|
# become smaller when master disconnects with slow replicas since output buffer
|
|
|
|
# limit is reached.
|
|
|
|
start_server {tags {"repl external:skip"}} {
|
|
|
|
start_server {} {
|
|
|
|
start_server {} {
|
|
|
|
set replica1 [srv -2 client]
|
|
|
|
set replica1_pid [s -2 process_id]
|
|
|
|
set replica2 [srv -1 client]
|
|
|
|
set replica2_pid [s -1 process_id]
|
|
|
|
|
|
|
|
set master [srv 0 client]
|
|
|
|
set master_host [srv 0 host]
|
|
|
|
set master_port [srv 0 port]
|
|
|
|
|
|
|
|
$master config set save ""
|
|
|
|
$master config set repl-backlog-size 16384
|
|
|
|
$master config set client-output-buffer-limit "replica 0 0 0"
|
2021-10-29 13:04:12 +08:00
|
|
|
|
|
|
|
# Executing 'debug digest' on master which has many keys costs much time
|
|
|
|
# (especially in valgrind), this causes that replica1 and replica2 disconnect
|
|
|
|
# with master.
|
|
|
|
$master config set repl-timeout 1000
|
|
|
|
$replica1 config set repl-timeout 1000
|
|
|
|
$replica2 config set repl-timeout 1000
|
|
|
|
|
Replication backlog and replicas use one global shared replication buffer (#9166)
## Background
For redis master, one replica uses one copy of replication buffer, that is a big waste of memory,
more replicas more waste, and allocate/free memory for every reply list also cost much.
If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with
replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big,
master may be OOM when there are many replicas that separately keep much memory.
Because replication buffers of different replica client are the same, one simple idea is that
all replicas only use one replication buffer, that will effectively save memory.
Since replication backlog content is the same as replicas' output buffer, now we
can discard replication backlog memory and use global shared replication buffer
to implement replication backlog mechanism.
## Implementation
I create one global "replication buffer" which contains content of replication stream.
The structure of "replication buffer" is similar to the reply list that exists in every client.
But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields.
```c
/* Replication buffer blocks is the list of replBufBlock.
*
* +--------------+ +--------------+ +--------------+
* | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 |
* +--------------+ +--------------+ +--------------+
* | / \
* | / \
* | / \
* Repl Backlog Replia_A Replia_B
*
* Each replica or replication backlog increments only the refcount of the
* 'ref_repl_buf_node' which it points to. So when replica walks to the next
* node, it should first increase the next node's refcount, and when we trim
* the replication buffer nodes, we remove node always from the head node which
* refcount is 0. If the refcount of the head node is not 0, we must stop
* trimming and never iterate the next node. */
/* Similar with 'clientReplyBlock', it is used for shared buffers between
* all replica clients and replication backlog. */
typedef struct replBufBlock {
int refcount; /* Number of replicas or repl backlog using. */
long long id; /* The unique incremental number. */
long long repl_offset; /* Start replication offset of the block. */
size_t size, used;
char buf[];
} replBufBlock;
```
So now when we feed replication stream into replication backlog and all replicas, we only need
to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of
replication backlog and replicas to references of the global replication buffer blocks. And we also
need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim
replication backlog if exceeding `repl-backlog-size`.
When sending reply to replicas, we also need to iterate replication buffer blocks and send its
content, when totally sending one block for replica, we decrease current node count and
increase the next current node count, and then free the block which reference is 0 from the
head of replication buffer blocks.
Since now we use linked list to manage replication backlog, it may cost much time for iterating
all linked list nodes to find corresponding replication buffer node. So we create a rax tree to
store some nodes for index, but to avoid rax tree occupying too much memory, i record
one per 64 nodes for index.
Currently, to make partial resynchronization as possible as much, we always let replication
backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting
if slow replicas that reference vast replication buffer blocks, and this method doesn't increase
memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced
replication buffer blocks when we need to trim backlog for exceeding backlog size setting,
we trim backlog incrementally (free 64 blocks per call now), and make it faster in
`beforeSleep` (free 640 blocks).
### Other changes
- `mem_total_replication_buffers`: we add this field in INFO command, it means the total
memory of replication buffers used.
- `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory
is not 0, but it still may be 0, since replication backlog and replicas share one global replication
buffer, only if replication buffer memory is more than the repl backlog setting size, we consider
the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption
of repl backlog.
- Key eviction
Since all replicas and replication backlog share global replication buffer, we think only the
part of exceeding backlog size the extra separate consumption of replicas.
Because we trim backlog incrementally in the background, backlog size may exceeds our
setting if slow replicas that reference vast replication buffer blocks disconnect.
To avoid massive eviction loop, we don't count the delayed freed replication backlog into
used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory.
- `client-output-buffer-limit` check for replica clients
It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size
config (partial sync will succeed and then replica will get disconnected). Such a configuration is
ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption
implications since the replica client will share the backlog buffers memory.
- Drop replication backlog after loading data if needed
We always create replication backlog if server is a master, we need it because we put DELs in
it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb,
it is not possible to support partial resynchronization, to avoid extra memory of replication backlog,
we drop it.
- Multi IO threads
Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled,
to guarantee data accessing thread safe, we must let main thread handle sending the output buffer
to all replicas. But before, other IO threads could handle sending output buffer of all replicas.
## Other optimizations
This solution resolve some other problem:
- When replicas disconnect with master since of out of output buffer limit, releasing the output
buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now,
it doesn't cause freezing.
- This implementation may mitigate reply list copy cost time(also freezes server) when one replication
has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy
reference info, it is very light.
- If we set replication backlog size big, it also may cost much time to copy replication backlog into
replica's output buffer. But this commit eliminates this problem.
- Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 14:24:31 +08:00
|
|
|
$replica1 replicaof $master_host $master_port
|
|
|
|
wait_for_sync $replica1
|
|
|
|
|
|
|
|
test {Replication backlog size can outgrow the backlog limit config} {
|
|
|
|
# Generating RDB will take 1000 seconds
|
|
|
|
$master config set rdb-key-save-delay 1000000
|
|
|
|
populate 1000 master 10000
|
|
|
|
$replica2 replicaof $master_host $master_port
|
|
|
|
# Make sure replica2 is waiting bgsave
|
|
|
|
wait_for_condition 5000 100 {
|
|
|
|
([s rdb_bgsave_in_progress] == 1) &&
|
|
|
|
[lindex [$replica2 role] 3] eq {sync}
|
|
|
|
} else {
|
|
|
|
fail "fail to sync with replicas"
|
|
|
|
}
|
|
|
|
# Replication actual backlog grow more than backlog setting since
|
|
|
|
# the slow replica2 kept replication buffer.
|
|
|
|
populate 10000 master 10000
|
|
|
|
assert {[s repl_backlog_histlen] > [expr 10000*10000]}
|
|
|
|
}
|
|
|
|
|
|
|
|
# Wait replica1 catch up with the master
|
|
|
|
wait_for_condition 1000 100 {
|
|
|
|
[s -2 master_repl_offset] eq [s master_repl_offset]
|
|
|
|
} else {
|
|
|
|
fail "Replica offset didn't catch up with the master after too long time"
|
|
|
|
}
|
|
|
|
|
|
|
|
test {Replica could use replication buffer (beyond backlog config) for partial resynchronization} {
|
|
|
|
# replica1 disconnects with master
|
|
|
|
$replica1 replicaof [srv -1 host] [srv -1 port]
|
|
|
|
# Write a mass of data that exceeds repl-backlog-size
|
|
|
|
populate 10000 master 10000
|
|
|
|
# replica1 reconnects with master
|
|
|
|
$replica1 replicaof $master_host $master_port
|
|
|
|
wait_for_condition 1000 100 {
|
|
|
|
[s -2 master_repl_offset] eq [s master_repl_offset]
|
|
|
|
} else {
|
|
|
|
fail "Replica offset didn't catch up with the master after too long time"
|
|
|
|
}
|
|
|
|
|
|
|
|
# replica2 still waits for bgsave ending
|
|
|
|
assert {[s rdb_bgsave_in_progress] eq {1} && [lindex [$replica2 role] 3] eq {sync}}
|
|
|
|
# master accepted replica1 partial resync
|
|
|
|
assert_equal [s sync_partial_ok] {1}
|
|
|
|
assert_equal [$master debug digest] [$replica1 debug digest]
|
|
|
|
}
|
|
|
|
|
|
|
|
test {Replication backlog memory will become smaller if disconnecting with replica} {
|
|
|
|
assert {[s repl_backlog_histlen] > [expr 2*10000*10000]}
|
|
|
|
assert_equal [s connected_slaves] {2}
|
|
|
|
|
|
|
|
exec kill -SIGSTOP $replica2_pid
|
|
|
|
r config set client-output-buffer-limit "replica 128k 0 0"
|
|
|
|
# trigger output buffer limit check
|
|
|
|
r set key [string repeat A [expr 64*1024]]
|
|
|
|
# master will close replica2's connection since replica2's output
|
|
|
|
# buffer limit is reached, so there only is replica1.
|
|
|
|
wait_for_condition 100 100 {
|
|
|
|
[s connected_slaves] eq {1}
|
|
|
|
} else {
|
|
|
|
fail "master didn't disconnect with replica2"
|
|
|
|
}
|
|
|
|
|
|
|
|
# Since we trim replication backlog inrementally, replication backlog
|
|
|
|
# memory may take time to be reclaimed.
|
|
|
|
wait_for_condition 1000 100 {
|
|
|
|
[s repl_backlog_histlen] < [expr 10000*10000]
|
|
|
|
} else {
|
|
|
|
fail "Replication backlog memory is not smaller"
|
|
|
|
}
|
|
|
|
exec kill -SIGCONT $replica2_pid
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
test {Partial resynchronization is successful even client-output-buffer-limit is less than repl-backlog-size} {
|
|
|
|
start_server {tags {"repl external:skip"}} {
|
|
|
|
start_server {} {
|
|
|
|
r config set save ""
|
|
|
|
r config set repl-backlog-size 100mb
|
|
|
|
r config set client-output-buffer-limit "replica 512k 0 0"
|
|
|
|
|
|
|
|
set replica [srv -1 client]
|
|
|
|
$replica replicaof [srv 0 host] [srv 0 port]
|
|
|
|
wait_for_sync $replica
|
|
|
|
|
|
|
|
set big_str [string repeat A [expr 10*1024*1024]] ;# 10mb big string
|
|
|
|
r multi
|
|
|
|
r client kill type replica
|
|
|
|
r set key $big_str
|
|
|
|
r set key $big_str
|
|
|
|
r debug sleep 2 ;# wait for replica reconnecting
|
|
|
|
r exec
|
|
|
|
# When replica reconnects with master, master accepts partial resync,
|
|
|
|
# and don't close replica client even client output buffer limit is
|
|
|
|
# reached.
|
|
|
|
r set key $big_str ;# trigger output buffer limit check
|
|
|
|
wait_for_ofs_sync r $replica
|
|
|
|
# master accepted replica partial resync
|
|
|
|
assert_equal [s sync_full] {1}
|
|
|
|
assert_equal [s sync_partial_ok] {1}
|
|
|
|
|
|
|
|
r multi
|
|
|
|
r set key $big_str
|
|
|
|
r set key $big_str
|
|
|
|
r exec
|
|
|
|
# replica's reply buffer size is more than client-output-buffer-limit but
|
|
|
|
# doesn't exceed repl-backlog-size, we don't close replica client.
|
|
|
|
wait_for_condition 1000 100 {
|
|
|
|
[s -1 master_repl_offset] eq [s master_repl_offset]
|
|
|
|
} else {
|
|
|
|
fail "Replica offset didn't catch up with the master after too long time"
|
|
|
|
}
|
|
|
|
assert_equal [s sync_full] {1}
|
|
|
|
assert_equal [s sync_partial_ok] {1}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|