Client eviction (#8687)
### Description
A mechanism for disconnecting clients when the sum of all connected clients is above a
configured limit. This prevents eviction or OOM caused by accumulated used memory
between all clients. It's a complimentary mechanism to the `client-output-buffer-limit`
mechanism which takes into account not only a single client and not only output buffers
but rather all memory used by all clients.
#### Design
The general design is as following:
* We track memory usage of each client, taking into account all memory used by the
client (query buffer, output buffer, parsed arguments, etc...). This is kept up to date
after reading from the socket, after processing commands and after writing to the socket.
* Based on the used memory we sort all clients into buckets. Each bucket contains all
clients using up up to x2 memory of the clients in the bucket below it. For example up
to 1m clients, up to 2m clients, up to 4m clients, ...
* Before processing a command and before sleep we check if we're over the configured
limit. If we are we start disconnecting clients from larger buckets downwards until we're
under the limit.
#### Config
`maxmemory-clients` max memory all clients are allowed to consume, above this threshold
we disconnect clients.
This config can either be set to 0 (meaning no limit), a size in bytes (possibly with MB/GB
suffix), or as a percentage of `maxmemory` by using the `%` suffix (e.g. setting it to `10%`
would mean 10% of `maxmemory`).
#### Important code changes
* During the development I encountered yet more situations where our io-threads access
global vars. And needed to fix them. I also had to handle keeps the clients sorted into the
memory buckets (which are global) while their memory usage changes in the io-thread.
To achieve this I decided to simplify how we check if we're in an io-thread and make it
much more explicit. I removed the `CLIENT_PENDING_READ` flag used for checking
if the client is in an io-thread (it wasn't used for anything else) and just used the global
`io_threads_op` variable the same way to check during writes.
* I optimized the cleanup of the client from the `clients_pending_read` list on client freeing.
We now store a pointer in the `client` struct to this list so we don't need to search in it
(`pending_read_list_node`).
* Added `evicted_clients` stat to `INFO` command.
* Added `CLIENT NO-EVICT ON|OFF` sub command to exclude a specific client from the
client eviction mechanism. Added corrosponding 'e' flag in the client info string.
* Added `multi-mem` field in the client info string to show how much memory is used up
by buffered multi commands.
* Client `tot-mem` now accounts for buffered multi-commands, pubsub patterns and
channels (partially), tracking prefixes (partially).
* CLIENT_CLOSE_ASAP flag is now handled in a new `beforeNextClient()` function so
clients will be disconnected between processing different clients and not only before sleep.
This new function can be used in the future for work we want to do outside the command
processing loop but don't want to wait for all clients to be processed before we get to it.
Specifically I wanted to handle output-buffer-limit related closing before we process client
eviction in case the two race with each other.
* Added a `DEBUG CLIENT-EVICTION` command to print out info about the client eviction
buckets.
* Each client now holds a pointer to the client eviction memory usage bucket it belongs to
and listNode to itself in that bucket for quick removal.
* Global `io_threads_op` variable now can contain a `IO_THREADS_OP_IDLE` value
indicating no io-threading is currently being executed.
* In order to track memory used by each clients in real-time we can't rely on updating
these stats in `clientsCron()` alone anymore. So now I call `updateClientMemUsage()`
(used to be `clientsCronTrackClientsMemUsage()`) after command processing, after
writing data to pubsub clients, after writing the output buffer and after reading from the
socket (and maybe other places too). The function is written to be fast.
* Clients are evicted if needed (with appropriate log line) in `beforeSleep()` and before
processing a command (before performing oom-checks and key-eviction).
* All clients memory usage buckets are grouped as follows:
* All clients using less than 64k.
* 64K..128K
* 128K..256K
* ...
* 2G..4G
* All clients using 4g and up.
* Added client-eviction.tcl with a bunch of tests for the new mechanism.
* Extended maxmemory.tcl to test the interaction between maxmemory and
maxmemory-clients settings.
* Added an option to flag a numeric configuration variable as a "percent", this means that
if we encounter a '%' after the number in the config file (or config set command) we
consider it as valid. Such a number is store internally as a negative value. This way an
integer value can be interpreted as either a percent (negative) or absolute value (positive).
This is useful for example if some numeric configuration can optionally be set to a percentage
of something else.
Co-authored-by: Oran Agra <oran@redislabs.com>
2021-09-23 07:02:16 -04:00
|
|
|
start_server {tags {"maxmemory" "external:skip"}} {
|
|
|
|
r config set maxmemory 11mb
|
|
|
|
r config set maxmemory-policy allkeys-lru
|
|
|
|
set server_pid [s process_id]
|
|
|
|
|
|
|
|
proc init_test {client_eviction} {
|
|
|
|
r flushdb
|
|
|
|
|
|
|
|
set prev_maxmemory_clients [r config get maxmemory-clients]
|
|
|
|
if $client_eviction {
|
|
|
|
r config set maxmemory-clients 3mb
|
2021-10-05 11:00:19 -04:00
|
|
|
r client no-evict on
|
Client eviction (#8687)
### Description
A mechanism for disconnecting clients when the sum of all connected clients is above a
configured limit. This prevents eviction or OOM caused by accumulated used memory
between all clients. It's a complimentary mechanism to the `client-output-buffer-limit`
mechanism which takes into account not only a single client and not only output buffers
but rather all memory used by all clients.
#### Design
The general design is as following:
* We track memory usage of each client, taking into account all memory used by the
client (query buffer, output buffer, parsed arguments, etc...). This is kept up to date
after reading from the socket, after processing commands and after writing to the socket.
* Based on the used memory we sort all clients into buckets. Each bucket contains all
clients using up up to x2 memory of the clients in the bucket below it. For example up
to 1m clients, up to 2m clients, up to 4m clients, ...
* Before processing a command and before sleep we check if we're over the configured
limit. If we are we start disconnecting clients from larger buckets downwards until we're
under the limit.
#### Config
`maxmemory-clients` max memory all clients are allowed to consume, above this threshold
we disconnect clients.
This config can either be set to 0 (meaning no limit), a size in bytes (possibly with MB/GB
suffix), or as a percentage of `maxmemory` by using the `%` suffix (e.g. setting it to `10%`
would mean 10% of `maxmemory`).
#### Important code changes
* During the development I encountered yet more situations where our io-threads access
global vars. And needed to fix them. I also had to handle keeps the clients sorted into the
memory buckets (which are global) while their memory usage changes in the io-thread.
To achieve this I decided to simplify how we check if we're in an io-thread and make it
much more explicit. I removed the `CLIENT_PENDING_READ` flag used for checking
if the client is in an io-thread (it wasn't used for anything else) and just used the global
`io_threads_op` variable the same way to check during writes.
* I optimized the cleanup of the client from the `clients_pending_read` list on client freeing.
We now store a pointer in the `client` struct to this list so we don't need to search in it
(`pending_read_list_node`).
* Added `evicted_clients` stat to `INFO` command.
* Added `CLIENT NO-EVICT ON|OFF` sub command to exclude a specific client from the
client eviction mechanism. Added corrosponding 'e' flag in the client info string.
* Added `multi-mem` field in the client info string to show how much memory is used up
by buffered multi commands.
* Client `tot-mem` now accounts for buffered multi-commands, pubsub patterns and
channels (partially), tracking prefixes (partially).
* CLIENT_CLOSE_ASAP flag is now handled in a new `beforeNextClient()` function so
clients will be disconnected between processing different clients and not only before sleep.
This new function can be used in the future for work we want to do outside the command
processing loop but don't want to wait for all clients to be processed before we get to it.
Specifically I wanted to handle output-buffer-limit related closing before we process client
eviction in case the two race with each other.
* Added a `DEBUG CLIENT-EVICTION` command to print out info about the client eviction
buckets.
* Each client now holds a pointer to the client eviction memory usage bucket it belongs to
and listNode to itself in that bucket for quick removal.
* Global `io_threads_op` variable now can contain a `IO_THREADS_OP_IDLE` value
indicating no io-threading is currently being executed.
* In order to track memory used by each clients in real-time we can't rely on updating
these stats in `clientsCron()` alone anymore. So now I call `updateClientMemUsage()`
(used to be `clientsCronTrackClientsMemUsage()`) after command processing, after
writing data to pubsub clients, after writing the output buffer and after reading from the
socket (and maybe other places too). The function is written to be fast.
* Clients are evicted if needed (with appropriate log line) in `beforeSleep()` and before
processing a command (before performing oom-checks and key-eviction).
* All clients memory usage buckets are grouped as follows:
* All clients using less than 64k.
* 64K..128K
* 128K..256K
* ...
* 2G..4G
* All clients using 4g and up.
* Added client-eviction.tcl with a bunch of tests for the new mechanism.
* Extended maxmemory.tcl to test the interaction between maxmemory and
maxmemory-clients settings.
* Added an option to flag a numeric configuration variable as a "percent", this means that
if we encounter a '%' after the number in the config file (or config set command) we
consider it as valid. Such a number is store internally as a negative value. This way an
integer value can be interpreted as either a percent (negative) or absolute value (positive).
This is useful for example if some numeric configuration can optionally be set to a percentage
of something else.
Co-authored-by: Oran Agra <oran@redislabs.com>
2021-09-23 07:02:16 -04:00
|
|
|
} else {
|
|
|
|
r config set maxmemory-clients 0
|
|
|
|
}
|
|
|
|
|
|
|
|
r config resetstat
|
|
|
|
# fill 5mb using 50 keys of 100kb
|
|
|
|
for {set j 0} {$j < 50} {incr j} {
|
|
|
|
r setrange $j 100000 x
|
|
|
|
}
|
|
|
|
assert_equal [r dbsize] 50
|
|
|
|
}
|
|
|
|
|
2021-10-07 08:43:48 -04:00
|
|
|
# Return true if the eviction occurred (client or key) based on argument
|
|
|
|
proc check_eviction_test {client_eviction} {
|
|
|
|
set evicted_keys [s evicted_keys]
|
|
|
|
set evicted_clients [s evicted_clients]
|
|
|
|
set dbsize [r dbsize]
|
|
|
|
|
|
|
|
if $client_eviction {
|
|
|
|
return [expr $evicted_clients > 0 && $evicted_keys == 0 && $dbsize == 50]
|
|
|
|
} else {
|
|
|
|
return [expr $evicted_clients == 0 && $evicted_keys > 0 && $dbsize < 50]
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
# Assert the eviction test passed (and prints some debug info on verbose)
|
|
|
|
proc verify_eviction_test {client_eviction} {
|
Client eviction (#8687)
### Description
A mechanism for disconnecting clients when the sum of all connected clients is above a
configured limit. This prevents eviction or OOM caused by accumulated used memory
between all clients. It's a complimentary mechanism to the `client-output-buffer-limit`
mechanism which takes into account not only a single client and not only output buffers
but rather all memory used by all clients.
#### Design
The general design is as following:
* We track memory usage of each client, taking into account all memory used by the
client (query buffer, output buffer, parsed arguments, etc...). This is kept up to date
after reading from the socket, after processing commands and after writing to the socket.
* Based on the used memory we sort all clients into buckets. Each bucket contains all
clients using up up to x2 memory of the clients in the bucket below it. For example up
to 1m clients, up to 2m clients, up to 4m clients, ...
* Before processing a command and before sleep we check if we're over the configured
limit. If we are we start disconnecting clients from larger buckets downwards until we're
under the limit.
#### Config
`maxmemory-clients` max memory all clients are allowed to consume, above this threshold
we disconnect clients.
This config can either be set to 0 (meaning no limit), a size in bytes (possibly with MB/GB
suffix), or as a percentage of `maxmemory` by using the `%` suffix (e.g. setting it to `10%`
would mean 10% of `maxmemory`).
#### Important code changes
* During the development I encountered yet more situations where our io-threads access
global vars. And needed to fix them. I also had to handle keeps the clients sorted into the
memory buckets (which are global) while their memory usage changes in the io-thread.
To achieve this I decided to simplify how we check if we're in an io-thread and make it
much more explicit. I removed the `CLIENT_PENDING_READ` flag used for checking
if the client is in an io-thread (it wasn't used for anything else) and just used the global
`io_threads_op` variable the same way to check during writes.
* I optimized the cleanup of the client from the `clients_pending_read` list on client freeing.
We now store a pointer in the `client` struct to this list so we don't need to search in it
(`pending_read_list_node`).
* Added `evicted_clients` stat to `INFO` command.
* Added `CLIENT NO-EVICT ON|OFF` sub command to exclude a specific client from the
client eviction mechanism. Added corrosponding 'e' flag in the client info string.
* Added `multi-mem` field in the client info string to show how much memory is used up
by buffered multi commands.
* Client `tot-mem` now accounts for buffered multi-commands, pubsub patterns and
channels (partially), tracking prefixes (partially).
* CLIENT_CLOSE_ASAP flag is now handled in a new `beforeNextClient()` function so
clients will be disconnected between processing different clients and not only before sleep.
This new function can be used in the future for work we want to do outside the command
processing loop but don't want to wait for all clients to be processed before we get to it.
Specifically I wanted to handle output-buffer-limit related closing before we process client
eviction in case the two race with each other.
* Added a `DEBUG CLIENT-EVICTION` command to print out info about the client eviction
buckets.
* Each client now holds a pointer to the client eviction memory usage bucket it belongs to
and listNode to itself in that bucket for quick removal.
* Global `io_threads_op` variable now can contain a `IO_THREADS_OP_IDLE` value
indicating no io-threading is currently being executed.
* In order to track memory used by each clients in real-time we can't rely on updating
these stats in `clientsCron()` alone anymore. So now I call `updateClientMemUsage()`
(used to be `clientsCronTrackClientsMemUsage()`) after command processing, after
writing data to pubsub clients, after writing the output buffer and after reading from the
socket (and maybe other places too). The function is written to be fast.
* Clients are evicted if needed (with appropriate log line) in `beforeSleep()` and before
processing a command (before performing oom-checks and key-eviction).
* All clients memory usage buckets are grouped as follows:
* All clients using less than 64k.
* 64K..128K
* 128K..256K
* ...
* 2G..4G
* All clients using 4g and up.
* Added client-eviction.tcl with a bunch of tests for the new mechanism.
* Extended maxmemory.tcl to test the interaction between maxmemory and
maxmemory-clients settings.
* Added an option to flag a numeric configuration variable as a "percent", this means that
if we encounter a '%' after the number in the config file (or config set command) we
consider it as valid. Such a number is store internally as a negative value. This way an
integer value can be interpreted as either a percent (negative) or absolute value (positive).
This is useful for example if some numeric configuration can optionally be set to a percentage
of something else.
Co-authored-by: Oran Agra <oran@redislabs.com>
2021-09-23 07:02:16 -04:00
|
|
|
set evicted_keys [s evicted_keys]
|
|
|
|
set evicted_clients [s evicted_clients]
|
|
|
|
set dbsize [r dbsize]
|
|
|
|
|
|
|
|
if $::verbose {
|
|
|
|
puts "evicted keys: $evicted_keys"
|
|
|
|
puts "evicted clients: $evicted_clients"
|
|
|
|
puts "dbsize: $dbsize"
|
|
|
|
}
|
|
|
|
|
2021-10-07 08:43:48 -04:00
|
|
|
assert [check_eviction_test $client_eviction]
|
Client eviction (#8687)
### Description
A mechanism for disconnecting clients when the sum of all connected clients is above a
configured limit. This prevents eviction or OOM caused by accumulated used memory
between all clients. It's a complimentary mechanism to the `client-output-buffer-limit`
mechanism which takes into account not only a single client and not only output buffers
but rather all memory used by all clients.
#### Design
The general design is as following:
* We track memory usage of each client, taking into account all memory used by the
client (query buffer, output buffer, parsed arguments, etc...). This is kept up to date
after reading from the socket, after processing commands and after writing to the socket.
* Based on the used memory we sort all clients into buckets. Each bucket contains all
clients using up up to x2 memory of the clients in the bucket below it. For example up
to 1m clients, up to 2m clients, up to 4m clients, ...
* Before processing a command and before sleep we check if we're over the configured
limit. If we are we start disconnecting clients from larger buckets downwards until we're
under the limit.
#### Config
`maxmemory-clients` max memory all clients are allowed to consume, above this threshold
we disconnect clients.
This config can either be set to 0 (meaning no limit), a size in bytes (possibly with MB/GB
suffix), or as a percentage of `maxmemory` by using the `%` suffix (e.g. setting it to `10%`
would mean 10% of `maxmemory`).
#### Important code changes
* During the development I encountered yet more situations where our io-threads access
global vars. And needed to fix them. I also had to handle keeps the clients sorted into the
memory buckets (which are global) while their memory usage changes in the io-thread.
To achieve this I decided to simplify how we check if we're in an io-thread and make it
much more explicit. I removed the `CLIENT_PENDING_READ` flag used for checking
if the client is in an io-thread (it wasn't used for anything else) and just used the global
`io_threads_op` variable the same way to check during writes.
* I optimized the cleanup of the client from the `clients_pending_read` list on client freeing.
We now store a pointer in the `client` struct to this list so we don't need to search in it
(`pending_read_list_node`).
* Added `evicted_clients` stat to `INFO` command.
* Added `CLIENT NO-EVICT ON|OFF` sub command to exclude a specific client from the
client eviction mechanism. Added corrosponding 'e' flag in the client info string.
* Added `multi-mem` field in the client info string to show how much memory is used up
by buffered multi commands.
* Client `tot-mem` now accounts for buffered multi-commands, pubsub patterns and
channels (partially), tracking prefixes (partially).
* CLIENT_CLOSE_ASAP flag is now handled in a new `beforeNextClient()` function so
clients will be disconnected between processing different clients and not only before sleep.
This new function can be used in the future for work we want to do outside the command
processing loop but don't want to wait for all clients to be processed before we get to it.
Specifically I wanted to handle output-buffer-limit related closing before we process client
eviction in case the two race with each other.
* Added a `DEBUG CLIENT-EVICTION` command to print out info about the client eviction
buckets.
* Each client now holds a pointer to the client eviction memory usage bucket it belongs to
and listNode to itself in that bucket for quick removal.
* Global `io_threads_op` variable now can contain a `IO_THREADS_OP_IDLE` value
indicating no io-threading is currently being executed.
* In order to track memory used by each clients in real-time we can't rely on updating
these stats in `clientsCron()` alone anymore. So now I call `updateClientMemUsage()`
(used to be `clientsCronTrackClientsMemUsage()`) after command processing, after
writing data to pubsub clients, after writing the output buffer and after reading from the
socket (and maybe other places too). The function is written to be fast.
* Clients are evicted if needed (with appropriate log line) in `beforeSleep()` and before
processing a command (before performing oom-checks and key-eviction).
* All clients memory usage buckets are grouped as follows:
* All clients using less than 64k.
* 64K..128K
* 128K..256K
* ...
* 2G..4G
* All clients using 4g and up.
* Added client-eviction.tcl with a bunch of tests for the new mechanism.
* Extended maxmemory.tcl to test the interaction between maxmemory and
maxmemory-clients settings.
* Added an option to flag a numeric configuration variable as a "percent", this means that
if we encounter a '%' after the number in the config file (or config set command) we
consider it as valid. Such a number is store internally as a negative value. This way an
integer value can be interpreted as either a percent (negative) or absolute value (positive).
This is useful for example if some numeric configuration can optionally be set to a percentage
of something else.
Co-authored-by: Oran Agra <oran@redislabs.com>
2021-09-23 07:02:16 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
foreach {client_eviction} {false true} {
|
|
|
|
set clients {}
|
|
|
|
test "eviction due to output buffers of many MGET clients, client eviction: $client_eviction" {
|
|
|
|
init_test $client_eviction
|
|
|
|
|
|
|
|
for {set j 0} {$j < 20} {incr j} {
|
|
|
|
set rr [redis_deferring_client]
|
|
|
|
lappend clients $rr
|
|
|
|
}
|
|
|
|
|
2021-10-07 08:43:48 -04:00
|
|
|
# Generate client output buffers via MGET until we can observe some effect on
|
|
|
|
# keys / client eviction, or we time out.
|
|
|
|
set t [clock seconds]
|
|
|
|
while {![check_eviction_test $client_eviction] && [expr [clock seconds] - $t] < 20} {
|
Client eviction (#8687)
### Description
A mechanism for disconnecting clients when the sum of all connected clients is above a
configured limit. This prevents eviction or OOM caused by accumulated used memory
between all clients. It's a complimentary mechanism to the `client-output-buffer-limit`
mechanism which takes into account not only a single client and not only output buffers
but rather all memory used by all clients.
#### Design
The general design is as following:
* We track memory usage of each client, taking into account all memory used by the
client (query buffer, output buffer, parsed arguments, etc...). This is kept up to date
after reading from the socket, after processing commands and after writing to the socket.
* Based on the used memory we sort all clients into buckets. Each bucket contains all
clients using up up to x2 memory of the clients in the bucket below it. For example up
to 1m clients, up to 2m clients, up to 4m clients, ...
* Before processing a command and before sleep we check if we're over the configured
limit. If we are we start disconnecting clients from larger buckets downwards until we're
under the limit.
#### Config
`maxmemory-clients` max memory all clients are allowed to consume, above this threshold
we disconnect clients.
This config can either be set to 0 (meaning no limit), a size in bytes (possibly with MB/GB
suffix), or as a percentage of `maxmemory` by using the `%` suffix (e.g. setting it to `10%`
would mean 10% of `maxmemory`).
#### Important code changes
* During the development I encountered yet more situations where our io-threads access
global vars. And needed to fix them. I also had to handle keeps the clients sorted into the
memory buckets (which are global) while their memory usage changes in the io-thread.
To achieve this I decided to simplify how we check if we're in an io-thread and make it
much more explicit. I removed the `CLIENT_PENDING_READ` flag used for checking
if the client is in an io-thread (it wasn't used for anything else) and just used the global
`io_threads_op` variable the same way to check during writes.
* I optimized the cleanup of the client from the `clients_pending_read` list on client freeing.
We now store a pointer in the `client` struct to this list so we don't need to search in it
(`pending_read_list_node`).
* Added `evicted_clients` stat to `INFO` command.
* Added `CLIENT NO-EVICT ON|OFF` sub command to exclude a specific client from the
client eviction mechanism. Added corrosponding 'e' flag in the client info string.
* Added `multi-mem` field in the client info string to show how much memory is used up
by buffered multi commands.
* Client `tot-mem` now accounts for buffered multi-commands, pubsub patterns and
channels (partially), tracking prefixes (partially).
* CLIENT_CLOSE_ASAP flag is now handled in a new `beforeNextClient()` function so
clients will be disconnected between processing different clients and not only before sleep.
This new function can be used in the future for work we want to do outside the command
processing loop but don't want to wait for all clients to be processed before we get to it.
Specifically I wanted to handle output-buffer-limit related closing before we process client
eviction in case the two race with each other.
* Added a `DEBUG CLIENT-EVICTION` command to print out info about the client eviction
buckets.
* Each client now holds a pointer to the client eviction memory usage bucket it belongs to
and listNode to itself in that bucket for quick removal.
* Global `io_threads_op` variable now can contain a `IO_THREADS_OP_IDLE` value
indicating no io-threading is currently being executed.
* In order to track memory used by each clients in real-time we can't rely on updating
these stats in `clientsCron()` alone anymore. So now I call `updateClientMemUsage()`
(used to be `clientsCronTrackClientsMemUsage()`) after command processing, after
writing data to pubsub clients, after writing the output buffer and after reading from the
socket (and maybe other places too). The function is written to be fast.
* Clients are evicted if needed (with appropriate log line) in `beforeSleep()` and before
processing a command (before performing oom-checks and key-eviction).
* All clients memory usage buckets are grouped as follows:
* All clients using less than 64k.
* 64K..128K
* 128K..256K
* ...
* 2G..4G
* All clients using 4g and up.
* Added client-eviction.tcl with a bunch of tests for the new mechanism.
* Extended maxmemory.tcl to test the interaction between maxmemory and
maxmemory-clients settings.
* Added an option to flag a numeric configuration variable as a "percent", this means that
if we encounter a '%' after the number in the config file (or config set command) we
consider it as valid. Such a number is store internally as a negative value. This way an
integer value can be interpreted as either a percent (negative) or absolute value (positive).
This is useful for example if some numeric configuration can optionally be set to a percentage
of something else.
Co-authored-by: Oran Agra <oran@redislabs.com>
2021-09-23 07:02:16 -04:00
|
|
|
foreach rr $clients {
|
2021-10-07 08:43:48 -04:00
|
|
|
if {[catch {
|
|
|
|
$rr mget 1
|
|
|
|
$rr flush
|
|
|
|
} err]} {
|
Client eviction (#8687)
### Description
A mechanism for disconnecting clients when the sum of all connected clients is above a
configured limit. This prevents eviction or OOM caused by accumulated used memory
between all clients. It's a complimentary mechanism to the `client-output-buffer-limit`
mechanism which takes into account not only a single client and not only output buffers
but rather all memory used by all clients.
#### Design
The general design is as following:
* We track memory usage of each client, taking into account all memory used by the
client (query buffer, output buffer, parsed arguments, etc...). This is kept up to date
after reading from the socket, after processing commands and after writing to the socket.
* Based on the used memory we sort all clients into buckets. Each bucket contains all
clients using up up to x2 memory of the clients in the bucket below it. For example up
to 1m clients, up to 2m clients, up to 4m clients, ...
* Before processing a command and before sleep we check if we're over the configured
limit. If we are we start disconnecting clients from larger buckets downwards until we're
under the limit.
#### Config
`maxmemory-clients` max memory all clients are allowed to consume, above this threshold
we disconnect clients.
This config can either be set to 0 (meaning no limit), a size in bytes (possibly with MB/GB
suffix), or as a percentage of `maxmemory` by using the `%` suffix (e.g. setting it to `10%`
would mean 10% of `maxmemory`).
#### Important code changes
* During the development I encountered yet more situations where our io-threads access
global vars. And needed to fix them. I also had to handle keeps the clients sorted into the
memory buckets (which are global) while their memory usage changes in the io-thread.
To achieve this I decided to simplify how we check if we're in an io-thread and make it
much more explicit. I removed the `CLIENT_PENDING_READ` flag used for checking
if the client is in an io-thread (it wasn't used for anything else) and just used the global
`io_threads_op` variable the same way to check during writes.
* I optimized the cleanup of the client from the `clients_pending_read` list on client freeing.
We now store a pointer in the `client` struct to this list so we don't need to search in it
(`pending_read_list_node`).
* Added `evicted_clients` stat to `INFO` command.
* Added `CLIENT NO-EVICT ON|OFF` sub command to exclude a specific client from the
client eviction mechanism. Added corrosponding 'e' flag in the client info string.
* Added `multi-mem` field in the client info string to show how much memory is used up
by buffered multi commands.
* Client `tot-mem` now accounts for buffered multi-commands, pubsub patterns and
channels (partially), tracking prefixes (partially).
* CLIENT_CLOSE_ASAP flag is now handled in a new `beforeNextClient()` function so
clients will be disconnected between processing different clients and not only before sleep.
This new function can be used in the future for work we want to do outside the command
processing loop but don't want to wait for all clients to be processed before we get to it.
Specifically I wanted to handle output-buffer-limit related closing before we process client
eviction in case the two race with each other.
* Added a `DEBUG CLIENT-EVICTION` command to print out info about the client eviction
buckets.
* Each client now holds a pointer to the client eviction memory usage bucket it belongs to
and listNode to itself in that bucket for quick removal.
* Global `io_threads_op` variable now can contain a `IO_THREADS_OP_IDLE` value
indicating no io-threading is currently being executed.
* In order to track memory used by each clients in real-time we can't rely on updating
these stats in `clientsCron()` alone anymore. So now I call `updateClientMemUsage()`
(used to be `clientsCronTrackClientsMemUsage()`) after command processing, after
writing data to pubsub clients, after writing the output buffer and after reading from the
socket (and maybe other places too). The function is written to be fast.
* Clients are evicted if needed (with appropriate log line) in `beforeSleep()` and before
processing a command (before performing oom-checks and key-eviction).
* All clients memory usage buckets are grouped as follows:
* All clients using less than 64k.
* 64K..128K
* 128K..256K
* ...
* 2G..4G
* All clients using 4g and up.
* Added client-eviction.tcl with a bunch of tests for the new mechanism.
* Extended maxmemory.tcl to test the interaction between maxmemory and
maxmemory-clients settings.
* Added an option to flag a numeric configuration variable as a "percent", this means that
if we encounter a '%' after the number in the config file (or config set command) we
consider it as valid. Such a number is store internally as a negative value. This way an
integer value can be interpreted as either a percent (negative) or absolute value (positive).
This is useful for example if some numeric configuration can optionally be set to a percentage
of something else.
Co-authored-by: Oran Agra <oran@redislabs.com>
2021-09-23 07:02:16 -04:00
|
|
|
lremove clients $rr
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-10-07 08:43:48 -04:00
|
|
|
verify_eviction_test $client_eviction
|
|
|
|
}
|
Client eviction (#8687)
### Description
A mechanism for disconnecting clients when the sum of all connected clients is above a
configured limit. This prevents eviction or OOM caused by accumulated used memory
between all clients. It's a complimentary mechanism to the `client-output-buffer-limit`
mechanism which takes into account not only a single client and not only output buffers
but rather all memory used by all clients.
#### Design
The general design is as following:
* We track memory usage of each client, taking into account all memory used by the
client (query buffer, output buffer, parsed arguments, etc...). This is kept up to date
after reading from the socket, after processing commands and after writing to the socket.
* Based on the used memory we sort all clients into buckets. Each bucket contains all
clients using up up to x2 memory of the clients in the bucket below it. For example up
to 1m clients, up to 2m clients, up to 4m clients, ...
* Before processing a command and before sleep we check if we're over the configured
limit. If we are we start disconnecting clients from larger buckets downwards until we're
under the limit.
#### Config
`maxmemory-clients` max memory all clients are allowed to consume, above this threshold
we disconnect clients.
This config can either be set to 0 (meaning no limit), a size in bytes (possibly with MB/GB
suffix), or as a percentage of `maxmemory` by using the `%` suffix (e.g. setting it to `10%`
would mean 10% of `maxmemory`).
#### Important code changes
* During the development I encountered yet more situations where our io-threads access
global vars. And needed to fix them. I also had to handle keeps the clients sorted into the
memory buckets (which are global) while their memory usage changes in the io-thread.
To achieve this I decided to simplify how we check if we're in an io-thread and make it
much more explicit. I removed the `CLIENT_PENDING_READ` flag used for checking
if the client is in an io-thread (it wasn't used for anything else) and just used the global
`io_threads_op` variable the same way to check during writes.
* I optimized the cleanup of the client from the `clients_pending_read` list on client freeing.
We now store a pointer in the `client` struct to this list so we don't need to search in it
(`pending_read_list_node`).
* Added `evicted_clients` stat to `INFO` command.
* Added `CLIENT NO-EVICT ON|OFF` sub command to exclude a specific client from the
client eviction mechanism. Added corrosponding 'e' flag in the client info string.
* Added `multi-mem` field in the client info string to show how much memory is used up
by buffered multi commands.
* Client `tot-mem` now accounts for buffered multi-commands, pubsub patterns and
channels (partially), tracking prefixes (partially).
* CLIENT_CLOSE_ASAP flag is now handled in a new `beforeNextClient()` function so
clients will be disconnected between processing different clients and not only before sleep.
This new function can be used in the future for work we want to do outside the command
processing loop but don't want to wait for all clients to be processed before we get to it.
Specifically I wanted to handle output-buffer-limit related closing before we process client
eviction in case the two race with each other.
* Added a `DEBUG CLIENT-EVICTION` command to print out info about the client eviction
buckets.
* Each client now holds a pointer to the client eviction memory usage bucket it belongs to
and listNode to itself in that bucket for quick removal.
* Global `io_threads_op` variable now can contain a `IO_THREADS_OP_IDLE` value
indicating no io-threading is currently being executed.
* In order to track memory used by each clients in real-time we can't rely on updating
these stats in `clientsCron()` alone anymore. So now I call `updateClientMemUsage()`
(used to be `clientsCronTrackClientsMemUsage()`) after command processing, after
writing data to pubsub clients, after writing the output buffer and after reading from the
socket (and maybe other places too). The function is written to be fast.
* Clients are evicted if needed (with appropriate log line) in `beforeSleep()` and before
processing a command (before performing oom-checks and key-eviction).
* All clients memory usage buckets are grouped as follows:
* All clients using less than 64k.
* 64K..128K
* 128K..256K
* ...
* 2G..4G
* All clients using 4g and up.
* Added client-eviction.tcl with a bunch of tests for the new mechanism.
* Extended maxmemory.tcl to test the interaction between maxmemory and
maxmemory-clients settings.
* Added an option to flag a numeric configuration variable as a "percent", this means that
if we encounter a '%' after the number in the config file (or config set command) we
consider it as valid. Such a number is store internally as a negative value. This way an
integer value can be interpreted as either a percent (negative) or absolute value (positive).
This is useful for example if some numeric configuration can optionally be set to a percentage
of something else.
Co-authored-by: Oran Agra <oran@redislabs.com>
2021-09-23 07:02:16 -04:00
|
|
|
foreach rr $clients {
|
|
|
|
$rr close
|
|
|
|
}
|
|
|
|
|
|
|
|
set clients {}
|
|
|
|
test "eviction due to input buffer of a dead client, client eviction: $client_eviction" {
|
|
|
|
init_test $client_eviction
|
|
|
|
|
|
|
|
for {set j 0} {$j < 30} {incr j} {
|
|
|
|
set rr [redis_deferring_client]
|
|
|
|
lappend clients $rr
|
|
|
|
}
|
|
|
|
|
|
|
|
foreach rr $clients {
|
|
|
|
if {[catch {
|
|
|
|
$rr write "*250\r\n"
|
|
|
|
for {set j 0} {$j < 249} {incr j} {
|
|
|
|
$rr write "\$1000\r\n"
|
|
|
|
$rr write [string repeat x 1000]
|
|
|
|
$rr write "\r\n"
|
|
|
|
$rr flush
|
|
|
|
}
|
|
|
|
}]} {
|
|
|
|
lremove clients $rr
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-10-07 08:43:48 -04:00
|
|
|
verify_eviction_test $client_eviction
|
Client eviction (#8687)
### Description
A mechanism for disconnecting clients when the sum of all connected clients is above a
configured limit. This prevents eviction or OOM caused by accumulated used memory
between all clients. It's a complimentary mechanism to the `client-output-buffer-limit`
mechanism which takes into account not only a single client and not only output buffers
but rather all memory used by all clients.
#### Design
The general design is as following:
* We track memory usage of each client, taking into account all memory used by the
client (query buffer, output buffer, parsed arguments, etc...). This is kept up to date
after reading from the socket, after processing commands and after writing to the socket.
* Based on the used memory we sort all clients into buckets. Each bucket contains all
clients using up up to x2 memory of the clients in the bucket below it. For example up
to 1m clients, up to 2m clients, up to 4m clients, ...
* Before processing a command and before sleep we check if we're over the configured
limit. If we are we start disconnecting clients from larger buckets downwards until we're
under the limit.
#### Config
`maxmemory-clients` max memory all clients are allowed to consume, above this threshold
we disconnect clients.
This config can either be set to 0 (meaning no limit), a size in bytes (possibly with MB/GB
suffix), or as a percentage of `maxmemory` by using the `%` suffix (e.g. setting it to `10%`
would mean 10% of `maxmemory`).
#### Important code changes
* During the development I encountered yet more situations where our io-threads access
global vars. And needed to fix them. I also had to handle keeps the clients sorted into the
memory buckets (which are global) while their memory usage changes in the io-thread.
To achieve this I decided to simplify how we check if we're in an io-thread and make it
much more explicit. I removed the `CLIENT_PENDING_READ` flag used for checking
if the client is in an io-thread (it wasn't used for anything else) and just used the global
`io_threads_op` variable the same way to check during writes.
* I optimized the cleanup of the client from the `clients_pending_read` list on client freeing.
We now store a pointer in the `client` struct to this list so we don't need to search in it
(`pending_read_list_node`).
* Added `evicted_clients` stat to `INFO` command.
* Added `CLIENT NO-EVICT ON|OFF` sub command to exclude a specific client from the
client eviction mechanism. Added corrosponding 'e' flag in the client info string.
* Added `multi-mem` field in the client info string to show how much memory is used up
by buffered multi commands.
* Client `tot-mem` now accounts for buffered multi-commands, pubsub patterns and
channels (partially), tracking prefixes (partially).
* CLIENT_CLOSE_ASAP flag is now handled in a new `beforeNextClient()` function so
clients will be disconnected between processing different clients and not only before sleep.
This new function can be used in the future for work we want to do outside the command
processing loop but don't want to wait for all clients to be processed before we get to it.
Specifically I wanted to handle output-buffer-limit related closing before we process client
eviction in case the two race with each other.
* Added a `DEBUG CLIENT-EVICTION` command to print out info about the client eviction
buckets.
* Each client now holds a pointer to the client eviction memory usage bucket it belongs to
and listNode to itself in that bucket for quick removal.
* Global `io_threads_op` variable now can contain a `IO_THREADS_OP_IDLE` value
indicating no io-threading is currently being executed.
* In order to track memory used by each clients in real-time we can't rely on updating
these stats in `clientsCron()` alone anymore. So now I call `updateClientMemUsage()`
(used to be `clientsCronTrackClientsMemUsage()`) after command processing, after
writing data to pubsub clients, after writing the output buffer and after reading from the
socket (and maybe other places too). The function is written to be fast.
* Clients are evicted if needed (with appropriate log line) in `beforeSleep()` and before
processing a command (before performing oom-checks and key-eviction).
* All clients memory usage buckets are grouped as follows:
* All clients using less than 64k.
* 64K..128K
* 128K..256K
* ...
* 2G..4G
* All clients using 4g and up.
* Added client-eviction.tcl with a bunch of tests for the new mechanism.
* Extended maxmemory.tcl to test the interaction between maxmemory and
maxmemory-clients settings.
* Added an option to flag a numeric configuration variable as a "percent", this means that
if we encounter a '%' after the number in the config file (or config set command) we
consider it as valid. Such a number is store internally as a negative value. This way an
integer value can be interpreted as either a percent (negative) or absolute value (positive).
This is useful for example if some numeric configuration can optionally be set to a percentage
of something else.
Co-authored-by: Oran Agra <oran@redislabs.com>
2021-09-23 07:02:16 -04:00
|
|
|
}
|
|
|
|
foreach rr $clients {
|
|
|
|
$rr close
|
|
|
|
}
|
|
|
|
|
|
|
|
set clients {}
|
|
|
|
test "eviction due to output buffers of pubsub, client eviction: $client_eviction" {
|
|
|
|
init_test $client_eviction
|
|
|
|
|
2021-10-05 11:00:19 -04:00
|
|
|
for {set j 0} {$j < 20} {incr j} {
|
2021-09-26 10:45:02 -04:00
|
|
|
set rr [redis_client]
|
Client eviction (#8687)
### Description
A mechanism for disconnecting clients when the sum of all connected clients is above a
configured limit. This prevents eviction or OOM caused by accumulated used memory
between all clients. It's a complimentary mechanism to the `client-output-buffer-limit`
mechanism which takes into account not only a single client and not only output buffers
but rather all memory used by all clients.
#### Design
The general design is as following:
* We track memory usage of each client, taking into account all memory used by the
client (query buffer, output buffer, parsed arguments, etc...). This is kept up to date
after reading from the socket, after processing commands and after writing to the socket.
* Based on the used memory we sort all clients into buckets. Each bucket contains all
clients using up up to x2 memory of the clients in the bucket below it. For example up
to 1m clients, up to 2m clients, up to 4m clients, ...
* Before processing a command and before sleep we check if we're over the configured
limit. If we are we start disconnecting clients from larger buckets downwards until we're
under the limit.
#### Config
`maxmemory-clients` max memory all clients are allowed to consume, above this threshold
we disconnect clients.
This config can either be set to 0 (meaning no limit), a size in bytes (possibly with MB/GB
suffix), or as a percentage of `maxmemory` by using the `%` suffix (e.g. setting it to `10%`
would mean 10% of `maxmemory`).
#### Important code changes
* During the development I encountered yet more situations where our io-threads access
global vars. And needed to fix them. I also had to handle keeps the clients sorted into the
memory buckets (which are global) while their memory usage changes in the io-thread.
To achieve this I decided to simplify how we check if we're in an io-thread and make it
much more explicit. I removed the `CLIENT_PENDING_READ` flag used for checking
if the client is in an io-thread (it wasn't used for anything else) and just used the global
`io_threads_op` variable the same way to check during writes.
* I optimized the cleanup of the client from the `clients_pending_read` list on client freeing.
We now store a pointer in the `client` struct to this list so we don't need to search in it
(`pending_read_list_node`).
* Added `evicted_clients` stat to `INFO` command.
* Added `CLIENT NO-EVICT ON|OFF` sub command to exclude a specific client from the
client eviction mechanism. Added corrosponding 'e' flag in the client info string.
* Added `multi-mem` field in the client info string to show how much memory is used up
by buffered multi commands.
* Client `tot-mem` now accounts for buffered multi-commands, pubsub patterns and
channels (partially), tracking prefixes (partially).
* CLIENT_CLOSE_ASAP flag is now handled in a new `beforeNextClient()` function so
clients will be disconnected between processing different clients and not only before sleep.
This new function can be used in the future for work we want to do outside the command
processing loop but don't want to wait for all clients to be processed before we get to it.
Specifically I wanted to handle output-buffer-limit related closing before we process client
eviction in case the two race with each other.
* Added a `DEBUG CLIENT-EVICTION` command to print out info about the client eviction
buckets.
* Each client now holds a pointer to the client eviction memory usage bucket it belongs to
and listNode to itself in that bucket for quick removal.
* Global `io_threads_op` variable now can contain a `IO_THREADS_OP_IDLE` value
indicating no io-threading is currently being executed.
* In order to track memory used by each clients in real-time we can't rely on updating
these stats in `clientsCron()` alone anymore. So now I call `updateClientMemUsage()`
(used to be `clientsCronTrackClientsMemUsage()`) after command processing, after
writing data to pubsub clients, after writing the output buffer and after reading from the
socket (and maybe other places too). The function is written to be fast.
* Clients are evicted if needed (with appropriate log line) in `beforeSleep()` and before
processing a command (before performing oom-checks and key-eviction).
* All clients memory usage buckets are grouped as follows:
* All clients using less than 64k.
* 64K..128K
* 128K..256K
* ...
* 2G..4G
* All clients using 4g and up.
* Added client-eviction.tcl with a bunch of tests for the new mechanism.
* Extended maxmemory.tcl to test the interaction between maxmemory and
maxmemory-clients settings.
* Added an option to flag a numeric configuration variable as a "percent", this means that
if we encounter a '%' after the number in the config file (or config set command) we
consider it as valid. Such a number is store internally as a negative value. This way an
integer value can be interpreted as either a percent (negative) or absolute value (positive).
This is useful for example if some numeric configuration can optionally be set to a percentage
of something else.
Co-authored-by: Oran Agra <oran@redislabs.com>
2021-09-23 07:02:16 -04:00
|
|
|
lappend clients $rr
|
|
|
|
}
|
|
|
|
|
|
|
|
foreach rr $clients {
|
|
|
|
$rr subscribe bla
|
|
|
|
}
|
|
|
|
|
2021-10-07 08:43:48 -04:00
|
|
|
# Generate client output buffers via PUBLISH until we can observe some effect on
|
|
|
|
# keys / client eviction, or we time out.
|
2021-10-05 11:00:19 -04:00
|
|
|
set bigstr [string repeat x 100000]
|
2021-10-07 08:43:48 -04:00
|
|
|
set t [clock seconds]
|
|
|
|
while {![check_eviction_test $client_eviction] && [expr [clock seconds] - $t] < 20} {
|
|
|
|
if {[catch { r publish bla $bigstr } err]} {
|
2021-09-29 10:10:05 -04:00
|
|
|
if $::verbose {
|
|
|
|
puts "Error publishing: $err"
|
|
|
|
}
|
|
|
|
}
|
Client eviction (#8687)
### Description
A mechanism for disconnecting clients when the sum of all connected clients is above a
configured limit. This prevents eviction or OOM caused by accumulated used memory
between all clients. It's a complimentary mechanism to the `client-output-buffer-limit`
mechanism which takes into account not only a single client and not only output buffers
but rather all memory used by all clients.
#### Design
The general design is as following:
* We track memory usage of each client, taking into account all memory used by the
client (query buffer, output buffer, parsed arguments, etc...). This is kept up to date
after reading from the socket, after processing commands and after writing to the socket.
* Based on the used memory we sort all clients into buckets. Each bucket contains all
clients using up up to x2 memory of the clients in the bucket below it. For example up
to 1m clients, up to 2m clients, up to 4m clients, ...
* Before processing a command and before sleep we check if we're over the configured
limit. If we are we start disconnecting clients from larger buckets downwards until we're
under the limit.
#### Config
`maxmemory-clients` max memory all clients are allowed to consume, above this threshold
we disconnect clients.
This config can either be set to 0 (meaning no limit), a size in bytes (possibly with MB/GB
suffix), or as a percentage of `maxmemory` by using the `%` suffix (e.g. setting it to `10%`
would mean 10% of `maxmemory`).
#### Important code changes
* During the development I encountered yet more situations where our io-threads access
global vars. And needed to fix them. I also had to handle keeps the clients sorted into the
memory buckets (which are global) while their memory usage changes in the io-thread.
To achieve this I decided to simplify how we check if we're in an io-thread and make it
much more explicit. I removed the `CLIENT_PENDING_READ` flag used for checking
if the client is in an io-thread (it wasn't used for anything else) and just used the global
`io_threads_op` variable the same way to check during writes.
* I optimized the cleanup of the client from the `clients_pending_read` list on client freeing.
We now store a pointer in the `client` struct to this list so we don't need to search in it
(`pending_read_list_node`).
* Added `evicted_clients` stat to `INFO` command.
* Added `CLIENT NO-EVICT ON|OFF` sub command to exclude a specific client from the
client eviction mechanism. Added corrosponding 'e' flag in the client info string.
* Added `multi-mem` field in the client info string to show how much memory is used up
by buffered multi commands.
* Client `tot-mem` now accounts for buffered multi-commands, pubsub patterns and
channels (partially), tracking prefixes (partially).
* CLIENT_CLOSE_ASAP flag is now handled in a new `beforeNextClient()` function so
clients will be disconnected between processing different clients and not only before sleep.
This new function can be used in the future for work we want to do outside the command
processing loop but don't want to wait for all clients to be processed before we get to it.
Specifically I wanted to handle output-buffer-limit related closing before we process client
eviction in case the two race with each other.
* Added a `DEBUG CLIENT-EVICTION` command to print out info about the client eviction
buckets.
* Each client now holds a pointer to the client eviction memory usage bucket it belongs to
and listNode to itself in that bucket for quick removal.
* Global `io_threads_op` variable now can contain a `IO_THREADS_OP_IDLE` value
indicating no io-threading is currently being executed.
* In order to track memory used by each clients in real-time we can't rely on updating
these stats in `clientsCron()` alone anymore. So now I call `updateClientMemUsage()`
(used to be `clientsCronTrackClientsMemUsage()`) after command processing, after
writing data to pubsub clients, after writing the output buffer and after reading from the
socket (and maybe other places too). The function is written to be fast.
* Clients are evicted if needed (with appropriate log line) in `beforeSleep()` and before
processing a command (before performing oom-checks and key-eviction).
* All clients memory usage buckets are grouped as follows:
* All clients using less than 64k.
* 64K..128K
* 128K..256K
* ...
* 2G..4G
* All clients using 4g and up.
* Added client-eviction.tcl with a bunch of tests for the new mechanism.
* Extended maxmemory.tcl to test the interaction between maxmemory and
maxmemory-clients settings.
* Added an option to flag a numeric configuration variable as a "percent", this means that
if we encounter a '%' after the number in the config file (or config set command) we
consider it as valid. Such a number is store internally as a negative value. This way an
integer value can be interpreted as either a percent (negative) or absolute value (positive).
This is useful for example if some numeric configuration can optionally be set to a percentage
of something else.
Co-authored-by: Oran Agra <oran@redislabs.com>
2021-09-23 07:02:16 -04:00
|
|
|
}
|
|
|
|
|
2021-10-07 08:43:48 -04:00
|
|
|
verify_eviction_test $client_eviction
|
Client eviction (#8687)
### Description
A mechanism for disconnecting clients when the sum of all connected clients is above a
configured limit. This prevents eviction or OOM caused by accumulated used memory
between all clients. It's a complimentary mechanism to the `client-output-buffer-limit`
mechanism which takes into account not only a single client and not only output buffers
but rather all memory used by all clients.
#### Design
The general design is as following:
* We track memory usage of each client, taking into account all memory used by the
client (query buffer, output buffer, parsed arguments, etc...). This is kept up to date
after reading from the socket, after processing commands and after writing to the socket.
* Based on the used memory we sort all clients into buckets. Each bucket contains all
clients using up up to x2 memory of the clients in the bucket below it. For example up
to 1m clients, up to 2m clients, up to 4m clients, ...
* Before processing a command and before sleep we check if we're over the configured
limit. If we are we start disconnecting clients from larger buckets downwards until we're
under the limit.
#### Config
`maxmemory-clients` max memory all clients are allowed to consume, above this threshold
we disconnect clients.
This config can either be set to 0 (meaning no limit), a size in bytes (possibly with MB/GB
suffix), or as a percentage of `maxmemory` by using the `%` suffix (e.g. setting it to `10%`
would mean 10% of `maxmemory`).
#### Important code changes
* During the development I encountered yet more situations where our io-threads access
global vars. And needed to fix them. I also had to handle keeps the clients sorted into the
memory buckets (which are global) while their memory usage changes in the io-thread.
To achieve this I decided to simplify how we check if we're in an io-thread and make it
much more explicit. I removed the `CLIENT_PENDING_READ` flag used for checking
if the client is in an io-thread (it wasn't used for anything else) and just used the global
`io_threads_op` variable the same way to check during writes.
* I optimized the cleanup of the client from the `clients_pending_read` list on client freeing.
We now store a pointer in the `client` struct to this list so we don't need to search in it
(`pending_read_list_node`).
* Added `evicted_clients` stat to `INFO` command.
* Added `CLIENT NO-EVICT ON|OFF` sub command to exclude a specific client from the
client eviction mechanism. Added corrosponding 'e' flag in the client info string.
* Added `multi-mem` field in the client info string to show how much memory is used up
by buffered multi commands.
* Client `tot-mem` now accounts for buffered multi-commands, pubsub patterns and
channels (partially), tracking prefixes (partially).
* CLIENT_CLOSE_ASAP flag is now handled in a new `beforeNextClient()` function so
clients will be disconnected between processing different clients and not only before sleep.
This new function can be used in the future for work we want to do outside the command
processing loop but don't want to wait for all clients to be processed before we get to it.
Specifically I wanted to handle output-buffer-limit related closing before we process client
eviction in case the two race with each other.
* Added a `DEBUG CLIENT-EVICTION` command to print out info about the client eviction
buckets.
* Each client now holds a pointer to the client eviction memory usage bucket it belongs to
and listNode to itself in that bucket for quick removal.
* Global `io_threads_op` variable now can contain a `IO_THREADS_OP_IDLE` value
indicating no io-threading is currently being executed.
* In order to track memory used by each clients in real-time we can't rely on updating
these stats in `clientsCron()` alone anymore. So now I call `updateClientMemUsage()`
(used to be `clientsCronTrackClientsMemUsage()`) after command processing, after
writing data to pubsub clients, after writing the output buffer and after reading from the
socket (and maybe other places too). The function is written to be fast.
* Clients are evicted if needed (with appropriate log line) in `beforeSleep()` and before
processing a command (before performing oom-checks and key-eviction).
* All clients memory usage buckets are grouped as follows:
* All clients using less than 64k.
* 64K..128K
* 128K..256K
* ...
* 2G..4G
* All clients using 4g and up.
* Added client-eviction.tcl with a bunch of tests for the new mechanism.
* Extended maxmemory.tcl to test the interaction between maxmemory and
maxmemory-clients settings.
* Added an option to flag a numeric configuration variable as a "percent", this means that
if we encounter a '%' after the number in the config file (or config set command) we
consider it as valid. Such a number is store internally as a negative value. This way an
integer value can be interpreted as either a percent (negative) or absolute value (positive).
This is useful for example if some numeric configuration can optionally be set to a percentage
of something else.
Co-authored-by: Oran Agra <oran@redislabs.com>
2021-09-23 07:02:16 -04:00
|
|
|
}
|
|
|
|
foreach rr $clients {
|
|
|
|
$rr close
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
}
|
|
|
|
|
2021-06-09 08:13:24 -04:00
|
|
|
start_server {tags {"maxmemory external:skip"}} {
|
2014-07-18 04:55:08 -04:00
|
|
|
test "Without maxmemory small integers are shared" {
|
|
|
|
r config set maxmemory 0
|
|
|
|
r set a 1
|
2022-11-22 09:38:27 -05:00
|
|
|
assert_refcount_morethan a 1
|
2014-07-18 04:55:08 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
test "With maxmemory and non-LRU policy integers are still shared" {
|
|
|
|
r config set maxmemory 1073741824
|
|
|
|
r config set maxmemory-policy allkeys-random
|
|
|
|
r set a 1
|
2022-11-22 09:38:27 -05:00
|
|
|
assert_refcount_morethan a 1
|
2014-07-18 04:55:08 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
test "With maxmemory and LRU policy integers are not shared" {
|
|
|
|
r config set maxmemory 1073741824
|
|
|
|
r config set maxmemory-policy allkeys-lru
|
|
|
|
r set a 1
|
|
|
|
r config set maxmemory-policy volatile-lru
|
|
|
|
r set b 1
|
2022-11-22 09:38:27 -05:00
|
|
|
assert_refcount 1 a
|
|
|
|
assert_refcount 1 b
|
2014-07-18 04:55:08 -04:00
|
|
|
r config set maxmemory 0
|
|
|
|
}
|
|
|
|
|
2011-07-28 06:32:52 -04:00
|
|
|
foreach policy {
|
2017-03-15 04:05:15 -04:00
|
|
|
allkeys-random allkeys-lru allkeys-lfu volatile-lru volatile-lfu volatile-random volatile-ttl
|
2011-07-28 06:32:52 -04:00
|
|
|
} {
|
|
|
|
test "maxmemory - is the memory limit honoured? (policy $policy)" {
|
|
|
|
# make sure to start with a blank instance
|
2014-07-31 14:39:49 -04:00
|
|
|
r flushall
|
2011-07-28 06:32:52 -04:00
|
|
|
# Get the current memory limit and calculate a new limit.
|
|
|
|
# We just add 100k to the current memory size so that it is
|
|
|
|
# fast for us to reach that limit.
|
|
|
|
set used [s used_memory]
|
|
|
|
set limit [expr {$used+100*1024}]
|
|
|
|
r config set maxmemory $limit
|
|
|
|
r config set maxmemory-policy $policy
|
|
|
|
# Now add keys until the limit is almost reached.
|
|
|
|
set numkeys 0
|
|
|
|
while 1 {
|
|
|
|
r setex [randomKey] 10000 x
|
|
|
|
incr numkeys
|
|
|
|
if {[s used_memory]+4096 > $limit} {
|
|
|
|
assert {$numkeys > 10}
|
|
|
|
break
|
|
|
|
}
|
|
|
|
}
|
|
|
|
# If we add the same number of keys already added again, we
|
|
|
|
# should still be under the limit.
|
|
|
|
for {set j 0} {$j < $numkeys} {incr j} {
|
|
|
|
r setex [randomKey] 10000 x
|
|
|
|
}
|
|
|
|
assert {[s used_memory] < ($limit+4096)}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
foreach policy {
|
|
|
|
allkeys-random allkeys-lru volatile-lru volatile-random volatile-ttl
|
|
|
|
} {
|
|
|
|
test "maxmemory - only allkeys-* should remove non-volatile keys ($policy)" {
|
|
|
|
# make sure to start with a blank instance
|
2014-07-31 14:39:49 -04:00
|
|
|
r flushall
|
2011-07-28 06:32:52 -04:00
|
|
|
# Get the current memory limit and calculate a new limit.
|
|
|
|
# We just add 100k to the current memory size so that it is
|
|
|
|
# fast for us to reach that limit.
|
|
|
|
set used [s used_memory]
|
|
|
|
set limit [expr {$used+100*1024}]
|
|
|
|
r config set maxmemory $limit
|
|
|
|
r config set maxmemory-policy $policy
|
|
|
|
# Now add keys until the limit is almost reached.
|
|
|
|
set numkeys 0
|
|
|
|
while 1 {
|
|
|
|
r set [randomKey] x
|
|
|
|
incr numkeys
|
|
|
|
if {[s used_memory]+4096 > $limit} {
|
|
|
|
assert {$numkeys > 10}
|
|
|
|
break
|
|
|
|
}
|
|
|
|
}
|
|
|
|
# If we add the same number of keys already added again and
|
|
|
|
# the policy is allkeys-* we should still be under the limit.
|
|
|
|
# Otherwise we should see an error reported by Redis.
|
|
|
|
set err 0
|
|
|
|
for {set j 0} {$j < $numkeys} {incr j} {
|
|
|
|
if {[catch {r set [randomKey] x} e]} {
|
|
|
|
if {[string match {*used memory*} $e]} {
|
|
|
|
set err 1
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if {[string match allkeys-* $policy]} {
|
|
|
|
assert {[s used_memory] < ($limit+4096)}
|
|
|
|
} else {
|
|
|
|
assert {$err == 1}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
foreach policy {
|
2017-03-15 04:05:15 -04:00
|
|
|
volatile-lru volatile-lfu volatile-random volatile-ttl
|
2011-07-28 06:32:52 -04:00
|
|
|
} {
|
|
|
|
test "maxmemory - policy $policy should only remove volatile keys." {
|
|
|
|
# make sure to start with a blank instance
|
2014-07-31 14:39:49 -04:00
|
|
|
r flushall
|
2011-07-28 06:32:52 -04:00
|
|
|
# Get the current memory limit and calculate a new limit.
|
|
|
|
# We just add 100k to the current memory size so that it is
|
|
|
|
# fast for us to reach that limit.
|
|
|
|
set used [s used_memory]
|
|
|
|
set limit [expr {$used+100*1024}]
|
|
|
|
r config set maxmemory $limit
|
|
|
|
r config set maxmemory-policy $policy
|
|
|
|
# Now add keys until the limit is almost reached.
|
|
|
|
set numkeys 0
|
|
|
|
while 1 {
|
|
|
|
# Odd keys are volatile
|
|
|
|
# Even keys are non volatile
|
|
|
|
if {$numkeys % 2} {
|
|
|
|
r setex "key:$numkeys" 10000 x
|
|
|
|
} else {
|
|
|
|
r set "key:$numkeys" x
|
|
|
|
}
|
|
|
|
if {[s used_memory]+4096 > $limit} {
|
|
|
|
assert {$numkeys > 10}
|
|
|
|
break
|
|
|
|
}
|
|
|
|
incr numkeys
|
|
|
|
}
|
|
|
|
# Now we add the same number of volatile keys already added.
|
|
|
|
# We expect Redis to evict only volatile keys in order to make
|
|
|
|
# space.
|
|
|
|
set err 0
|
|
|
|
for {set j 0} {$j < $numkeys} {incr j} {
|
|
|
|
catch {r setex "foo:$j" 10000 x}
|
|
|
|
}
|
|
|
|
# We should still be under the limit.
|
|
|
|
assert {[s used_memory] < ($limit+4096)}
|
|
|
|
# However all our non volatile keys should be here.
|
|
|
|
for {set j 0} {$j < $numkeys} {incr j 2} {
|
|
|
|
assert {[r exists "key:$j"]}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2018-02-21 13:18:34 -05:00
|
|
|
|
2021-06-15 07:46:19 -04:00
|
|
|
# Calculate query buffer memory of slave
|
|
|
|
proc slave_query_buffer {srv} {
|
|
|
|
set clients [split [$srv client list] "\r\n"]
|
|
|
|
set c [lsearch -inline $clients *flags=S*]
|
|
|
|
if {[string length $c] > 0} {
|
|
|
|
assert {[regexp {qbuf=([0-9]+)} $c - qbuf]}
|
|
|
|
assert {[regexp {qbuf-free=([0-9]+)} $c - qbuf_free]}
|
|
|
|
return [expr $qbuf + $qbuf_free]
|
|
|
|
}
|
|
|
|
return 0
|
|
|
|
}
|
|
|
|
|
2018-08-21 04:46:07 -04:00
|
|
|
proc test_slave_buffers {test_name cmd_count payload_len limit_memory pipeline} {
|
2021-06-09 08:13:24 -04:00
|
|
|
start_server {tags {"maxmemory external:skip"}} {
|
2018-02-21 13:18:34 -05:00
|
|
|
start_server {} {
|
2018-08-21 04:46:07 -04:00
|
|
|
set slave_pid [s process_id]
|
|
|
|
test "$test_name" {
|
2018-02-21 13:18:34 -05:00
|
|
|
set slave [srv 0 client]
|
|
|
|
set slave_host [srv 0 host]
|
|
|
|
set slave_port [srv 0 port]
|
|
|
|
set master [srv -1 client]
|
|
|
|
set master_host [srv -1 host]
|
|
|
|
set master_port [srv -1 port]
|
|
|
|
|
2021-06-15 07:46:19 -04:00
|
|
|
# Disable slow log for master to avoid memory growth in slow env.
|
|
|
|
$master config set slowlog-log-slower-than -1
|
|
|
|
|
2018-02-21 13:18:34 -05:00
|
|
|
# add 100 keys of 100k (10MB total)
|
|
|
|
for {set j 0} {$j < 100} {incr j} {
|
|
|
|
$master setrange "key:$j" 100000 asdf
|
|
|
|
}
|
|
|
|
|
2018-08-21 04:46:07 -04:00
|
|
|
# make sure master doesn't disconnect slave because of timeout
|
2019-05-05 01:19:52 -04:00
|
|
|
$master config set repl-timeout 1200 ;# 20 minutes (for valgrind and slow machines)
|
2018-02-21 13:18:34 -05:00
|
|
|
$master config set maxmemory-policy allkeys-random
|
2018-09-11 04:57:50 -04:00
|
|
|
$master config set client-output-buffer-limit "replica 100000000 100000000 300"
|
2018-02-21 13:18:34 -05:00
|
|
|
$master config set repl-backlog-size [expr {10*1024}]
|
|
|
|
|
Added INFO LATENCYSTATS section: latency by percentile distribution/latency by cumulative distribution of latencies (#9462)
# Short description
The Redis extended latency stats track per command latencies and enables:
- exporting the per-command percentile distribution via the `INFO LATENCYSTATS` command.
**( percentile distribution is not mergeable between cluster nodes ).**
- exporting the per-command cumulative latency distributions via the `LATENCY HISTOGRAM` command.
Using the cumulative distribution of latencies we can merge several stats from different cluster nodes
to calculate aggregate metrics .
By default, the extended latency monitoring is enabled since the overhead of keeping track of the
command latency is very small.
If you don't want to track extended latency metrics, you can easily disable it at runtime using the command:
- `CONFIG SET latency-tracking no`
By default, the exported latency percentiles are the p50, p99, and p999.
You can alter them at runtime using the command:
- `CONFIG SET latency-tracking-info-percentiles "0.0 50.0 100.0"`
## Some details:
- The total size per histogram should sit around 40 KiB. We only allocate those 40KiB when a command
was called for the first time.
- With regards to the WRITE overhead As seen below, there is no measurable overhead on the achievable
ops/sec or full latency spectrum on the client. Including also the measured redis-benchmark for unstable
vs this branch.
- We track from 1 nanosecond to 1 second ( everything above 1 second is considered +Inf )
## `INFO LATENCYSTATS` exposition format
- Format: `latency_percentiles_usec_<CMDNAME>:p0=XX,p50....`
## `LATENCY HISTOGRAM [command ...]` exposition format
Return a cumulative distribution of latencies in the format of a histogram for the specified command names.
The histogram is composed of a map of time buckets:
- Each representing a latency range, between 1 nanosecond and roughly 1 second.
- Each bucket covers twice the previous bucket's range.
- Empty buckets are not printed.
- Everything above 1 sec is considered +Inf.
- At max there will be log2(1000000000)=30 buckets
We reply a map for each command in the format:
`<command name> : { `calls`: <total command calls> , `histogram` : { <bucket 1> : latency , < bucket 2> : latency, ... } }`
Co-authored-by: Oran Agra <oran@redislabs.com>
2022-01-05 07:01:05 -05:00
|
|
|
# disable latency tracking
|
|
|
|
$master config set latency-tracking no
|
|
|
|
$slave config set latency-tracking no
|
|
|
|
|
2018-02-21 13:18:34 -05:00
|
|
|
$slave slaveof $master_host $master_port
|
|
|
|
wait_for_condition 50 100 {
|
|
|
|
[s 0 master_link_status] eq {up}
|
|
|
|
} else {
|
|
|
|
fail "Replication not started."
|
|
|
|
}
|
|
|
|
|
|
|
|
# measure used memory after the slave connected and set maxmemory
|
|
|
|
set orig_used [s -1 used_memory]
|
|
|
|
set orig_client_buf [s -1 mem_clients_normal]
|
|
|
|
set orig_mem_not_counted_for_evict [s -1 mem_not_counted_for_evict]
|
|
|
|
set orig_used_no_repl [expr {$orig_used - $orig_mem_not_counted_for_evict}]
|
2021-03-08 13:53:53 -05:00
|
|
|
set limit [expr {$orig_used - $orig_mem_not_counted_for_evict + 32*1024}]
|
2018-02-21 13:18:34 -05:00
|
|
|
|
|
|
|
if {$limit_memory==1} {
|
|
|
|
$master config set maxmemory $limit
|
|
|
|
}
|
|
|
|
|
|
|
|
# put the slave to sleep
|
|
|
|
set rd_slave [redis_deferring_client]
|
2018-08-21 04:46:07 -04:00
|
|
|
exec kill -SIGSTOP $slave_pid
|
2018-02-21 13:18:34 -05:00
|
|
|
|
2018-08-21 04:46:07 -04:00
|
|
|
# send some 10mb worth of commands that don't increase the memory usage
|
2018-02-21 13:18:34 -05:00
|
|
|
if {$pipeline == 1} {
|
|
|
|
set rd_master [redis_deferring_client -1]
|
|
|
|
for {set k 0} {$k < $cmd_count} {incr k} {
|
|
|
|
$rd_master setrange key:0 0 [string repeat A $payload_len]
|
|
|
|
}
|
|
|
|
for {set k 0} {$k < $cmd_count} {incr k} {
|
Replication backlog and replicas use one global shared replication buffer (#9166)
## Background
For redis master, one replica uses one copy of replication buffer, that is a big waste of memory,
more replicas more waste, and allocate/free memory for every reply list also cost much.
If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with
replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big,
master may be OOM when there are many replicas that separately keep much memory.
Because replication buffers of different replica client are the same, one simple idea is that
all replicas only use one replication buffer, that will effectively save memory.
Since replication backlog content is the same as replicas' output buffer, now we
can discard replication backlog memory and use global shared replication buffer
to implement replication backlog mechanism.
## Implementation
I create one global "replication buffer" which contains content of replication stream.
The structure of "replication buffer" is similar to the reply list that exists in every client.
But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields.
```c
/* Replication buffer blocks is the list of replBufBlock.
*
* +--------------+ +--------------+ +--------------+
* | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 |
* +--------------+ +--------------+ +--------------+
* | / \
* | / \
* | / \
* Repl Backlog Replia_A Replia_B
*
* Each replica or replication backlog increments only the refcount of the
* 'ref_repl_buf_node' which it points to. So when replica walks to the next
* node, it should first increase the next node's refcount, and when we trim
* the replication buffer nodes, we remove node always from the head node which
* refcount is 0. If the refcount of the head node is not 0, we must stop
* trimming and never iterate the next node. */
/* Similar with 'clientReplyBlock', it is used for shared buffers between
* all replica clients and replication backlog. */
typedef struct replBufBlock {
int refcount; /* Number of replicas or repl backlog using. */
long long id; /* The unique incremental number. */
long long repl_offset; /* Start replication offset of the block. */
size_t size, used;
char buf[];
} replBufBlock;
```
So now when we feed replication stream into replication backlog and all replicas, we only need
to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of
replication backlog and replicas to references of the global replication buffer blocks. And we also
need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim
replication backlog if exceeding `repl-backlog-size`.
When sending reply to replicas, we also need to iterate replication buffer blocks and send its
content, when totally sending one block for replica, we decrease current node count and
increase the next current node count, and then free the block which reference is 0 from the
head of replication buffer blocks.
Since now we use linked list to manage replication backlog, it may cost much time for iterating
all linked list nodes to find corresponding replication buffer node. So we create a rax tree to
store some nodes for index, but to avoid rax tree occupying too much memory, i record
one per 64 nodes for index.
Currently, to make partial resynchronization as possible as much, we always let replication
backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting
if slow replicas that reference vast replication buffer blocks, and this method doesn't increase
memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced
replication buffer blocks when we need to trim backlog for exceeding backlog size setting,
we trim backlog incrementally (free 64 blocks per call now), and make it faster in
`beforeSleep` (free 640 blocks).
### Other changes
- `mem_total_replication_buffers`: we add this field in INFO command, it means the total
memory of replication buffers used.
- `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory
is not 0, but it still may be 0, since replication backlog and replicas share one global replication
buffer, only if replication buffer memory is more than the repl backlog setting size, we consider
the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption
of repl backlog.
- Key eviction
Since all replicas and replication backlog share global replication buffer, we think only the
part of exceeding backlog size the extra separate consumption of replicas.
Because we trim backlog incrementally in the background, backlog size may exceeds our
setting if slow replicas that reference vast replication buffer blocks disconnect.
To avoid massive eviction loop, we don't count the delayed freed replication backlog into
used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory.
- `client-output-buffer-limit` check for replica clients
It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size
config (partial sync will succeed and then replica will get disconnected). Such a configuration is
ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption
implications since the replica client will share the backlog buffers memory.
- Drop replication backlog after loading data if needed
We always create replication backlog if server is a master, we need it because we put DELs in
it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb,
it is not possible to support partial resynchronization, to avoid extra memory of replication backlog,
we drop it.
- Multi IO threads
Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled,
to guarantee data accessing thread safe, we must let main thread handle sending the output buffer
to all replicas. But before, other IO threads could handle sending output buffer of all replicas.
## Other optimizations
This solution resolve some other problem:
- When replicas disconnect with master since of out of output buffer limit, releasing the output
buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now,
it doesn't cause freezing.
- This implementation may mitigate reply list copy cost time(also freezes server) when one replication
has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy
reference info, it is very light.
- If we set replication backlog size big, it also may cost much time to copy replication backlog into
replica's output buffer. But this commit eliminates this problem.
- Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
|
|
|
$rd_master read
|
2018-02-21 13:18:34 -05:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
for {set k 0} {$k < $cmd_count} {incr k} {
|
|
|
|
$master setrange key:0 0 [string repeat A $payload_len]
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
set new_used [s -1 used_memory]
|
|
|
|
set slave_buf [s -1 mem_clients_slaves]
|
|
|
|
set client_buf [s -1 mem_clients_normal]
|
|
|
|
set mem_not_counted_for_evict [s -1 mem_not_counted_for_evict]
|
2021-06-15 07:46:19 -04:00
|
|
|
set used_no_repl [expr {$new_used - $mem_not_counted_for_evict - [slave_query_buffer $master]}]
|
|
|
|
# we need to exclude replies buffer and query buffer of replica from used memory.
|
|
|
|
# removing the replica (output) buffers is done so that we are able to measure any other
|
|
|
|
# changes to the used memory and see that they're insignificant (the test's purpose is to check that
|
|
|
|
# the replica buffers are counted correctly, so the used memory growth after deducting them
|
|
|
|
# should be nearly 0).
|
|
|
|
# we remove the query buffers because on slow test platforms, they can accumulate many ACKs.
|
2018-02-21 13:18:34 -05:00
|
|
|
set delta [expr {($used_no_repl - $client_buf) - ($orig_used_no_repl - $orig_client_buf)}]
|
|
|
|
|
|
|
|
assert {[$master dbsize] == 100}
|
|
|
|
assert {$slave_buf > 2*1024*1024} ;# some of the data may have been pushed to the OS buffers
|
2019-05-05 01:19:52 -04:00
|
|
|
set delta_max [expr {$cmd_count / 2}] ;# 1 byte unaccounted for, with 1M commands will consume some 1MB
|
|
|
|
assert {$delta < $delta_max && $delta > -$delta_max}
|
2018-02-21 13:18:34 -05:00
|
|
|
|
|
|
|
$master client kill type slave
|
Replication backlog and replicas use one global shared replication buffer (#9166)
## Background
For redis master, one replica uses one copy of replication buffer, that is a big waste of memory,
more replicas more waste, and allocate/free memory for every reply list also cost much.
If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with
replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big,
master may be OOM when there are many replicas that separately keep much memory.
Because replication buffers of different replica client are the same, one simple idea is that
all replicas only use one replication buffer, that will effectively save memory.
Since replication backlog content is the same as replicas' output buffer, now we
can discard replication backlog memory and use global shared replication buffer
to implement replication backlog mechanism.
## Implementation
I create one global "replication buffer" which contains content of replication stream.
The structure of "replication buffer" is similar to the reply list that exists in every client.
But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields.
```c
/* Replication buffer blocks is the list of replBufBlock.
*
* +--------------+ +--------------+ +--------------+
* | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 |
* +--------------+ +--------------+ +--------------+
* | / \
* | / \
* | / \
* Repl Backlog Replia_A Replia_B
*
* Each replica or replication backlog increments only the refcount of the
* 'ref_repl_buf_node' which it points to. So when replica walks to the next
* node, it should first increase the next node's refcount, and when we trim
* the replication buffer nodes, we remove node always from the head node which
* refcount is 0. If the refcount of the head node is not 0, we must stop
* trimming and never iterate the next node. */
/* Similar with 'clientReplyBlock', it is used for shared buffers between
* all replica clients and replication backlog. */
typedef struct replBufBlock {
int refcount; /* Number of replicas or repl backlog using. */
long long id; /* The unique incremental number. */
long long repl_offset; /* Start replication offset of the block. */
size_t size, used;
char buf[];
} replBufBlock;
```
So now when we feed replication stream into replication backlog and all replicas, we only need
to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of
replication backlog and replicas to references of the global replication buffer blocks. And we also
need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim
replication backlog if exceeding `repl-backlog-size`.
When sending reply to replicas, we also need to iterate replication buffer blocks and send its
content, when totally sending one block for replica, we decrease current node count and
increase the next current node count, and then free the block which reference is 0 from the
head of replication buffer blocks.
Since now we use linked list to manage replication backlog, it may cost much time for iterating
all linked list nodes to find corresponding replication buffer node. So we create a rax tree to
store some nodes for index, but to avoid rax tree occupying too much memory, i record
one per 64 nodes for index.
Currently, to make partial resynchronization as possible as much, we always let replication
backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting
if slow replicas that reference vast replication buffer blocks, and this method doesn't increase
memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced
replication buffer blocks when we need to trim backlog for exceeding backlog size setting,
we trim backlog incrementally (free 64 blocks per call now), and make it faster in
`beforeSleep` (free 640 blocks).
### Other changes
- `mem_total_replication_buffers`: we add this field in INFO command, it means the total
memory of replication buffers used.
- `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory
is not 0, but it still may be 0, since replication backlog and replicas share one global replication
buffer, only if replication buffer memory is more than the repl backlog setting size, we consider
the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption
of repl backlog.
- Key eviction
Since all replicas and replication backlog share global replication buffer, we think only the
part of exceeding backlog size the extra separate consumption of replicas.
Because we trim backlog incrementally in the background, backlog size may exceeds our
setting if slow replicas that reference vast replication buffer blocks disconnect.
To avoid massive eviction loop, we don't count the delayed freed replication backlog into
used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory.
- `client-output-buffer-limit` check for replica clients
It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size
config (partial sync will succeed and then replica will get disconnected). Such a configuration is
ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption
implications since the replica client will share the backlog buffers memory.
- Drop replication backlog after loading data if needed
We always create replication backlog if server is a master, we need it because we put DELs in
it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb,
it is not possible to support partial resynchronization, to avoid extra memory of replication backlog,
we drop it.
- Multi IO threads
Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled,
to guarantee data accessing thread safe, we must let main thread handle sending the output buffer
to all replicas. But before, other IO threads could handle sending output buffer of all replicas.
## Other optimizations
This solution resolve some other problem:
- When replicas disconnect with master since of out of output buffer limit, releasing the output
buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now,
it doesn't cause freezing.
- This implementation may mitigate reply list copy cost time(also freezes server) when one replication
has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy
reference info, it is very light.
- If we set replication backlog size big, it also may cost much time to copy replication backlog into
replica's output buffer. But this commit eliminates this problem.
- Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
|
|
|
set info_str [$master info memory]
|
|
|
|
set killed_used [getInfoProperty $info_str used_memory]
|
|
|
|
set killed_mem_not_counted_for_evict [getInfoProperty $info_str mem_not_counted_for_evict]
|
2018-02-21 13:18:34 -05:00
|
|
|
set killed_slave_buf [s -1 mem_clients_slaves]
|
2021-06-15 07:46:19 -04:00
|
|
|
# we need to exclude replies buffer and query buffer of slave from used memory after kill slave
|
|
|
|
set killed_used_no_repl [expr {$killed_used - $killed_mem_not_counted_for_evict - [slave_query_buffer $master]}]
|
2018-02-21 13:18:34 -05:00
|
|
|
set delta_no_repl [expr {$killed_used_no_repl - $used_no_repl}]
|
Replication backlog and replicas use one global shared replication buffer (#9166)
## Background
For redis master, one replica uses one copy of replication buffer, that is a big waste of memory,
more replicas more waste, and allocate/free memory for every reply list also cost much.
If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with
replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big,
master may be OOM when there are many replicas that separately keep much memory.
Because replication buffers of different replica client are the same, one simple idea is that
all replicas only use one replication buffer, that will effectively save memory.
Since replication backlog content is the same as replicas' output buffer, now we
can discard replication backlog memory and use global shared replication buffer
to implement replication backlog mechanism.
## Implementation
I create one global "replication buffer" which contains content of replication stream.
The structure of "replication buffer" is similar to the reply list that exists in every client.
But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields.
```c
/* Replication buffer blocks is the list of replBufBlock.
*
* +--------------+ +--------------+ +--------------+
* | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 |
* +--------------+ +--------------+ +--------------+
* | / \
* | / \
* | / \
* Repl Backlog Replia_A Replia_B
*
* Each replica or replication backlog increments only the refcount of the
* 'ref_repl_buf_node' which it points to. So when replica walks to the next
* node, it should first increase the next node's refcount, and when we trim
* the replication buffer nodes, we remove node always from the head node which
* refcount is 0. If the refcount of the head node is not 0, we must stop
* trimming and never iterate the next node. */
/* Similar with 'clientReplyBlock', it is used for shared buffers between
* all replica clients and replication backlog. */
typedef struct replBufBlock {
int refcount; /* Number of replicas or repl backlog using. */
long long id; /* The unique incremental number. */
long long repl_offset; /* Start replication offset of the block. */
size_t size, used;
char buf[];
} replBufBlock;
```
So now when we feed replication stream into replication backlog and all replicas, we only need
to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of
replication backlog and replicas to references of the global replication buffer blocks. And we also
need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim
replication backlog if exceeding `repl-backlog-size`.
When sending reply to replicas, we also need to iterate replication buffer blocks and send its
content, when totally sending one block for replica, we decrease current node count and
increase the next current node count, and then free the block which reference is 0 from the
head of replication buffer blocks.
Since now we use linked list to manage replication backlog, it may cost much time for iterating
all linked list nodes to find corresponding replication buffer node. So we create a rax tree to
store some nodes for index, but to avoid rax tree occupying too much memory, i record
one per 64 nodes for index.
Currently, to make partial resynchronization as possible as much, we always let replication
backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting
if slow replicas that reference vast replication buffer blocks, and this method doesn't increase
memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced
replication buffer blocks when we need to trim backlog for exceeding backlog size setting,
we trim backlog incrementally (free 64 blocks per call now), and make it faster in
`beforeSleep` (free 640 blocks).
### Other changes
- `mem_total_replication_buffers`: we add this field in INFO command, it means the total
memory of replication buffers used.
- `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory
is not 0, but it still may be 0, since replication backlog and replicas share one global replication
buffer, only if replication buffer memory is more than the repl backlog setting size, we consider
the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption
of repl backlog.
- Key eviction
Since all replicas and replication backlog share global replication buffer, we think only the
part of exceeding backlog size the extra separate consumption of replicas.
Because we trim backlog incrementally in the background, backlog size may exceeds our
setting if slow replicas that reference vast replication buffer blocks disconnect.
To avoid massive eviction loop, we don't count the delayed freed replication backlog into
used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory.
- `client-output-buffer-limit` check for replica clients
It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size
config (partial sync will succeed and then replica will get disconnected). Such a configuration is
ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption
implications since the replica client will share the backlog buffers memory.
- Drop replication backlog after loading data if needed
We always create replication backlog if server is a master, we need it because we put DELs in
it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb,
it is not possible to support partial resynchronization, to avoid extra memory of replication backlog,
we drop it.
- Multi IO threads
Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled,
to guarantee data accessing thread safe, we must let main thread handle sending the output buffer
to all replicas. But before, other IO threads could handle sending output buffer of all replicas.
## Other optimizations
This solution resolve some other problem:
- When replicas disconnect with master since of out of output buffer limit, releasing the output
buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now,
it doesn't cause freezing.
- This implementation may mitigate reply list copy cost time(also freezes server) when one replication
has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy
reference info, it is very light.
- If we set replication backlog size big, it also may cost much time to copy replication backlog into
replica's output buffer. But this commit eliminates this problem.
- Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
|
|
|
assert {[$master dbsize] == 100}
|
2018-02-21 13:18:34 -05:00
|
|
|
assert {$killed_slave_buf == 0}
|
2019-05-05 01:19:52 -04:00
|
|
|
assert {$delta_no_repl > -$delta_max && $delta_no_repl < $delta_max}
|
2018-08-21 04:46:07 -04:00
|
|
|
|
|
|
|
}
|
|
|
|
# unfreeze slave process (after the 'test' succeeded or failed, but before we attempt to terminate the server
|
|
|
|
exec kill -SIGCONT $slave_pid
|
2018-02-21 13:18:34 -05:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-08-21 04:46:07 -04:00
|
|
|
# test that slave buffer are counted correctly
|
|
|
|
# we wanna use many small commands, and we don't wanna wait long
|
|
|
|
# so we need to use a pipeline (redis_deferring_client)
|
|
|
|
# that may cause query buffer to fill and induce eviction, so we disable it
|
|
|
|
test_slave_buffers {slave buffer are counted correctly} 1000000 10 0 1
|
2018-02-21 13:18:34 -05:00
|
|
|
|
2018-08-21 04:46:07 -04:00
|
|
|
# test that slave buffer don't induce eviction
|
|
|
|
# test again with fewer (and bigger) commands without pipeline, but with eviction
|
2018-09-11 04:57:50 -04:00
|
|
|
test_slave_buffers "replica buffer don't induce eviction" 100000 100 1 0
|
2018-02-21 13:18:34 -05:00
|
|
|
|
2021-06-09 08:13:24 -04:00
|
|
|
start_server {tags {"maxmemory external:skip"}} {
|
Limit the main db and expires dictionaries to expand (#7954)
As we know, redis may reject user's requests or evict some keys if
used memory is over maxmemory. Dictionaries expanding may make
things worse, some big dictionaries, such as main db and expires dict,
may eat huge memory at once for allocating a new big hash table and be
far more than maxmemory after expanding.
There are related issues: #4213 #4583
More details, when expand dict in redis, we will allocate a new big
ht[1] that generally is double of ht[0], The size of ht[1] will be
very big if ht[0] already is big. For db dict, if we have more than
64 million keys, we need to cost 1GB for ht[1] when dict expands.
If the sum of used memory and new hash table of dict needed exceeds
maxmemory, we shouldn't allow the dict to expand. Because, if we
enable keys eviction, we still couldn't add much more keys after
eviction and rehashing, what's worse, redis will keep less keys when
redis only remains a little memory for storing new hash table instead
of users' data. Moreover users can't write data in redis if disable
keys eviction.
What this commit changed ?
Add a new member function expandAllowed for dict type, it provide a way
for caller to allow expand or not. We expose two parameters for this
function: more memory needed for expanding and dict current load factor,
users can implement a function to make a decision by them.
For main db dict and expires dict type, these dictionaries may be very
big and cost huge memory for expanding, so we implement a judgement
function: we can stop dict to expand provisionally if used memory will
be over maxmemory after dict expands, but to guarantee the performance
of redis, we still allow dict to expand if dict load factor exceeds the
safe load factor.
Add test cases to verify we don't allow main db to expand when left
memory is not enough, so that avoid keys eviction.
Other changes:
For new hash table size when expand. Before this commit, the size is
that double used of dict and later _dictNextPower. Actually we aim to
control a dict load factor between 0.5 and 1.0. Now we replace *2 with
+1, since the first check is that used >= size, the outcome of before
will usually be the same as _dictNextPower(used+1). The only case where
it'll differ is when dict_can_resize is false during fork, so that later
the _dictNextPower(used*2) will cause the dict to jump to *4 (i.e.
_dictNextPower(1025*2) will return 4096).
Fix rehash test cases due to changing algorithm of new hash table size
when expand.
2020-12-06 04:53:04 -05:00
|
|
|
test {Don't rehash if used memory exceeds maxmemory after rehash} {
|
Added INFO LATENCYSTATS section: latency by percentile distribution/latency by cumulative distribution of latencies (#9462)
# Short description
The Redis extended latency stats track per command latencies and enables:
- exporting the per-command percentile distribution via the `INFO LATENCYSTATS` command.
**( percentile distribution is not mergeable between cluster nodes ).**
- exporting the per-command cumulative latency distributions via the `LATENCY HISTOGRAM` command.
Using the cumulative distribution of latencies we can merge several stats from different cluster nodes
to calculate aggregate metrics .
By default, the extended latency monitoring is enabled since the overhead of keeping track of the
command latency is very small.
If you don't want to track extended latency metrics, you can easily disable it at runtime using the command:
- `CONFIG SET latency-tracking no`
By default, the exported latency percentiles are the p50, p99, and p999.
You can alter them at runtime using the command:
- `CONFIG SET latency-tracking-info-percentiles "0.0 50.0 100.0"`
## Some details:
- The total size per histogram should sit around 40 KiB. We only allocate those 40KiB when a command
was called for the first time.
- With regards to the WRITE overhead As seen below, there is no measurable overhead on the achievable
ops/sec or full latency spectrum on the client. Including also the measured redis-benchmark for unstable
vs this branch.
- We track from 1 nanosecond to 1 second ( everything above 1 second is considered +Inf )
## `INFO LATENCYSTATS` exposition format
- Format: `latency_percentiles_usec_<CMDNAME>:p0=XX,p50....`
## `LATENCY HISTOGRAM [command ...]` exposition format
Return a cumulative distribution of latencies in the format of a histogram for the specified command names.
The histogram is composed of a map of time buckets:
- Each representing a latency range, between 1 nanosecond and roughly 1 second.
- Each bucket covers twice the previous bucket's range.
- Empty buckets are not printed.
- Everything above 1 sec is considered +Inf.
- At max there will be log2(1000000000)=30 buckets
We reply a map for each command in the format:
`<command name> : { `calls`: <total command calls> , `histogram` : { <bucket 1> : latency , < bucket 2> : latency, ... } }`
Co-authored-by: Oran Agra <oran@redislabs.com>
2022-01-05 07:01:05 -05:00
|
|
|
r config set latency-tracking no
|
Limit the main db and expires dictionaries to expand (#7954)
As we know, redis may reject user's requests or evict some keys if
used memory is over maxmemory. Dictionaries expanding may make
things worse, some big dictionaries, such as main db and expires dict,
may eat huge memory at once for allocating a new big hash table and be
far more than maxmemory after expanding.
There are related issues: #4213 #4583
More details, when expand dict in redis, we will allocate a new big
ht[1] that generally is double of ht[0], The size of ht[1] will be
very big if ht[0] already is big. For db dict, if we have more than
64 million keys, we need to cost 1GB for ht[1] when dict expands.
If the sum of used memory and new hash table of dict needed exceeds
maxmemory, we shouldn't allow the dict to expand. Because, if we
enable keys eviction, we still couldn't add much more keys after
eviction and rehashing, what's worse, redis will keep less keys when
redis only remains a little memory for storing new hash table instead
of users' data. Moreover users can't write data in redis if disable
keys eviction.
What this commit changed ?
Add a new member function expandAllowed for dict type, it provide a way
for caller to allow expand or not. We expose two parameters for this
function: more memory needed for expanding and dict current load factor,
users can implement a function to make a decision by them.
For main db dict and expires dict type, these dictionaries may be very
big and cost huge memory for expanding, so we implement a judgement
function: we can stop dict to expand provisionally if used memory will
be over maxmemory after dict expands, but to guarantee the performance
of redis, we still allow dict to expand if dict load factor exceeds the
safe load factor.
Add test cases to verify we don't allow main db to expand when left
memory is not enough, so that avoid keys eviction.
Other changes:
For new hash table size when expand. Before this commit, the size is
that double used of dict and later _dictNextPower. Actually we aim to
control a dict load factor between 0.5 and 1.0. Now we replace *2 with
+1, since the first check is that used >= size, the outcome of before
will usually be the same as _dictNextPower(used+1). The only case where
it'll differ is when dict_can_resize is false during fork, so that later
the _dictNextPower(used*2) will cause the dict to jump to *4 (i.e.
_dictNextPower(1025*2) will return 4096).
Fix rehash test cases due to changing algorithm of new hash table size
when expand.
2020-12-06 04:53:04 -05:00
|
|
|
r config set maxmemory 0
|
|
|
|
r config set maxmemory-policy allkeys-random
|
|
|
|
|
|
|
|
# Next rehash size is 8192, that will eat 64k memory
|
|
|
|
populate 4096 "" 1
|
|
|
|
|
|
|
|
set used [s used_memory]
|
|
|
|
set limit [expr {$used + 10*1024}]
|
|
|
|
r config set maxmemory $limit
|
|
|
|
r set k1 v1
|
|
|
|
# Next writing command will trigger evicting some keys if last
|
|
|
|
# command trigger DB dict rehash
|
|
|
|
r set k2 v2
|
|
|
|
# There must be 4098 keys because redis doesn't evict keys.
|
|
|
|
r dbsize
|
|
|
|
} {4098}
|
|
|
|
}
|
2020-12-06 07:51:22 -05:00
|
|
|
|
2021-06-09 08:13:24 -04:00
|
|
|
start_server {tags {"maxmemory external:skip"}} {
|
2020-12-06 07:51:22 -05:00
|
|
|
test {client tracking don't cause eviction feedback loop} {
|
Added INFO LATENCYSTATS section: latency by percentile distribution/latency by cumulative distribution of latencies (#9462)
# Short description
The Redis extended latency stats track per command latencies and enables:
- exporting the per-command percentile distribution via the `INFO LATENCYSTATS` command.
**( percentile distribution is not mergeable between cluster nodes ).**
- exporting the per-command cumulative latency distributions via the `LATENCY HISTOGRAM` command.
Using the cumulative distribution of latencies we can merge several stats from different cluster nodes
to calculate aggregate metrics .
By default, the extended latency monitoring is enabled since the overhead of keeping track of the
command latency is very small.
If you don't want to track extended latency metrics, you can easily disable it at runtime using the command:
- `CONFIG SET latency-tracking no`
By default, the exported latency percentiles are the p50, p99, and p999.
You can alter them at runtime using the command:
- `CONFIG SET latency-tracking-info-percentiles "0.0 50.0 100.0"`
## Some details:
- The total size per histogram should sit around 40 KiB. We only allocate those 40KiB when a command
was called for the first time.
- With regards to the WRITE overhead As seen below, there is no measurable overhead on the achievable
ops/sec or full latency spectrum on the client. Including also the measured redis-benchmark for unstable
vs this branch.
- We track from 1 nanosecond to 1 second ( everything above 1 second is considered +Inf )
## `INFO LATENCYSTATS` exposition format
- Format: `latency_percentiles_usec_<CMDNAME>:p0=XX,p50....`
## `LATENCY HISTOGRAM [command ...]` exposition format
Return a cumulative distribution of latencies in the format of a histogram for the specified command names.
The histogram is composed of a map of time buckets:
- Each representing a latency range, between 1 nanosecond and roughly 1 second.
- Each bucket covers twice the previous bucket's range.
- Empty buckets are not printed.
- Everything above 1 sec is considered +Inf.
- At max there will be log2(1000000000)=30 buckets
We reply a map for each command in the format:
`<command name> : { `calls`: <total command calls> , `histogram` : { <bucket 1> : latency , < bucket 2> : latency, ... } }`
Co-authored-by: Oran Agra <oran@redislabs.com>
2022-01-05 07:01:05 -05:00
|
|
|
r config set latency-tracking no
|
2020-12-06 07:51:22 -05:00
|
|
|
r config set maxmemory 0
|
|
|
|
r config set maxmemory-policy allkeys-lru
|
|
|
|
r config set maxmemory-eviction-tenacity 100
|
|
|
|
|
|
|
|
# 10 clients listening on tracking messages
|
|
|
|
set clients {}
|
|
|
|
for {set j 0} {$j < 10} {incr j} {
|
|
|
|
lappend clients [redis_deferring_client]
|
|
|
|
}
|
|
|
|
foreach rd $clients {
|
|
|
|
$rd HELLO 3
|
|
|
|
$rd read ; # Consume the HELLO reply
|
|
|
|
$rd CLIENT TRACKING on
|
|
|
|
$rd read ; # Consume the CLIENT reply
|
|
|
|
}
|
|
|
|
|
|
|
|
# populate 300 keys, with long key name and short value
|
|
|
|
for {set j 0} {$j < 300} {incr j} {
|
|
|
|
set key $j[string repeat x 1000]
|
|
|
|
r set $key x
|
|
|
|
|
|
|
|
# for each key, enable caching for this key
|
|
|
|
foreach rd $clients {
|
|
|
|
$rd get $key
|
|
|
|
$rd read
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-12-08 09:33:09 -05:00
|
|
|
# we need to wait one second for the client querybuf excess memory to be
|
|
|
|
# trimmed by cron, otherwise the INFO used_memory and CONFIG maxmemory
|
|
|
|
# below (on slow machines) won't be "atomic" and won't trigger eviction.
|
|
|
|
after 1100
|
|
|
|
|
2020-12-06 07:51:22 -05:00
|
|
|
# set the memory limit which will cause a few keys to be evicted
|
|
|
|
# we need to make sure to evict keynames of a total size of more than
|
|
|
|
# 16kb since the (PROTO_REPLY_CHUNK_BYTES), only after that the
|
|
|
|
# invalidation messages have a chance to trigger further eviction.
|
|
|
|
set used [s used_memory]
|
|
|
|
set limit [expr {$used - 40000}]
|
|
|
|
r config set maxmemory $limit
|
|
|
|
|
|
|
|
# make sure some eviction happened
|
|
|
|
set evicted [s evicted_keys]
|
|
|
|
if {$::verbose} { puts "evicted: $evicted" }
|
2020-12-08 09:33:09 -05:00
|
|
|
|
|
|
|
# make sure we didn't drain the database
|
|
|
|
assert_range [r dbsize] 200 300
|
|
|
|
|
2020-12-06 07:51:22 -05:00
|
|
|
assert_range $evicted 10 50
|
|
|
|
foreach rd $clients {
|
|
|
|
$rd read ;# make sure we have some invalidation message waiting
|
|
|
|
$rd close
|
|
|
|
}
|
|
|
|
|
2020-12-08 09:33:09 -05:00
|
|
|
# eviction continues (known problem described in #8069)
|
|
|
|
# for now this test only make sures the eviction loop itself doesn't
|
|
|
|
# have feedback loop
|
|
|
|
set evicted [s evicted_keys]
|
|
|
|
if {$::verbose} { puts "evicted: $evicted" }
|
2020-12-06 07:51:22 -05:00
|
|
|
}
|
|
|
|
}
|
Sort out mess around propagation and MULTI/EXEC (#9890)
The mess:
Some parts use alsoPropagate for late propagation, others using an immediate one (propagate()),
causing edge cases, ugly/hacky code, and the tendency for bugs
The basic idea is that all commands are propagated via alsoPropagate (i.e. added to a list) and the
top-most call() is responsible for going over that list and actually propagating them (and wrapping
them in MULTI/EXEC if there's more than one command). This is done in the new function,
propagatePendingCommands.
Callers to propagatePendingCommands:
1. top-most call() (we want all nested call()s to add to the also_propagate array and just the top-most
one to propagate them) - via `afterCommand`
2. handleClientsBlockedOnKeys: it is out of call() context and it may propagate stuff - via `afterCommand`.
3. handleClientsBlockedOnKeys edge case: if the looked-up key is already expired, we will propagate the
expire but will not unblock any client so `afterCommand` isn't called. in that case, we have to propagate
the deletion explicitly.
4. cron stuff: active-expire and eviction may also propagate stuff
5. modules: the module API allows to propagate stuff from just about anywhere (timers, keyspace notifications,
threads). I could have tried to catch all the out-of-call-context places but it seemed easier to handle it in one
place: when we free the context. in the spirit of what was done in call(), only the top-most freeing of a module
context may cause propagation.
6. modules: when using a thread-safe ctx it's not clear when/if the ctx will be freed. we do know that the module
must lock the GIL before calling RM_Replicate/RM_Call so we propagate the pending commands when
releasing the GIL.
A "known limitation", which were actually a bug, was fixed because of this commit (see propagate.tcl):
When using a mix of RM_Call with `!` and RM_Replicate, the command would propagate out-of-order:
first all the commands from RM_Call, and then the ones from RM_Replicate
Another thing worth mentioning is that if, in the past, a client would issue a MULTI/EXEC with just one
write command the server would blindly propagate the MULTI/EXEC too, even though it's redundant.
not anymore.
This commit renames propagate() to propagateNow() in order to cause conflicts in pending PRs.
propagatePendingCommands is the only caller of propagateNow, which is now a static, internal helper function.
Optimizations:
1. alsoPropagate will not add stuff to also_propagate if there's no AOF and replicas
2. alsoPropagate reallocs also_propagagte exponentially, to save calls to memmove
Bugfixes:
1. CONFIG SET can create evictions, sending notifications which can cause to dirty++ with modules.
we need to prevent it from propagating to AOF/replicas
2. We need to set current_client in RM_Call. buggy scenario:
- CONFIG SET maxmemory, eviction notifications, module hook calls RM_Call
- assertion in lookupKey crashes, because current_client has CONFIG SET, which isn't CMD_WRITE
3. minor: in eviction, call propagateDeletion after notification, like active-expire and all commands
(we always send a notification before propagating the command)
2021-12-22 17:03:48 -05:00
|
|
|
|
|
|
|
start_server {tags {"maxmemory" "external:skip"}} {
|
|
|
|
test {propagation with eviction} {
|
|
|
|
set repl [attach_to_replication_stream]
|
|
|
|
|
|
|
|
r set asdf1 1
|
|
|
|
r set asdf2 2
|
|
|
|
r set asdf3 3
|
|
|
|
|
|
|
|
r config set maxmemory-policy allkeys-lru
|
|
|
|
r config set maxmemory 1
|
|
|
|
|
|
|
|
wait_for_condition 5000 10 {
|
|
|
|
[r dbsize] eq 0
|
|
|
|
} else {
|
|
|
|
fail "Not all keys have been evicted"
|
|
|
|
}
|
|
|
|
|
|
|
|
r config set maxmemory 0
|
|
|
|
r config set maxmemory-policy noeviction
|
|
|
|
|
|
|
|
r set asdf4 4
|
|
|
|
|
|
|
|
assert_replication_stream $repl {
|
|
|
|
{select *}
|
|
|
|
{set asdf1 1}
|
|
|
|
{set asdf2 2}
|
|
|
|
{set asdf3 3}
|
|
|
|
{del asdf*}
|
|
|
|
{del asdf*}
|
|
|
|
{del asdf*}
|
|
|
|
{set asdf4 4}
|
|
|
|
}
|
|
|
|
close_replication_stream $repl
|
|
|
|
|
|
|
|
r config set maxmemory 0
|
|
|
|
r config set maxmemory-policy noeviction
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
start_server {tags {"maxmemory" "external:skip"}} {
|
|
|
|
test {propagation with eviction in MULTI} {
|
|
|
|
set repl [attach_to_replication_stream]
|
|
|
|
|
|
|
|
r config set maxmemory-policy allkeys-lru
|
|
|
|
|
|
|
|
r multi
|
|
|
|
r incr x
|
|
|
|
r config set maxmemory 1
|
|
|
|
r incr x
|
|
|
|
assert_equal [r exec] {1 OK 2}
|
|
|
|
|
|
|
|
wait_for_condition 5000 10 {
|
|
|
|
[r dbsize] eq 0
|
|
|
|
} else {
|
|
|
|
fail "Not all keys have been evicted"
|
|
|
|
}
|
|
|
|
|
|
|
|
assert_replication_stream $repl {
|
|
|
|
{multi}
|
Fix replication inconsistency on modules that uses key space notifications (#10969)
Fix replication inconsistency on modules that uses key space notifications.
### The Problem
In general, key space notifications are invoked after the command logic was
executed (this is not always the case, we will discuss later about specific
command that do not follow this rules). For example, the `set x 1` will trigger
a `set` notification that will be invoked after the `set` logic was performed, so
if the notification logic will try to fetch `x`, it will see the new data that was written.
Consider the scenario on which the notification logic performs some write
commands. for example, the notification logic increase some counter,
`incr x{counter}`, indicating how many times `x` was changed.
The logical order by which the logic was executed is has follow:
```
set x 1
incr x{counter}
```
The issue is that the `set x 1` command is added to the replication buffer
at the end of the command invocation (specifically after the key space
notification logic was invoked and performed the `incr` command).
The replication/aof sees the commands in the wrong order:
```
incr x{counter}
set x 1
```
In this specific example the order is less important.
But if, for example, the notification would have deleted `x` then we would
end up with primary-replica inconsistency.
### The Solution
Put the command that cause the notification in its rightful place. In the
above example, the `set x 1` command logic was executed before the
notification logic, so it should be added to the replication buffer before
the commands that is invoked by the notification logic. To achieve this,
without a major code refactoring, we save a placeholder in the replication
buffer, when finishing invoking the command logic we check if the command
need to be replicated, and if it does, we use the placeholder to add it to the
replication buffer instead of appending it to the end.
To be efficient and not allocating memory on each command to save the
placeholder, the replication buffer array was modified to reuse memory
(instead of allocating it each time we want to replicate commands).
Also, to avoid saving a placeholder when not needed, we do it only for
WRITE or MAY_REPLICATE commands.
#### Additional Fixes
* Expire and Eviction notifications:
* Expire/Eviction logical order was to first perform the Expire/Eviction
and then the notification logic. The replication buffer got this in the
other way around (first notification effect and then the `del` command).
The PR fixes this issue.
* The notification effect and the `del` command was not wrap with
`multi-exec` (if needed). The PR also fix this issue.
* SPOP command:
* On spop, the `spop` notification was fired before the command logic
was executed. The change in this PR would have cause the replication
order to be change (first `spop` command and then notification `logic`)
although the logical order is first the notification logic and then the
`spop` logic. The right fix would have been to move the notification to
be fired after the command was executed (like all the other commands),
but this can be considered a breaking change. To overcome this, the PR
keeps the current behavior and changes the `spop` code to keep the right
logical order when pushing commands to the replication buffer. Another PR
will follow to fix the SPOP properly and match it to the other command (we
split it to 2 separate PR's so it will be easy to cherry-pick this PR to 7.0 if
we chose to).
#### Unhanded Known Limitations
* key miss event:
* On key miss event, if a module performed some write command on the
event (using `RM_Call`), the `dirty` counter would increase and the read
command that cause the key miss event would be replicated to the replication
and aof. This problem can also happened on a write command that open
some keys but eventually decides not to perform any action. We decided
not to handle this problem on this PR because the solution is complex
and will cause additional risks in case we will want to cherry-pick this PR.
We should decide if we want to handle it in future PR's. For now, modules
writers is advice not to perform any write commands on key miss event.
#### Testing
* We already have tests to cover cases where a notification is invoking write
commands that are also added to the replication buffer, the tests was modified
to verify that the replica gets the command in the correct logical order.
* Test was added to verify that `spop` behavior was kept unchanged.
* Test was added to verify key miss event behave as expected.
* Test was added to verify the changes do not break lazy expiration.
#### Additional Changes
* `propagateNow` function can accept a special dbid, -1, indicating not
to replicate `select`. We use this to replicate `multi/exec` on `propagatePendingCommands`
function. The side effect of this change is that now the `select` command
will appear inside the `multi/exec` block on the replication stream (instead of
outside of the `multi/exec` block). Tests was modified to match this new behavior.
2022-08-18 03:16:32 -04:00
|
|
|
{select *}
|
Sort out mess around propagation and MULTI/EXEC (#9890)
The mess:
Some parts use alsoPropagate for late propagation, others using an immediate one (propagate()),
causing edge cases, ugly/hacky code, and the tendency for bugs
The basic idea is that all commands are propagated via alsoPropagate (i.e. added to a list) and the
top-most call() is responsible for going over that list and actually propagating them (and wrapping
them in MULTI/EXEC if there's more than one command). This is done in the new function,
propagatePendingCommands.
Callers to propagatePendingCommands:
1. top-most call() (we want all nested call()s to add to the also_propagate array and just the top-most
one to propagate them) - via `afterCommand`
2. handleClientsBlockedOnKeys: it is out of call() context and it may propagate stuff - via `afterCommand`.
3. handleClientsBlockedOnKeys edge case: if the looked-up key is already expired, we will propagate the
expire but will not unblock any client so `afterCommand` isn't called. in that case, we have to propagate
the deletion explicitly.
4. cron stuff: active-expire and eviction may also propagate stuff
5. modules: the module API allows to propagate stuff from just about anywhere (timers, keyspace notifications,
threads). I could have tried to catch all the out-of-call-context places but it seemed easier to handle it in one
place: when we free the context. in the spirit of what was done in call(), only the top-most freeing of a module
context may cause propagation.
6. modules: when using a thread-safe ctx it's not clear when/if the ctx will be freed. we do know that the module
must lock the GIL before calling RM_Replicate/RM_Call so we propagate the pending commands when
releasing the GIL.
A "known limitation", which were actually a bug, was fixed because of this commit (see propagate.tcl):
When using a mix of RM_Call with `!` and RM_Replicate, the command would propagate out-of-order:
first all the commands from RM_Call, and then the ones from RM_Replicate
Another thing worth mentioning is that if, in the past, a client would issue a MULTI/EXEC with just one
write command the server would blindly propagate the MULTI/EXEC too, even though it's redundant.
not anymore.
This commit renames propagate() to propagateNow() in order to cause conflicts in pending PRs.
propagatePendingCommands is the only caller of propagateNow, which is now a static, internal helper function.
Optimizations:
1. alsoPropagate will not add stuff to also_propagate if there's no AOF and replicas
2. alsoPropagate reallocs also_propagagte exponentially, to save calls to memmove
Bugfixes:
1. CONFIG SET can create evictions, sending notifications which can cause to dirty++ with modules.
we need to prevent it from propagating to AOF/replicas
2. We need to set current_client in RM_Call. buggy scenario:
- CONFIG SET maxmemory, eviction notifications, module hook calls RM_Call
- assertion in lookupKey crashes, because current_client has CONFIG SET, which isn't CMD_WRITE
3. minor: in eviction, call propagateDeletion after notification, like active-expire and all commands
(we always send a notification before propagating the command)
2021-12-22 17:03:48 -05:00
|
|
|
{incr x}
|
|
|
|
{incr x}
|
|
|
|
{exec}
|
|
|
|
{del x}
|
|
|
|
}
|
|
|
|
close_replication_stream $repl
|
|
|
|
|
|
|
|
r config set maxmemory 0
|
|
|
|
r config set maxmemory-policy noeviction
|
|
|
|
}
|
|
|
|
}
|