Commit Graph

403 Commits

Author SHA1 Message Date
Binbin
35b3fbd90c
Freeze time sampling during command execution, and scripts (#10300)
Freeze time during execution of scripts and all other commands.
This means that a key is either expired or not, and doesn't change
state during a script execution. resolves #10182

This PR try to add a new `commandTimeSnapshot` function.
The function logic is extracted from `keyIsExpired`, but the related
calls to `fixed_time_expire` and `mstime()` are removed, see below.

In commands, we will avoid calling `mstime()` multiple times
and just use the one that sampled in call. The background is,
e.g. using `PEXPIRE 1` with valgrind sometimes result in the key
being deleted rather than expired. The reason is that both `PEXPIRE`
command and `checkAlreadyExpired` call `mstime()` separately.

There are other more important changes in this PR:
1. Eliminate `fixed_time_expire`, it is no longer needed. 
   When we want to sample time we should always use a time snapshot. 
   We will use `in_nested_call` instead to update the cached time in `call`.
2. Move the call for `updateCachedTime` from `serverCron` to `afterSleep`.
    Now `commandTimeSnapshot` will always return the sample time, the
    `lookupKeyReadWithFlags` call in `getNodeByQuery` will get a outdated
    cached time (because `processCommand` is out of the `call` context).
    We put the call to `updateCachedTime` in `aftersleep`.
3. Cache the time each time the module lock Redis.
    Call `updateCachedTime` in `moduleGILAfterLock`, affecting `RM_ThreadSafeContextLock`
    and `RM_ThreadSafeContextTryLock`

Currently the commandTimeSnapshot change affects the following TTL commands:
- SET EX / SET PX
- EXPIRE / PEXPIRE
- SETEX / PSETEX
- GETEX EX / GETEX PX
- TTL / PTTL
- EXPIRETIME / PEXPIRETIME
- RESTORE key TTL

And other commands just use the cached mstime (including TIME).

This is considered to be a breaking change since it can break a script
that uses a loop to wait for a key to expire.
2022-10-09 08:18:34 +03:00
Binbin
3c02d1acc4
code, typo and comment cleanups (#11280)
- fix `the the` typo
- `LPOPRPUSH` does not exist, should be `RPOPLPUSH`
- `CLUSTER GETKEYINSLOT` 's time complexity should be O(N)
- `there bytes` should be `three bytes`, this closes #11266
- `slave` word to `replica` in log, modified the front and missed the back
- remove useless aofReadDiffFromParent in server.h
- `trackingHandlePendingKeyInvalidations` method adds a void parameter
2022-10-02 13:56:45 +03:00
Valentino Geron
e53bf65245
Replica that asks for rdb only should be closed right after the rdb part (#11296)
The bug is that the the server keeps on sending newlines to the client.
As a result, the receiver might not find the EOF marker since it searches
for it only on the end of each payload it reads from the socket.
The but only affects `CLIENT_REPL_RDBONLY`.
This affects `redis-cli --rdb` (depending on timing)

The fixed consist of two steps:
1. The `CLIENT_REPL_RDBONLY` should be closed ASAP (we cannot
   always call to `freeClient` so we use `freeClientAsync`)
2. Add new replication state `SLAVE_STATE_RDB_TRANSMITTED`
2022-09-22 11:22:05 +03:00
zhenwei pi
45617385e7 Use connection name of string
Suggested by Oran, use an array to store all the connection types
instead of a linked list, and use connection name of string. The index
of a connection is dynamically allocated.

Currently we support max 8 connection types, include:
- tcp
- unix socket
- tls

and RDMA is in the plan, then we have another 4 types to support, it
should be enough in a long time.

Introduce 3 functions to get connection type by a fast path:
- connectionTypeTcp()
- connectionTypeTls()
- connectionTypeUnix()

Note that connectionByType() is designed to use only in unlikely code path.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
2022-08-22 15:15:37 +08:00
zhenwei pi
1234e3a562 Fully abstract connection type
Abstract common interface of connection type, so Redis can hide the
implementation and uplayer only calls connection API without macro.

               uplayer
                  |
           connection layer
             /          \
          socket        TLS

Currently, for both socket and TLS, all the methods of connection type
are declared as static functions.

It's possible to build TLS(even socket) as a shared library, and Redis
loads it dynamically in the next step.

Also add helper function connTypeOfCluster() and
connTypeOfReplication() to simplify the code:
link->conn = server.tls_cluster ? connCreateTLS() : connCreateSocket();
-> link->conn = connCreate(connTypeOfCluster());

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
2022-08-22 15:11:44 +08:00
zhenwei pi
bff7ecc786 Introduce connAddr
Originally, connPeerToString is designed to get the address info from
socket only(for both TCP & TLS), and the API 'connPeerToString' is
oriented to operate a FD like:
int connPeerToString(connection *conn, char *ip, size_t ip_len, int *port) {
    return anetFdToString(conn ? conn->fd : -1, ip, ip_len, port, FD_TO_PEER_NAME);
}

Introduce connAddr and implement .addr method for socket and TLS,
thus the API 'connAddr' and 'connFormatAddr' become oriented to a
connection like:
static inline int connAddr(connection *conn, char *ip, size_t ip_len, int *port, int remote) {
    if (conn && conn->type->addr) {
        return conn->type->addr(conn, ip, ip_len, port, remote);
    }

    return -1;
}

Also remove 'FD_TO_PEER_NAME' & 'FD_TO_SOCK_NAME', use a boolean type
'remote' to get local/remote address of a connection.

With these changes, it's possible to support the other connection
types which does not use socket(Ex, RDMA).

Thanks to Oran for suggestions!

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
2022-08-22 15:01:40 +08:00
Meir Shpilraien (Spielrein)
508a138885
Fix replication inconsistency on modules that uses key space notifications (#10969)
Fix replication inconsistency on modules that uses key space notifications.

### The Problem

In general, key space notifications are invoked after the command logic was
executed (this is not always the case, we will discuss later about specific
command that do not follow this rules). For example, the `set x 1` will trigger
a `set` notification that will be invoked after the `set` logic was performed, so
if the notification logic will try to fetch `x`, it will see the new data that was written.
Consider the scenario on which the notification logic performs some write
commands. for example, the notification logic increase some counter,
`incr x{counter}`, indicating how many times `x` was changed.
The logical order by which the logic was executed is has follow:

```
set x 1
incr x{counter}
```

The issue is that the `set x 1` command is added to the replication buffer
at the end of the command invocation (specifically after the key space
notification logic was invoked and performed the `incr` command).
The replication/aof sees the commands in the wrong order:

```
incr x{counter}
set x 1
```

In this specific example the order is less important.
But if, for example, the notification would have deleted `x` then we would
end up with primary-replica inconsistency.

### The Solution

Put the command that cause the notification in its rightful place. In the
above example, the `set x 1` command logic was executed before the
notification logic, so it should be added to the replication buffer before
the commands that is invoked by the notification logic. To achieve this,
without a major code refactoring, we save a placeholder in the replication
buffer, when finishing invoking the command logic we check if the command
need to be replicated, and if it does, we use the placeholder to add it to the
replication buffer instead of appending it to the end.

To be efficient and not allocating memory on each command to save the
placeholder, the replication buffer array was modified to reuse memory
(instead of allocating it each time we want to replicate commands).
Also, to avoid saving a placeholder when not needed, we do it only for
WRITE or MAY_REPLICATE commands.

#### Additional Fixes

* Expire and Eviction notifications:
  * Expire/Eviction logical order was to first perform the Expire/Eviction
    and then the notification logic. The replication buffer got this in the
    other way around (first notification effect and then the `del` command).
    The PR fixes this issue.
  * The notification effect and the `del` command was not wrap with
    `multi-exec` (if needed). The PR also fix this issue.
* SPOP command:
  * On spop, the `spop` notification was fired before the command logic
    was executed. The change in this PR would have cause the replication
    order to be change (first `spop` command and then notification `logic`)
    although the logical order is first the notification logic and then the
    `spop` logic. The right fix would have been to move the notification to
    be fired after the command was executed (like all the other commands),
    but this can be considered a breaking change. To overcome this, the PR
    keeps the current behavior and changes the `spop` code to keep the right
    logical order when pushing commands to the replication buffer. Another PR
    will follow to fix the SPOP properly and match it to the other command (we
    split it to 2 separate PR's so it will be easy to cherry-pick this PR to 7.0 if
    we chose to).

#### Unhanded Known Limitations

* key miss event:
  * On key miss event, if a module performed some write command on the
    event (using `RM_Call`), the `dirty` counter would increase and the read
    command that cause the key miss event would be replicated to the replication
    and aof. This problem can also happened on a write command that open
    some keys but eventually decides not to perform any action. We decided
    not to handle this problem on this PR because the solution is complex
    and will cause additional risks in case we will want to cherry-pick this PR.
    We should decide if we want to handle it in future PR's. For now, modules
    writers is advice not to perform any write commands on key miss event.

#### Testing

* We already have tests to cover cases where a notification is invoking write
  commands that are also added to the replication buffer, the tests was modified
  to verify that the replica gets the command in the correct logical order.
* Test was added to verify that `spop` behavior was kept unchanged.
* Test was added to verify key miss event behave as expected.
* Test was added to verify the changes do not break lazy expiration.

#### Additional Changes

* `propagateNow` function can accept a special dbid, -1, indicating not
  to replicate `select`. We use this to replicate `multi/exec` on `propagatePendingCommands`
  function. The side effect of this change is that now the `select` command
  will appear inside the `multi/exec` block on the replication stream (instead of
  outside of the `multi/exec` block). Tests was modified to match this new behavior.
2022-08-18 10:16:32 +03:00
Binbin
4505eb1821
errno cleanup around rdbLoad (#11042)
This is an addition to #11039, which cleans up rdbLoad* related errno. Remove the
errno print from the outer message (may be invalid since errno may have been overwritten).

Our aim should be the code that detects the error and knows which system call
triggered it, is the one to print errno, and not the code way up above (in some cases
a result of a logical error and not a system one).

Remove the code to update errno in rdbLoadRioWithLoadingCtx, signature check
and the rdb version check, in these cases, we do print the error message.
The caller dose not have the specific logic for handling EINVAL.

Small fix around rdb-preamble AOF: A truncated RDB is considered a failure,
not handled the same as a truncated AOF file.
2022-08-04 10:47:37 +03:00
Binbin
00097bf4aa
Change the return value of rdbLoad function to enums (#11039)
The reason we do this is because in #11036, we added error
log message when failing to open RDB file for reading.
In loadDdataFromDisk we call rdbLoad and also check errno,
now the logging corrupts errno (reported in alpine daily).

It is not safe to rely on errno as we do today, so we change
the return value of rdbLoad function to enums, like we have
when loading an AOF.
2022-07-26 15:13:13 +03:00
Binbin
03fff10ab4
fsync the old aof file when open a new INCR AOF (#11004)
In rewriteAppendOnlyFileBackground, after flushAppendOnlyFile(1),
and before openNewIncrAofForAppend, we should call redis_fsync
to fsync the aof file.

Because we may open a new INCR AOF in openNewIncrAofForAppend,
in the case of using everysec policy, the old AOF file may not
be fsynced in time (or even at all).

When using everysec, we don't want to pay the disk latency from
the main thread, so we will do a background fsync.

Adding a argument for bioCreateCloseJob, a `need_fsync` flag to
indicate that a fsync is required before the file is closed. So we will
fsync the old AOF file before we close it.

A cleanup, we make union become a union, since the free_* args and
the fd / fsync args are never used together.

Co-authored-by: Oran Agra <oran@redislabs.com>
2022-07-25 09:16:35 +03:00
Tian
99a425d0f3
Fsync directory while persisting AOF manifest, RDB file, and config file (#10737)
The current process to persist files is `write` the data, `fsync` and `rename` the file,
but a underlying problem is that the rename may be lost when a sudden crash like
power outage and the directory hasn't been persisted.

The article [Ensuring data reaches disk](https://lwn.net/Articles/457667/) mentions
a safe way to update file should be:

1. create a new temp file (on the same file system!)
2. write data to the temp file
3. fsync() the temp file
4. rename the temp file to the appropriate name
5. fsync() the containing directory

This commit handles CONFIG REWRITE, AOF manifest, and RDB file (both for persistence,
and the one the replica gets from the master).
It doesn't handle (yet), ACL SAVE and Cluster configs, since these don't yet follow this pattern.
2022-06-20 19:17:23 +03:00
XiongDa
d4595dd94f
Fix typo in replication.c (#10854) 2022-06-13 12:12:46 +03:00
DarrenJiang13
bb1de082ea
Adds isolated netstats for replication. (#10062)
The amount of `server.stat_net_output_bytes/server.stat_net_input_bytes`
is actually the sum of replication flow and users' data flow. 
It may cause confusions like this:
"Why does my server get such a large output_bytes while I am doing nothing? ". 

After discussions and revisions, now here is the change about what this
PR brings (final version before merge):
- 2 server variables to count the network bytes during replication,
     including fullsync and propagate bytes.
     - `server.stat_net_repl_output_bytes`/`server.stat_net_repl_input_bytes`
- 3 info fields to print the input and output of repl bytes and instantaneous
     value of total repl bytes.
     - `total_net_repl_input_bytes` / `total_net_repl_output_bytes`
     - `instantaneous_repl_total_kbps`
- 1 new API `rioCheckType()` to check the type of rio. So we can use this
     to distinguish between diskless and diskbased replication
- 2 new counting items to keep network statistics consistent between master
     and slave
    - rdb portion during diskless replica. in `rdbLoadProgressCallback()`
    - first line of the full sync payload. in `readSyncBulkPayload()`

Co-authored-by: Oran Agra <oran@redislabs.com>
2022-05-31 08:07:33 +03:00
Qu Chen
837c063baf
Replica fail and retry the PSYNC if the master is unresponsive (#10726)
We observed from our replication testing that when the master becomes unresponsive,
or the replication connection is broken during PSYNC so the replica doesn't get a
response from the master, it was not able to recognize that condition as a failure
and instead moved into the full-sync code path. This fix makes the replica fail and
retry the PSYNC with the master in such scenarios.
2022-05-16 12:08:19 +03:00
menwen
046654a70e
Stop RDB child before flushing and parsing the RDB in Diskless replication too (#10602)
We should stop RDB child in advance before flushing to reduce COW in diskless replication too.

Co-authored-by: Oran Agra <oran@redislabs.com>
2022-04-20 09:54:55 +03:00
Wang Yuan
a9d5cfa99b
Optimize the call of prepareReplicasToWrite (#10588)
From #9166, we call several times of prepareReplicasToWrite when propagating
one write command to replication stream (once per argument, same as we do for
normal clients), that is not necessary. Now we only call it one time per command
at the begin of feeding replication stream.

This results in reducing CPU consumption and slightly better performance,
specifically when there are many replicas.
2022-04-17 09:41:46 +03:00
zhaozhao.zz
1a7765cb7c
Durability enhancement for appendfsync=always policy (#9678)
Durability of database is a big and old topic, in this regard Redis use AOF to
support it, and `appendfsync=alwasys` policy is the most strict level, guarantee
all data is both written and synced on disk before reply success to client.

But there are some cases have been overlooked, and could lead to durability broken.

1. The most clear one is about threaded-io mode
   we should also set client's write handler with `ae_barrier` in
   `handleClientsWithPendingWritesUsingThreads`, or the write handler would be
   called after read handler in the next event loop, it means the write command result
   could be replied to client before flush to AOF.
2. About blocked client (mostly by module)
   in `beforeSleep()`, `handleClientsBlockedOnKeys()` should be called before
   `flushAppendOnlyFile()`, in case the unblocked clients modify data without persistence
   but send reply.
3. When handling `ProcessingEventsWhileBlocked`
   normally it takes place when lua/function/module timeout, and we give a chance to users
   to kill the slow operation, but we should call `flushAppendOnlyFile()` before
   `handleClientsWithPendingWrites()`, in case the other clients in the last event loop get
   acknowledge before data persistence.
   for a instance:
   ```
   in the same event loop
   client A executes set foo bar
   client B executes eval "for var=1,10000000,1 do end" 0
   ```
   after the script timeout, client A will get `OK` but lose data after restart (kill redis when
   timeout) if we don't flush the write command to AOF.
4. A more complex case about `ProcessingEventsWhileBlocked`
   it is lua timeout in transaction, for example
   `MULTI; set foo bar; eval "for var=1,10000000,1 do end" 0; EXEC`, then client will get set
   command's result before the whole transaction done, that breaks atomicity too.
   fortunately, it's already fixed by #5428 (although it's not the original purpose just a side
   effect : )), but module timeout should be fixed too.

case 1, 2, 3 are fixed in this commit, the module issue in case 4 needs a followup PR.
2022-04-11 11:08:39 +03:00
zhaozhao.zz
78bef6e1fe
optimize(remove) usage of client's pending_querybuf (#10413)
To remove `pending_querybuf`, the key point is reusing `querybuf`, it means master client's `querybuf` is not only used to parse command, but also proxy to sub-replicas.

1. add a new variable `repl_applied` for master client to record how many data applied (propagated via `replicationFeedStreamFromMasterStream()`) but not trimmed in `querybuf`.

2. don't sdsrange `querybuf` in `commandProcessed()`, we trim it to `repl_applied` after the whole replication pipeline processed to avoid fragmented `sdsrange`. And here are some scenarios we cannot trim to `qb_pos`:
    * we don't receive complete command from master
    * master client blocked because of client pause
    * IO threads operate read, master client flagged with CLIENT_PENDING_COMMAND

    In these scenarios, `qb_pos` points to the part of the current command or the beginning of next command, and the current command is not applied yet, so the `repl_applied` is not equal to `qb_pos`.

Some other notes:
* Do not do big arg optimization on master client, since we can only sdsrange `querybuf` after data sent to replicas.
* Set `qb_pos` and `repl_applied` to 0 when `freeClient` in `replicationCacheMaster`.
* Rewrite `processPendingCommandsAndResetClient` to `processPendingCommandAndInputBuffer`, let `processInputBuffer` to be called successively after `processCommandAndResetClient`.
2022-03-25 10:45:40 +08:00
Meir Shpilraien (Spielrein)
f3855a0930
Add new RM_Call flags for script mode, no writes, and error replies. (#10372)
The PR extends RM_Call with 3 new capabilities using new flags that
are given to RM_Call as part of the `fmt` argument.
It aims to assist modules that are getting a list of commands to be
executed from the user (not hard coded as part of the module logic),
think of a module that implements a new scripting language...

* `S` - Run the command in a script mode, this means that it will raise an
  error if a command which are not allowed inside a script (flaged with the
  `deny-script` flag) is invoked (like SHUTDOWN). In addition, on script mode,
  write commands are not allowed if there is not enough good replicas (as
  configured with `min-replicas-to-write`) and/or a disk error happened.

* `W` - no writes mode, Redis will reject any command that is marked with `write`
  flag. Again can be useful to modules that implement a new scripting language
  and wants to prevent any write commands.

* `E` - Return errors as RedisModuleCallReply. Today the errors that happened
  before the command was invoked (like unknown commands or acl error) return
  a NULL reply and set errno. This might be missing important information about
  the failure and it is also impossible to just pass the error to the user using
  RM_ReplyWithCallReply. This new flag allows you to get a RedisModuleCallReply
  object with the relevant error message and treat it as if it was an error that was
  raised by the command invocation.

Tests were added to verify the new code paths.

In addition small refactoring was done to share some code between modules,
scripts, and `processCommand` function:
1. `getAclErrorMessage` was added to `acl.c` to unified to log message extraction
  from the acl result
2. `checkGoodReplicasStatus` was added to `replication.c` to check the status of
  good replicas. It is used on `scriptVerifyWriteCommandAllow`, `RM_Call`, and
  `processCommand`.
3. `writeCommandsGetDiskErrorMessage` was added to `server.c` to get the error
  message on persistence failure. Again it is used on `scriptVerifyWriteCommandAllow`,
  `RM_Call`, and `processCommand`.
2022-03-22 14:13:28 +02:00
郭伟光
dc7a9d3a31
Cleanup: replicationFeedMonitors use the monitor list arg it got (#10417)
Better check the monitors list argument instead of server.monitors in the function,
although they are basically the same in the context, so this doesn't have any
impact on the current code.
2022-03-13 16:19:42 +02:00
rangerzhang
4e012daee9
Fix outdated comments on updateSlavesWaitingBgsave (#10394)
* fix-replication-comments

The described capacity
 `and to schedule a new BGSAVE if there are slaves that attached while a BGSAVE was in progress`
was moved to `checkChildrenDone()`  named by `replicationStartPendingFork`

But the comment was not changed, may misleading others.

* remove-misleading-comments

The described capacity
 `to schedule a new BGSAVE if there are slaves that attached while a BGSAVE was in progress` 
and 
`or when the replication RDB transfer strategy is modified from disk to socket or the other way around` 
were not correct now.
2022-03-10 09:51:55 +02:00
rangerzhang
9b0fd9f4d0
Fix a mistake in comments (#10312)
There is no variable named by REPL_STATE_RECEIVE_PSYNC_REPLY, it should be REPL_STATE_RECEIVE_PSYNC_REPLY according to the contexts.
2022-02-20 08:09:19 +02:00
Oran Agra
aa9beaca77
Attempt to fix a rare crash in cluster tests. (#10265)
The theory is that a replica gets disconnected from within REPLCONF ACK,
so when we go up the stack, we'll crash when attempting to access
c->cmd->flags
2022-02-08 19:10:13 +02:00
Binbin
344e41c922
Fix PSYNC crash with wrong offset (#10243)
`PSYNC replicationid str_offset` will crash the server.

The reason is in `masterTryPartialResynchronization`,
we will call `getLongLongFromObjectOrReply` check the
offset. With a wrong offset, it will add a reply and
then trigger a full SYNC and the client become a replica.

So crash in `c->bufpos == 0 && listLength(c->reply) == 0`.
In this commit, we check the psync_offset before entering
function `masterTryPartialResynchronization`, and return.

Regardless of that crash, accepting the sync, but also replying
with an error would have corrupt the replication stream.
2022-02-06 13:13:56 +02:00
Oran Agra
ae89958972
Set repl-diskless-sync to yes by default, add repl-diskless-sync-max-replicas (#10092)
1. enable diskless replication by default
2. add a new config named repl-diskless-sync-max-replicas that enables
   replication to start before the full repl-diskless-sync-delay was
   reached.
3. put replica online sooner on the master (see below)
4. test suite uses repl-diskless-sync-delay of 0 to be faster
5. a few tests that use multiple replica on a pre-populated master, are
   now using the new repl-diskless-sync-max-replicas
6. fix possible timing issues in a few cluster tests (see below)

put replica online sooner on the master 
----------------------------------------------------
there were two tests that failed because they needed for the master to
realize that the replica is online, but the test code was actually only
waiting for the replica to realize it's online, and in diskless it could
have been before the master realized it.

changes include two things:
1. the tests wait on the right thing
2. issues in the master, putting the replica online in two steps.

the master used to put the replica as online in 2 steps. the first
step was to mark it as online, and the second step was to enable the
write event (only after getting ACK), but in fact the first step didn't
contains some of the tasks to put it online (like updating good slave
count, and sending the module event). this meant that if a test was
waiting to see that the replica is online form the point of view of the
master, and then confirm that the module got an event, or that the
master has enough good replicas, it could fail due to timing issues.

so now the full effect of putting the replica online, happens at once,
and only the part about enabling the writes is delayed till the ACK.

fix cluster tests 
--------------------
I added some code to wait for the replica to sync and avoid race
conditions.
later realized the sentinel and cluster tests where using the original 5
seconds delay, so changed it to 0.

this means the other changes are probably not needed, but i suppose
they're still better (avoid race conditions)
2022-01-17 14:11:11 +02:00
Meir Shpilraien (Spielrein)
885f6b5ceb
Redis Function Libraries (#10004)
# Redis Function Libraries

This PR implements Redis Functions Libraries as describe on: https://github.com/redis/redis/issues/9906.

Libraries purpose is to provide a better code sharing between functions by allowing to create multiple
functions in a single command. Functions that were created together can safely share code between
each other without worrying about compatibility issues and versioning.

Creating a new library is done using 'FUNCTION LOAD' command (full API is described below)

This PR introduces a new struct called libraryInfo, libraryInfo holds information about a library:
* name - name of the library
* engine - engine used to create the library
* code - library code
* description - library description
* functions - the functions exposed by the library

When Redis gets the `FUNCTION LOAD` command it creates a new empty libraryInfo.
Redis passes the `CODE` to the relevant engine alongside the empty libraryInfo.
As a result, the engine will create one or more functions by calling 'libraryCreateFunction'.
The new funcion will be added to the newly created libraryInfo. So far Everything is happening
locally on the libraryInfo so it is easy to abort the operation (in case of an error) by simply
freeing the libraryInfo. After the library info is fully constructed we start the joining phase by
which we will join the new library to the other libraries currently exist on Redis.
The joining phase make sure there is no function collision and add the library to the
librariesCtx (renamed from functionCtx). LibrariesCtx is used all around the code in the exact
same way as functionCtx was used (with respect to RDB loading, replicatio, ...).
The only difference is that apart from function dictionary (maps function name to functionInfo
object), the librariesCtx contains also a libraries dictionary that maps library name to libraryInfo object.

## New API
### FUNCTION LOAD
`FUNCTION LOAD <ENGINE> <LIBRARY NAME> [REPLACE] [DESCRIPTION <DESCRIPTION>] <CODE>`
Create a new library with the given parameters:
* ENGINE - REPLACE Engine name to use to create the library.
* LIBRARY NAME - The new library name.
* REPLACE - If the library already exists, replace it.
* DESCRIPTION - Library description.
* CODE - Library code.

Return "OK" on success, or error on the following cases:
* Library name already taken and REPLACE was not used
* Name collision with another existing library (even if replace was uses)
* Library registration failed by the engine (usually compilation error)

## Changed API
### FUNCTION LIST
`FUNCTION LIST [LIBRARYNAME <LIBRARY NAME PATTERN>] [WITHCODE]`
Command was modified to also allow getting libraries code (so `FUNCTION INFO` command is no longer
needed and removed). In addition the command gets an option argument, `LIBRARYNAME` allows you to
only get libraries that match the given `LIBRARYNAME` pattern. By default, it returns all libraries.

### INFO MEMORY
Added number of libraries to `INFO MEMORY`

### Commands flags
`DENYOOM` flag was set on `FUNCTION LOAD` and `FUNCTION RESTORE`. We consider those commands
as commands that add new data to the dateset (functions are data) and so we want to disallows
to run those commands on OOM.

## Removed API
* FUNCTION CREATE - Decided on https://github.com/redis/redis/issues/9906
* FUNCTION INFO - Decided on https://github.com/redis/redis/issues/9899

## Lua engine changes
When the Lua engine gets the code given on `FUNCTION LOAD` command, it immediately runs it, we call
this run the loading run. Loading run is not a usual script run, it is not possible to invoke any
Redis command from within the load run.
Instead there is a new API provided by `library` object. The new API's: 
* `redis.log` - behave the same as `redis.log`
* `redis.register_function` - register a new function to the library

The loading run purpose is to register functions using the new `redis.register_function` API.
Any attempt to use any other API will result in an error. In addition, the load run is has a time
limit of 500ms, error is raise on timeout and the entire operation is aborted.

### `redis.register_function`
`redis.register_function(<function_name>, <callback>, [<description>])`
This new API allows users to register a new function that will be linked to the newly created library.
This API can only be called during the load run (see definition above). Any attempt to use it outside
of the load run will result in an error.
The parameters pass to the API are:
* function_name - Function name (must be a Lua string)
* callback - Lua function object that will be called when the function is invokes using fcall/fcall_ro
* description - Function description, optional (must be a Lua string).

### Example
The following example creates a library called `lib` with 2 functions, `f1` and `f1`, returns 1 and 2 respectively:
```
local function f1(keys, args)
    return 1
end

local function f2(keys, args)
    return 2
end

redis.register_function('f1', f1)
redis.register_function('f2', f2)
```

Notice: Unlike `eval`, functions inside a library get the KEYS and ARGV as arguments to the
functions and not as global.

### Technical Details

On the load run we only want the user to be able to call a white list on API's. This way, in
the future, if new API's will be added, the new API's will not be available to the load run
unless specifically added to this white list. We put the while list on the `library` object and
make sure the `library` object is only available to the load run by using [lua_setfenv](https://www.lua.org/manual/5.1/manual.html#lua_setfenv) API. This API allows us to set
the `globals` of a function (and all the function it creates). Before starting the load run we
create a new fresh Lua table (call it `g`) that only contains the `library` API (we make sure
to set global protection on this table just like the general global protection already exists
today), then we use [lua_setfenv](https://www.lua.org/manual/5.1/manual.html#lua_setfenv)
to set `g` as the global table of the load run. After the load run finished we update `g`
metatable and set `__index` and `__newindex` functions to be `_G` (Lua default globals),
we also pop out the `library` object as we do not need it anymore.
This way, any function that was created on the load run (and will be invoke using `fcall`) will
see the default globals as it expected to see them and will not have the `library` API anymore.

An important outcome of this new approach is that now we can achieve a distinct global table
for each library (it is not yet like that but it is very easy to achieve it now). In the future we can
decide to remove global protection because global on different libraries will not collide or we
can chose to give different API to different libraries base on some configuration or input.

Notice that this technique was meant to prevent errors and was not meant to prevent malicious
user from exploit it. For example, the load run can still save the `library` object on some local
variable and then using in `fcall` context. To prevent such a malicious use, the C code also make
sure it is running in the right context and if not raise an error.
2022-01-06 13:39:38 +02:00
yoav-steinberg
65a7635793
redis-cli --replica reads dummy empty rdb instead of full snapshot (#10044)
This makes redis-cli --replica much faster and reduces COW/fork risks on server side.
This commit also improves the RDB filtering via REPLCONF rdb-filter-only to support no "include" specifiers at all.
2022-01-04 17:09:22 +02:00
Matthieu MOREL
d5a3b3f5ec
Setup dependabot for github-actions and codespell (#9857)
This sets up  dependabot to check weekly updates for pip and github-actions dependencies.
If it finds an update it will create a PR to update the dependency. More information can be found here

It includes the update of:

* vmactions/freebsd-vm from 0.1.4 to 0.1.5
* codespell from 2.0.0 to 2.1.0

Also includes spelling fixes found by the latest version of codespell.
Includes a dedicated .codespell folder so dependabot can read a requirements.txt file and every files dedicated to codespell can be grouped in the same place

Co-Authored-By: Matthieu MOREL <mmorel-35@users.noreply.github.com>
Co-Authored-By: MOREL Matthieu <matthieu.morel@cnp.fr>
2022-01-04 16:19:28 +02:00
Viktor Söderqvist
45a155bd0f
Wait for replicas when shutting down (#9872)
To avoid data loss, this commit adds a grace period for lagging replicas to
catch up the replication offset.

Done:

* Wait for replicas when shutdown is triggered by SIGTERM and SIGINT.

* Wait for replicas when shutdown is triggered by the SHUTDOWN command. A new
  blocked client type BLOCKED_SHUTDOWN is introduced, allowing multiple clients
  to call SHUTDOWN in parallel.
  Note that they don't expect a response unless an error happens and shutdown is aborted.

* Log warning for each replica lagging behind when finishing shutdown.

* CLIENT_PAUSE_WRITE while waiting for replicas.

* Configurable grace period 'shutdown-timeout' in seconds (default 10).

* New flags for the SHUTDOWN command:

    - NOW disables the grace period for lagging replicas.

    - FORCE ignores errors writing the RDB or AOF files which would normally
      prevent a shutdown.

    - ABORT cancels ongoing shutdown. Can't be combined with other flags.

* New field in the output of the INFO command: 'shutdown_in_milliseconds'. The
  value is the remaining maximum time to wait for lagging replicas before
  finishing the shutdown. This field is present in the Server section **only**
  during shutdown.

Not directly related:

* When shutting down, if there is an AOF saving child, it is killed **even** if AOF
  is disabled. This can happen if BGREWRITEAOF is used when AOF is off.

* Client pause now has end time and type (WRITE or ALL) per purpose. The
  different pause purposes are *CLIENT PAUSE command*, *failover* and
  *shutdown*. If clients are unpaused for one purpose, it doesn't affect client
  pause for other purposes. For example, the CLIENT UNPAUSE command doesn't
  affect client pause initiated by the failover or shutdown procedures. A completed
  failover or a failed shutdown doesn't unpause clients paused by the CLIENT
  PAUSE command.

Notes:

* DEBUG RESTART doesn't wait for replicas.

* We already have a warning logged when a replica disconnects. This means that
  if any replica connection is lost during the shutdown, it is either logged as
  disconnected or as lagging at the time of exit.

Co-authored-by: Oran Agra <oran@redislabs.com>
2022-01-02 09:50:15 +02:00
yoav-steinberg
1bf6d6f11e
Generate RDB with Functions only via redis-cli --functions-rdb (#9968)
This is needed in order to ease the deployment of functions for ephemeral cases, where user
needs to spin up a server with functions pre-loaded.

#### Details:

* Added `--functions-rdb` option to _redis-cli_.
* Functions only rdb via `REPLCONF rdb-filter-only functions`. This is a placeholder for a space
  separated inclusion filter for the RDB. In the future can be `REPLCONF rdb-filter-only
  "functions db:3 key-patten:user*"` and a complementing `rdb-filter-exclude` `REPLCONF`
  can also be added.
* Handle "slave requirements" specification to RDB saving code so we can use the same RDB
  when different slaves express the same requirements (like functions-only) and not share the
  RDB when their requirements differ. This is currently just a flags `int`, but can be extended to
  a more complex structure with various filter fields.
* make sure to support filters only in diskless replication mode (not to override the persistence file),
  we do that by forcing diskless (even if disabled by config)

other changes:
* some refactoring in rdb.c (extract portion of a big function to a sub-function)
* rdb_key_save_delay used in AOFRW too
* sendChildInfo takes the number of updated keys (incremental, rather than absolute)

Co-authored-by: Oran Agra <oran@redislabs.com>
2022-01-02 09:39:01 +02:00
guybe7
7ac213079c
Sort out mess around propagation and MULTI/EXEC (#9890)
The mess:
Some parts use alsoPropagate for late propagation, others using an immediate one (propagate()),
causing edge cases, ugly/hacky code, and the tendency for bugs

The basic idea is that all commands are propagated via alsoPropagate (i.e. added to a list) and the
top-most call() is responsible for going over that list and actually propagating them (and wrapping
them in MULTI/EXEC if there's more than one command). This is done in the new function,
propagatePendingCommands.

Callers to propagatePendingCommands:
1. top-most call() (we want all nested call()s to add to the also_propagate array and just the top-most
   one to propagate them) - via `afterCommand`
2. handleClientsBlockedOnKeys: it is out of call() context and it may propagate stuff - via `afterCommand`. 
3. handleClientsBlockedOnKeys edge case: if the looked-up key is already expired, we will propagate the
   expire but will not unblock any client so `afterCommand` isn't called. in that case, we have to propagate
   the deletion explicitly.
4. cron stuff: active-expire and eviction may also propagate stuff
5. modules: the module API allows to propagate stuff from just about anywhere (timers, keyspace notifications,
   threads). I could have tried to catch all the out-of-call-context places but it seemed easier to handle it in one
   place: when we free the context. in the spirit of what was done in call(), only the top-most freeing of a module
   context may cause propagation.
6. modules: when using a thread-safe ctx it's not clear when/if the ctx will be freed. we do know that the module
   must lock the GIL before calling RM_Replicate/RM_Call so we propagate the pending commands when
   releasing the GIL.

A "known limitation", which were actually a bug, was fixed because of this commit (see propagate.tcl):
   When using a mix of RM_Call with `!` and RM_Replicate, the command would propagate out-of-order:
   first all the commands from RM_Call, and then the ones from RM_Replicate

Another thing worth mentioning is that if, in the past, a client would issue a MULTI/EXEC with just one
write command the server would blindly propagate the MULTI/EXEC too, even though it's redundant.
not anymore.

This commit renames propagate() to propagateNow() in order to cause conflicts in pending PRs.
propagatePendingCommands is the only caller of propagateNow, which is now a static, internal helper function.

Optimizations:
1. alsoPropagate will not add stuff to also_propagate if there's no AOF and replicas
2. alsoPropagate reallocs also_propagagte exponentially, to save calls to memmove

Bugfixes:
1. CONFIG SET can create evictions, sending notifications which can cause to dirty++ with modules.
   we need to prevent it from propagating to AOF/replicas
2. We need to set current_client in RM_Call. buggy scenario:
   - CONFIG SET maxmemory, eviction notifications, module hook calls RM_Call
   - assertion in lookupKey crashes, because current_client has CONFIG SET, which isn't CMD_WRITE
3. minor: in eviction, call propagateDeletion after notification, like active-expire and all commands
   (we always send a notification before propagating the command)
2021-12-23 00:03:48 +02:00
Meir Shpilraien (Spielrein)
3bcf108416
Change FUNCTION CREATE, DELETE and FLUSH to be WRITE commands instead of MAY_REPLICATE. (#9953)
The issue with MAY_REPLICATE is that all automatic mechanisms to handle
write commands will not work. This require have a special treatment for:
* Not allow those commands to be executed on RO replica.
* Allow those commands to be executed on RO replica from primary connection.
* Allow those commands to be executed on the RO replica from AOF.

By setting those commands as WRITE commands we are getting all those properties from Redis.
Test was added to verify that those properties work as expected.

In addition, rearrange when and where functions are flushed. Before this PR functions were
flushed manually on `rdbLoadRio` and cleaned manually on failure. This contradicts the
assumptions that functions are data and need to be created/deleted alongside with the
data. A side effect of this, for example, `debug reload noflush` did not flush the data but
did flush the functions, `debug loadaof` flush the data but not the functions.
This PR move functions deletion into `emptyDb`. `emptyDb` (renamed to `emptyData`) will
now accept an additional flag, `NOFUNCTIONS` which specifically indicate that we do not
want to flush the functions (on all other cases, functions will be flushed). Used the new flag
on FLUSHALL and FLUSHDB only! Tests were added to `debug reload` and `debug loadaof`
to verify that functions behave the same as the data.

Notice that because now functions will be deleted along side with the data we can not allow
`CLUSTER RESET` to be called from within a function (it will cause the function to be released
while running), this PR adds `NO_SCRIPT` flag to `CLUSTER RESET`  so it will not be possible
to be called from within a function. The other cluster commands are allowed from within a
function (there are use-cases that uses `GETKEYSINSLOT` to iterate over all the keys on a
given slot). Tests was added to verify `CLUSTER RESET` is denied from within a script.

Another small change on this PR is that `RDBFLAGS_ALLOW_DUP` is also applicable on functions.
When loading functions, if this flag is set, we will replace old functions with new ones on collisions.
2021-12-21 16:13:29 +02:00
zhugezy
1b0968df46
Remove EVAL script verbatim replication, propagation, and deterministic execution logic (#9812)
# Background

The main goal of this PR is to remove relevant logics on Lua script verbatim replication,
only keeping effects replication logic, which has been set as default since Redis 5.0.
As a result, Lua in Redis 7.0 would be acting the same as Redis 6.0 with default
configuration from users' point of view.

There are lots of reasons to remove verbatim replication.
Antirez has listed some of the benefits in Issue #5292:

>1. No longer need to explain to users side effects into scripts.
    They can do whatever they want.
>2. No need for a cache about scripts that we sent or not to the slaves.
>3. No need to sort the output of certain commands inside scripts
    (SMEMBERS and others): this both simplifies and gains speed.
>4. No need to store scripts inside the RDB file in order to startup correctly.
>5. No problems about evicting keys during the script execution.

When looking back at Redis 5.0, antirez and core team decided to set the config
`lua-replicate-commands yes` by default instead of removing verbatim replication
directly, in case some bad situations happened. 3 years later now before Redis 7.0,
it's time to remove it formally.

# Changes

- configuration for lua-replicate-commands removed
  - created config file stub for backward compatibility
- Replication script cache removed
  - this is useless under script effects replication
  - relevant statistics also removed
- script persistence in RDB files is also removed
- Propagation of SCRIPT LOAD and SCRIPT FLUSH to replica / AOF removed
- Deterministic execution logic in scripts removed (i.e. don't run write commands
  after random ones, and sorting output of commands with random order)
  - the flags indicating which commands have non-deterministic results are kept as hints to clients.
- `redis.replicate_commands()` & `redis.set_repl()` changed
  - now `redis.replicate_commands()` does nothing and return an 1
  - ...and then `redis.set_repl()` can be issued before `redis.replicate_commands()` now
- Relevant TCL cases adjusted
- DEBUG lua-always-replicate-commands removed

# Other changes
- Fix a recent bug comparing CLIENT_ID_AOF to original_client->flags instead of id. (introduced in #9780)

Co-authored-by: Oran Agra <oran@redislabs.com>
2021-12-21 08:32:42 +02:00
meir@redislabs.com
cbd463175f Redis Functions - Added redis function unit and Lua engine
Redis function unit is located inside functions.c
and contains Redis Function implementation:
1. FUNCTION commands:
  * FUNCTION CREATE
  * FCALL
  * FCALL_RO
  * FUNCTION DELETE
  * FUNCTION KILL
  * FUNCTION INFO
2. Register engine

In addition, this commit introduce the first engine
that uses the Redis Function capabilities, the
Lua engine.
2021-12-02 19:35:52 +02:00
meir@redislabs.com
fc731bc67f Redis Functions - Introduce script unit.
Script unit is a new unit located on script.c.
Its purpose is to provides an API for functions (and eval)
to interact with Redis. Interaction includes mostly
executing commands, but also functionalities like calling
Redis back on long scripts or check if the script was killed.

The interaction is done using a scriptRunCtx object that
need to be created by the user and initialized using scriptPrepareForRun.

Detailed list of functionalities expose by the unit:
1. Calling commands (including all the validation checks such as
   acl, cluster, read only run, ...)
2. Set Resp
3. Set Replication method (AOF/REPLICATION/NONE)
4. Call Redis back to on long running scripts to allow Redis reply
   to clients and perform script kill

The commit introduce the new unit and uses it on eval commands to
interact with Redis.
2021-12-01 23:54:23 +02:00
yoav-steinberg
0e5b813ef9
Multiparam config set (#9748)
We can now do: `config set maxmemory 10m repl-backlog-size 5m`

## Basic algorithm to support "transaction like" config sets:

1. Backup all relevant current values (via get).
2. Run "verify" and "set" on everything, if we fail run "restore".
3. Run "apply" on everything (optional optimization: skip functions already run). If we fail run "restore".
4. Return success.

### restore
1. Run set on everything in backup. If we fail log it and continue (this puts us in an undefined
   state but we decided it's better than the alternative of panicking). This indicates either a bug
   or some unsupported external state.
2. Run apply on everything in backup (optimization: skip functions already run). If we fail log
   it (see comment above).
3. Return error.

## Implementation/design changes:
* Apply function are idempotent (have no effect if they are run more than once for the same config).
* No indication in set functions if we're reading the config or running from the `CONFIG SET` command
   (removed `update` argument).
* Set function should set some config variable and assume an (optional) apply function will use that
   later to apply. If we know this setting can be safely applied immediately and can always be reverted
   and doesn't depend on any other configuration we can apply immediately from within the set function
   (and not store the setting anywhere). This is the case of this `dir` config, for example, which has no
   apply function. No apply function is need also in the case that setting the variable in the `server` struct
   is all that needs to be done to make the configuration take effect. Note that the original concept of `update_fn`,
   which received the old and new values was removed and replaced by the optional apply function.
* Apply functions use settings written to the `server` struct and don't receive any inputs.
* I take care that for the generic (non-special) configs if there's no change I avoid calling the setter (possible
   optimization: avoid calling the apply function as well).
* Passing the same config parameter more than once to `config set` will fail. You can't do `config set my-setting
   value1 my-setting value2`.

Note that getting `save` in the context of the conf file parsing to work here as before was a pain.
The conf file supports an aggregate `save` definition, where each `save` line is added to the server's
save params. This is unlike any other line in the config file where each line overwrites any previous
configuration. Since we now support passing multiple save params in a single line (see top comments
about `save` in https://github.com/redis/redis/pull/9644) we should deprecate the aggregate nature of
this config line and perhaps reduce this ugly code in the future.
2021-12-01 10:15:11 +02:00
Eduardo Semprebon
c22d3684ba
Fix diskless load handling on broken EOF marker (#9752)
During diskless replication, the check for broken EOF mark is misplaced
and should be earlier. Now we do not swap db, we do proper cleanup and
correctly raise module events on this kind of failure.

This issue existed prior to #9323, but before, the side effect was not restoring
backup and not raising the correct module events on this failure.
2021-11-09 11:46:10 +02:00
Eduardo Semprebon
91d0c758e5
Replica keep serving data during repl-diskless-load=swapdb for better availability (#9323)
For diskless replication in swapdb mode, considering we already spend replica memory
having a backup of current db to restore in case of failure, we can have the following benefits
by instead swapping database only in case we succeeded in transferring db from master:

- Avoid `LOADING` response during failed and successful synchronization for cases where the
  replica is already up and running with data.
- Faster total time of diskless replication, because now we're moving from Transfer + Flush + Load
  time to Transfer + Load only. Flushing the tempDb is done asynchronously after swapping.
- This could be implemented also for disk replication with similar benefits if consumers are willing
  to spend the extra memory usage.

General notes:
- The concept of `backupDb` becomes `tempDb` for clarity.
- Async loading mode will only kick in if the replica is syncing from a master that has the same
  repl-id the one it had before. i.e. the data it's getting belongs to a different time of the same timeline. 
- New property in INFO: `async_loading` to differentiate from the blocking loading
- Slot to Key mapping is now a field of `redisDb` as it's more natural to access it from both server.db
  and the tempDb that is passed around.
- Because this is affecting replicas only, we assume that if they are not readonly and write commands
  during replication, they are lost after SYNC same way as before, but we're still denying CONFIG SET
  here anyways to avoid complications.

Considerations for review:
- We have many cases where server.loading flag is used and even though I tried my best, there may
  be cases where async_loading should be checked as well and cases where it shouldn't (would require
  very good understanding of whole code)
- Several places that had different behavior depending on the loading flag where actually meant to just
  handle commands coming from the AOF client differently than ones coming from real clients, changed
  to check CLIENT_ID_AOF instead.

**Additional for Release Notes**
- Bugfix - server.dirty was not incremented for any kind of diskless replication, as effect it wouldn't
  contribute on triggering next database SAVE
- New flag for RM_GetContextFlags module API: REDISMODULE_CTX_FLAGS_ASYNC_LOADING
- Deprecated RedisModuleEvent_ReplBackup. Starting from Redis 7.0, we don't fire this event.
  Instead, we have the new RedisModuleEvent_ReplAsyncLoad holding 3 sub-events: STARTED,
  ABORTED and COMPLETED.
- New module flag REDISMODULE_OPTIONS_HANDLE_REPL_ASYNC_LOAD for RedisModule_SetModuleOptions
  to allow modules to declare they support the diskless replication with async loading (when absent, we fall
  back to disk-based loading).

Co-authored-by: Eduardo Semprebon <edus@saxobank.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2021-11-04 10:46:50 +02:00
Wang Yuan
526cbb5cff
Fix not updating backlog histlen when trimming repl backlog (#9713)
Since the loop in incrementalTrimReplicationBacklog checks the size of histlen,
we cannot afford to update it only when the loop exits, this may cause deleting
much more replication blocks, and replication backlog may be less than setting size.

introduce in #9166 

Co-authored-by: sundb <sundbcn@gmail.com>
2021-11-02 11:04:11 +02:00
zhaozhao.zz
d08f0552ee
rebuild replication backlog index when master restart (#9720)
After PR #9166 , replication backlog is not a real block of memory, just contains a
reference points to replication buffer's block and the blocks index (to accelerate
search offset when partial sync), so we need update both replication buffer's block's
offset and replication backlog blocks index's offset when master restart from RDB,
since the `server.master_repl_offset` is changed.
The implications of this bug was just a slow search, but not a replication failure.
2021-11-02 10:53:52 +02:00
Wang Yuan
c1718f9d86
Replication backlog and replicas use one global shared replication buffer (#9166)
## Background
For redis master, one replica uses one copy of replication buffer, that is a big waste of memory,
more replicas more waste, and allocate/free memory for every reply list also cost much.
If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with
replicas and can't finish synchronization with replica. If we set  client-output-buffer-limit big,
master may be OOM when there are many replicas that separately keep much memory.
Because replication buffers of different replica client are the same, one simple idea is that
all replicas only use one replication buffer, that will effectively save memory.

Since replication backlog content is the same as replicas' output buffer, now we
can discard replication backlog memory and use global shared replication buffer
to implement replication backlog mechanism.

## Implementation
I create one global "replication buffer" which contains content of replication stream.
The structure of "replication buffer" is similar to the reply list that exists in every client.
But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields.
```c
/* Replication buffer blocks is the list of replBufBlock.
 *
 * +--------------+       +--------------+       +--------------+
 * | refcount = 1 |  ...  | refcount = 0 |  ...  | refcount = 2 |
 * +--------------+       +--------------+       +--------------+
 *      |                                            /       \
 *      |                                           /         \
 *      |                                          /           \
 *  Repl Backlog                               Replia_A      Replia_B
 * 
 * Each replica or replication backlog increments only the refcount of the
 * 'ref_repl_buf_node' which it points to. So when replica walks to the next
 * node, it should first increase the next node's refcount, and when we trim
 * the replication buffer nodes, we remove node always from the head node which
 * refcount is 0. If the refcount of the head node is not 0, we must stop
 * trimming and never iterate the next node. */

/* Similar with 'clientReplyBlock', it is used for shared buffers between
 * all replica clients and replication backlog. */
typedef struct replBufBlock {
    int refcount;           /* Number of replicas or repl backlog using. */
    long long id;           /* The unique incremental number. */
    long long repl_offset;  /* Start replication offset of the block. */
    size_t size, used;
    char buf[];
} replBufBlock;
```
So now when we feed replication stream into replication backlog and all replicas, we only need
to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of
replication backlog and replicas to references of the global replication buffer blocks. And we also
need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim
replication backlog if exceeding `repl-backlog-size`.

When sending reply to replicas, we also need to iterate replication buffer blocks and send its
content, when totally sending one block for replica, we decrease current node count and
increase the next current node count, and then free the block which reference is 0 from the
head of replication buffer blocks.

Since now we use linked list to manage replication backlog, it may cost much time for iterating
all linked list nodes to find corresponding replication buffer node. So we create a rax tree to
store some nodes  for index, but to avoid rax tree occupying too much memory, i record
one per 64 nodes for index.

Currently, to make partial resynchronization as possible as much, we always let replication
backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting
if slow replicas that reference vast replication buffer blocks, and this method doesn't increase
memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced
replication buffer blocks when we need to trim backlog for exceeding backlog size setting,
we trim backlog incrementally (free 64 blocks per call now), and make it faster in
`beforeSleep` (free 640 blocks).

### Other changes
- `mem_total_replication_buffers`: we add this field in INFO command, it means the total
  memory of replication buffers used.
- `mem_clients_slaves`:  now even replica is slow to replicate, and its output buffer memory
  is not 0, but it still may be 0, since replication backlog and replicas share one global replication
  buffer, only if replication buffer memory is more than the repl backlog setting size, we consider
  the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption
  of repl backlog.
- Key eviction
  Since all replicas and replication backlog share global replication buffer, we think only the
  part of exceeding backlog size the extra separate consumption of replicas.
  Because we trim backlog incrementally in the background, backlog size may exceeds our
  setting if slow replicas that reference vast replication buffer blocks disconnect.
  To avoid massive eviction loop, we don't count the delayed freed replication backlog into
  used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory.
- `client-output-buffer-limit` check for replica clients
  It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size
  config (partial sync will succeed and then replica will get disconnected). Such a configuration is
  ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption
  implications since the replica client will share the backlog buffers memory.
- Drop replication backlog after loading data if needed
  We always create replication backlog if server is a master, we need it because we put DELs in
  it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb,
  it is not possible to support partial resynchronization, to avoid extra memory of replication backlog,
  we drop it.
- Multi IO threads
 Since all replicas and replication backlog use global replication buffer,  if I/O threads are enabled,
  to guarantee data accessing thread safe, we must let main thread handle sending the output buffer
  to all replicas. But before, other IO threads could handle sending output buffer of all replicas.

## Other optimizations
This solution resolve some other problem:
- When replicas disconnect with master since of out of output buffer limit, releasing the output
  buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now,
  it doesn't cause freezing.
- This implementation may mitigate reply list copy cost time(also freezes server) when one replication
  has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy
  reference info, it is very light.
- If we set replication backlog size big, it also may cost much time to copy replication backlog into
  replica's output buffer. But this commit eliminates this problem.
- Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 09:24:31 +03:00
Oran Agra
6b297cd646
Improve errno reporting on fork and fopen rdbLoad failures (#9649)
I moved a bunch of stats in redisFork to be executed only on successful
fork, since they seem wrong to be done when it failed.
I guess when fork fails it does that immediately, no latency spike.
2021-10-24 16:52:44 +03:00
guybe7
43e736f79b
Treat subcommands as commands (#9504)
## Intro

The purpose is to allow having different flags/ACL categories for
subcommands (Example: CONFIG GET is ok-loading but CONFIG SET isn't)

We create a small command table for every command that has subcommands
and each subcommand has its own flags, etc. (same as a "regular" command)

This commit also unites the Redis and the Sentinel command tables

## Affected commands

CONFIG
Used to have "admin ok-loading ok-stale no-script"
Changes:
1. Dropped "ok-loading" in all except GET (this doesn't change behavior since
there were checks in the code doing that)

XINFO
Used to have "read-only random"
Changes:
1. Dropped "random" in all except CONSUMERS

XGROUP
Used to have "write use-memory"
Changes:
1. Dropped "use-memory" in all except CREATE and CREATECONSUMER

COMMAND
No changes.

MEMORY
Used to have "random read-only"
Changes:
1. Dropped "random" in PURGE and USAGE

ACL
Used to have "admin no-script ok-loading ok-stale"
Changes:
1. Dropped "admin" in WHOAMI, GENPASS, and CAT

LATENCY
No changes.

MODULE
No changes.

SLOWLOG
Used to have "admin random ok-loading ok-stale"
Changes:
1. Dropped "random" in RESET

OBJECT
Used to have "read-only random"
Changes:
1. Dropped "random" in ENCODING and REFCOUNT

SCRIPT
Used to have "may-replicate no-script"
Changes:
1. Dropped "may-replicate" in all except FLUSH and LOAD

CLIENT
Used to have "admin no-script random ok-loading ok-stale"
Changes:
1. Dropped "random" in all except INFO and LIST
2. Dropped "admin" in ID, TRACKING, CACHING, GETREDIR, INFO, SETNAME, GETNAME, and REPLY

STRALGO
No changes.

PUBSUB
No changes.

CLUSTER
Changes:
1. Dropped "admin in countkeysinslots, getkeysinslot, info, nodes, keyslot, myid, and slots

SENTINEL
No changes.

(note that DEBUG also fits, but we decided not to convert it since it's for
debugging and anyway undocumented)

## New sub-command
This commit adds another element to the per-command output of COMMAND,
describing the list of subcommands, if any (in the same structure as "regular" commands)
Also, it adds a new subcommand:
```
COMMAND LIST [FILTERBY (MODULE <module-name>|ACLCAT <cat>|PATTERN <pattern>)]
```
which returns a set of all commands (unless filters), but excluding subcommands.

## Module API
A new module API, RM_CreateSubcommand, was added, in order to allow
module writer to define subcommands

## ACL changes:
1. Now, that each subcommand is actually a command, each has its own ACL id.
2. The old mechanism of allowed_subcommands is redundant
(blocking/allowing a subcommand is the same as blocking/allowing a regular command),
but we had to keep it, to support the widespread usage of allowed_subcommands
to block commands with certain args, that aren't subcommands (e.g. "-select +select|0").
3. I have renamed allowed_subcommands to allowed_firstargs to emphasize the difference.
4. Because subcommands are commands in ACL too, you can now use "-" to block subcommands
(e.g. "+client -client|kill"), which wasn't possible in the past.
5. It is also possible to use the allowed_firstargs mechanism with subcommand.
For example: `+config -config|set +config|set|loglevel` will block all CONFIG SET except
for setting the log level.
6. All of the ACL changes above required some amount of refactoring.

## Misc
1. There are two approaches: Either each subcommand has its own function or all
   subcommands use the same function, determining what to do according to argv[0].
   For now, I took the former approaches only with CONFIG and COMMAND,
   while other commands use the latter approach (for smaller blamelog diff).
2. Deleted memoryGetKeys: It is no longer needed because MEMORY USAGE now uses the "range" key spec.
4. Bugfix: GETNAME was missing from CLIENT's help message.
5. Sentinel and Redis now use the same table, with the same function pointer.
   Some commands have a different implementation in Sentinel, so we redirect
   them (these are ROLE, PUBLISH, and INFO).
6. Command stats now show the stats per subcommand (e.g. instead of stats just
   for "config" you will have stats for "config|set", "config|get", etc.)
7. It is now possible to use COMMAND directly on subcommands:
   COMMAND INFO CONFIG|GET (The pipeline syntax was inspired from ACL, and
   can be used in functions lookupCommandBySds and lookupCommandByCString)
8. STRALGO is now a container command (has "help")

## Breaking changes:
1. Command stats now show the stats per subcommand (see (5) above)
2021-10-20 11:52:57 +03:00
Binbin
dd3ac97ffe
Cleanup typos, incorrect comments, and fixed small memory leak in redis-cli (#9153)
1. Remove forward declarations from header files to functions that do not exist:
hmsetCommand and rdbSaveTime.
2. Minor phrasing fixes in #9519
3. Add missing sdsfree(title) and fix typo in redis-benchmark.
4. Modify some error comments in some zset commands.
5. Fix copy-paste bug comment in syncWithMaster about `ip-address`.
2021-10-02 22:19:33 -07:00
yoav-steinberg
2753429c99
Client eviction (#8687)
### Description
A mechanism for disconnecting clients when the sum of all connected clients is above a
configured limit. This prevents eviction or OOM caused by accumulated used memory
between all clients. It's a complimentary mechanism to the `client-output-buffer-limit`
mechanism which takes into account not only a single client and not only output buffers
but rather all memory used by all clients.

#### Design
The general design is as following:
* We track memory usage of each client, taking into account all memory used by the
  client (query buffer, output buffer, parsed arguments, etc...). This is kept up to date
  after reading from the socket, after processing commands and after writing to the socket.
* Based on the used memory we sort all clients into buckets. Each bucket contains all
  clients using up up to x2 memory of the clients in the bucket below it. For example up
  to 1m clients, up to 2m clients, up to 4m clients, ...
* Before processing a command and before sleep we check if we're over the configured
  limit. If we are we start disconnecting clients from larger buckets downwards until we're
  under the limit.

#### Config
`maxmemory-clients` max memory all clients are allowed to consume, above this threshold
we disconnect clients.
This config can either be set to 0 (meaning no limit), a size in bytes (possibly with MB/GB
suffix), or as a percentage of `maxmemory` by using the `%` suffix (e.g. setting it to `10%`
would mean 10% of `maxmemory`).

#### Important code changes
* During the development I encountered yet more situations where our io-threads access
  global vars. And needed to fix them. I also had to handle keeps the clients sorted into the
  memory buckets (which are global) while their memory usage changes in the io-thread.
  To achieve this I decided to simplify how we check if we're in an io-thread and make it
  much more explicit. I removed the `CLIENT_PENDING_READ` flag used for checking
  if the client is in an io-thread (it wasn't used for anything else) and just used the global
  `io_threads_op` variable the same way to check during writes.
* I optimized the cleanup of the client from the `clients_pending_read` list on client freeing.
  We now store a pointer in the `client` struct to this list so we don't need to search in it
  (`pending_read_list_node`).
* Added `evicted_clients` stat to `INFO` command.
* Added `CLIENT NO-EVICT ON|OFF` sub command to exclude a specific client from the
  client eviction mechanism. Added corrosponding 'e' flag in the client info string.
* Added `multi-mem` field in the client info string to show how much memory is used up
  by buffered multi commands.
* Client `tot-mem` now accounts for buffered multi-commands, pubsub patterns and
  channels (partially), tracking prefixes (partially).
* CLIENT_CLOSE_ASAP flag is now handled in a new `beforeNextClient()` function so
  clients will be disconnected between processing different clients and not only before sleep.
  This new function can be used in the future for work we want to do outside the command
  processing loop but don't want to wait for all clients to be processed before we get to it.
  Specifically I wanted to handle output-buffer-limit related closing before we process client
  eviction in case the two race with each other.
* Added a `DEBUG CLIENT-EVICTION` command to print out info about the client eviction
  buckets.
* Each client now holds a pointer to the client eviction memory usage bucket it belongs to
  and listNode to itself in that bucket for quick removal.
* Global `io_threads_op` variable now can contain a `IO_THREADS_OP_IDLE` value
  indicating no io-threading is currently being executed.
* In order to track memory used by each clients in real-time we can't rely on updating
  these stats in `clientsCron()` alone anymore. So now I call `updateClientMemUsage()`
  (used to be `clientsCronTrackClientsMemUsage()`) after command processing, after
  writing data to pubsub clients, after writing the output buffer and after reading from the
  socket (and maybe other places too). The function is written to be fast.
* Clients are evicted if needed (with appropriate log line) in `beforeSleep()` and before
  processing a command (before performing oom-checks and key-eviction).
* All clients memory usage buckets are grouped as follows:
  * All clients using less than 64k.
  * 64K..128K
  * 128K..256K
  * ...
  * 2G..4G
  * All clients using 4g and up.
* Added client-eviction.tcl with a bunch of tests for the new mechanism.
* Extended maxmemory.tcl to test the interaction between maxmemory and
  maxmemory-clients settings.
* Added an option to flag a numeric configuration variable as a "percent", this means that
  if we encounter a '%' after the number in the config file (or config set command) we
  consider it as valid. Such a number is store internally as a negative value. This way an
  integer value can be interpreted as either a percent (negative) or absolute value (positive).
  This is useful for example if some numeric configuration can optionally be set to a percentage
  of something else.

Co-authored-by: Oran Agra <oran@redislabs.com>
2021-09-23 14:02:16 +03:00
Wang Yuan
cee3d67f50
Delay to discard cached master when full synchronization (#9398)
* Delay to discard cache master when full synchronization
* Don't disconnect with replicas before loading transferred RDB when full sync

Previously, once replica need to start full synchronization with master,
it will discard cached master whatever full synchronization is failed or
not. 
Now we discard cached master only when transferring RDB is finished
and start to change data space, this make replica could start partial
resynchronization with another new master if new master is failed
during full synchronization.
2021-09-09 11:32:29 +03:00
yoav-steinberg
5e908a290c
dict struct memory optimizations (#9228)
Reduce dict struct memory overhead
on 64bit dict size goes down from jemalloc's 96 byte bin to its 56 byte bin.

summary of changes:
- Remove `privdata` from callbacks and dict creation. (this affects many files, see "Interface change" below).
- Meld `dictht` struct into the `dict` struct to eliminate struct padding. (this affects just dict.c and defrag.c)
- Eliminate the `sizemask` field, can be calculated from size when needed.
- Convert the `size` field into `size_exp` (exponent), utilizes one byte instead of 8.

Interface change: pass dict pointer to dict type call back functions.
This is instead of passing the removed privdata field. In the future if
we'd like to have private data in the callbacks we can extract it from
the dict type. We can extend dictType to include a custom dict struct
allocator and use it to allocate more data at the end of the dict
struct. This data can then be used to store private data later acccessed
by the callbacks.
2021-08-05 08:25:58 +03:00
Binbin
4000cb7d34
Modify some error logs printing level. (#9306)
1. In sendBulkToSlave, we used LL_VERBOSE in the past, changed to
LL_WARNING. (all the other places that do freeClient(slave) use LL_WARNING)
2. The old style LOG_WARNING, chang it to LL_WARNING. Introduced in an
old pr (#1690).
2021-08-02 11:18:35 +03:00
Ewg-c
a403816405
Minor refactoring for rioConnRead and adding errno (#9280)
minor refactoring for rioConnRead and adding errno
2021-07-29 17:29:23 -07:00
Binbin
3a26f61fb3
Add range check for master port in replicaof. (#9201)
So that we can avoid commands that are obviously wrong.

Also unified with loadServerConfigFromString
because we also checked the range.
2021-07-06 10:46:10 +03:00