Reduce dict struct memory overhead
on 64bit dict size goes down from jemalloc's 96 byte bin to its 56 byte bin.
summary of changes:
- Remove `privdata` from callbacks and dict creation. (this affects many files, see "Interface change" below).
- Meld `dictht` struct into the `dict` struct to eliminate struct padding. (this affects just dict.c and defrag.c)
- Eliminate the `sizemask` field, can be calculated from size when needed.
- Convert the `size` field into `size_exp` (exponent), utilizes one byte instead of 8.
Interface change: pass dict pointer to dict type call back functions.
This is instead of passing the removed privdata field. In the future if
we'd like to have private data in the callbacks we can extract it from
the dict type. We can extend dictType to include a custom dict struct
allocator and use it to allocate more data at the end of the dict
struct. This data can then be used to store private data later acccessed
by the callbacks.
## Backgroud
As we know, after `fork`, one process will copy pages when writing data to these
pages(CoW), and another process still keep old pages, they totally cost more memory.
For redis, we suffered that redis consumed much memory when the fork child is serializing
key/values, even that maybe cause OOM.
But actually we find, in redis fork child process, the child process don't need to keep some
memory and parent process may write or update that, for example, child process will never
access the key-value that is serialized but users may update it in parent process.
So we think it may reduce COW if the child process release memory that it is not needed.
## Implementation
For releasing key value in child process, we may think we call `decrRefCount` to free memory,
but i find the fork child process still use much memory when we don't write any data to redis,
and it costs much more time that slows down bgsave. Maybe because memory allocator doesn't
really release memory to OS, and it may modify some inner data for this free operation, especially
when we free small objects.
Moreover, CoW is based on pages, so it is a easy way that we only free the memory bulk that is
not less than kernel page size. madvise(MADV_DONTNEED) can quickly release specified region
pages to OS bypassing memory allocator, and allocator still consider that this memory still is used
and don't change its inner data.
There are some buffers we can release in the fork child process:
- **Serialized key-values**
the fork child process never access serialized key-values, so we try to free them.
Because we only can release big bulk memory, and it is time consumed to iterate all
items/members/fields/entries of complex data type. So we decide to iterate them and
try to release them only when their average size of item/member/field/entry is more
than page size of OS.
- **Replication backlog**
Because replication backlog is a cycle buffer, it will be changed quickly if redis has heavy
write traffic, but in fork child process, we don't need to access that.
- **Client buffers**
If clients have requests during having the fork child process, clients' buffer also be changed
frequently. The memory includes client query buffer, output buffer, and client struct used memory.
To get child process peak private dirty memory, we need to count peak memory instead
of last used memory, because the child process may continue to release memory (since
COW used to only grow till now, the last was equivalent to the peak).
Also we're adding a new `current_cow_peak` info variable (to complement the existing
`current_cow_size`)
Co-authored-by: Oran Agra <oran@redislabs.com>
## Current state
1. Lua has its own parser that handles parsing `reds.call` replies and translates them
to Lua objects that can be used by the user Lua code. The parser partially handles
resp3 (missing big number, verbatim, attribute, ...)
2. Modules have their own parser that handles parsing `RM_Call` replies and translates
them to RedisModuleCallReply objects. The parser does not support resp3.
In addition, in the future, we want to add Redis Function (#8693) that will probably
support more languages. At some point maintaining so many parsers will stop
scaling (bug fixes and protocol changes will need to be applied on all of them).
We will probably end up with different parsers that support different parts of the
resp protocol (like we already have today with Lua and modules)
## PR Changes
This PR attempt to unified the reply parsing of Lua and modules (and in the future
Redis Function) by introducing a new parser unit (`resp_parser.c`). The new parser
handles parsing the reply and calls different callbacks to allow the users (another
unit that uses the parser, i.e, Lua, modules, or Redis Function) to analyze the reply.
### Lua API Additions
The code that handles reply parsing on `scripting.c` was removed. Instead, it uses
the resp_parser to parse and create a Lua object out of the reply. As mentioned
above the Lua parser did not handle parsing big numbers, verbatim, and attribute.
The new parser can handle those and so Lua also gets it for free.
Those are translated to Lua objects in the following way:
1. Big Number - Lua table `{'big_number':'<str representation for big number>'}`
2. Verbatim - Lua table `{'verbatim_string':{'format':'<verbatim format>', 'string':'<verbatim string value>'}}`
3. Attribute - currently ignored and not expose to the Lua parser, another issue will be open to decide how to expose it.
Tests were added to check resp3 reply parsing on Lua
### Modules API Additions
The reply parsing code on `module.c` was also removed and the new resp_parser is used instead.
In addition, the RedisModuleCallReply was also extracted to a separate unit located on `call_reply.c`
(in the future, this unit will also be used by Redis Function). A nice side effect of unified parsing is
that modules now also support resp3. Resp3 can be enabled by giving `3` as a parameter to the
fmt argument of `RM_Call`. It is also possible to give `0`, which will indicate an auto mode. i.e, Redis
will automatically chose the reply protocol base on the current client set on the RedisModuleCtx
(this mode will mostly be used when the module want to pass the reply to the client as is).
In addition, the following RedisModuleAPI were added to allow analyzing resp3 replies:
* New RedisModuleCallReply types:
* `REDISMODULE_REPLY_MAP`
* `REDISMODULE_REPLY_SET`
* `REDISMODULE_REPLY_BOOL`
* `REDISMODULE_REPLY_DOUBLE`
* `REDISMODULE_REPLY_BIG_NUMBER`
* `REDISMODULE_REPLY_VERBATIM_STRING`
* `REDISMODULE_REPLY_ATTRIBUTE`
* New RedisModuleAPI:
* `RedisModule_CallReplyDouble` - getting double value from resp3 double reply
* `RedisModule_CallReplyBool` - getting boolean value from resp3 boolean reply
* `RedisModule_CallReplyBigNumber` - getting big number value from resp3 big number reply
* `RedisModule_CallReplyVerbatim` - getting format and value from resp3 verbatim reply
* `RedisModule_CallReplySetElement` - getting element from resp3 set reply
* `RedisModule_CallReplyMapElement` - getting key and value from resp3 map reply
* `RedisModule_CallReplyAttribute` - getting a reply attribute
* `RedisModule_CallReplyAttributeElement` - getting key and value from resp3 attribute reply
* New context flags:
* `REDISMODULE_CTX_FLAGS_RESP3` - indicate that the client is using resp3
Tests were added to check the new RedisModuleAPI
### Modules API Changes
* RM_ReplyWithCallReply might return REDISMODULE_ERR if the given CallReply is in resp3
but the client expects resp2. This is not a breaking change because in order to get a resp3
CallReply one needs to specifically specify `3` as a parameter to the fmt argument of
`RM_Call` (as mentioned above).
Tests were added to check this change
### More small Additions
* Added `debug set-disable-deny-scripts` that allows to turn on and off the commands no-script
flag protection. This is used by the Lua resp3 tests so it will be possible to run `debug protocol`
and check the resp3 parsing code.
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Yossi Gottlieb <yossigo@gmail.com>
Add SINTERCARD and ZINTERCARD commands that are similar to
ZINTER and SINTER but only return the cardinality with minimum
processing and memory overheads.
Co-authored-by: Oran Agra <oran@redislabs.com>
Add two INFO metrics:
```
total_eviction_exceeded_time:69734
current_eviction_exceeded_time:10230
```
`current_eviction_exceeded_time` if greater than 0, means how much time current used memory is greater than `maxmemory`. And we are still over the maxmemory. If used memory is below `maxmemory`, this metric is reset to 0.
`total_eviction_exceeded_time` means total time used memory is greater than `maxmemory` since server startup.
The units of these two metrics are ms.
Co-authored-by: Oran Agra <oran@redislabs.com>
GETBIT, SETBIT may access wrong address because of wrap.
BITCOUNT and BITPOS may return wrapped results.
BITFIELD may access the wrong address but also allocate insufficient memory and segfault (see CVE-2021-32761).
This commit uses `uint64_t` or `long long` instead of `size_t`.
related https://github.com/redis/redis/pull/8096
At 32bit platform:
> setbit bit 4294967295 1
(integer) 0
> config set proto-max-bulk-len 536870913
OK
> append bit "\xFF"
(integer) 536870913
> getbit bit 4294967296
(integer) 0
When the bit index is larger than 4294967295, size_t can't hold bit index. In the past, `proto-max-bulk-len` is limit to 536870912, so there is no problem.
After this commit, bit position is stored in `uint64_t` or `long long`. So when `proto-max-bulk-len > 536870912`, 32bit platforms can still be correct.
For 64bit platform, this problem still exists. The major reason is bit pos 8 times of byte pos. When proto-max-bulk-len is very larger, bit pos may overflow.
But at 64bit platform, we don't have so long string. So this bug may never happen.
Additionally this commit add a test cost `512MB` memory which is tag as `large-memory`. Make freebsd ci and valgrind ci ignore this test.
- promote the code in DEBUG PROTOCOL to addReplyBigNum
- DEBUG PROTOCOL ATTRIB skips the attribute when client is RESP2
- networking.c addReply for push and attributes generate assertion when
called on a RESP2 client, anything else would produce a broken
protocol that clients can't handle.
There are two issues fixed in this commit:
1. we want to fail the EXEC command in case there is a watched key that's logically
expired but not yet deleted by active expire or lazy expire.
2. we saw that currently cache time is update in every `call()` (including nested calls),
this time is being also being use for the isKeyExpired comparison, we want to update
the cache time only in the first call (execCommand)
Co-authored-by: Oran Agra <oran@redislabs.com>
Modules that use background threads with thread safe contexts are likely
to use RM_BlockClient() without a timeout function, because they do not
set up a timeout.
Before this commit, `CLIENT UNBLOCK` would result with a crash as the
`NULL` timeout callback is called. Beyond just crashing, this is also
logically wrong as it may throw the module into an unexpected client
state.
This commits makes `CLIENT UNBLOCK` on such clients behave the same as
any other client that is not in a blocked state and therefore cannot be
unblocked.
In the past, the first bind address that was explicitly specified was
also used to bind outgoing connections. This could result with some
problems. For example: on some systems using `bind 127.0.0.1` would
result with outgoing connections also binding to `127.0.0.1` and failing
to connect to remote addresses.
With the recent change to the way `bind` is handled, this presented
other issues:
* The default first bind address is '*' which is not a valid address.
* We make no distinction between user-supplied config that is identical
to the default, and the default config.
This commit addresses both these issues by introducing an explicit
configuration parameter to control the bind address on outgoing
connections.
* Specifying an empty `bind ""` configuration prevents Redis from listening on any TCP port. Before this commit, such configuration was not accepted.
* Using `CONFIG GET bind` will always return an explicit configuration value. Before this commit, if a bind address was not specified the returned value was empty (which was an anomaly).
Another behavior change is that modifying the `bind` configuration to a non-default value will NO LONGER DISABLE protected-mode implicitly.
Gopher support was added mainly because it was simple (trivial to add).
But apparently even something that was trivial at the time, does cause complications
down the line when adding more features.
We recently ran into a few issues with io-threads conflicting with the gopher support.
We had to either complicate the code further in order to solve them, or drop gopher.
AFAIK it's completely unused, so we wanna chuck it, rather than keep supporting it.
Create new module type enhanced callbacks: mem_usage2, free_effort2, unlink2, copy2.
These will be given a context point from which the module can obtain the key name and database id.
In addition the digest and defrag context can now be used to obtain the key name and database id.
* Cleaning up the cluster interface by moving almost all related declarations into cluster.h
(no logic change -- just moving declarations/definitions around)
This initial effort leaves two items out of scope - the configuration parsing into the server
struct and the internals exposed by the clusterNode struct.
* Remove unneeded declarations of dictSds*
Ideally all the dictSds functionality would move from server.c into a dedicated module
so we can avoid the duplication in redis-benchmark/cli
* Move crc16 back into server.h, will be moved out once we create a seperate header file for
hashing functions
The initialize memory of `querybuf` is `PROTO_IOBUF_LEN(1024*16) * 2` (due to sdsMakeRoomFor being greedy), under `jemalloc`, the allocated memory will be 40k.
This will most likely result in the `querybuf` being resized when call `clientsCronResizeQueryBuffer` unless the client requests it fast enough.
Note that this bug existed even before #7875, since the condition for resizing includes the sds headers (32k+6).
## Changes
1. Use non-greedy sdsMakeRoomFor when allocating the initial query buffer (of 16k).
1. Also use non-greedy allocation when working with BIG_ARG (we won't use that extra space anyway)
2. in case we did use a greedy allocation, read as much as we can into the buffer we got (including internal frag), to reduce system calls.
3. introduce a dedicated constant for the shrinking (same value as before)
3. Add test for querybuf.
4. improve a maxmemory test by ignoring the effect of replica query buffers (can accumulate many ACKs on slow env)
5. improve a maxmemory by disabling slowlog (it will cause slight memory growth on slow env).
Today when we load the AOF on startup, the loadAppendOnlyFile checks if
the file is openning for reading.
This check is redundent (dead code) as we open the AOF file for writing at initServer,
and the file will always be existing for the loadAppendOnlyFile.
In this commit:
- remove all the exit(1) from loadAppendOnlyFile, as it is the caller
responsibility to decide what to do in case of failure.
- move the opening of the AOF file for writing, to be after we loading it.
- avoid return -ERR in DEBUG LOADAOF, when the AOF is existing but empty
This PR adds a spell checker CI action that will fail future PRs if they introduce typos and spelling mistakes.
This spell checker is based on blacklist of common spelling mistakes, so it will not catch everything,
but at least it is also unlikely to cause false positives.
Besides that, the PR also fixes many spelling mistakes and types, not all are a result of the spell checker we use.
Here's a summary of other changes:
1. Scanned the entire source code and fixes all sorts of typos and spelling mistakes (including missing or extra spaces).
2. Outdated function / variable / argument names in comments
3. Fix outdated keyspace masks error log when we check `config.notify-keyspace-events` in loadServerConfigFromString.
4. Trim the white space at the end of line in `module.c`. Check: https://github.com/redis/redis/pull/7751
5. Some outdated https link URLs.
6. Fix some outdated comment. Such as:
- In README: about the rdb, we used to said create a `thread`, change to `process`
- dbRandomKey function coment (about the dictGetRandomKey, change to dictGetFairRandomKey)
- notifyKeyspaceEvent fucntion comment (add type arg)
- Some others minor fix in comment (Most of them are incorrectly quoted by variable names)
7. Modified the error log so that users can easily distinguish between TCP and TLS in `changeBindAddr`
When we allocate a client struct with 16k reply buffer, the allocator we may give us 20K,
This commit makes use of that extra space.
Additionally, it tries to store whatever it can from the reply into the static 'buf' before
allocating a new node for the reply list.
Make aof rewrite buffer memory size more accurate, before, there may be 20%
deviation with its real memory usage.
The implication are both lower memory usage, and also a more accurate INFO.
Till now, on replica full-sync we used to transfer absolute time for TTL,
however when a command arrived (EXPIRE or EXPIREAT),
we used to propagate it as is to replicas (possibly with relative time),
but always translate it to EXPIREAT (absolute time) to AOF.
This commit changes that and will always use absolute time for propagation.
see discussion in #8433
Furthermore, we Introduce new commands: `EXPIRETIME/PEXPIRETIME`
that allow extracting the absolute TTL time from a key.
When client breached the output buffer soft limit but then went idle,
we didn't disconnect on soft limit timeout, now we do.
Note this also resolves some sporadic test failures in due to Linux
buffering data which caused tests to fail if during the test we went
back under the soft COB limit.
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: sundb <sundbcn@gmail.com>
Adding a new type mask for key space notification, REDISMODULE_NOTIFY_MODULE, to enable unique notifications from commands on REDISMODULE_KEYTYPE_MODULE type keys (which is currently unsupported).
Modules can subscribe to a module key keyspace notification by RM_SubscribeToKeyspaceEvents,
and clients by notify-keyspace-events of redis.conf or via the CONFIG SET, with the characters 'd' or 'A'
(REDISMODULE_NOTIFY_MODULE type mask is part of the '**A**ll' notation for key space notifications).
Refactor: move some pubsub test infra from pubsub.tcl to util.tcl to be re-used by other tests.
Starting redis 6.0 (part of the TLS feature), diskless master uses pipe from the fork
child so that the parent is the one sending data to the replicas.
This mechanism has an issue in which a hung replica will cause the master to wait
for it to read the data sent to it forever, thus preventing the fork child from terminating
and preventing the creations of any other forks.
This PR adds a timeout mechanism, much like the ACK-based timeout,
we disconnect replicas that aren't reading the RDB file fast enough.
The bio aof fsync fd may be closed by main thread (AOFRW done handler)
and even possibly reused for another socket, pipe, or file.
This can can an EBADF or EINVAL fsync error, which will lead to -MISCONF errors failing all writes.
We just ignore these errno because aof fsync did not really fail.
We handle errno when fsyncing aof in bio, so we could know the real reason
when users get -MISCONF Errors writing to the AOF file error
Issue created with #8419
Background:
Redis 6.2 added ACL control for pubsub channels (#7993), which were supposed
to be permissive by default to retain compatibility with redis 6.0 ACL.
But due to a bug, only newly created users got this `acl-pubsub-default` applied,
while overwritten (updated) users got reset to `resetchannels` (denied).
Since the "default" user exists before loading the config file,
any ACL change to it, results in an update / overwrite.
So when a "default" user is loaded from config file or include ACL
file with no channels related rules, the user will not have any
permissions to any channels. But other users will have default
permissions to any channels.
When upgraded from 6.0 with config rewrite, this will lead to
"default" user channels permissions lost.
When users are loaded from include file, then call "acl load", users
will also lost channels permissions.
Similarly, the `reset` ACL rule, would have reset the user to be denied
access to any channels, ignoring `acl-pubsub-default` and breaking
compatibility with redis 6.0.
The implication of this fix is that it regains compatibility with redis 6.0,
but breaks compatibility with redis 6.2.0 and 2.0.1. e.g. after the upgrade,
the default user will regain access to pubsub channels.
Other changes:
Additionally this commit rename server.acl_pubusub_default to
server.acl_pubsub_default and fix typo in acl tests.
Previously (and by default after commit) when master loose its last slot
(due to migration, for example), its replicas will migrate to new last slot
holder.
There are cases where this is not desired:
* Consolidation that results with removed nodes (including the replica, eventually).
* Manually configured cluster topologies, which the admin wishes to preserve.
Needlessly migrating a replica triggers a full synchronization and can have a negative impact, so
we prefer to be able to avoid it where possible.
This commit adds 'cluster-allow-replica-migration' configuration option that is
enabled by default to preserve existed behavior. When disabled, replicas will
not be auto-migrated.
Fixes#4896
Co-authored-by: Oran Agra <oran@redislabs.com>
In `aof.c`, we call fsync when stop aof, and now print a log to let user know that if fail.
In `cluster.c`, we now return error, the calling function already handles these write errors.
In `redis-cli.c`, users hope to save rdb, we now print a message if fsync failed.
In `rio.c`, we now treat fsync errors like we do for write errors.
In `server.c`, we try to fsync aof file when shutdown redis, we only can print one log if fail.
In `bio.c`, if failing to fsync aof file, we will set `aof_bio_fsync_status` to error , and reject writing just like last writing aof error, moreover also set INFO command field `aof_last_write_status` to error.
The implications of this change is just that in the past when a config file was missing,
in some cases it was exiting before printing the sever startup prints and sometimes after,
and now it'll always exit before printing them.
The 'sentinel replicas <master>' command will ignore replicas with
`replica-announced` set to no.
The goal of disabling the config setting replica-announced is to allow ghost
replicas. The replica is in the cluster, synchronize with its master, can be
promoted to master and is not exposed to sentinel clients. This way, it is
acting as a live backup or living ghost.
In addition, to prevent the replica to be promoted as master, set
replica-priority to 0.
The cluster bus is established over TLS or non-TLS depending on the configuration tls-cluster. The client ports distributed in the cluster and sent to clients are assumed to be TLS or non-TLS also depending on tls-cluster.
The cluster bus is now extended to also contain the non-TLS port of clients in a TLS cluster, when available. The non-TLS port of a cluster node, when available, is sent to clients connected without TLS in responses to CLUSTER SLOTS, CLUSTER NODES, CLUSTER SLAVES and MOVED and ASK redirects, instead of the TLS port.
The user was able to override the client port by defining cluster-announce-port. Now cluster-announce-tls-port is added, so the user can define an alternative announce port for both TLS and non-TLS clients.
Fixes#8134
Add publish channel permissions check in processCommand.
processCommand didn't check publish channel permissions, so we can
queue a publish command in a transaction. But when exec the transaction,
it will fail with -NOPERM.
We also union keys/commands/channels permissions check togegher in
ACLCheckAllPerm. Remove pubsubCheckACLPermissionsOrReply in
publishCommand/subscribeCommand/psubscribeCommand. Always
check permissions in processCommand/execCommand/
luaRedisGenericCommand.
* SLOWLOG didn't record anything for blocked commands because the client
was reset and argv was already empty. there was a fix for this issue
specifically for modules, now it works for all blocked clients.
* The original command argv (before being re-written) was also reset
before adding the slowlog on behalf of the blocked command.
* Latency monitor is now updated regardless of the slowlog flags of the
command or its execution (their purpose is to hide sensitive info from
the slowlog, not hide the fact the latency happened).
* Latency monitor now uses real_cmd rather than c->cmd (which may be
different if the command got re-written, e.g. GEOADD)
Changes:
* Unify shared code between slowlog insertion in call() and
updateStatsOnUnblock(), hopefully prevent future bugs from happening
due to the later being overlooked.
* Reset CLIENT_PREVENT_LOGGING in resetClient rather than after command
processing.
* Add a test for SLOWLOG and BLPOP
Notes:
- real_cmd == c->lastcmd, except inside MULTI and Lua.
- blocked commands never happen in these cases (MULTI / Lua)
- real_cmd == c->cmd, except for when the command is rewritten (e.g.
GEOADD)
- blocked commands (currently) are never rewritten
- other than the command's CLIENT_PREVENT_LOGGING, and the
execution flag CLIENT_PREVENT_LOGGING, other cases that we want to
avoid slowlog are on AOF loading (specifically CMD_CALL_SLOWLOG will
be off when executed from execCommand that runs from an AOF)
Reading CoW from /proc/<pid>/smaps can be slow with large processes on
some platforms.
This measures the time it takes to read CoW info and limits the duty
cycle of future updates to roughly 1/100.
As current_cow_size no longer represnets a current, fixed interval value
there is also a new current_cow_size_age field that provides information
about the age of the size value, in seconds.
Bug 1:
When a module ctx is freed moduleHandlePropagationAfterCommandCallback
is called and handles propagation. We want to prevent it from propagating
commands that were not replicated by the same context. Example:
1. module1.foo does: RM_Replicate(cmd1); RM_Call(cmd2); RM_Replicate(cmd3)
2. RM_Replicate(cmd1) propagates MULTI and adds cmd1 to also_propagagte
3. RM_Call(cmd2) create a new ctx, calls call() and destroys the ctx.
4. moduleHandlePropagationAfterCommandCallback is called, calling
alsoPropagates EXEC (Note: EXEC is still not written to socket),
setting server.in_trnsaction = 0
5. RM_Replicate(cmd3) is called, propagagting yet another MULTI (now
we have nested MULTI calls, which is no good) and then cmd3
We must prevent RM_Call(cmd2) from resetting server.in_transaction.
REDISMODULE_CTX_MULTI_EMITTED was revived for that purpose.
Bug 2:
Fix issues with nested RM_Call where some have '!' and some don't.
Example:
1. module1.foo does RM_Call of module2.bar without replication (i.e. no '!')
2. module2.bar internally calls RM_Call of INCR with '!'
3. at the end of module1.foo we call RM_ReplicateVerbatim
We want the replica/AOF to see only module1.foo and not the INCR from module2.bar
Introduced a global replication_allowed flag inside RM_Call to determine
whether we need to replicate or not (even if '!' was specified)
Other changes:
Split beforePropagateMultiOrExec to beforePropagateMulti afterPropagateExec
just for better readability