Thanks to @soloestoy for discovering this issue in #5667.
This is an alternative fix in order to avoid both cycling the clients
and also disconnecting clients just having valid read-only transactions
pending.
these metrics become negative when RSS is smaller than the used_memory.
This can easily happen when the program allocated a lot of memory and haven't
written to it yet, in which case the kernel doesn't allocate any pages to the process
Note: this breaks backward compatibility with Redis 4, since now slaves
by default are exact copies of masters and do not try to evict keys
independently.
This is an optimization for processing pipeline, we discussed a
problem in issue #5229: clients may be paused if we apply `CLIENT
PAUSE` command, and then querybuf may grow too large, the cost of
memmove in sdsrange after parsing a completed command will be
horrible. The optimization is that parsing all commands in queyrbuf
, after that we can just call sdsrange only once.
Reading the PR gave me the opportunity to better specify what the code
was doing in places where I was not immediately sure about what was
going on. Moreover I documented the structure in server.h so that people
reading the header file will immediately understand what the structure
is useful for.
A) slave buffers didn't count internal fragmentation and sds unused space,
this caused them to induce eviction although we didn't mean for it.
B) slave buffers were consuming about twice the memory of what they actually needed.
- this was mainly due to sdsMakeRoomFor growing to twice as much as needed each time
but networking.c not storing more than 16k (partially fixed recently in 237a38737).
- besides it wasn't able to store half of the new string into one buffer and the
other half into the next (so the above mentioned fix helped mainly for small items).
- lastly, the sds buffers had up to 30% internal fragmentation that was wasted,
consumed but not used.
C) inefficient performance due to starting from a small string and reallocing many times.
what i changed:
- creating dedicated buffers for reply list, counting their size with zmalloc_size
- when creating a new reply node from, preallocate it to at least 16k.
- when appending a new reply to the buffer, first fill all the unused space of the
previous node before starting a new one.
other changes:
- expose mem_not_counted_for_evict info field for the benefit of the test suite
- add a test to make sure slave buffers are counted correctly and that they don't cause eviction
With such information will be able to use a private localtime()
implementation serverLog(), which does not use any locking and is both
thread and fork() safe.
RESTORE now supports:
1. Setting LRU/LFU
2. Absolute-time TTL
Other related changes:
1. RDB loading will not override LRU bits when RDB file
does not contain the LRU opcode.
2. RDB loading will not set LRU/LFU bits if the server's
maxmemory-policy does not match.
this reduces the extra 8 bytes we save before each pointer.
but more importantly maybe, it makes the valgrind runs to be more similiar
to our normal runs.
note: the change in malloc_stats struct in server.h is to eliminate an name conflict.
structs that are not typedefed are resolved from a separate name space.
Usually blocking operations make a lot of sense with multiple keys so
that we can listen to multiple queues (or whatever the app models) with
a single connection. However in the synchronous case it is more useful
to be able to ask for N elements. This is a change that I also wanted to
perform soon or later in the blocking list variant, but here it is more
natural since there is no reply type difference.
Implementation notes: as INFO is "already broken", I didn't want to break it further. Instead of computing the server.lua_script dict size on every call, I'm keeping a running sum of the body's length and dict overheads.
This implementation is naive as it **does not** take into consideration dict rehashing, but that inaccuracy pays off in speed ;)
Demo time:
```bash
$ redis-cli info memory | grep "script"
used_memory_scripts:96
used_memory_scripts_human:96B
number_of_cached_scripts:0
$ redis-cli eval "" 0 ; redis-cli info memory | grep "script"
(nil)
used_memory_scripts:120
used_memory_scripts_human:120B
number_of_cached_scripts:1
$ redis-cli script flush ; redis-cli info memory | grep "script"
OK
used_memory_scripts:96
used_memory_scripts_human:96B
number_of_cached_scripts:0
$ redis-cli eval "return('Hello, Script Cache :)')" 0 ; redis-cli info memory | grep "script"
"Hello, Script Cache :)"
used_memory_scripts:152
used_memory_scripts_human:152B
number_of_cached_scripts:1
$ redis-cli eval "return redis.sha1hex(\"return('Hello, Script Cache :)')\")" 0 ; redis-cli info memory | grep "script"
"1be72729d43da5114929c1260a749073732dc822"
used_memory_scripts:232
used_memory_scripts_human:232B
number_of_cached_scripts:2
✔ 19:03:54 redis [lua_scripts-in-info-memory L ✚…⚑] $ redis-cli evalsha 1be72729d43da5114929c1260a749073732dc822 0
"Hello, Script Cache :)"
```
There are too many advantages in doing this, RDB is faster to persist,
more compact, much faster to load back. The main issues here are that
the code is less tested because this was not the old default (so we are
enabling it for the new 5.0 release), and that the AOF is no longer a
trivially parsable format from now on. However the non-preamble mode
will be supported in the future as well, if new data types will be
added.
This commit, in some parts derived from PR #3041 which is no longer
possible to merge (because the user deleted the original branch),
implements the ability of slaves to have a special configuration
preventing that they try to start a failover when the master is failing.
There are multiple reasons for wanting this, and the feautre was
requested in issue #3021 time ago.
The differences between this patch and the original PR are the
following:
1. The flag is saved/loaded on the nodes configuration.
2. The 'myself' node is now flag-aware, the flag is updated as needed
when the configuration is changed via CONFIG SET.
3. The flag name uses NOFAILOVER instead of NO_FAILOVER to be consistent
with existing NOADDR.
4. The redis.conf documentation was rewritten.
Thanks to @deep011 for the original patch.
other fixes / improvements:
- LUA script memory isn't taken from zmalloc (taken from libc malloc)
so it can cause high fragmentation ratio to be displayed (which is false)
- there was a problem with "fragmentation" info being calculated from
RSS and used_memory sampled at different times (now sampling them together)
other details:
- adding a few more allocator info fields to INFO and MEMORY commands
- improve defrag test to measure defrag latency of big keys
- increasing the accuracy of the defrag test (by looking at real grag info)
this way we can use an even lower threshold and still avoid false positives
- keep the old (total) "fragmentation" field unchanged, but add new ones for spcific things
- add these the MEMORY DOCTOR command
- deduct LUA memory from the rss in case of non jemalloc allocator (one for which we don't "allocator active/used")
- reduce sampling rate of the rss and allocator info
- big keys are not defragged in one go from within the dict scan
instead they are scanned in parts after the main dict hash bucket is done.
- add latency monitor sample for defrag
- change default active-defrag-cycle-min to induce lower latency
- make active defrag start a new scan right away if needed, so it's easier
(for the test suite) to detect when it's done
- make active defrag quick the current cycle after each db / big key
- defrag some non key long term global allocations
- some refactoring for smaller functions and more reusable code
- during dict rehashing, one scan iteration of the dict, can end up scanning
one bucket in the smaller dict and many many buckets in the larger dict.
so waiting for 16 scan iterations before checking the time, may be much too long.
This commit adds two new fields in the INFO output, stats section:
expired_stale_perc:0.34
expired_time_cap_reached_count:58
The first field is an estimate of the number of keys that are yet in
memory but are already logically expired. They reason why those keys are
yet not reclaimed is because the active expire cycle can't spend more
time on the process of reclaiming the keys, and at the same time nobody
is accessing such keys. However as the active expire cycle runs, while
it will eventually have to return to the caller, because of time limit
or because there are less than 25% of keys logically expired in each
given database, it collects the stats in order to populate this INFO
field.
Note that expired_stale_perc is a running average, where the current
sample accounts for 5% and the history for 95%, so you'll see it
changing smoothly over time.
The other field, expired_time_cap_reached_count, counts the number
of times the expire cycle had to stop, even if still it was finding a
sizeable number of keys yet to expire, because of the time limit.
This allows people handling operations to understand if the Redis
server, during mass-expiration events, is able to collect keys fast
enough usually. It is normal for this field to increment during mass
expires, but normally it should very rarely increment. When instead it
constantly increments, it means that the current workloads is using
a very important percentage of CPU time to expire keys.
This feature was created thanks to the hints of Rashmi Ramesh and
Bart Robinson from Twitter. In private email exchanges, they noted how
it was important to improve the observability of this parameter in the
Redis server. Actually in big deployments, the amount of keys that are
yet to expire in each server, even if they are logically expired, may
account for a very big amount of wasted memory.
We have this operation in two places: when caching the master and
when linking a new client after the client creation. By having an API
for this we avoid incurring in errors when modifying one of the two
places forgetting the other. The function is also a good place where to
document why we cache the linked list node.
Related to #4497 and #4210.
The function in its initial form, and after the fixes for the PSYNC2
bugs, required code duplication in multiple spots. This commit modifies
it in order to always compute the script name independently, and to
return the SDS of the SHA of the body: this way it can be used in all
the places, including for SCRIPT LOAD, without duplicating the code to
create the Lua function name. Note that this requires to re-compute the
body SHA1 in the case of EVAL seeing a script for the first time, but
this should not change scripting performance in any way because new
scripts definition is a rare event happening the first time a script is
seen, and the SHA1 computation is anyway not a very slow process against
the typical Redis script and compared to the actua Lua byte compiling of
the body.
Note that the function used to assert() if a duplicated script was
loaded, however actually now two times over three, we want the function
to handle duplicated scripts just fine: this happens in SCRIPT LOAD and
in RDB AUX "lua" loading. Moreover the assert was not defending against
some obvious failure mode, so now the function always tests against
already defined functions at start.
In the case of slaves loading the RDB from master, or in other similar
cases, the script is already defined, and the function registering the
script should not fail in the assert() call.
XADD was suboptimal in the first incarnation of the command, not being
able to accept an ID (very useufl for replication), nor options for
having capped streams.
The keyspace notification for streams was not implemented.
With lists we need to signal only on key creation, but streams can
provide data to clients listening at every new item added.
To make this slightly more efficient we now track different classes of
blocked clients to avoid signaling keys when there is nobody listening.
A typical case is when the stream is used as a time series DB and
accessed only by range with XRANGE.
This is currently needed in order to fix#4483, but this can be
useful in other contexts, so maybe later we may want to remove the
conditionals and always save/load scripts.
Note that we are using the "lua" AUX field here, in order to guarantee
backward compatibility of the RDB file. The unknown AUX fields must be
discarded by past versions of Redis.