jemalloc 5 doesn't immediately release memory back to the OS, instead there's a decaying
mechanism, which doesn't work when there's no traffic (no allocations).
this is most evident if there's no traffic after flushdb, the RSS will remain high.
1) enable jemalloc background purging
2) explicitly purge in flushdb
Now threads are stopped even when the connections drop immediately to
zero, not allowing the networking code to detect the condition and stop
the threads. serverCron() will handle that.
This is just an experiment for now, there are a couple of race
conditions, mostly harmless for the performance gain experiment that
this commit represents so far.
The general idea here is to take Redis single threaded and instead
fan-out on expansive kernel calls: write(2) in this case, but the same
concept could be easily implemented for read(2) and protcol parsing.
However just threading writes like in this commit, is enough to evaluate
if the approach is sounding.
Fixes#6012.
As long as "INFO is broken", this should be adequate IMO. Once we rework
`INFO`, perhaps into RESP3, this implementation should be revisited.
when redis appends the blocked client reply list to the real client, it didn't
bother to check if it is in fact the master client. so a slave executing that
module command will send replies to the master, causing the master to send the
slave error responses, which will mess up the replication offset
(slave will advance it's replication offset, and the master does not)
Adding another new filed categories at the end of
command reply, it's easy to read and distinguish
flags and categories, also compatible with old format.
In some cases processMultibulkBuffer uses sdsMakeRoomFor to
expand the querybuf, but later in some cases it uses that query
buffer as is for an argv element (see "Optimization"), which means
that the sds in argv may have a lot of wasted space, and then in case
modules keep that argv RedisString inside their data structure, this
space waste will remain for long (until restarted from rdb).
In mostly production environment, normal user's behavior should be
limited.
Now in redis ACL mechanism we can do it like that:
user default on +@all ~* -@dangerous nopass
user admin on +@all ~* >someSeriousPassword
Then the default normal user can not execute dangerous commands like
FLUSHALL/KEYS.
But some admin commands are in dangerous category too like PSYNC,
and the configurations above will forbid replica from sync with master.
Finally I think we could add a new configuration for replication,
it is masteruser option, like this:
masteruser admin
masterauth someSeriousPassword
Then replica will try AUTH admin someSeriousPassword and get privilege
to execute PSYNC. If masteruser is NULL, replica would AUTH with only
masterauth like before.
This is needed in order to model the current behavior of authenticating
the connection directly when no password is set. Now with ACLs this will
be obtained by setting the default user as "nopass" user. Moreover this
flag can be used in order to create other users that do not require any
password but will work with "AUTH username <any-password>".
Thanks to @soloestoy for discovering this issue in #5667.
This is an alternative fix in order to avoid both cycling the clients
and also disconnecting clients just having valid read-only transactions
pending.
these metrics become negative when RSS is smaller than the used_memory.
This can easily happen when the program allocated a lot of memory and haven't
written to it yet, in which case the kernel doesn't allocate any pages to the process
Note: this breaks backward compatibility with Redis 4, since now slaves
by default are exact copies of masters and do not try to evict keys
independently.
This is an optimization for processing pipeline, we discussed a
problem in issue #5229: clients may be paused if we apply `CLIENT
PAUSE` command, and then querybuf may grow too large, the cost of
memmove in sdsrange after parsing a completed command will be
horrible. The optimization is that parsing all commands in queyrbuf
, after that we can just call sdsrange only once.
Reading the PR gave me the opportunity to better specify what the code
was doing in places where I was not immediately sure about what was
going on. Moreover I documented the structure in server.h so that people
reading the header file will immediately understand what the structure
is useful for.
A) slave buffers didn't count internal fragmentation and sds unused space,
this caused them to induce eviction although we didn't mean for it.
B) slave buffers were consuming about twice the memory of what they actually needed.
- this was mainly due to sdsMakeRoomFor growing to twice as much as needed each time
but networking.c not storing more than 16k (partially fixed recently in 237a38737).
- besides it wasn't able to store half of the new string into one buffer and the
other half into the next (so the above mentioned fix helped mainly for small items).
- lastly, the sds buffers had up to 30% internal fragmentation that was wasted,
consumed but not used.
C) inefficient performance due to starting from a small string and reallocing many times.
what i changed:
- creating dedicated buffers for reply list, counting their size with zmalloc_size
- when creating a new reply node from, preallocate it to at least 16k.
- when appending a new reply to the buffer, first fill all the unused space of the
previous node before starting a new one.
other changes:
- expose mem_not_counted_for_evict info field for the benefit of the test suite
- add a test to make sure slave buffers are counted correctly and that they don't cause eviction
With such information will be able to use a private localtime()
implementation serverLog(), which does not use any locking and is both
thread and fork() safe.
RESTORE now supports:
1. Setting LRU/LFU
2. Absolute-time TTL
Other related changes:
1. RDB loading will not override LRU bits when RDB file
does not contain the LRU opcode.
2. RDB loading will not set LRU/LFU bits if the server's
maxmemory-policy does not match.
this reduces the extra 8 bytes we save before each pointer.
but more importantly maybe, it makes the valgrind runs to be more similiar
to our normal runs.
note: the change in malloc_stats struct in server.h is to eliminate an name conflict.
structs that are not typedefed are resolved from a separate name space.
Usually blocking operations make a lot of sense with multiple keys so
that we can listen to multiple queues (or whatever the app models) with
a single connection. However in the synchronous case it is more useful
to be able to ask for N elements. This is a change that I also wanted to
perform soon or later in the blocking list variant, but here it is more
natural since there is no reply type difference.
Implementation notes: as INFO is "already broken", I didn't want to break it further. Instead of computing the server.lua_script dict size on every call, I'm keeping a running sum of the body's length and dict overheads.
This implementation is naive as it **does not** take into consideration dict rehashing, but that inaccuracy pays off in speed ;)
Demo time:
```bash
$ redis-cli info memory | grep "script"
used_memory_scripts:96
used_memory_scripts_human:96B
number_of_cached_scripts:0
$ redis-cli eval "" 0 ; redis-cli info memory | grep "script"
(nil)
used_memory_scripts:120
used_memory_scripts_human:120B
number_of_cached_scripts:1
$ redis-cli script flush ; redis-cli info memory | grep "script"
OK
used_memory_scripts:96
used_memory_scripts_human:96B
number_of_cached_scripts:0
$ redis-cli eval "return('Hello, Script Cache :)')" 0 ; redis-cli info memory | grep "script"
"Hello, Script Cache :)"
used_memory_scripts:152
used_memory_scripts_human:152B
number_of_cached_scripts:1
$ redis-cli eval "return redis.sha1hex(\"return('Hello, Script Cache :)')\")" 0 ; redis-cli info memory | grep "script"
"1be72729d43da5114929c1260a749073732dc822"
used_memory_scripts:232
used_memory_scripts_human:232B
number_of_cached_scripts:2
✔ 19:03:54 redis [lua_scripts-in-info-memory L ✚…⚑] $ redis-cli evalsha 1be72729d43da5114929c1260a749073732dc822 0
"Hello, Script Cache :)"
```
There are too many advantages in doing this, RDB is faster to persist,
more compact, much faster to load back. The main issues here are that
the code is less tested because this was not the old default (so we are
enabling it for the new 5.0 release), and that the AOF is no longer a
trivially parsable format from now on. However the non-preamble mode
will be supported in the future as well, if new data types will be
added.