A) slave buffers didn't count internal fragmentation and sds unused space,
this caused them to induce eviction although we didn't mean for it.
B) slave buffers were consuming about twice the memory of what they actually needed.
- this was mainly due to sdsMakeRoomFor growing to twice as much as needed each time
but networking.c not storing more than 16k (partially fixed recently in 237a38737).
- besides it wasn't able to store half of the new string into one buffer and the
other half into the next (so the above mentioned fix helped mainly for small items).
- lastly, the sds buffers had up to 30% internal fragmentation that was wasted,
consumed but not used.
C) inefficient performance due to starting from a small string and reallocing many times.
what i changed:
- creating dedicated buffers for reply list, counting their size with zmalloc_size
- when creating a new reply node from, preallocate it to at least 16k.
- when appending a new reply to the buffer, first fill all the unused space of the
previous node before starting a new one.
other changes:
- expose mem_not_counted_for_evict info field for the benefit of the test suite
- add a test to make sure slave buffers are counted correctly and that they don't cause eviction
Basically we cannot be sure that if the key is expired while writing the
AOF, the main thread will surely find the key expired. There are
possible race conditions like the moment at which the "now" is sampled,
and the fact that time may jump backward.
Think about the following:
SET a 5
EXPIRE a 1
AOF rewrite starts after about 1 second. The child process finds the key
expired, while in the main thread instead an INCR command is called
against the key "a" immediately after a fork, and the scheduler was
faster to give execution time to the main thread, so "a" is yet not
expired.
The main thread will generate an INCR a command to the AOF log that will
be appended to the rewritten AOF file, but that INCR command will target
a non existin "a" key, so a new non volatile key "a" will be created.
Two observations:
A) In theory by computing "now" before the fork, we should be sure that
if a key is expired at that time, it will be expired later when the
main thread will try to access to such key. However this does not take
into account the fact that the computer time may jump backward.
B) Technically we may still make the process safe by using a monotonic
time source.
However there were other similar related bugs, and in general the new
"vision" is that Redis persistence files should represent the memory
state without trying to be too smart: this makes the design more
consistent, bugs less likely to arise from complex interactions, and in
the end what is to fix is the Redis expire process to have less expired
keys in RAM.
Thanks to Oran Agra and Guy Benoish for writing me an email outlining
this problem, after they conducted a Redis 5 code review.
The AOF tail of a combined RDB+AOF is based on the premise of applying
the AOF commands to the exact state that there was in the server while
the RDB was persisted. By expiring keys while loading the RDB file, we
change the state, so applying the AOF tail later may change the state.
Test case:
* Time1: SET a 10
* Time2: EXPIREAT a $time5
* Time3: INCR a
* Time4: PERSIT A. Start bgrewiteaof with RDB preamble. The value of a is 11 without expire time.
* Time5: Restart redis from the RDB+AOF: consistency violation.
Thanks to @soloestoy for providing the patch.
Thanks to @trevor211 for the original issue report and the initial fix.
Check issue #4950 for more info.
It is possible to do BGREWRITEAOF even if appendonly=no. This is by design.
stopAppendonly() didn't turn off aof_rewrite_scheduled (it can be turned on
again by BGREWRITEAOF even while appendonly is off anyway).
After configuring `appendonly yes` it will see that the state is AOF_OFF,
there's no RDB fork, so it will do rewriteAppendOnlyFileBackground() which
will fail since the aof_child_pid is set (was scheduled and started by cron).
Solution:
stopAppendonly() will turn off the schedule flag (regardless of who asked for it).
startAppendonly() will terminate any existing fork and start a new one (so it is the most recent).
The gist of the changes is that now, partial resynchronizations between
slaves and masters (without the need of a full resync with RDB transfer
and so forth), work in a number of cases when it was impossible
in the past. For instance:
1. When a slave is promoted to mastrer, the slaves of the old master can
partially resynchronize with the new master.
2. Chained slalves (slaves of slaves) can be moved to replicate to other
slaves or the master itsef, without requiring a full resync.
3. The master itself, after being turned into a slave, is able to
partially resynchronize with the new master, when it joins replication
again.
In order to obtain this, the following main changes were operated:
* Slaves also take a replication backlog, not just masters.
* Same stream replication for all the slaves and sub slaves. The
replication stream is identical from the top level master to its slaves
and is also the same from the slaves to their sub-slaves and so forth.
This means that if a slave is later promoted to master, it has the
same replication backlong, and can partially resynchronize with its
slaves (that were previously slaves of the old master).
* A given replication history is no longer identified by the `runid` of
a Redis node. There is instead a `replication ID` which changes every
time the instance has a new history no longer coherent with the past
one. So, for example, slaves publish the same replication history of
their master, however when they are turned into masters, they publish
a new replication ID, but still remember the old ID, so that they are
able to partially resynchronize with slaves of the old master (up to a
given offset).
* The replication protocol was slightly modified so that a new extended
+CONTINUE reply from the master is able to inform the slave of a
replication ID change.
* REPLCONF CAPA is used in order to notify masters that a slave is able
to understand the new +CONTINUE reply.
* The RDB file was extended with an auxiliary field that is able to
select a given DB after loading in the slave, so that the slave can
continue receiving the replication stream from the point it was
disconnected without requiring the master to insert "SELECT" statements.
This is useful in order to guarantee the "same stream" property, because
the slave must be able to accumulate an identical backlog.
* Slave pings to sub-slaves are now sent in a special form, when the
top-level master is disconnected, in order to don't interfer with the
replication stream. We just use out of band "\n" bytes as in other parts
of the Redis protocol.
An old design document is available here:
https://gist.github.com/antirez/ae068f95c0d084891305
However the implementation is not identical to the description because
during the work to implement it, different changes were needed in order
to make things working well.
It was noted by @dvirsky that it is not possible to use string functions
when writing the AOF file. This sometimes is critical since the command
rewriting may need to be built in the context of the AOF callback, and
without access to the context, and the limited types that the AOF
production functions will accept, this can be an issue.
Moreover there are other needs that we can't anticipate regarding the
ability to use Redis Modules APIs using the context in order to build
representations to emit AOF / RDB.
Because of this a new API was added that allows the user to get a
temporary context from the IO context. The context is auto released
if obtained when the RDB / AOF callback returns.
Calling multiple time the function to get the context, always returns
the same one, since it is invalid to have more than a single context.
This patch, written in collaboration with Oran Agra (@oranagra) is a companion
to 780a8b1. Together the two patches should avoid that the AOF and RDB saving
processes can be spawned at the same time. Previously conditions that
could lead to two saving processes at the same time were:
1. When AOF is enabled via CONFIG SET and an RDB saving process is
already active.
2. When the SYNC command decides to start an RDB saving process ASAP in
order to serve a new slave that cannot partially resynchronize (but
only if we have a disk target for replication, for diskless
replication there is not such a problem).
Condition "1" is not very severe but "2" can happen often and is
definitely good at degrading Redis performances in an unexpected way.
The two commits have the effect of always spawning RDB savings for
replication in replicationCron() instead of attempting to start an RDB
save synchronously. Moreover when a BGSAVE or AOF rewrite must be
performed, they are instead just postponed using flags that will try to
perform such operations ASAP.
Finally the BGSAVE command was modified in order to accept a SCHEDULE
option so that if an AOF rewrite is in progress, when this option is
given, the command no longer returns an error, but instead schedules an
RDB rewrite operation for when it will be possible to start it.
The cleanup code expects that if 'di' is not NULL, it is a valid
iterator that should be freed.
The result of this bug was a crash of the AOF rewriting process if an
error occurred after the DBs data are written and the iterator is no
longer valid.
This replaces individual ziplist vs. linkedlist representations
for Redis list operations.
Big thanks for all the reviews and feedback from everybody in
https://github.com/antirez/redis/pull/2143
- Remove trailing newlines from redis.conf
- Fix comment misspelling
- Clarifies zipEncodeLength usage and a C API mention (#1243, #1242)
- Fix cluster typos (inspired by @papanikge #1507)
- Fix rewite -> rewrite in a few places (inspired by #682)
Closes#1243, #1242, #1507
It is not clear if files open in append only mode will automatically fix
their offset after a truncate(2) operation. This commit makes sure that
we reposition the AOF file descriptor offset at the end of the file
after a truncated AOF is loaded and trimmed to the last valid command.
Recently we introduced the ability to load truncated AOFs, but
unfortuantely the support was broken since the server, after loading the
truncated AOF, continues appending to the file that is corrupted at the
end. The problem is fixed only in the next AOF rewrite.
This commit fixes the issue by truncating the AOF to the last valid
opcode, and aborting if it is not possible to truncate the file
correctly.
Because of the new ability to start with a truncated AOF, we need
to correctly release all the memory on EOF error. Otherwise there is a
small leak, that is not really a problem, but causes a false positive in
the tests that detect memory leaks.
We now wait up to 1 second for diff data to come from the parent,
however we use poll(2) to wait for more data, and use a counter of
contiguous failures to get data for N times (set to 20 experimentally
after different tests) as an early stop condition to avoid wasting 1
second when the write traffic is too low.
This commit adds peer ID caching in the client structure plus an API
change and the use of sdsMakeRoomFor() in order to improve the
reallocation pattern to generate the CLIENT LIST output.
Both the changes account for a very significant speedup.
When we are blocked and a few events a processed from time to time, it
is smarter to call the event handler a few times in order to handle the
accept, read, write, close cycle of a client in a single pass, otherwise
there is too much latency added for clients to receive a reply while the
server is busy in some way (for example during the DB loading).
Previously, the (!fp) would only catch lack of free space
under OS X. Linux waits to discover it can't write until
it actually writes contents to disk.
(fwrite() returns success even if the underlying file
has no free space to write into. All the errors
only show up at flush/sync/close time.)
Fixesantirez/redis#1604
A system similar to the RDB write error handling is used, in which when
we can't write to the AOF file, writes are no longer accepted until we
are able to write again.
For fsync == always we still abort on errors since there is currently no
easy way to avoid replying with success to the user otherwise, and this
would violate the contract with the user of only acknowledging data
already secured on disk.
Previously two string encodings were used for string objects:
1) REDIS_ENCODING_RAW: a string object with obj->ptr pointing to an sds
stirng.
2) REDIS_ENCODING_INT: a string object where the obj->ptr void pointer
is casted to a long.
This commit introduces a experimental new encoding called
REDIS_ENCODING_EMBSTR that implements an object represented by an sds
string that is not modifiable but allocated in the same memory chunk as
the robj structure itself.
The chunk looks like the following:
+--------------+-----------+------------+--------+----+
| robj data... | robj->ptr | sds header | string | \0 |
+--------------+-----+-----+------------+--------+----+
| ^
+-----------------------+
The robj->ptr points to the contiguous sds string data, so the object
can be manipulated with the same functions used to manipulate plan
string objects, however we need just on malloc and one free in order to
allocate or release this kind of objects. Moreover it has better cache
locality.
This new allocation strategy should benefit both the memory usage and
the performances. A performance gain between 60 and 70% was observed
during micro-benchmarks, however there is more work to do to evaluate
the performance impact and the memory usage behavior.
There was a race condition in the AOF rewrite code that, with bad enough
timing, could cause a volatile key just about to expire to be turned
into a non-volatile key. The bug was never reported to cause actualy
issues, but was found analytically by an user in the Redis mailing list:
https://groups.google.com/forum/?fromgroups=#!topic/redis-db/Kvh2FAGK4Uk
This commit fixes issue #1079.
This prevents the kernel from putting too much stuff in the output
buffers, doing too heavy I/O all at once. So the goal of this commit is
to split the disk pressure due to the AOF rewrite process into smaller
spikes.
Please see issue #1019 for more information.
This commit allows Redis to set a process name that includes the binding
address and the port number in order to make operations simpler.
Redis children processes doing AOF rewrites or RDB saving change the
name into redis-aof-rewrite and redis-rdb-bgsave respectively.
This in general makes harder to kill the wrong process because of an
error and makes simpler to identify saving children.
This feature was suggested by Arnaud GRANAL in the Redis Google Group,
Arnaud also pointed me to the setproctitle.c implementation includeed in
this commit.
This feature should work on all the Linux, OSX, and all the three major
BSD systems.
decrRefCount used to get its argument as a void* pointer in order to be
used as destructor where a 'void free_object(void*)' prototype is
expected. However this made simpler to introduce bugs by freeing the
wrong pointer. This commit fixes the argument type and introduces a new
wrapper called decrRefCountVoid() that can be used when the void*
argument is needed.
This commit fixes issue #875 that was caused by the following events:
1) There is an active child doing BGSAVE.
2) flushall is called (or any other condition that makes Redis killing
the saving child process).
3) An error is sensed by Redis as the child exited with an error (killed
by a singal), that stops accepting write commands until a BGSAVE happens
to be executed with success.
Whitelisting SIGUSR1 and making sure Redis always uses this signal in
order to kill its own children fixes the issue.
Sometimes it is much simpler to debug complex Redis installations if it
is possible to assign clients a name that is displayed in the CLIENT
LIST output.
This is the case, for example, for "leaked" connections. The ability to
provide a name to the client makes it quite trivial to understand what
is the part of the code implementing the client not releasing the
resources appropriately.
Behavior:
CLIENT SETNAME: set a name for the client, or remove the current
name if an empty name is set.
CLIENT GETNAME: get the current name, or a nil.
CLIENT LIST: now displays the client name if any.
Thanks to Mark Gravell for pushing this idea forward.
Finally Redis is able to report the amount of memory used by
copy-on-write while saving an RDB or writing an AOF file in background.
Note that this information is currently only logged (at NOTICE level)
and not shown in INFO because this is less trivial (but surely doable
with some minor form of interprocess communication).
The reason we can't capture this information on the parent before we
call wait3() is that the Linux kernel will release the child memory
ASAP, and only retain the minimal state for the process that is useful
to report the child termination to the parent.
The COW size is obtained by summing all the Private_Dirty fields found
in the "smap" file inside the proc filesystem for the process.
All this is Linux specific and is not available on other systems.
If Redis only manages to write out a partial buffer, the AOF file won't
load back into Redis the next time it starts up. It is better to
discard the short write than waste time running redis-check-aof.
Behaves like rdb_last_bgsave_status -- even down to reporting 'ok' when
no rewrite has been done yet. (You might want to check that
aof_last_rewrite_time_sec is not -1.)
The 'persistence' section of INFO output now contains additional four
fields related to RDB and AOF persistence:
rdb_last_bgsave_time_sec Duration of latest BGSAVE in sec.
rdb_current_bgsave_time_sec Duration of current BGSAVE in sec.
aof_last_rewrite_time_sec Duration of latest AOF rewrite in sec.
aof_current_rewrite_time_sec Duration of current AOF rewrite in sec.
The 'current' fields are set to -1 if a BGSAVE / AOF rewrite is not in
progress. The 'last' fileds are set to -1 if no previous BGSAVE / AOF
rewrites were performed.
Additionally a few fields in the persistence section were renamed for
consistency:
changes_since_last_save -> rdb_changes_since_last_save
bgsave_in_progress -> rdb_bgsave_in_progress
last_save_time -> rdb_last_save_time
last_bgsave_status -> rdb_last_bgsave_status
bgrewriteaof_in_progress -> aof_rewrite_in_progress
bgrewriteaof_scheduled -> aof_rewrite_scheduled
After the renaming, fields in the persistence section start with rdb_ or
aof_ prefix depending on the persistence method they describe.
The field 'loading' and related fields are not prefixed because they are
unique for both the persistence methods.