AE_BARRIER was implemented like:
- Fire the readable event.
- Do not fire the writabel event if the readable fired.
However this may lead to the writable event to never be called if the
readable event is always fired. There is an alterantive, we can just
invert the sequence of the calls in case AE_BARRIER is set. This commit
does that.
In case the write handler is already installed, it could happen that we
serve the reply of a query in the same event loop cycle we received it,
preventing beforeSleep() from guaranteeing that we do the AOF fsync
before sending the reply to the client.
The AE_BARRIER mechanism, introduced in a previous commit, prevents this
problem. This commit makes actual use of this new feature to fix the
bug.
Add AE_BARRIER to the writable event loop so that slaves requesting
votes can't be served before we re-enter the event loop in the next
iteration, so clusterBeforeSleep() will fsync to disk in time.
Also add the call to explicitly fsync, given that we modified the last
vote epoch variable.
AOF fsync=always, and certain Redis Cluster bus operations, require to
fsync data on disk before replying with an acknowledge.
In such case, in order to implement Group Commits, we want to be sure
that queries that are read in a given cycle of the event loop, are never
served to clients in the same event loop iteration. This way, by using
the event loop "before sleep" callback, we can fsync the information
just one time before returning into the event loop for the next cycle.
This is much more efficient compared to calling fsync() multiple times.
Unfortunately because of a bug, this was not always guaranteed: the
actual way the events are installed was the sole thing that could
control. Normally this problem is hard to trigger when AOF is enabled
with fsync=always, because we try to flush the output buffers to the
socekt directly in the beforeSleep() function of Redis. However if the
output buffers are full, we actually install a write event, and in such
a case, this bug could happen.
This change to ae.c modifies the event loop implementation to make this
concept explicit. Write events that are registered with:
AE_WRITABLE|AE_BARRIER
Are guaranteed to never fire after the readable event was fired for the
same file descriptor. In this way we are sure that data is persisted to
disk before the client performing the operation receives an
acknowledged.
However note that this semantics does not provide all the guarantees
that one may believe are automatically provided. Take the example of the
blocking list operations in Redis.
With AOF and fsync=always we could have:
Client A doing: BLPOP myqueue 0
Client B doing: RPUSH myqueue a b c
In this scenario, Client A will get the "a" elements immediately after
the Client B RPUSH will be executed, even before the operation is persisted.
However when Client B will get the acknowledge, it can be sure that
"b,c" are already safe on disk inside the list.
What to note here is that it cannot be assumed that Client A receiving
the element is a guaranteed that the operation succeeded from the point
of view of Client B.
This is due to the fact that the barrier exists within the same socket,
and not between different sockets. However in the case above, the
element "a" was not going to be persisted regardless, so it is a pretty
synthetic argument.
This commit adds two new fields in the INFO output, stats section:
expired_stale_perc:0.34
expired_time_cap_reached_count:58
The first field is an estimate of the number of keys that are yet in
memory but are already logically expired. They reason why those keys are
yet not reclaimed is because the active expire cycle can't spend more
time on the process of reclaiming the keys, and at the same time nobody
is accessing such keys. However as the active expire cycle runs, while
it will eventually have to return to the caller, because of time limit
or because there are less than 25% of keys logically expired in each
given database, it collects the stats in order to populate this INFO
field.
Note that expired_stale_perc is a running average, where the current
sample accounts for 5% and the history for 95%, so you'll see it
changing smoothly over time.
The other field, expired_time_cap_reached_count, counts the number
of times the expire cycle had to stop, even if still it was finding a
sizeable number of keys yet to expire, because of the time limit.
This allows people handling operations to understand if the Redis
server, during mass-expiration events, is able to collect keys fast
enough usually. It is normal for this field to increment during mass
expires, but normally it should very rarely increment. When instead it
constantly increments, it means that the current workloads is using
a very important percentage of CPU time to expire keys.
This feature was created thanks to the hints of Rashmi Ramesh and
Bart Robinson from Twitter. In private email exchanges, they noted how
it was important to improve the observability of this parameter in the
Redis server. Actually in big deployments, the amount of keys that are
yet to expire in each server, even if they are logically expired, may
account for a very big amount of wasted memory.
It is possible to do BGREWRITEAOF even if appendonly=no. This is by design.
stopAppendonly() didn't turn off aof_rewrite_scheduled (it can be turned on
again by BGREWRITEAOF even while appendonly is off anyway).
After configuring `appendonly yes` it will see that the state is AOF_OFF,
there's no RDB fork, so it will do rewriteAppendOnlyFileBackground() which
will fail since the aof_child_pid is set (was scheduled and started by cron).
Solution:
stopAppendonly() will turn off the schedule flag (regardless of who asked for it).
startAppendonly() will terminate any existing fork and start a new one (so it is the most recent).
When feeding the master with a high rate traffic the the slave's feed is much slower.
This causes the replication buffer to grow (indefinitely) which leads to slave disconnection.
The problem is that writeToClient() decides to stop writing after NET_MAX_WRITES_PER_EVENT
writes (In order to be fair to clients).
We should ignore this when the client is a slave.
It's better if clients wait longer, the alternative is that the slave has no chance to stay in
sync in this situation.
See #3462 and related PRs.
We use a simple algorithm to calculate the level of affinity violation,
and then an optimizer that performs random swaps until things improve.
after a slave is promoted (assuming it has no slaves
and it booted over an hour ago), it will lose it's replication
backlog at the next replication cron, rather than waiting for slaves
to connect to it.
so on a simple master/slave faiover, if the new slave doesn't connect
immediately, it may be too later and PSYNC2 will fail.
This fixes a crash with Redis Cluster when OBJECT is mis-used, because
getKeysUsingCommandTable() will call serverPanic() detecting we are
accessing an invalid argument in the case "OBJECT foo" is called.
This bug was introduced when OBJECT HELP was introduced, because the key
argument is set fixed at index 2 in the command table, however now
OBJECT may be called with an insufficient number of arguments to extract
the key.
The "Right Thing" would be to have a specific function to extract keys
from the OBJECT command, however this is kinda of an overkill, so I
preferred to make getKeysUsingCommandTable() more robust and just return
no keys when it's not possible to honor the command table, because new
commands are often added and also there are a number with an HELP
subcommand violating the normal form, and crashing for this trivial
reason or having many command-specific key extraction functions is not
great.
- protocol parsing (processMultibulkBuffer) was limitted to 32big positions in the buffer
readQueryFromClient potential overflow
- rioWriteBulkCount used int, although rioWriteBulkString gave it size_t
- several places in sds.c that used int for string length or index.
- bugfix in RM_SaveAuxField (return was 1 or -1 and not length)
- RM_SaveStringBuffer was limitted to 32bit length
The commit splits the add functions into a set() and add() set of
functions, so that it's possible to set registers in an independent way
just having the index and count.
Related to #3819, otherwise a fix is not possible.
The main change introduced by this commit is pretending that help
arrays are more text than code, thus indenting them at level 0. This
improves readability, and is an old practice when defining arrays of
C strings describing text.
Additionally a few useless return statements are removed, and the HELP
subcommand capitalized when printed to the user.
We have this operation in two places: when caching the master and
when linking a new client after the client creation. By having an API
for this we avoid incurring in errors when modifying one of the two
places forgetting the other. The function is also a good place where to
document why we cache the linked list node.
Related to #4497 and #4210.
The function in its initial form, and after the fixes for the PSYNC2
bugs, required code duplication in multiple spots. This commit modifies
it in order to always compute the script name independently, and to
return the SDS of the SHA of the body: this way it can be used in all
the places, including for SCRIPT LOAD, without duplicating the code to
create the Lua function name. Note that this requires to re-compute the
body SHA1 in the case of EVAL seeing a script for the first time, but
this should not change scripting performance in any way because new
scripts definition is a rare event happening the first time a script is
seen, and the SHA1 computation is anyway not a very slow process against
the typical Redis script and compared to the actua Lua byte compiling of
the body.
Note that the function used to assert() if a duplicated script was
loaded, however actually now two times over three, we want the function
to handle duplicated scripts just fine: this happens in SCRIPT LOAD and
in RDB AUX "lua" loading. Moreover the assert was not defending against
some obvious failure mode, so now the function always tests against
already defined functions at start.
Unfortunately, as outlined by @soloestoy in #4505, "lua" AUX RDB field
loading in case of duplicated script was still broken. This commit fixes
this problem and also a memory leak introduced by the past commit.
Note that now we have a regression test able to duplicate the issue, so
this commit was actually tested against the regression. The original PR
also had a valid fix, but I prefer to hide the details of scripting.c
outside scripting.c, and later "SCRIPT LOAD" should also be able to use
the function luaCreateFunction() instead of redoing the work.
With PSYNC2 to force a full SYNC in tests is hard. With this new DEBUG
subcommand we just need to call it and then CLIENT KILL TYPE master in
the slave.