This bug was introduced by a recent change in which readQueryFromClient
is using freeClientAsync, and despite the fact that now
freeClientsInAsyncFreeQueue is in beforeSleep, that's not enough since
it's not called during loading in processEventsWhileBlocked.
furthermore, afterSleep was called in that case but beforeSleep wasn't.
This bug also caused slowness sine the level-triggered mode of epoll
kept signaling these connections as readable causing us to keep doing
connRead again and again for ll of these, which keep accumulating.
now both before and after sleep are called, but not all of their actions
are performed during loading, some are only reserved for the main loop.
fixes issue #7215
Currently, there are several types of threads/child processes of a
redis server. Sometimes we need deeply optimise the performance of
redis, so we would like to isolate threads/processes.
There were some discussion about cpu affinity cases in the issue:
https://github.com/antirez/redis/issues/2863
So implement cpu affinity setting by redis.conf in this patch, then
we can config server_cpulist/bio_cpulist/aof_rewrite_cpulist/
bgsave_cpulist by cpu list.
Examples of cpulist in redis.conf:
server_cpulist 0-7:2 means cpu affinity 0,2,4,6
bio_cpulist 1,3 means cpu affinity 1,3
aof_rewrite_cpulist 8-11 means cpu affinity 8,9,10,11
bgsave_cpulist 1,10-11 means cpu affinity 1,10,11
Test on linux/freebsd, both work fine.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
If client gets blocked again in `processUnblockedClients`, redis will not send
`REPLCONF GETACK *` to slaves untill next eventloop, so the client will be
blocked for 100ms by default(10hz) if no other file event fired.
move server.get_ack_from_slaves sinppet after `processUnblockedClients`, so
that both the first WAIT command that puts client in blocked context and the
following WAIT command processed in processUnblockedClients would trigger
redis-sever to send `REPLCONF GETACK *`, so that the eventloop would get
`REPLCONG ACK <reploffset>` from slaves and unblocked ASAP.
come to think of it, in theory (not in practice), getDecodedObject can
return the same original object with refcount incremented, so the
pointer comparision in the previous commit was invalid.
so now instead of checking the encoding, we explicitly check the
refcount.
since the recent addition of OBJ_STATIC_REFCOUNT and the assertion in
incrRefCount it is now impossible to use dictFind using a static robj,
because dictEncObjKeyCompare will call getDecodedObject which tries to
increment the refcount just in order to decrement it later.
STRALGO should be a container for mostly read-only string
algorithms in Redis. The algorithms should have two main
characteristics:
1. They should be non trivial to compute, and often not part of
programming language standard libraries.
2. They should be fast enough that it is a good idea to have optimized C
implementations.
Next thing I would love to see? A small strings compression algorithm.
If redis crashes early, before lua is set up (like, if File Descriptor 0 is closed before exec), it will crash again trying to print memory statistics.
Related to #5145.
Design note: clients may change type when they turn into replicas or are
moved into the Pub/Sub category and so forth. Moreover the recomputation
of the bytes used is problematic for obvious reasons: it changes
continuously, so as a conservative way to avoid accumulating errors,
each client remembers the contribution it gave to the sum, and removes
it when it is freed or before updating it with the new memory usage.
Example: Client uses a pipe to send the following to a
stale replica:
MULTI
.. do something ...
DISCARD
The replica will reply the MUTLI with -MASTERDOWN and
execute the rest of the commands... A client using a
pipe might not be aware that MULTI failed until it's
too late.
I can't think of a reason why MULTI/EXEC/DISCARD should
not be executed on stale replicas...
Also, enable MULTI/EXEC/DISCARD during loading
Makse sure call() doesn't wrap replicated commands with
a redundant MULTI/EXEC
Other, unrelated changes:
1. Formatting compiler warning in INFO CLIENTS
2. Use CLIENT_ID_AOF instead of UINT64_MAX
37a10cef introduced automatic wrapping of MULTI/EXEC for the
alsoPropagate API. However this collides with the built-in mechanism
already present in module.c. To avoid complex changes near Redis 6 GA
this commit introduces the ability to exclude call() MUTLI/EXEC wrapping
for also propagate in order to continue to use the old code paths in
module.c.
Now that this mechanism is the sole one used for blocked clients
timeouts, it is more wise to cleanup the table when the client unblocks
for any reason. We use a flag: CLIENT_IN_TO_TABLE, in order to avoid a
radix tree lookup when the client was already removed from the table
because we processed it by scanning the radix tree.
A very commonly signaled operational problem with Redis master-replicas
sets is that, once the master becomes unavailable for some reason,
especially because of network problems, many times it wont be able to
perform a partial resynchronization with the new master, once it rejoins
the partition, for the following reason:
1. The master becomes isolated, however it keeps sending PINGs to the
replicas. Such PINGs will never be received since the link connection is
actually already severed.
2. On the other side, one of the replicas will turn into the new master,
setting its secondary replication ID offset to the one of the last
command received from the old master: this offset will not include the
PINGs sent by the master once the link was already disconnected.
3. When the master rejoins the partion and is turned into a replica, its
offset will be too advanced because of the PINGs, so a PSYNC will fail,
and a full synchronization will be required.
Related to issue #7002 and other discussion we had in the past around
this problem.
Redis refusing to run MULTI or EXEC during script timeout may cause partial
transactions to run.
1) if the client sends MULTI+commands+EXEC in pipeline without waiting for
response, but these arrive to the shards partially while there's a busy script,
and partially after it eventually finishes: we'll end up running only part of
the transaction (since multi was ignored, and exec would fail).
2) similar to the above if EXEC arrives during busy script, it'll be ignored and
the client state remains in a transaction.
the 3rd test which i added for a case where MULTI and EXEC are ok, and
only the body arrives during busy script was already handled correctly
since processCommand calls flagTransaction
Before this commit, when upgrading a replica, expired keys will not
be loaded, thus causing replica having less keys in db. To this point,
master and replica's keys is logically consistent. However, before
the keys in master and replica are physically consistent, that is,
they have the same dbsize, if master got a problem and the replica
got promoted and becomes new master of that partition, and master
updates a key which does not exist on master, but physically exists
on the old master(new replica), the old master would refuse to update
the key, thus causing master and replica data inconsistent.
How could this happen?
That's all because of the wrong judgement of roles while starting up
the server. We can not use server.masterhost to judge if the server
is master or replica, since it fails in cluster mode.
When we start the server, we load rdb and do want to load expired keys,
and do not want to have the ability to active expire keys, if it is
a replica.
1. server.repl_no_slaves_since can be set when a MONITOR client disconnects
2. c->repl_ack_time can be set by a newline from a MONITOR client
3. Improved comments
SELECT, and HELLO are commands that may be executed by the client
as soon as it connects, there's no reason to block them, preventing the
client from doing the rest of his sequence (which might just be INFO or
CONFIG, etc).
MONITOR, DEBUG, SLOWLOG, TIME, LASTSAVE are all non-data accessing
commands, which there's no reason to block.
Checking OOM by `getMaxMemoryState` inside script might get different result
with `freeMemoryIfNeededAndSafe` at script start, because lua stack and
arguments also consume memory.
This leads to memory `borderline` when memory grows near server.maxmemory:
- `freeMemoryIfNeededAndSafe` at script start detects no OOM, no memory freed
- `getMaxMemoryState` inside script detects OOM, script aborted
We solve this 'borderline' issue by saving OOM state at script start to get
stable lua OOM state.
related to issue #6565 and #5250.
So error message `ERR only (P)SUBSCRIBE / (P)UNSUBSCRIBE / PING / QUIT allowed in this context` will become
`ERR 'get' command submitted, but only (P)SUBSCRIBE / (P)UNSUBSCRIBE / PING / QUIT allowed in this context`
since the refactory of config.c, it was initialized from config_hz in initServer
but apparently that's too late since the config file loading creates objects
which call LRU_CLOCK
This message is there for ten years, but is hardly useful.
Moreover it is likely that it will fill an entire disk if log ratation
is not configured, for no good reasons.
Changes in behavior:
- Change server.stream_node_max_entries from int64_t to long long, so that it can be used by the generic infra
- standard error reply instead of "repl-backlog-size must be 1 or greater" and such
- tls-port and a few TLS booleans were readable (config get) even when USE_OPENSSL was off (now they aren't)
- syslog-enabled, syslog-ident, cluster-enabled, appendfilename, and supervised didn't have a get (now they do)
- pidfile was initialized to NULL in InitServerConfig but had CONFIG_DEFAULT_PID_FILE in rewriteConfig (so the real default was "", but rewrite would cause it to be set), fixed the rewrite.
- TLS config in server.h was uninitialized (if no tls config args were provided)
Adding test for sanity and coverage
- add capability for each config to have a callback to check if value is valid and return error string
will enable converting many of the remaining custom configs into generic ones (reducing the x4 repetition for set,get,config,rewrite)
- add capability for each config to to run some update code after config is changed (only for CONFIG SET)
will also enable converting many of the remaining custom configs into generic ones
- add capability to move default values from server.h and server.c to config.c
will reduce many excess lines in server.h and server.c (plus, no need to rebuild the entire code base when a default change 8-))
other behavior changes:
- fix bug in bool config get (always returning 'yes')
- fix a bug in modifying jemalloc-bg-thread at runtime (didn't call set_jemalloc_bg_thread, due to bad merge conflict resolution (my fault))
- side effect when a failed attempt to enable activedefrag at runtime, we now respond with -ERR and not with -DISABLED
Random command like SPOP with count is replicated as
some SREM operations, and store them in also_propagate
array to propagate after the call, but this would break
atomicity.
To keep the command's atomicity, wrap also_propagate
array with MULTI/EXEC.
Is it sufficient... ? -- Yes it is. In standalone mode, we say READY=1
at the comment point; however in replicated mode, we delay sending
READY=1 until the replication sync completes.
This adds Makefile/build-system support for USE_SYSTEMD=(yes|no|*). This
variable's value determines whether or not libsystemd will be linked at
build-time.
If USE_SYSTEMD is set to "yes", make will use PKG_CONFIG to check for
libsystemd's presence, and fail the build early if it isn't
installed/detected properly.
If USE_SYSTEM is set to "no", libsystemd will *not* be linked, even if
support for it is available on the system redis is being built on.
For any other value that USE_SYSTEM might assume (e.g. "auto"),
PKG_CONFIG will try to determine libsystemd's presence, and set up the
build process to link against it, if it was indicated as being
installed/available.
This approach has a number of repercussions of its own, most importantly
the following: If you build redis on a system that actually has systemd
support, but no libsystemd-dev package(s) installed, you'll end up
*without* support for systemd notification/status reporting support in
redis-server. This changes established runtime behaviour.
I'm not sure if the build system and/or the server binary should
indicate this. I'm also wondering if not actually having
systemd-notify-support, but requesting it via the server's config,
should result in a fatal error now.
Instead of replicating a subset of libsystemd's sd_notify(3) internally,
use the dynamic library provided by systemd to communicate with the
service manager.
When systemd supervision was auto-detected or configured, communicate
the actual server status (i.e. "Loading dataset", "Waiting for
master<->replica sync") to systemd, instead of declaring readiness right
after initializing the server process.
Reduce default minimum effort, so that when fragmentation is just detected,
the impact on the latency will be minor.
Reduce the default maximum effort, mainly to prevent a case were a sudden
massive deletions, won't trigger an aggressive defrag that will cause latency.
When activedefrag is disabled mid-run, reset the 'running' info field, and
clear the scan cursor, so that when it'll be re-enabled, a new fresh scan will
start.
Clearing the 'running' variable is important since lowering the defragger
tunables mid-scan won't help, the defragger only considers new threshold when
a new scan starts, and during a scan it can only become more aggressive,
(when more severe fragmentation is detected), it'll never go less aggressive.
So by temporarily disabling activedefrag, one can lower th the tunables.
Removing the experimantal warning.
One problem with the solution proposed so far in #6537 is that key
lookups outside a command execution via call(), still used a cached
time. The cached time needed to be refreshed in multiple places,
especially because of modules callbacks from timers, cluster bus, and
thread safe contexts, that may use RM_Open().
In order to avoid this problem, this commit introduces the ability to
detect if we are inside call(): this way we can use the reference fixed
time only when we are in the context of a command execution or Lua
script, but for the asynchronous lookups, we can still use mstime() to
get a fresh time reference.
After the thread in #6537 and thanks to the suggestions received, this
commit updates the original patch in order to:
1. Solve the problem of updating the time in multiple places by updating
it in call().
2. Avoid introducing a new field but use our cached time.
This required some minor refactoring to the function updating the time,
and the introduction of a new cached time in microseconds in order to
use less gettimeofday() calls.
Calling lookupKey*() many times to search a key in one command
may get different result.
That's because lookupKey*() calls expireIfNeeded(), and delete
the key when reach the expire time. So we can get an robj before
the expire time, but a NULL after the expire time.
The worst is that may lead to Redis crash, for example
`RPOPLPUSH foo foo` the first time we get a list form `foo` and
hold the pointer, but when we get `foo` again it's expired and
deleted. Now we hold a freed memory, when execute rpoplpushHandlePush()
redis crash.
To fix it, we can refactor the judgment about whether a key is expired,
using the same basetime `server.cmd_start_mstime` instead of calling
mstime() everytime.