Every log contains, just after the pid, a single character that provides
information about the role of an instance:
S - Slave
M - Master
C - Writing child
X - Sentinel
The previous code handling a lost slot (by another master with an higher
configuration for the slot) was defensive, considering it an error and
putting the cluster in an odd state requiring redis-cli fix.
This was changed, because actually this only happens either in a
legitimate way, with failovers, or when the admin messed with the config
in order to reconfigure the cluster. So the new code instead will try to
make sure that the keys stored match the new slots map, by removing all
the keys in the slots we lost ownership from.
The function that deletes the keys from the lost slots is called only
if the node does not lose all its slots (resulting in a reconfiguration
as a slave of the node that got ownership). This is an optimization
since the replication code will anyway flush all the instance data in
a faster way.
The error when the target key is busy was a generic one, while it makes
sense to be able to distinguish between the target key busy error and
the others easily.
This commit adds peer ID caching in the client structure plus an API
change and the use of sdsMakeRoomFor() in order to improve the
reallocation pattern to generate the CLIENT LIST output.
Both the changes account for a very significant speedup.
When we are blocked and a few events a processed from time to time, it
is smarter to call the event handler a few times in order to handle the
accept, read, write, close cycle of a client in a single pass, otherwise
there is too much latency added for clients to receive a reply while the
server is busy in some way (for example during the DB loading).
All the Redis functions that need to modify the string value of a key in
a destructive way (APPEND, SETBIT, SETRANGE, ...) require to make the
object unshared (if refcount > 1) and encoded in raw format (if encoding
is not already REDIS_ENCODING_RAW).
This was cut & pasted many times in multiple places of the code. This
commit puts the small logic needed into a function called
dbUnshareStringValue().
To test the bitfield array of counters set/get macros from the Redis Tcl
suite is hard, so a specialized command that is able to test the
internals was developed.
Also update the original REDIS_EVENTLOOP_FDSET_INCR to
include REDIS_MIN_RESERVED_FDS. REDIS_EVENTLOOP_FDSET_INCR
exists to make sure more than (maxclients+RESERVED) entries
are allocated, but we can only guarantee that if we include
the current value of REDIS_MIN_RESERVED_FDS as a minimum
for the INCR size.
Everywhere in the Redis code base, maxclients is treated
as an int with (int)maxclients or `maxclients = atoi(source)`,
so let's make maxclients an int.
This fixes a bug where someone could specify a negative maxclients
on startup and it would work (as well as set maxclients very high)
because:
unsigned int maxclients;
char *update = "-300";
maxclients = atoi(update);
if (maxclients < 1) goto fail;
But, (maxclients < 1) can only catch the case when maxclients
is exactly 0. maxclients happily sets itself to -300, which isn't
-300, but rather 4294966996, which isn't < 1, so... everything
"worked."
maxclients config parsing checks for the case of < 1, but maxclients
CONFIG SET parsing was checking for case of < 0 (allowing
maxclients to be set to 0). CONFIG SET parsing is now updated to
match config parsing of < 1.
It's tempting to add a MINIMUM_CLIENTS define, but... I didn't.
These changes were inspired by antirez#356, but this doesn't
fix that issue.
Obtaining the RSS (Resident Set Size) info is slow in Linux and OSX.
This slowed down the generation of the INFO 'memory' section.
Since the RSS does not require to be a real-time measurement, we
now sample it with server.hz frequency (10 times per second by default)
and use this value both to show the INFO rss field and to compute the
fragmentation ratio.
Practically this does not make any difference for memory profiling of
Redis but speeds up the INFO call significantly.
This is safer as by default maxmemory should just set a memory limit
without any key to be deleted, unless the policy is set to something
more relaxed.
There were 2 spare bits inside the Redis object structure that are now
used in order to enlarge 4x the range of the LRU field.
At the same time the resolution was improved from 10 to 1 second: this
still provides 194 days before the LRU counter overflows (restarting from
zero).
This is not a problem since it only causes lack of eviction precision for
objects not touched for a very long time, and the lack of precision is
only temporary.
This is an improvement over the previous eviction algorithm where we use
an eviction pool that is persistent across evictions of keys, and gets
populated with the best candidates for evictions found so far.
It allows to approximate LRU eviction at a given number of samples
better than the previous algorithm used.
For testing purposes it is handy to have a very high resolution of the
LRU clock, so that it is possible to experiment with scripts running in
just a few seconds how the eviction algorithms works.
This commit allows Redis to use the cached LRU clock, or a value
computed on demand, depending on the resolution. So normally we have the
good performance of a precomputed value, and a clock that wraps in many
days using the normal resolution, but if needed, changing a define will
switch behavior to an high resolution LRU clock.
Previously we used zunionInterGetKeys(), however after this function was
fixed to account for the destination key (not needed when the API was
designed for "diskstore") the two set of commands can no longer be served
by an unique keys-extraction function.
This API originated from the "diskstore" experiment, not for Redis
Cluster itself, so there were legacy/useless things trying to
differentiate between keys that are going to be overwritten and keys
that need to be fetched from disk (preloaded).
All useless with Cluster, so removed with the result of code
simplification.
server.unixtime and server.mstime are cached less precise timestamps
that we use every time we don't need an accurate time representation and
a syscall would be too slow for the number of calls we require.
Such an example is the initialization and update process of the last
interaction time with the client, that is used for timeouts.
However rdbLoad() can take some time to load the DB, but at the same
time it did not updated the time during DB loading. This resulted in the
bug described in issue #1535, where in the replication process the slave
loads the DB, creates the redisClient representation of its master, but
the timestamp is so old that the master, under certain conditions, is
sensed as already "timed out".
Thanks to @yoav-steinberg and Redis Labs Inc for the bug report and
analysis.
A system similar to the RDB write error handling is used, in which when
we can't write to the AOF file, writes are no longer accepted until we
are able to write again.
For fsync == always we still abort on errors since there is currently no
easy way to avoid replying with success to the user otherwise, and this
would violate the contract with the user of only acknowledging data
already secured on disk.
The API is one of the bulding blocks of CLUSTER FAILOVER command that
executes a manual failover in Redis Cluster. However exposed as a
command that the user can call directly, it makes much simpler to
upgrade a standalone Redis instance using a slave in a safer way.
The commands works like that:
CLIENT PAUSE <milliesconds>
All the clients that are not slaves and not in MONITOR state are paused
for the specified number of milliesconds. This means that slaves are
normally served in the meantime.
At the end of the specified amount of time all the clients are unblocked
and will continue operations normally. This command has no effects on
the population of the slow log, since clients are not blocked in the
middle of operations but only when there is to process new data.
Note that while the clients are unblocked, still new commands are
accepted and queued in the client buffer, so clients will likely not
block while writing to the server while the pause is active.
server.lua_time_start is expressed in milliseconds. Use mstime_t instead
of long long, and populate it with mstime() instead of ustime()/1000.
Functionally identical but more natural.
In high RPS environments, the default listen backlog is not sufficient, so
giving users the power to configure it is the right approach, especially
since it requires only minor modifications to the code.
Return the number of slaves for the same master having a better
replication offset of the current slave, that is, the slave "rank" used
to pick a delay before the request for election.
A client can enter a special cluster read-only mode using the READONLY
command: if the client read from a slave instance after this command,
for slots that are actually served by the instance's master, the queries
will be processed without redirection, allowing clients to read from
slaves (but without any kind fo read-after-write guarantee).
The READWRITE command can be used in order to exit the readonly state.
Masters not understanding REPLCONF ACK will reply with errors to our
requests causing a number of possible issues.
This commit detects a global replication offest set to -1 at the end of
the replication, and marks the client representing the master with the
REDIS_PRE_PSYNC flag.
Note that this flag was called REDIS_PRE_PSYNC_SLAVE but now it is just
REDIS_PRE_PSYNC as it is used for both slaves and masters starting with
this commit.
This commit fixes issue #1488.