Will be configurable / adaptive at some point but let's start with a
saner value compared to 1 sec which is not a good idea for big data
structures stored into a single key.
The error when the target key is busy was a generic one, while it makes
sense to be able to distinguish between the target key busy error and
the others easily.
The same change was operated for normal client connections. This is
important for Cluster as well, since when a node rejoins the cluster,
when a partition heals or after a restart, it gets flooded with new
connection attempts by all the other nodes trying to form a full
mesh again.
When a Sentinel performs a failover (successful or not), or when a
Sentinel votes for a different Sentinel trying to start a failover, it
sets a min delay before it will try to get elected for a failover.
While not strictly needed, because if multiple Sentinels will try
to failover the same master at the same time, only one configuration
will eventually win, this serialization is practically very useful.
Normal failovers are cleaner: one Sentinel starts to failover, the
others update their config when the Sentinel performing the failover
is able to get the selected slave to move from the role of slave to the
one of master.
However currently this timeout was implicit, so users could see
Sentinels not reacting, after a failed failover, for some time, without
giving any feedback in the logs to the poor sysadmin waiting for clues.
This commit makes Sentinels more verbose about the delay: when a master
is down and a failover attempt is not performed because the delay has
still not elaped, something like that will be logged:
Next failover delay: I will not start a failover
before Thu May 8 16:48:59 2014
Reusing small objects when possible is a major speedup under certain
conditions, since it is able to avoid the malloc/free pattern that
otherwise is performed for every argument in the client command vector.
Replace the three calls to Lua API lua_tostring, lua_lua_strlen,
and lua_isstring, with a single call to lua_tolstring.
~ 5% consistent speed gain measured.
Calling lua_gc() after every script execution is too expensive, and
apparently does not make the execution smoother: the same peak latency
was measured before and after the commit.
This change accounts for scripts execution speedup in the order of 10%.
The function showed up consuming a non trivial amount of time in the
profiler output. After this change benchmarking gives a 6% speed
improvement that can be consistently measured.
When the reply is only contained in the client static output buffer, use
a fast path avoiding the dynamic allocation of an SDS string to
concatenate the client reply objects.
Initially Redis Cluster accepted that after cluster creation all the
nodes were at configEpoch 0, evolving from zero as failovers happen.
However later the semantic was made more strict in order to make sure a
cluster has always all the master nodes with a different configEpoch,
which is more robust in some corner case (especially resulting from
errors by the system administrator).
To assign different configEpochs to different nodes at startup was a
task performed naturally by the config conflicts resolution algorithm
(see the Cluster specification). However this works well only for small
clusters or when there are actually just a few collisions, since it is
designed for exceptional cases.
When a large cluster is created hundred of nodes can be at epoch 0, so
the conflict resolution code is slow to provide an unique config to each
node. For this reason this new command was introduced. It can be called
only when a node is totally fresh: no other nodes known, and configEpoch
set to zero, so it is safe even against misuses.
redis-trib will use the new command in order to start the cluster
already setting an incremental unique config to every node.
This commit adds peer ID caching in the client structure plus an API
change and the use of sdsMakeRoomFor() in order to improve the
reallocation pattern to generate the CLIENT LIST output.
Both the changes account for a very significant speedup.
This commit also fixes a bug in the implementation of sdscatfmt()
resulting from stale references to the SDS string header after
sdsMakeRoomFor() calls.
sdscatprintf() relies on printf() family libc functions and is sometimes
too slow in critical code paths. sdscatfmt() is an alternative which is:
1) Far less capable.
2) Format specifier uncompatible.
3) Faster.
It is suitable to be used in those speed critical code paths such as
CLIENT LIST output generation.
When we are blocked and a few events a processed from time to time, it
is smarter to call the event handler a few times in order to handle the
accept, read, write, close cycle of a client in a single pass, otherwise
there is too much latency added for clients to receive a reply while the
server is busy in some way (for example during the DB loading).
When the listening sockets readable event is fired, we have the chance
to accept multiple clients instead of accepting a single one. This makes
Redis more responsive when there is a mass-connect event (for example
after the server startup), and in workloads where a connect-disconnect
pattern is used often, so that multiple clients are waiting to be
accepted continuously.
As a side effect, this commit makes the LOADING, BUSY, and similar
errors much faster to deliver to the client, making Redis more
responsive when there is to return errors to inform the clients that the
server is blocked in an not interruptible operation.
We should return REDIS_ERR to signal we can't read the configuration
because there is no config file only after checking errno, othewise
we risk to rewrite an existing file that was not accessible for some
other reason.
This was a common source of problems among users.
The solution adopted is not bullet-proof as if the user deletes the
nodes.conf file manually, and starts a new instance with the same
nodes.conf file path, two instances will use the same file. However
following this reasoning the user may drop a nuclear bomb into the
datacenter as well.
I happen to be working on a system that lacks urandom. While the code does try
to handle this case and artificially create some bytes if the file pointer is
empty, it does try to close it unconditionally, leading to a segfault.
When we set a protocol error we should return with REDIS_ERR to let the
caller know it should stop processing the client.
Bug found in a code auditing related to issue #1699.
The internal HLL raw encoding used by PFCOUNT when merging multiple keys
is aligned to 8 bits (1 byte per register) so we can exploit this to
improve performances by processing multiple bytes per iteration.
In benchmarks the new code was several times faster with HLLs with many
registers set to zero, while no slowdown was observed with populated
HLLs.
When the register is set to zero, we need to add 2^-0 to E, which is 1,
but it is faster to just add 'ez' at the end, which is the number of
registers set to zero, a value we need to compute anyway.
Given that the code was written with a 2 years pause... something
strange happened in the middle. So there was no function to free a
lex range min/max objects, and in some places the range was passed by
value.
After running a few benchmarks, 3000 looks like a reasonable value to
keep HLLs with a few thousand elements small while the CPU cost is
still not huge.
This covers all the cases where the dense representation would use N
orders of magnitude more space, like in the case of many HLLs with
carinality of a few tens or hundreds.
It is not impossible that in the future this gets user configurable,
however it is easy to pick an unreasoable value just looking at savings
in the space dimension without checking what happens in the time
dimension.
Bulk length for registers was emitted too early, so if there was a bug
the reply looked like a long array with just one element, blocking the
client as result.
The function checks if all the HLL_REGISTERS were processed during the
convertion from sparse to dense encoding, returning REDIS_OK or
REDIS_ERR to signal a corruption problem.
A bug in PFDEBUG GETREG was fixed: when the object is converted to the
dense representation we need to reassign the new pointer to the header
structure pointer.
Provides a human readable description of the opcodes composing a
run-length encoded HLL (sparse encoding).
The command is only useful for debugging / development tasks.
The new API takes directly the object doing everything needed to
turn it into a dense representation, including setting the new
representation as object->ptr.
Code never tested, but the basic layout is shaped in this commit.
Also missing:
1) Sparse -> Dense conversion function.
2) New HLL object creation using the sparse representation.
3) Implementation of PFMERGE for the sparse representation.
Metadata are now placed at the start of the representation as an header.
There is a proper structure to access the representation.
Still work to do in order to truly abstract the implementation from the
representation, commands still work assuming dense representation.
After running a few simulations with different alternative encodings,
it was found that the VAL opcode performs better using 5 bits for the
value and 2 bits for the run length, at least for cardinalities in the
range of interest.
adjustOpenFilesLimit() and clusterUpdateSlotsWithConfig() that were
assuming uint64_t is the same as unsigned long long, which is true
probably for all the systems out there that we target, but still GCC
emitted a warning since technically they are two different types.
This will be a non-op most of the times since the object will be
unshared / decoded, however it is more technically correct to start this
way since the object may be decoded even in the read-only code path.
Using a seed of zero has the side effect of having the empty string
hashing to what is a very special case in the context of HyperLogLog: a
very long run of zeroes.
This did not influenced the correctness of the result with 16k registers
because of the harmonic mean, but still it is inconvenient that a so
obvious value maps to a so special hash.
The seed 0xadc83b19 is used instead, which is the first 64 bits of the
SHA1 of the empty string.
Reference: issue #1657.
We need to guarantee that the last bit is 1, otherwise an element may
hash to just zeroes with probability 1/(2^64) and trigger an infinite
loop.
See issue #1657.