read() and write() return ssize_t (signed long), not int.
For other offsets, we can use the unsigned size_t type instead
of a signed offset (since our replication offsets and buffer
positions are never negative).
It's possible large objects could be larger than 'int', so let's
upgrade all size counters to ssize_t.
This also fixes rdbSaveObject serialized bytes calculation.
Since entire serializations of data structures can be large,
so we don't want to limit their calculated size to a 32 bit signed max.
This commit increases object size calculation and
cascades the change back up to serializedlength printing.
Before:
127.0.0.1:6379> debug object hihihi
... encoding:quicklist serializedlength:-2147483559 ...
After:
127.0.0.1:6379> debug object hihihi
... encoding:quicklist serializedlength:2147483737 ...
In order to avoid that misconfigured cluster nodes at some time may
force an IP update on other nodes, it is required that nodes update
their own address only on MEET messages. However it does not make sense
to do this the first time a node is contacted and yet does not have an
IP, we just risk that myself->ip remains not assigned if there are
messages lost or cluster creation procedures that don't make sure
everybody is targeted by at least one incoming MEET message.
Also fix the logging of the IP switch avoiding the :-1 tail.
Also explicitly set version to 0, add a protocol version define, improve
comments in the gossip structure.
Note that the structure layout is the same after the change, we are just
making the padding explicit with an additional not used 16 bits field.
So this commit is still able to talk with the previous versions of
cluster nodes.
Valgrind checks that the buffers we transfer via syscalls are all
composed of bytes actually initialized. This is useful, it makes we able
to avoid leaking informations in non initialized parts fo messages
transferred to other hosts. This commit fixes one of such issues.
Can't be initialized by resetManualFailover() since it's actual state
the function uses, so we need to initialize it at startup time. Not
really a bug in practical terms, but showed up into valgrind and is not
technically correct anyway.
Adds configuration option 'supervised [no | upstart | systemd | auto]'
Also removed 'bzero' from the previous implementation because it's 2015.
(We could actually statically initialize those structs, but clang
throws an invalid warning when we try, so it looks bad even though it
isn't bad.)
Fixes#2264
Previously, Redis only wrote the pid file if
it was daemonizing, but many times it's useful to have
the pid written out even if you're in the foreground.
Some background for this is:
I usually run redis via daemontools. That entails running
redis-server on the foreground. Given that, I'd also want
redis-server to create a pidfile so other processes (e.g. nagios)
can run checks for that.
Closes#463
This commit introduces a new RDB data type called 'aux'. It is used in
order to insert inside an RDB file key-value pairs that may serve
different needs, without breaking backward compatibility when new
informations are embedded inside an RDB file. The contract between Redis
versions is to ignore unknown aux fields when encountered.
Aux fields can be used in order to:
1. Augment the RDB file with info like version of Redis that created the
RDB file, creation time, used memory while the RDB was created, and so
forth.
2. Add state about Redis inside the RDB file that we need to reload
later: replication offset, previos master run ID, in order to improve
failovers safety and allow partial resynchronization after a slave
restart.
3. Anything that we may want to add to RDB files without breaking the
ability of past versions of Redis to load the file.
The new opcode is an hint about the size of the dataset (keys and number
of expires) we are going to load for a given Redis database inside the
RDB file. Since hash tables are resized accordingly ASAP, useless
rehashing is avoided, speeding up load times significantly, in the order
of ~ 20% or more for larger data sets.
Related issue: #1719
Adds: ql_compressed (boolean, 1 if compression enabled for list, 0
otherwise)
Adds: ql_uncompressed_size (actual uncompressed size of all quicklistNodes)
Adds: ql_ziplist_max (quicklist max ziplist fill factor)
Compression ratio of the list is then ql_uncompressed_size / serializedlength
We report ql_uncompressed_size for all quicklists because serializedlength
is a _compressed_ representation anyway.
Sample output from a large list:
127.0.0.1:6379> llen abc
(integer) 38370061
127.0.0.1:6379> debug object abc
Value at:0x7ff97b51d140 refcount:1 encoding:quicklist serializedlength:19878335 lru:9718164 lru_seconds_idle:5 ql_nodes:21945 ql_avg_node:1748.46 ql_ziplist_max:-2 ql_compressed:0 ql_uncompressed_size:1643187761
(1.36s)
The 1.36s result time is because rdbSavedObjectLen() is serializing the
object, not because of any new stats reporting.
If we run DEBUG OBJECT on a compressed list, DEBUG OBJECT takes almost *zero*
time because rdbSavedObjectLen() reuses already-compressed ziplists:
127.0.0.1:6379> debug object abc
Value at:0x7fe5c5800040 refcount:1 encoding:quicklist serializedlength:19878335 lru:9718109 lru_seconds_idle:5 ql_nodes:21945 ql_avg_node:1748.46 ql_ziplist_max:-2 ql_compressed:1 ql_uncompressed_size:1643187761
This removes:
- list-max-ziplist-entries
- list-max-ziplist-value
This adds:
- list-max-ziplist-size
- list-compress-depth
Also updates config file with new sections and updates
tests to use quicklist settings instead of old list settings.
Let user set how many nodes to *not* compress.
We can specify a compression "depth" of how many nodes
to leave uncompressed on each end of the quicklist.
Depth 0 = disable compression.
Depth 1 = only leave head/tail uncompressed.
- (read as: "skip 1 node on each end of the list before compressing")
Depth 2 = leave head, head->next, tail->prev, tail uncompressed.
- ("skip 2 nodes on each end of the list before compressing")
Depth 3 = Depth 2 + head->next->next + tail->prev->prev
- ("skip 3 nodes...")
etc.
This also:
- updates RDB storage to use native quicklist compression (if node is
already compressed) instead of uncompressing, generating the RDB string,
then re-compressing the quicklist node.
- internalizes the "fill" parameter for the quicklist so we don't
need to pass it to _every_ function. Now it's just a property of
the list.
- allows a runtime-configurable compression option, so we can
expose a compresion parameter in the configuration file if people
want to trade slight request-per-second performance for up to 90%+
memory savings in some situations.
- updates the quicklist tests to do multiple passes: 200k+ tests now.
Added field 'ql_nodes' and 'ql_avg_per_node'.
ql_nodes is the number of quicklist nodes in the quicklist.
ql_avg_node is the average fill level in each quicklist node. (LLEN / QL_NODES)
Sample output:
127.0.0.1:6379> DEBUG object b
Value at:0x7fa42bf2fed0 refcount:1 encoding:quicklist serializedlength:18489 lru:8983768 lru_seconds_idle:3 ql_nodes:430 ql_avg_per_node:511.73
127.0.0.1:6379> llen b
(integer) 220044
Turns out it's a huge improvement during save/reload/migrate/restore
because, with compression enabled, we're compressing 4k or 8k
chunks of data consisting of multiple elements in one ziplist
instead of compressing series of smaller individual elements.
Use the existing memory space for an SDS to convert it to a regular
character buffer so we don't need to allocate duplicate space just
to extract a usable buffer for native operations.
Fill factor now has two options:
- negative (1-5) for size-based ziplist filling
- positive for length-based ziplist filling with implicit size cap.
Negative offsets define ziplist size limits of:
-1: 4k
-2: 8k
-3: 16k
-4: 32k
-5: 64k
Positive offsets now automatically limit their max size to 8k. Any
elements larger than 8k will be in individual nodes.
Positive ziplist fill factors will keep adding elements
to a ziplist until one of:
- ziplist has FILL number of elements
- or -
- ziplist grows above our ziplist max size (currently 8k)
When using positive fill factors, if you insert a large
element (over 8k), that element will automatically allocate
an individual quicklist node with one element and no other elements will be
in the same ziplist inside that quicklist node.
When using negative fill factors, elements up to the size
limit can be added to one quicklist node. If an element
is added larger than the max ziplist size, that element
will be allocated an individual ziplist in a new quicklist node.
Tests also updated to start testing at fill factor -5.
This started out as #2158 by sunheehnus, but I kept rewriting it
until I could understand things more easily and get a few more
correctness guarantees out of the readability flow.
The original commit created and returned a new ziplist with the contents of
both input ziplists, but I prefer to grow one of the input ziplists
and destroy the other one.
So, instead of malloc+copy as in #2158, the merge now reallocs one of
the existing ziplists and copies the other ziplist into the new space.
Also added merge test cases to ziplistTest()
This replaces individual ziplist vs. linkedlist representations
for Redis list operations.
Big thanks for all the reviews and feedback from everybody in
https://github.com/antirez/redis/pull/2143
The previous test wasn't returning the new ziplist, so the test
was invalid. Now the test works properly.
These problems were simultaenously discovered in #2154 and that
PR also had an additional fix we included here.
zipEntry was returning a struct, but that caused some
problems with tests under 32 bit builds.
The tests run better if we operate on structs allocated in the
caller without worrying about copying on return.
Previously, many files had individual main() functions for testing,
but each required being compiled with their own testing flags.
That gets difficult when you have 8 different flags you need
to set just to run all tests (plus, some test files required
other files to be compiled aaginst them, and it seems some didn't
build at all without including the rest of Redis).
Now all individual test main() funcions are renamed to a test
function for the file itself and one global REDIS_TEST define enables
testing across the entire codebase.
Tests can now be run with:
- `./redis-server test <test>`
e.g. ./redis-server test ziplist
If REDIS_TEST is not defined, then no tests get included and no
tests are included in the final redis-server binary.
1. Server unxtime may remain not updated while loading AOF, so ETA is
not updated correctly.
2. Number of processed byte was not initialized.
3. Possible division by zero condition (likely cause of issue #1932).
1. memory leak in t_set.c has been fixed
2. end-of-line spaces has been removed (from all over the place)
3. for loops have been ordered up to match existing Redis style (less weird)
4. comments format has been fixed (added * in the beggining of every comment line)
setTypeRandomElements() now returns unsigned long, and also uses unsigned long for anything related to count of members.
spopWithCountCommand() now uses unsigned long elements_returned instead of int, for values returned from setTypeRandomElements()
If we woke up to accept a connection, but we can't
accept it, inform the user of the error going on
with their networking.
(The previous message was the same for success or error!)
spopCommand() now runs spopWithCountCommand() in case the <count> param is found.
Added intsetRandomMembers() to Intset: Copies N random members from the set into inputted 'values' array. Uses either the Knuth or Floyd sample algos depending on ratio count/size.
Added setTypeRandomElements() to SET type: Returns a number of random elements from a non empty set. This is a version of setTypeRandomElement() that is modified in order to return multiple entries, using dictGetRandomKeys() and intsetRandomMembers().
Added tests for SPOP with <count>: unit/type/set, unit/scripting, integration/aof
--
Cleaned up code a bit to match with required Redis coding style
Otherwise there are security risks, especially when providing Redis as a
service, the user may "sniff" for admin commands renamed to an
unguessable string via rename-command in redis.conf.
Previously the string was created empty then re-sized
to fit the offset, but sds resize causes the sds to
over-allocate by at least 1 MB (which is a lot when
you are operating at bit-level access).
This also improves the speed of initial sets by 2% to 6%
based on quick testing.
Patch logic provided by @oranagra
Fixes#1918
Improvements:
- Return empty string if asking for non-existing section (INFO foo)
- Fix potential memory leak (caused by sdsempty() then returned if >2 args)
- Clean up argument parsing
- Allow "all" as valid section (same as "default" or zero args currently)
- Move strcasecmp to end of evaluation chain in conditionals
Also, since we're C99, I moved some variable declarations to be closer
to where they are actually used (saves us from needing to free an empty info
if detect argument errors up front).
Closes#1915Closes#1966
There is no standard cross-platform way of obtaining
system memory info, but I found a useful function
convering all common platforms. I removed support
for uncommon Redis platforms (windows, AIX) and left
others intact.
For more info, see:
http://nadeausoftware.com/articles/2012/09/c_c_tip_how_get_physical_memory_size_system
The system memory info is cached on startup, but some systems
may be able to change the amount of memory visible to Redis
at runtime if Redis is deployed in a VM or container.
Also see #1820
Slaves key expire is orchestrated by the master. Sometimes the master
will send the synthesized DEL to expire keys on the slave with a non
trivial delay (when the key is not accessed, only the incremental expiry
algorithm will expire it in background).
During that time, a key is logically expired, but slaves still return
the key if you GET (or whatever) it. This is a bad behavior.
However we can't simply trust the slave view of the key, since we need
the master to be able to send write commands to update the slave data
set, and DELs should only happen when the key is expired in the master
in order to ensure consistency.
However 99.99% of the issues with this behavior is when a client which
is not a master sends a read only command. In this case we are safe and
can consider the key as non existing.
This commit does a few changes in order to make this sane:
1. lookupKeyRead() is modified in order to return NULL if the above
conditions are met.
2. Calls to lookupKeyRead() in commands actually writing to the data set
are repliaced with calls to lookupKeyWrite().
There are redundand checks, so for example, if in "2" something was
overlooked, we should be still safe, since anyway, when the master
writes the behavior is to don't care about what expireIfneeded()
returns.
This commit is related to #1768, #1770, #2131.
I guess the initial goal of the initialization was to suppress GCC
warning, but if we have to initialize, we can do it with the base-case
value instead of NULL which is never retained.
in the case (all chars of the string s found in 'cset' ),
line[573] will no more do the same thing line[572] did.
this will be more faster especially in the case that the string s is very long and all chars of string s found in 'cset'
Track bandwidth used by clients and replication (but diskless
replication is not tracked since the actual transfer happens in the
child process).
This includes a refactoring that makes tracking new instantaneous
metrics simpler.
PFCOUNT is technically speaking a write command, since the cached value
of the HLL is exposed in the data structure (design error, mea culpa), and
can be modified by PFCOUNT.
However if we flag PFCOUNT as "w", read only slaves can't execute the
command, which is a problem since there are environments where slaves
are used to scale PFCOUNT reads.
Nor it is possible to just prevent PFCOUNT to modify the data structure
in slaves, since without the cache we lose too much efficiency.
So while this commit allows slaves to create a temporary inconsistency
(the strings representing the HLLs in the master and slave can be
different in certain moments) it is actually harmless.
In the long run this should be probably fixed by turning the HLL into a
more opaque representation, for example by storing the cached value in
the part of the string which is not exposed (this should be possible
with SDS strings).
bulk_data field size was not removed from the count. It is not possible
to declare it simply as 'char bulk_data[]' since the structure is nested
into another structure.
Because of (not so) recent Redis changes, now the LRU internally
reported unit is milliseconds, not seconds, but the DEBUG OBJECT output
was still claiming seconds while providing milliseconds.
However OBJECT IDLETIME was working as expected, which is the correct
API to use.
zmalloc(0) cauesd to actually trigger a non-zero allocation since with
standard libc malloc we have our own zmalloc header for memory tracking,
but at the same time the returned pointer is at the end of the block and
not in the middle. This triggers a false positive when testing with
valgrind.
When the inline protocol args count is 0, we now avoid reallocating
c->argv, preventing the issue to happen.
Issue: #2157
As the SET command is parsed, it remembers which options are already set
and if a duplicate option is found, raises an error because it is
essentially an invalid syntax.
It still allows mutually exclusive options like EX and PX because taking
an option over another (precedence) is not essentially a syntactic
error.
Sentinel queries the INFO from every master and from every replica of
every master.
We can cache the INFO results in Sentinel so Sentinel can be a single
place to quickly get all INFO output for an entire Sentinel monitoring
group.
This commit gives us SENTINEL INFO-CACHE in two forms:
- SENTINEL INFO-CACHE — returns all masters and all replicas
- SENTINEL INFO-CACHE master0 master1 ... masterN — vararg specify masters
Results are returned as a multibulk reply with two top-level entries
for each master. The first entry for each master is the name of the master.
The second entry is a nested multibulk reply with the contents of INFO,
first for the master, then an additional entry for each of the
replicas.
RDB EOF detection was relying on the final part of the RDB transfer to
be a magic 40 bytes EOF marker. However as the slave is put online
immediately, and because of sockets timeouts, the replication stream is
actually contiguous with the RDB file.
This means that to detect the EOF correctly we should either:
1) Scan all the stream searching for the mark. Sucks CPU-wise.
2) Start to send the replication stream only after an acknowledge.
3) Implement a proper chunked encoding.
For now solution "2" was picked, so the master does not start to send
ASAP the stream of commands in the case of diskless replication. We wait
for the first REPLCONF ACK command from the slave, that certifies us
that the slave correctly loaded the RDB file and is ready to get more
data.
Both upstart and systemd provide a way for daemons to
be supervised, as well as a mechanism for them to
signal their readyness status.
This patch provides compatibility with this functionality while
not interfering with other methods.
With this, it will be possible to use `expect stop` with upstart
and `Type=notify` with systemd.
A more detailed explanation of the mechanism can be found here:
http://spootnik.org/entries/2014/11/09_pid-tracking-in-modern-init-systems.html
if redis works in cluster-mode and redis-cli was run with argv, reconnect if needs.
example:
./redis-cli set foo bar
if return is MOVED redis-cli just do nothing.
Same as the original bind fixes (we just missed these the
first time around).
This helps Redis not automatically send
connections from the first IP on an interface if we are bound
to a specific IP address (e.g. with multiple IP aliases on one
interface, you want to send from _your_ IP, not from the first IP
on the interface).
People mostly use SORT against lists, but our prior
behavior was pretending lists were an unordered bag
requiring a forced-sort when no sort was requested.
We can just use the native list ordering to ensure
consistency across replicaion and scripting calls.
Closes#2079Closes#545 (again)
This caused BGSAVE to be triggered a second time without any need when
we switch from socket to disk target via the command
CONFIG SET repl-diskless-sync no
and there is already a slave waiting for the BGSAVE to start.
Also comments clarified about what is happening.
EWOULDBLOCK with the fdset rio target is returned when we try to write
but the send timeout socket option triggered an error. Better to
translate the error in something the user can actually recognize as a
timeout.
We need to avoid that a child -> slaves transfer can continue forever.
We use the same timeout used as global replication timeout, which is
documented to also affect I/O operations during bulk transfers.
To perform a socket write() for each RDB rio API write call was
extremely unefficient, so now rio has minimal buffering capabilities.
Writes are accumulated into a buffer and only when a given limit is
reacehd are actually wrote to the N slaves FDs.
Trivia: rio lacked support for buffering since our targets were:
1) Memory buffers.
2) C standard I/O.
Both were buffered already.
This is useful for normal replication in order to refresh the slave
when we are persisting on disk, but for diskless replication the
child is already receiving data while in WAIT_BGSAVE_END state.
If we turn from diskless to disk-based replication via CONFIG SET, we
need a way to start a BGSAVE if there are slaves alerady waiting for a
BGSAVE to start. Normally with disk-based replication we do it as soon
as the previous child exits, but when there is a configuration change
via CONFIG SET, we may have slaves in WAIT_BGSAVE_START state without
an RDB background process currently active.
Fdset target is used when we want to write an RDB file directly to
slave's sockets. In this setup as long as there is a single slave that
is still receiving our payload, we want to continue sennding instead of
aborting. However rio calls should abort of no FD is ok.
Also we want the errors reported so that we can signal the parent who is
ok and who is broken, so there is a new set integers with the state of
each fd. Zero is ok, non-zero is the errno of the failure, if avaialble,
or a generic EIO.
A few people have written custom C commands because bit
manipulation isn't exposed through Lua. Let's give
them Mike Pall's bitop.
This adds bitop 1.0.2 (2012-05-08) from http://bitop.luajit.org/
bitop is imported as "bit" into the global namespace.
New Lua commands: bit.tobit, bit.tohex, bit.bnot, bit.band, bit.bor, bit.bxor,
bit.lshift, bit.rshift, bit.arshift, bit.rol, bit.ror, bit.bswap
Verification of working (the asserts would abort on error, so (nil) is correct):
127.0.0.1:6379> eval "assert(bit.tobit(1) == 1); assert(bit.band(1) == 1); assert(bit.bxor(1,2) == 3); assert(bit.bor(1,2,4,8,16,32,64,128) == 255)" 0
(nil)
127.0.0.1:6379> eval 'assert(0x7fffffff == 2147483647, "broken hex literals"); assert(0xffffffff == -1 or 0xffffffff == 2^32-1, "broken hex literals"); assert(tostring(-1) == "-1", "broken tostring()"); assert(tostring(0xffffffff) == "-1" or tostring(0xffffffff) == "4294967295", "broken tostring()")' 0
(nil)
Tests also integrated into the scripting tests and can be run with:
./runtest --single unit/scripting
Tests are excerpted from `bittest.lua` included in the bitop distribution.
With the exception of nodes sending MEET packets: we have to trust them
since they can send us MEET packets only when the cluster is initially
created or because sysadmin manual action.
In the cluster evaluation function we are supposed to set the cluster
state as "fail" if we are among a minority, however the code was not
detecting to be into a minority partition if exactly half the masters
were reachable, which is a minority.
We need to remember what is the saving strategy of the current RDB child
process, since the configuration may be modified at runtime via CONFIG
SET and still we'll need to understand, when the child exists, what to
do and for what goal the process was initiated: to create an RDB file
on disk or to write stuff directly to slave's sockets.
However we don't try to do this if the integer is already inside a range
representable with a shared integer.
The performance gain appears to be around ~15% in micro benchmarks,
however in the long run this also helps to improve locality, so should
have more, hard to measure, benefits.
Some language in the comment was difficult
to understand, so this commit: clarifies wording, removes
unnecessary words, and relocates some dependent clauses
closer to what they actually describe.
I also tried to break up longer chains of thought
(if X, then Y, and Q, and also F, so obviously M)
into more manageable chunks for ease of understanding.
- Remove trailing newlines from redis.conf
- Fix comment misspelling
- Clarifies zipEncodeLength usage and a C API mention (#1243, #1242)
- Fix cluster typos (inspired by @papanikge #1507)
- Fix rewite -> rewrite in a few places (inspired by #682)
Closes#1243, #1242, #1507
The old DEBUG POPULATE form for automatic creation of test keys is:
DEBUG POPULATE <count>
Now an additional form is available:
DEBUG POPULATE <count> <prefix>
When prefix is not specified, it defaults to "key", so the keys are
named incrementally from key:0 to key:<count-1>. Otherwise the specified
prefix is used instead of "key".
The command is useful in order to populate different Redis instances
with key names guaranteed to don't collide. There are other debugging
uses, for example it is possible to add additional N keys using a count
of N and a random prefix at every call.
Following the CLIENT LIST output format, we prefix the unix socket
address with a "/" so that it is different than an IPv4/6 address.
This makes parsing simpler.
Related to #2010.
This fixes a potential bug that was never observed in practice since
what happens is that the asynchronous connect returns ok (to fail later,
calling the handler) every time, so a ping is queued, and sent_ping
happens to always be populated.
Howver technically connect(2) with a non blocking socket may return an
error synchronously, so before this fix the code was not correct.
It is not clear if files open in append only mode will automatically fix
their offset after a truncate(2) operation. This commit makes sure that
we reposition the AOF file descriptor offset at the end of the file
after a truncated AOF is loaded and trimmed to the last valid command.
Recently we introduced the ability to load truncated AOFs, but
unfortuantely the support was broken since the server, after loading the
truncated AOF, continues appending to the file that is corrupted at the
end. The problem is fixed only in the next AOF rewrite.
This commit fixes the issue by truncating the AOF to the last valid
opcode, and aborting if it is not possible to truncate the file
correctly.
The code to check the number of voters was never updated to follow the new
Sentinel specification, so the number of voters was computed using only
the set of Sentinels that provided a vote.
This means that there is a changing majority on partitions, even if
usually the issue is not triggered because of the configured quorum
check (what was broken was the other implicit check that requires anyway
half of the known sentinels to agree in order to start a failover).
Because of the new ability to start with a truncated AOF, we need
to correctly release all the memory on EOF error. Otherwise there is a
small leak, that is not really a problem, but causes a false positive in
the tests that detect memory leaks.
The original implementation was modified in order to allow to
selectively announce a different IP or port, and to rewrite the two
options in the config file after a rewrite.
Previously, GETRANGE of a key containing nothing ("")
would allocate a large (size_t)-1 return value causing
crashes on 32bit builds when it tried to allocate the
4 GB return string.
32 bit builds don't have a big enough long to capture
the same range as a 64 bit build. If we use "long long"
we get proper size limits everywhere.
Also updates size of unsigned comparison to fit new size of `end`.
Fixes#1981
This allows to support datasets with more than 2 billion of keys
(possible in very large memory instances, this bug was actually
reported).
Closes issue #1814.
This just deletes old code that didn't get removed when
logic changed. We were setting offsets that never
got read anywhere.
Since clients are now just cloned, we don't need to track
per-client buffer offsets anywhere because they are all
the same from the original client.
This commit adds a size check after initial config
line parsing to make sure we have *at least* 8 arguments
per line.
Also, instead of asserting for cluster->myself, we just test
and error out normally (since the error does a hard exit anyway).
Closes#1597
For some Solaris flavours, the inet_aton in in resolv library.
Not linking this library will introduce link error.
Improves compatability with older Solaris and still
works on new Solaris.
Closes#1092
The funciton was also modified in order to be more standalone and
produce an output without trailing spaces, making the reuse simpler.
The global variable was renamed in cammel case as most other Redis
globals, except the main ones we refer too many times, like 'server'.
Found by The Mayhem Team (Alexandre Rebert, Thanassis Avgerinos,
Sang Kil Cha, David Brumley, Manuel Egele) Cylab, Carnegie Mellon
University. See http://bugs.debian.org/716259 for more.
Signed-off-by: Chris Lamb <lamby@debian.org>
Fixes#1191
Previously, "MOVE key somestring" would move the key to
DB 0 which is just unexpected and wrong.
String as DB == error.
Test added too.
Modified by @antirez in order to use the getLongLongFromObject() API
instead of strtol().
Fixes#1428
Also adds test for numsub — due to tcl being tcl,
it doesn't capture the "numberness" of the fix,
but now we at least have one test case for numsub.
Closes#1561
Modified by @antirez since the original fix to genInfoString() looked
weak. Probably the clang analyzer complained about `section` being
possibly NULL, and strcasecmp() called with a NULL pointer. In the
practice this can never happen, still for the sake of correctness
the right fix is not to modify only the first call, but to set `section`
to the value of "default" if it happens to be NULL.
Closes#1660
We only want to use the last STORE key, but we have to record
we actually found a STORE key so we can increment the final return
key count.
Test added to prevent further regression.
Closes#1883, #1645, #1647
Previously the end was casted to a smaller type
which resulted in a wrong check and failed
with values larger than handled by unsigned.
Closes#1847, #1844
Cluster leaks memory while connecting due to missing freeaddrinfo()
(Commit modified by @antirez. The freeaddrinfo() call was misplaced so
in case of no address was bound, the memory leak was still there).
Closes#1801
Previously redis-cli would happily show "-1" or "99999"
as valid DB choices.
Now, if the SELECT call returned an error, we don't update
the DB number in the CLI.
Inspired by @anupshendkar in #1313Fixes#566, #1313
Previously, if you did SELECT then AUTH, redis-cli
would show your SELECT'd db even though it didn't
happen.
Note: running into this situation is a (hopefully) very limited
used case of people using multiple DBs and AUTH all at the same
time.
Fixes antirez#1639
Replica migration algorithm modified so that slaves never try to migrate
to masters that were never configured to have slaves in the past.
We want the algorithm to take care of masters that remained without
*working* slaves, but that used to have slaves according to the cluster
configuration.
Based on ideas documented in this blog post:
https://www.facebook.com/notes/facebook-engineering/three-optimization-tips-for-c/10151361643253920
The original code was modified to handle signed integers, reformetted to
fit inside the Redis code base, and was stress-tested with a program
in order to validate the implementation against snprintf().
Redis was measured to be measurably faster from the point of view of
clients in real-world operations because of this change, since sometimes
number to string conversion is used extensively (for example every time
a GET results into an integer encoded object to be returned to the
user).
This is just a quickfix, for the nature of the test the right way to fix
it is to average the error of N runs, since otherwise it is always
possible to get a false positive with a bad run, or to minimize too much
this possibility we may end testing with too much "large" error ranges.
The user @kjmph provided excellent ideas to improve speed of ZUNIONSTORE
(in certain cases by many order of magnitude), together with an
implementation of the ideas.
While the ideas were sounding, the implementation could be improved both
in terms of speed and clearness, so that's my attempt at reimplementing
the speedup proposed, trying to improve by directly using just a
dictionary with an embedded score inside, and reusing the single-pass
aggregate + order-later approach.
Note that you can't apply this commit without applying the previous
commit in this branch that adds a double in the dictEntry value union.
Issue #1786.
For non-empty masters, CLUSTER RESET is denied, and the user requires to
start to reset a node by explicitly clearing it with FLUSHALL.
However CLUSTER RESET when executed with slaves don't have this
restrictions since data is just a replica of the master, and with
read-only slaves it is also not possible to remove the data set. However
the node was turned from slave to master after a reset, without touching
the old slave data. This is 99.99% of times not appropriate and forces
full resets to follow this path to work with both slave and master
nodes:
FLUSHALL
CLUSTER RESET HARD
FLUSHALL
Since we need the first flushall for masters, and the second for slaves.
This commit changes the behavior so that CLUSTER RESET removes the data set
of a slave node during a reset, in the moment it gets turned into a master,
so the new pattern is simply:
FLUSHALL (that may fail for slaves)
CLUSTER RESET
The introduction of --from --to --slots --yes options allow to reshard
from cli in an automated way from scripts. The code is ugly and needs
refactoring as soon as we get it in RC / stable release.
In order to make sure every object has its own private LRU counter, when
maxmemory is enabled tryObjectEncoding() does not use the pool of shared
integers. However when the policy is not LRU-based, it does not make
sense to do so, and it is much better to save memory using shared
integers.
PING can now be called with an additional arugment, behaving exactly
like the ECHO command. PING can now also be called in Pub/Sub mode (with
one more more subscriptions to channels / patterns) in order to trigger
the delivery of an asynchronous pong message with the optional payload.
This fixes issue #420.
The code tested many times if a client had active Pub/Sub subscriptions
by checking the length of a list and dictionary where the patterns and
channels are stored. This was substituted with a client flag called
REDIS_PUBSUB that is simpler to test for. Moreover in order to manage
this flag some code was refactored.
This commit is believed to have no effects in the behavior of the
server.
Previously, the command definition for the OBJECT command specified
a minimum of two args (and that it was variadic), which meant that
if you sent this:
OBJECT foo
When cluster was enabled, it would result in an assertion/SEGFAULT
when Redis was attempting to extract keys.
It appears that OBJECT is not variadic, and only ever takes 3 args.
https://gist.github.com/michael-grunder/25960ce1508396d0d36a
We now wait up to 1 second for diff data to come from the parent,
however we use poll(2) to wait for more data, and use a counter of
contiguous failures to get data for N times (set to 20 experimentally
after different tests) as an early stop condition to avoid wasting 1
second when the write traffic is too low.
We introduce the distinction between slow and fast commands since those
are two different sources of latency. An O(1) or O(log N) command without
side effects (can't trigger deletion of large objects as a side effect of
its execution) if delayed is a symptom of inherent latency of the system.
A non-fast command (commands that may run large O(N) computations) if
delayed may just mean that the user is executing slow operations.
The advices LATENCY should provide in this two different cases are
different, so we log the two classes of commands in a separated way.
This fixes detection of wrong subcommand (that resulted in the default
all-commands output instead) and allows COMMAND INFO to be called
without arguments (resulting into an empty array) which is useful in
programmtically generated calls like the following (in Ruby):
redis.commands("command","info",*mycommands)
Note: mycommands may be empty.
Static was removed since it is needed in order to get symbols in stack
traces. Minor changes in the source code were operated to make it more
similar to the existing Redis code base.
COMMANDS returns a nested multibulk reply for each
command in the command table. The reply for each
command contains:
- command name
- arity
- array of command flags
- start key position
- end key position
- key offset step
- optional: if the keys are not deterministic and
Redis uses an internal key evaluation function,
the 6th field appears and is defined as a status
reply of: REQUIRES ARGUMENT PARSING
Cluster clients need to know where the keys are in each
command to implement proper routing to cluster nodes.
Redis commands can have multiple keys, keys at offset steps, or other
issues where you can't always assume the first element after
the command name is the cluster routing key.
Using the information exposed by COMMANDS, client implementations
can have live, accurate key extraction details for all commands.
Also implements COMMANDS INFO [commands...] to return only a
specific set of commands instead of all 160+ commands live in Redis.
Technically the problem is due to the client type API that does not
return a special value for the master, however fixing it locally in the
CLIENT KILL command is better currently because otherwise we would
introduce a new output buffer limit class as a side effect.
From mailing list post https://groups.google.com/forum/#!topic/redis-db/D3k7KmJmYgM
In the file “config.h”, the definition HAVE_ATOMIC is used to indicate
if an architecture on which redis is implemented supports atomic
synchronization primitives. Powerpc supports atomic synchronization
primitives, however, it is not listed as one of the architectures
supported in config.h. This patch adds the __powerpc__ to the list of
architectures supporting these primitives. The improvement of redis
due to the atomic synchronization on powerpc is significant,
around 30% to 40%, over the default implementation using pthreads.
This proposal adds __powerpc__ to the list of architectures designated
to support atomic builtins.
From mailing list post https://groups.google.com/forum/#!topic/redis-db/QLjiQe4D7LA
In zmalloc.c the following primitives are currently used
to synchronize access to single global variable:
__sync_add_and_fetch
__sync_sub_and_fetch
In some architectures such as powerpc these primitives are overhead
intensive. More efficient C11 __atomic builtins are available with
newer GCC versions, see
http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/_005f_005fatomic-Builtins.html#_005f_005fatomic-Builtins
By substituting the following __atomic… builtins:
__atomic_add_fetch
__atomic_sub_fetch
the performance improvement on certain architectures such as powerpc can be significant,
around 10% to 15%, over the implementation using __sync builtins while there is only slight uptick on
Intel architectures because it was already enforcing Intel Strongly ordered memory semantics.
The selection of __atomic built-ins can be predicated on the definition of ATOMIC_RELAXED
which Is available on in gcc 4.8.2 and later versions.
While we have to output failing masters in order to provide an accurate
map (that may be the one of a Redis Cluster in down state because not
all slots are served by a working master), to provide slaves in FAIL
state is not a good idea since those are not necesarely needed, and the
client will likely incur into a latency penalty trying to connect with a
slave which is down.
Note that this means that CLUSTER SLOTS does not provide a *complete*
map of slaves, however this would not be of any help since slaves may be
added later, and a client that needs to scale reads and requires to
stay updated with the list of slaves, need to do a refresh of the map
from time to time, anyway.
CLUSTER SLOTS returns a Redis-formatted mapping from
slot ranges to IP/Port pairs serving that slot range.
The outer return elements group return values by slot ranges.
The first two entires in each result are the min and max slots for the range.
The third entry in each result is guaranteed to be either
an IP/Port of the master for that slot range - OR - null
if that slot range, for some reason, has no master
The 4th and higher entries in each result are replica instances
for the slot range.
Output comparison:
127.0.0.1:7001> cluster nodes
f853501ec8ae1618df0e0f0e86fd7abcfca36207 127.0.0.1:7001 myself,master - 0 0 2 connected 4096-8191
5a2caa782042187277647661ffc5da739b3e0805 127.0.0.1:7005 slave f853501ec8ae1618df0e0f0e86fd7abcfca36207 0 1402622415859 6 connected
6c70b49813e2ffc9dd4b8ec1e108276566fcf59f 127.0.0.1:7007 slave 26f4729ca0a5a992822667fc16b5220b13368f32 0 1402622415357 8 connected
2bd5a0e3bb7afb2b56a2120d3fef2f2e4333de1d 127.0.0.1:7006 slave 32adf4b8474fdc938189dba00dc8ed60ce635b0f 0 1402622419373 7 connected
5a9450e8279df36ff8e6bb1c139ce4d5268d1390 127.0.0.1:7000 master - 0 1402622418872 1 connected 0-4095
32adf4b8474fdc938189dba00dc8ed60ce635b0f 127.0.0.1:7002 master - 0 1402622419874 3 connected 8192-12287
5db7d05c245267afdfe48c83e7de899348d2bdb6 127.0.0.1:7004 slave 5a9450e8279df36ff8e6bb1c139ce4d5268d1390 0 1402622417867 5 connected
26f4729ca0a5a992822667fc16b5220b13368f32 127.0.0.1:7003 master - 0 1402622420877 4 connected 12288-16383
127.0.0.1:7001> cluster slots
1) 1) (integer) 0
2) (integer) 4095
3) 1) "127.0.0.1"
2) (integer) 7000
4) 1) "127.0.0.1"
2) (integer) 7004
2) 1) (integer) 12288
2) (integer) 16383
3) 1) "127.0.0.1"
2) (integer) 7003
4) 1) "127.0.0.1"
2) (integer) 7007
3) 1) (integer) 4096
2) (integer) 8191
3) 1) "127.0.0.1"
2) (integer) 7001
4) 1) "127.0.0.1"
2) (integer) 7005
4) 1) (integer) 8192
2) (integer) 12287
3) 1) "127.0.0.1"
2) (integer) 7002
4) 1) "127.0.0.1"
2) (integer) 7006
Instead of having an hardcoded IP address in the node configuration, we
autodiscover it via MEET messages for automatic update when the node is
restarted with a different IP address.
This mechanism was discussed in the context of PR #1782.
Some deployments need traffic sent from a specific address. This
change uses the same policy as Cluster where the first listed bindaddr
becomes the source address for outgoing Sentinel communication.
Fixes#1667
Eventual configuration convergence is guaranteed by our periodic hello
messages to all the instances, however when there are important notices
to share, better make a phone call. With this commit we force an hello
message to other Sentinal and Redis instances within the next 100
milliseconds of a config update, which is practically better than
waiting a few seconds.
Lack of check of the SRI_PROMOTED flag caused Sentienl to act with the
promoted slave turned into a master during failover like if it was a
normal instance.
Normally this problem was not apparent because during real failovers the
old master is down so the bugged code path was not entered, however with
manual failovers via the SENTINEL FAILOVER command, the problem was
easily triggered.
This commit prevents promoted slaves from getting reconfigured, moreover
we now explicitly check that during a failover the slave turning into a
master is the one we selected for promotion and not a different one.
This implements the new Sentinel-Client protocol for the Sentinel part:
now instances are reconfigured using a transaction that ensures that the
config is rewritten in the target instance, and that clients lose the
connection with the instance, in order to be forced to: ask Sentinel,
reconnect to the instance, and verify the instance role with the new
ROLE command.
This will be used by CLIENT KILL and is also a good way to ensure a
given client is still the same across CLIENT LIST calls.
The output of CLIENT LIST was modified to include the new ID, but this
change is considered to be backward compatible as the API does not imply
you can do positional parsing, since each filed as a different name.
Because of output buffer limits Redis internals had this idea of type of
clients: normal, pubsub, slave. It is possible to set different output
buffer limits for the three kinds of clients.
However all the macros and API were named after output buffer limit
classes, while the idea of a client type is a generic one that can be
reused.
This commit does two things:
1) Rename the API and defines with more general names.
2) Change the class of clients executing the MONITOR command from "slave"
to "normal".
"2" is a good idea because you want to have very special settings for
slaves, that are not a good idea for MONITOR clients that are instead
normal clients even if they are conceptually slave-alike (since it is a
push protocol).
The backward-compatibility breakage resulting from "2" is considered to
be minimal to care, since MONITOR is a debugging command, and because
anyway this change is not going to break the format or the behavior, but
just when a connection is closed on big output buffer issues.
Lua scripts are executed in the context of the currently selected
database (as selected by the caller of the script).
However Lua scripts are also free to use the SELECT command in order to
affect other DBs. When SELECT is called frm Lua, the old behavior, before
this commit, was to automatically set the Lua caller selected DB to the
last DB selected by Lua. See for example the following sequence of
commands:
SELECT 0
SET x 10
EVAL "redis.call('select','1')" 0
SET x 20
Before this commit after the execution of this sequence of commands,
we'll have x=10 in DB 0, and x=20 in DB 1.
Because of the problem above, there was a bug affecting replication of
Lua scripts, because of the actual implementation of replication. It was
possible to fix the implementation of Lua scripts in order to fix the
issue, but looking closely, the bug is the consequence of the behavior
of Lua ability to set the caller's DB.
Under the old semantics, a script selecting a different DB, has no simple
ways to restore the state and select back the previously selected DB.
Moreover the script auhtor must remember that the restore is needed,
otherwise the new commands executed by the caller, will be executed in
the context of a different DB.
So this commit fixes both the replication issue, and this hard-to-use
semantics, by removing the ability of Lua, after the script execution,
to force the caller to switch to the DB selected by the Lua script.
The new behavior of the previous sequence of commadns is to just set
X=20 in DB 0. However Lua scripts are still capable of writing / reading
from different DBs if needed.
WARNING: This is a semantical change that will break programs that are
conceived to select the client selected DB via Lua scripts.
This fixes issue #1811.
It is more straightforward to just test for a numerical type avoiding
Lua's automatic conversion. The code is technically more correct now,
however Lua should automatically convert to number only if the original
type is a string that "looks like a number", and not from other types,
so practically speaking the fix is identical AFAIK.
The new check-for-number behavior of Lua arguments broke
users who use large strings of just integers.
The Lua number check would convert the string to a number, but
that breaks user data because
Lua numbers have limited precision compared to an arbitrarily
precise number wrapped in a string.
Regression fixed and new test added.
Fixes#1118 again.
The new ROLE command is designed in order to provide a client with
informations about the replication in a fast and easy to use way
compared to the INFO command where the same information is also
available.
Since there are ways to alter the configEpoch outside of the failover
procedure (for exampel CLUSTER SET-CONFIG-EPOCH and via the configEpoch
collision resolution algorithm), make always sure, before replacing our
configEpoch with a new one, that it is greater than the current one.
SET-CONFIG-EPOCH, used by redis-trib at cluster creation time, failed to
update the currentEpoch, making it possible after a failover for a
server to set its configEpoch to a value smaller than the current one
(since configEpochs are obtained using currentEpoch).
The bug totally break the Redis Cluster algorithms and protocols
allowing for permanent split brain conditions about the slots
configuration as shown in issue #1799.
I'm not sure if while the visibility is the inner block, the fact we
point to 'dbuf' is a problem or not, probably the stack var isx
guaranteed to live until the function returns. However obvious code is
better anyway.
The lua_to*string() family of functions use a non optimal format
specifier when converting integers to strings. This has both the problem
of the number being converted in exponential notation, which we don't
use as a Redis return value when floating point numbers are involed,
and, moreover, there is a loss of precision since the default format
specifier is not able to represent numbers that must be represented
exactly in the IEEE 754 number mantissa.
The new code handles it as a special case using a saner conversion.
This fixes issue #1118.
If we are in the signal handler, we don't want to handle
the signal again. In extreme cases, this can cause a stack overflow
and segfault Redis.
Fixes#1771
There is a time defined by REDIS_CLUSTER_WRITABLE_DELAY where fail -> ok
switch is not possible after startup as a master for some time, however
the contrary (ok -> fail) should always be possible.
Every log contains, just after the pid, a single character that provides
information about the role of an instance:
S - Slave
M - Master
C - Writing child
X - Sentinel
Behrad Zari discovered [1] and Josiah reported [2]: if you block
and wait for a list to exist, but the list creates from
a non-push command, the blocked client never gets notified.
This commit adds notification of blocked clients into
the DB layer and away from individual commands.
Lists can be created by [LR]PUSH, SORT..STORE, RENAME, MOVE,
and RESTORE. Previously, blocked client notifications were
only triggered by [LR]PUSH. Your client would never get
notified if a list were created by SORT..STORE or RENAME or
a RESTORE, etc.
Blocked client notification now happens in one unified place:
- dbAdd() triggers notification when adding a list to the DB
Two new tests are added that fail prior to this commit.
All test pass.
Fixes#1668
[1]: https://groups.google.com/forum/#!topic/redis-db/k4oWfMkN1NU
[2]: #1668