* fail the test (exit code) in case of timeout.
* add --wait-server to allow attaching a debugger
* add --dont-clean to keep log files when tests are done
RESTORE now supports:
1. Setting LRU/LFU
2. Absolute-time TTL
Other related changes:
1. RDB loading will not override LRU bits when RDB file
does not contain the LRU opcode.
2. RDB loading will not set LRU/LFU bits if the server's
maxmemory-policy does not match.
Removing the fix about 50% of the times the test will not be able to
pass cleanly. It's very hard to write a test that will always fail, or
actually, it is possible but then it's likely that it will consistently
pass if we change some random bit, so better to use randomization here.
problems fixed:
* failing to read fragmentation information from jemalloc
* overflow in jemalloc fragmentation hint to the defragger
* test suite not triggering eviction after population
other fixes / improvements:
- LUA script memory isn't taken from zmalloc (taken from libc malloc)
so it can cause high fragmentation ratio to be displayed (which is false)
- there was a problem with "fragmentation" info being calculated from
RSS and used_memory sampled at different times (now sampling them together)
other details:
- adding a few more allocator info fields to INFO and MEMORY commands
- improve defrag test to measure defrag latency of big keys
- increasing the accuracy of the defrag test (by looking at real grag info)
this way we can use an even lower threshold and still avoid false positives
- keep the old (total) "fragmentation" field unchanged, but add new ones for spcific things
- add these the MEMORY DOCTOR command
- deduct LUA memory from the rss in case of non jemalloc allocator (one for which we don't "allocator active/used")
- reduce sampling rate of the rss and allocator info
After checking with the community via Twitter (here:
https://twitter.com/antirez/status/915130876861788161) the verdict was to
use ":". However I later realized, after users lamented the fact that
it's hard to copy IDs just with double click, that this was the reason
why I moved to "." in the first instance. Fortunately "-", that was the
other option with most votes, also gets selected with double click on
most terminal applications on Linux and MacOS.
So my reasoning was:
1) We can't retain "." because it's actually confusing to newcomers, it
looks like a floating number, people may be tricked into thinking they
can order IDs numerically as floats.
2) Moving to a double-click-to-select format is much better. People will
work with such IDs for long time when coding / debugging. Why making now
a choice that will impact this for the next years?
The only other viable option was "-", and that's what I did. Thanks.
getLongLongFromObject calls string2ll which has this line:
/* Return if not all bytes were used. */
so if you pass an sds with 3 characters "1\01" it will fail.
but getLongDoubleFromObject calls strtold, and considers it ok if eptr[0]==`\0`
i.e. if the end of the string found by strtold ends with null terminator
127.0.0.1:6379> set a 1
OK
127.0.0.1:6379> setrange a 2 2
(integer) 3
127.0.0.1:6379> get a
"1\x002"
127.0.0.1:6379> incrbyfloat a 2
"3"
127.0.0.1:6379> get a
"3"
This commit closes issue #3698, at least for now, since the root cause
was not fixed: the bounding box function, for huge radiuses, does not
return a correct bounding box, there are points still within the radius
that are left outside.
So when using GEORADIUS queries with radiuses in the order of 5000 km or
more, it was possible to see, at the edge of the area, certain points
not correctly reported.
Because the bounding box for now was used just as an optimization, and
such huge radiuses are not common, for now the optimization is just
switched off when the radius is near such magnitude.
Three test cases found by the Continuous Integration test were added, so
that we can easily trigger the bug again, both for regression testing
and in order to properly fix it as some point in the future.
Experimentally verified that it can trigger the issue reverting the fix.
At least on my system... Being the bug time/backlog dependant, it is
very hard to tell if this test will be able to trigger the problem
consistently, however even if it triggers the problem once in a while,
we'll see it in the CI environment at http://ci.redis.io.
Apparently 1.4 is too low compared to what you get in certain setups
(including mine). I raised it to 1.55 that hopefully is still enough to
test that the fragmentation went down from 1.7 but without incurring in
issues, however the test setup may be still fragile so certain times this
may lead to false positives again, it's hard to test for these things
in a determinsitic way.
Related to #3786.
And many other related Github issues... all reporting the same problem.
There was probably just not enough backlog in certain unlucky runs.
I'll ask people that can reporduce if they see now this as fixed as
well.
Testing with Solaris C compiler (SunOS 5.11 11.2 sun4v sparc sun4v)
there were issues compiling due to atomicvar.h and running the
tests also failed because of "tail" usage not conform with Solaris
tail implementation. This commit fixes both the issues.
Slow systems like the original Raspberry PI need more time
than 5 seconds to start the script and detect writes.
After fixing the Raspberry PI can pass the unit without issues.
The test now uses more diverse radius sizes, especially sizes near or
greater the whole earth surface are used, that are known to trigger edge
cases. Moreover the PRNG seeding was probably resulting into the same
sequence tested over and over again, now seeding unsing the current unix
time in milliseconds.
Related to #3631.
This actually includes two changes:
1) No newlines to take the master-slave link up when the upstream master
is down. Doing this is dangerous because the sub-slave often is received
replication protocol for an half-command, so can't receive newlines
without desyncing the replication link, even with the code in order to
cancel out the bytes that PSYNC2 was using. Moreover this is probably
also not needed/sane, because anyway the slave can keep serving
requests, and because if it's configured to don't serve stale data, it's
a good idea, actually, to break the link.
2) When a +CONTINUE with a different ID is received, we now break
connection with the sub-slaves: they need to be notified as well. This
was part of the original specification but for some reason it was not
implemented in the code, and was alter found as a PSYNC2 bug in the
integration testing.
This is the PSYNC2 test that helped find issues in the code, and that
still can show a protocol desync from time to time. Work is in progress
in order to find the issue. For now the test is not enabled in "make
test" and must be run manually.
By grepping the continuous integration errors log a number of GEORADIUS
tests failures were detected.
Fortunately when a GEORADIUS failure happens, the test suite logs enough
information in order to reproduce the problem: the PRNG seed,
coordinates and radius of the query.
By reproducing the issues, three different bugs were discovered and
fixed in this commit. This commit also improves the already good
reporting of the fuzzer and adds the failure vectors as regression
tests.
The issues found:
1. We need larger squares around the poles in order to cover the area
requested by the user. There were already checks in order to use a
smaller step (larger squares) but the limit set (+/- 67 degrees) is not
enough in certain edge cases, so 66 is used now.
2. Even near the equator, when the search area center is very near the
edge of the square, the north, south, west or ovest square may not be
able to fully cover the specified radius. Now a test is performed at the
edge of the initial guessed search area, and larger squares are used in
case the test fails.
3. Because of rounding errors between Redis and Tcl, sometimes the test
signaled false positives. This is now addressed.
Whenever possible the original code was improved a bit in other ways. A
debugging example stanza was added in order to make the next debugging
session simpler when the next bug is found.
During the initial handshake with the master a slave will report to have
a very high disconnection time from its master (since technically it was
disconnected since forever, so the current UNIX time in seconds is
reported).
However when the slave is connected again the Sentinel may re-scan the
INFO output again only after 10 seconds, which is a long time. During
this time Sentinels will consider this instance unable to failover, so
a useless delay is introduced.
Actaully this hardly happened in the practice because when a slave's
master is down, the INFO period for slaves changes to 1 second. However
when a manual failover is attempted immediately after adding slaves
(like in the case of the Sentinel unit test), this problem may happen.
This commit changes the INFO period to 1 second even in the case the
slave's master is not down, but the slave reported to be disconnected
from the master (by publishing, last time we checked, a master
disconnection time field in INFO).
This change is required as a result of an unrelated change in the
replication code that adds a small delay in the master-slave first
synchronization.
The test works but is very slow so far, since it involves resharding
1/5 of all the cluster slots from master 0 to the other 4 masters and
back into the original master.
Now elements added to lists are incremental numbers in order to
understand, when inconsistencies are found, what is the order in which
the elements were added. Also the error now provides both the expected
and found value.
This fix, provided by Paul Kulchenko (@pkulchenko), allows the Lua
scripting engine to evaluate statements with a trailing comment like the
following one:
EVAL "print() --comment" 0
Lua can't parse the above if the string does not end with a newline, so
now a final newline is always added automatically. This does not change
the SHA1 of scripts since the SHA1 is computed on the body we pass to
EVAL, without the other code we add to register the function.
Close#2951.
It's a key invariant that when AOF is enabled, after the cluster
reshards, a crash-recovery event causes all the keys to be still fine
with the expected logical content. Now this is part of unit 04.
The old version was modeled with two failovers, however after the first
it is possible that another slave will migrate to the new master, since
for some time the new master is not backed by any slave. Probably there
should be some pause after a failover, before the migration. Anyway the
test is simpler in this way, and depends less on timing.
64 bit double math is not enough to make the test passing, and rounding
to 1.2999999 instead of 1.23 is not an error in the implementation.
Valgrind and sometimes other archs are not able to work with 80 bit
doubles.
An user raised a question about a given behavior of PFCOUNT. Added a
test to show the behavior (union) is correct when most of the items are
in common.
HINCRBY* tests later used the value "tmp" that was sometimes generated
by the random key generation function. The result was ovewriting what
Tcl expected to be inside Redis with another value, causing the next
HSTRLEN test to fail.
Georadius works by computing the center + neighbors squares covering all
the area of the specified position and radius. Then a distance filter is
used to remove elements which are actually outside the range.
When a huge radius is used, like 5000 km or more, adjacent neighbors may
collide and be the same, leading to the reporting of the same element
multiple times. This only happens in the edge case of huge radius but is
not ideal.
A robust but slow solution would involve qsorting the range to remove
all the duplicates. However since the collisions are only in adjacent
boxes, for the way they are ordered in the code, it is much faster to
just check if the current box is the same as the previous one processed.
This commit adds a regression test for the bug.
Fixes#2767.
MOVE was not able to move the TTL: when a key was moved into a different
database number, it became persistent like if PERSIST was used.
In some incredible way (I guess almost nobody uses Redis MOVE) this bug
remained unnoticed inside Redis internals for many years.
Finally Andy Grunwald discovered it and opened an issue.
This commit fixes the bug and adds a regression test.
Close#2765.
This additional info may provide more clues about the test randomly
failing from time to time. Probably the failure is due to some previous
test that overwrites the logical content in the Tcl variable, but this
will make the problem more obvious.
Rationale:
1. The commands look like internals exposed without a real strong use
case.
2. Whatever there is an use case, the client would implement the
commands client side instead of paying RTT just to use a simple to
reimplement library.
3. They add complexity to an otherwise quite straightforward API.
So for now KILLED ;-)
The GIS standard and all the major DBs implementing GIS related
functions take coordinates as x,y that is longitude,latitude.
It was a bad start for Redis to do things differently, so even if this
means that existing users of the Geo module will be required to change
their code, Redis now conforms to the standard.
Usually Redis is very backward compatible, but this is not an exception
to this rule, since this is the first Geo implementation entering the
official Redis source code. It is not wise to try to be backward
compatible with code forks... :-)
Close#2637.
We set random points in the world, pick a random position, and check if
the returned points by Redis match the ones computed by Tcl by brute
forcing all the points using the distance between two points formula.
This approach is sounding since immediately resulted in finding a bug in
the original implementation.
Current todo:
- replace functions in zset.{c,h} with a new unified Redis
zset access API.
Once we get the zset interface fixed, we can squash
relevant commits in this branch and have one nice commit
to merge into unstable.
This commit adds:
- Geo commands
- Tests; runnable with: ./runtest --single unit/geo
- Geo helpers in deps/geohash-int/
- src/geo.{c,h} and src/geojson.{c,h} implementing geo commands
- Updated build configurations to get everything working
- TEMPORARY: src/zset.{c,h} implementing zset score and zset
range reading without writing to client output buffers.
- Modified linkage of one t_zset.c function for use in zset.c
Conflicts:
src/Makefile
src/redis.c
1. HVSTRLEN -> HSTRLEN. It's unlikely one needs the length of the key,
not clear how the API would work (by value does not make sense) and
there will be better names anyway.
2. Default is to return 0 when field is missing.
3. Default is to return 0 when key is missing.
4. The implementation was slower than needed, and produced unnecessary COW.
Related issue #2415.
This test on Linux was extremely slow, since in Tcl we can't enable
easily tcp-nodelay, so the busy loop used to take *a lot* with bigger
writes. Fixed using pipelining.
This removes:
- list-max-ziplist-entries
- list-max-ziplist-value
This adds:
- list-max-ziplist-size
- list-compress-depth
Also updates config file with new sections and updates
tests to use quicklist settings instead of old list settings.
Previously, the old test ran 5,000 loops and used about 500k.
With quicklist, storing those same 5,000 loops takes up 24k, so the
"large value check" failed!
This increases the test to 20,000 loops which makes the object dump 96k.
This replaces individual ziplist vs. linkedlist representations
for Redis list operations.
Big thanks for all the reviews and feedback from everybody in
https://github.com/antirez/redis/pull/2143
spopCommand() now runs spopWithCountCommand() in case the <count> param is found.
Added intsetRandomMembers() to Intset: Copies N random members from the set into inputted 'values' array. Uses either the Knuth or Floyd sample algos depending on ratio count/size.
Added setTypeRandomElements() to SET type: Returns a number of random elements from a non empty set. This is a version of setTypeRandomElement() that is modified in order to return multiple entries, using dictGetRandomKeys() and intsetRandomMembers().
Added tests for SPOP with <count>: unit/type/set, unit/scripting, integration/aof
--
Cleaned up code a bit to match with required Redis coding style
start_server now uses return value from Tcl exec to get the server pid,
however this introduces errors that depend from timing: a lot of the
testing code base assumed the server to be actually up and running when
server_start returns.
So the old code that waits to see the pid in the log file was restored.
Basically: test to make sure we can load cmsgpack
and do some sanity checks to make sure pack/unpack works
properly. We also have a bonus test for circular encoding
and decoding because I was curious how it worked.
People mostly use SORT against lists, but our prior
behavior was pretending lists were an unordered bag
requiring a forced-sort when no sort was requested.
We can just use the native list ordering to ensure
consistency across replicaion and scripting calls.
Closes#2079Closes#545 (again)
A few people have written custom C commands because bit
manipulation isn't exposed through Lua. Let's give
them Mike Pall's bitop.
This adds bitop 1.0.2 (2012-05-08) from http://bitop.luajit.org/
bitop is imported as "bit" into the global namespace.
New Lua commands: bit.tobit, bit.tohex, bit.bnot, bit.band, bit.bor, bit.bxor,
bit.lshift, bit.rshift, bit.arshift, bit.rol, bit.ror, bit.bswap
Verification of working (the asserts would abort on error, so (nil) is correct):
127.0.0.1:6379> eval "assert(bit.tobit(1) == 1); assert(bit.band(1) == 1); assert(bit.bxor(1,2) == 3); assert(bit.bor(1,2,4,8,16,32,64,128) == 255)" 0
(nil)
127.0.0.1:6379> eval 'assert(0x7fffffff == 2147483647, "broken hex literals"); assert(0xffffffff == -1 or 0xffffffff == 2^32-1, "broken hex literals"); assert(tostring(-1) == "-1", "broken tostring()"); assert(tostring(0xffffffff) == "-1" or tostring(0xffffffff) == "4294967295", "broken tostring()")' 0
(nil)
Tests also integrated into the scripting tests and can be run with:
./runtest --single unit/scripting
Tests are excerpted from `bittest.lua` included in the bitop distribution.
When aof-load-truncated option was introduced, with a default of "yes",
the past behavior of the server to abort with trunncated AOF changed, so
we need to explicitly configure the tests to abort with truncated AOF
by setting the option to no.
Previously, "MOVE key somestring" would move the key to
DB 0 which is just unexpected and wrong.
String as DB == error.
Test added too.
Modified by @antirez in order to use the getLongLongFromObject() API
instead of strtol().
Fixes#1428
Also adds test for numsub — due to tcl being tcl,
it doesn't capture the "numberness" of the fix,
but now we at least have one test case for numsub.
Closes#1561
We only want to use the last STORE key, but we have to record
we actually found a STORE key so we can increment the final return
key count.
Test added to prevent further regression.
Closes#1883, #1645, #1647
Previously the end was casted to a smaller type
which resulted in a wrong check and failed
with values larger than handled by unsigned.
Closes#1847, #1844
In the test we use WAIT when the master and slave are up, and only later the
partition is created killing the master, so we are sure we don't incur
in failure modes that may lose writes in this test: the goal here is to
make sure that the elected slave was replicating correctly with the
master.
In the initialization test for each instance we used to unregister the
old master and register it again to clear the config.
However there is a race condition doing this: as soon as we unregister
and re-register "mymaster", another Sentinel can update the new
configuration with the old state because of gossip "hello" messages.
So the correct procedure is instead, unregister "mymaster" from all the
sentinel instances, and re-register it everywhere again.
Lua scripts are executed in the context of the currently selected
database (as selected by the caller of the script).
However Lua scripts are also free to use the SELECT command in order to
affect other DBs. When SELECT is called frm Lua, the old behavior, before
this commit, was to automatically set the Lua caller selected DB to the
last DB selected by Lua. See for example the following sequence of
commands:
SELECT 0
SET x 10
EVAL "redis.call('select','1')" 0
SET x 20
Before this commit after the execution of this sequence of commands,
we'll have x=10 in DB 0, and x=20 in DB 1.
Because of the problem above, there was a bug affecting replication of
Lua scripts, because of the actual implementation of replication. It was
possible to fix the implementation of Lua scripts in order to fix the
issue, but looking closely, the bug is the consequence of the behavior
of Lua ability to set the caller's DB.
Under the old semantics, a script selecting a different DB, has no simple
ways to restore the state and select back the previously selected DB.
Moreover the script auhtor must remember that the restore is needed,
otherwise the new commands executed by the caller, will be executed in
the context of a different DB.
So this commit fixes both the replication issue, and this hard-to-use
semantics, by removing the ability of Lua, after the script execution,
to force the caller to switch to the DB selected by the Lua script.
The new behavior of the previous sequence of commadns is to just set
X=20 in DB 0. However Lua scripts are still capable of writing / reading
from different DBs if needed.
WARNING: This is a semantical change that will break programs that are
conceived to select the client selected DB via Lua scripts.
This fixes issue #1811.
The new check-for-number behavior of Lua arguments broke
users who use large strings of just integers.
The Lua number check would convert the string to a number, but
that breaks user data because
Lua numbers have limited precision compared to an arbitrarily
precise number wrapped in a string.
Regression fixed and new test added.
Fixes#1118 again.
FLUSHALL will fail on read-only slaves, but there the command is not
needed in order to reset the instance with CLUSTER RESET so errors can
be ignored.
Previously the PID format was:
[PID] Timestamp
But it recently changed to:
PID:X Timestamp
The tcl testing framework was grabbing the PID from \[\d+\], but
that's not valid anymore.
Now we grab the pid from "PID: <PID>" in the part of Redis startup
output to the right of the ASCII logo.
The bug was triggered by running the test with Valgrind (which is a lot
slower and more sensible to timing issues) after the recent changes
that made Redis more promptly able to reply with the -LOADING error.
Behrad Zari discovered [1] and Josiah reported [2]: if you block
and wait for a list to exist, but the list creates from
a non-push command, the blocked client never gets notified.
This commit adds notification of blocked clients into
the DB layer and away from individual commands.
Lists can be created by [LR]PUSH, SORT..STORE, RENAME, MOVE,
and RESTORE. Previously, blocked client notifications were
only triggered by [LR]PUSH. Your client would never get
notified if a list were created by SORT..STORE or RENAME or
a RESTORE, etc.
Blocked client notification now happens in one unified place:
- dbAdd() triggers notification when adding a list to the DB
Two new tests are added that fail prior to this commit.
All test pass.
Fixes#1668
[1]: https://groups.google.com/forum/#!topic/redis-db/k4oWfMkN1NU
[2]: #1668
Better handling of connection errors in order to update the table and
recovery, populate the startup nodes table after fetching the list of
nodes.
More work to do about it, it is still not as reliable as
redis-rb-cluster implementation which is the minimal reference
implementation for Redis Cluster clients.
SPOP, tested in the new test, is among the commands rewritng the
client->argv argument vector (it gets rewritten as SREM) for command
replication purposes.
Because of recent optimizations to client->argv caching in the context
of the Lua internal Redis client, it is important to test for SPOP to be
callable from Lua without bad effects to the other commands.
Sometimes the process is still there but no longer in a state that can
be checked (after being killed). This used to happen after a call to
SHUTDOWN NOSAVE in the scripting unit, causing a false positive.
This makes tests a bit slower, but it is better to test things at a
decent scale instead of using just a few nodes, and for a few tests we
actually need so many nodes.
The test now runs in a self-contained directory.
The general abstractions to run the tests in an environment where
mutliple instances are executed at the same time was extrapolated into
instances.tcl, that will be reused to test Redis Cluster.
This commit sets the failover timeout to 30 seconds instead of the 180
seconds default, and allows to reconfigure multiple slaves at the same
time.
This makes tests less sensible to timing, with the result that there are
less false positives due to normal behaviors that require time to
succeed or to be retried.
However the long term solution is probably some way in order to detect
when a test failed because of timing issues (for example split brain
during leader election) and retry it.
It was verified in practice that this test is able to stress much more
the implementation by introducing errors that were only trivially to
detect with different offsets but impossible to detect starting always
at zero and counting bits the full length of the string.
An unit can abort in the middle for an error. The next unit should not
assume that the instances are in a clean state, and must restart what
was left killed.
Sentinel tests are designed to be dependent on the previous tests in the
same unit, so usually we can't continue with the next test in the same
unit if a previous test failed.
The test was previously performed by removing the master from the
Sentinel monitored masters. The test with the Sentinels crashed is
more similar to real-world partitions / failures.
The area a number of mandatory tests to craete a stable setup for
testing that is not too sensitive to timing issues. All those tests
moved to includes/init-tests, and marked as (init).
It is now possible to kill and restart sentinel or redis instances for
more real-world testing.
The 01 unit tests the capability of Sentinel to update the configuration
of Sentinels rejoining the cluster, however the test is pretty trivial
and more tests should be added.
Some inline test moved into server_is_up procedure.
Also find_available_port was moved into util since it is going
to be used for the Sentinel test as well.
The Redis test uses a server-clients model in order to parallelize the
execution of different tests. However in recent versions of osx not
setting the channel to a binary encoding caused issues even if AFAIK no
binary data is really sent via this channel. However now the channels
are deliberately set to a binary encoding and this solves the issue.
The exact issue was the test not terminating and giving the impression
of running forever, since test clients or servers were unable to
exchange the messages to continue.
When the test is executed using the root account, setting the permission
to 222 does not work as expected, as root can read files with 222
permission.
Now we skip the test if root is detected.
This fixes issue #1034 and the duplicated #1040 issue.
Thanks to Jan-Erik Rediger (@badboy on Github) for finding a way to reproduce the issue.
When keyspace events are enabled, the overhead is not sever but
noticeable, so this commit introduces the ability to select subclasses
of events in order to avoid to generate events the user is not
interested in.
The events can be selected using redis.conf or CONFIG SET / GET.
UNSUBSCRIBE and PUNSUBSCRIBE commands are designed to mass-unsubscribe
the client respectively all the channels and patters if called without
arguments.
However when these functions are called without arguments, but there are
no channels or patters we are subscribed to, the old behavior was to
don't reply at all.
This behavior is broken, as every command should always reply.
Also it is possible that we are no longer subscribed to a channels but we
are subscribed to patters or the other way around, and the client should
be notified with the correct number of subscriptions.
Also it is not pretty that sometimes we did not receive a reply at all
in a redis-cli session from these commands, blocking redis-cli trying
to read the reply.
This fixes issue #714.
The Redis Slow Log always used to log the slow commands executed inside
a MULTI/EXEC block. However also EXEC was logged at the end, which is
perfectly useless.
Now EXEC is no longer logged and a test was added to test this behavior.
This fixes issue #759.