This commit closes issue #3698, at least for now, since the root cause
was not fixed: the bounding box function, for huge radiuses, does not
return a correct bounding box, there are points still within the radius
that are left outside.
So when using GEORADIUS queries with radiuses in the order of 5000 km or
more, it was possible to see, at the edge of the area, certain points
not correctly reported.
Because the bounding box for now was used just as an optimization, and
such huge radiuses are not common, for now the optimization is just
switched off when the radius is near such magnitude.
Three test cases found by the Continuous Integration test were added, so
that we can easily trigger the bug again, both for regression testing
and in order to properly fix it as some point in the future.
Experimentally verified that it can trigger the issue reverting the fix.
At least on my system... Being the bug time/backlog dependant, it is
very hard to tell if this test will be able to trigger the problem
consistently, however even if it triggers the problem once in a while,
we'll see it in the CI environment at http://ci.redis.io.
Apparently 1.4 is too low compared to what you get in certain setups
(including mine). I raised it to 1.55 that hopefully is still enough to
test that the fragmentation went down from 1.7 but without incurring in
issues, however the test setup may be still fragile so certain times this
may lead to false positives again, it's hard to test for these things
in a determinsitic way.
Related to #3786.
And many other related Github issues... all reporting the same problem.
There was probably just not enough backlog in certain unlucky runs.
I'll ask people that can reporduce if they see now this as fixed as
well.
Testing with Solaris C compiler (SunOS 5.11 11.2 sun4v sparc sun4v)
there were issues compiling due to atomicvar.h and running the
tests also failed because of "tail" usage not conform with Solaris
tail implementation. This commit fixes both the issues.
Slow systems like the original Raspberry PI need more time
than 5 seconds to start the script and detect writes.
After fixing the Raspberry PI can pass the unit without issues.
The test now uses more diverse radius sizes, especially sizes near or
greater the whole earth surface are used, that are known to trigger edge
cases. Moreover the PRNG seeding was probably resulting into the same
sequence tested over and over again, now seeding unsing the current unix
time in milliseconds.
Related to #3631.
This actually includes two changes:
1) No newlines to take the master-slave link up when the upstream master
is down. Doing this is dangerous because the sub-slave often is received
replication protocol for an half-command, so can't receive newlines
without desyncing the replication link, even with the code in order to
cancel out the bytes that PSYNC2 was using. Moreover this is probably
also not needed/sane, because anyway the slave can keep serving
requests, and because if it's configured to don't serve stale data, it's
a good idea, actually, to break the link.
2) When a +CONTINUE with a different ID is received, we now break
connection with the sub-slaves: they need to be notified as well. This
was part of the original specification but for some reason it was not
implemented in the code, and was alter found as a PSYNC2 bug in the
integration testing.
This is the PSYNC2 test that helped find issues in the code, and that
still can show a protocol desync from time to time. Work is in progress
in order to find the issue. For now the test is not enabled in "make
test" and must be run manually.
By grepping the continuous integration errors log a number of GEORADIUS
tests failures were detected.
Fortunately when a GEORADIUS failure happens, the test suite logs enough
information in order to reproduce the problem: the PRNG seed,
coordinates and radius of the query.
By reproducing the issues, three different bugs were discovered and
fixed in this commit. This commit also improves the already good
reporting of the fuzzer and adds the failure vectors as regression
tests.
The issues found:
1. We need larger squares around the poles in order to cover the area
requested by the user. There were already checks in order to use a
smaller step (larger squares) but the limit set (+/- 67 degrees) is not
enough in certain edge cases, so 66 is used now.
2. Even near the equator, when the search area center is very near the
edge of the square, the north, south, west or ovest square may not be
able to fully cover the specified radius. Now a test is performed at the
edge of the initial guessed search area, and larger squares are used in
case the test fails.
3. Because of rounding errors between Redis and Tcl, sometimes the test
signaled false positives. This is now addressed.
Whenever possible the original code was improved a bit in other ways. A
debugging example stanza was added in order to make the next debugging
session simpler when the next bug is found.
During the initial handshake with the master a slave will report to have
a very high disconnection time from its master (since technically it was
disconnected since forever, so the current UNIX time in seconds is
reported).
However when the slave is connected again the Sentinel may re-scan the
INFO output again only after 10 seconds, which is a long time. During
this time Sentinels will consider this instance unable to failover, so
a useless delay is introduced.
Actaully this hardly happened in the practice because when a slave's
master is down, the INFO period for slaves changes to 1 second. However
when a manual failover is attempted immediately after adding slaves
(like in the case of the Sentinel unit test), this problem may happen.
This commit changes the INFO period to 1 second even in the case the
slave's master is not down, but the slave reported to be disconnected
from the master (by publishing, last time we checked, a master
disconnection time field in INFO).
This change is required as a result of an unrelated change in the
replication code that adds a small delay in the master-slave first
synchronization.
The test works but is very slow so far, since it involves resharding
1/5 of all the cluster slots from master 0 to the other 4 masters and
back into the original master.
Now elements added to lists are incremental numbers in order to
understand, when inconsistencies are found, what is the order in which
the elements were added. Also the error now provides both the expected
and found value.
This fix, provided by Paul Kulchenko (@pkulchenko), allows the Lua
scripting engine to evaluate statements with a trailing comment like the
following one:
EVAL "print() --comment" 0
Lua can't parse the above if the string does not end with a newline, so
now a final newline is always added automatically. This does not change
the SHA1 of scripts since the SHA1 is computed on the body we pass to
EVAL, without the other code we add to register the function.
Close#2951.
It's a key invariant that when AOF is enabled, after the cluster
reshards, a crash-recovery event causes all the keys to be still fine
with the expected logical content. Now this is part of unit 04.
The old version was modeled with two failovers, however after the first
it is possible that another slave will migrate to the new master, since
for some time the new master is not backed by any slave. Probably there
should be some pause after a failover, before the migration. Anyway the
test is simpler in this way, and depends less on timing.
64 bit double math is not enough to make the test passing, and rounding
to 1.2999999 instead of 1.23 is not an error in the implementation.
Valgrind and sometimes other archs are not able to work with 80 bit
doubles.
An user raised a question about a given behavior of PFCOUNT. Added a
test to show the behavior (union) is correct when most of the items are
in common.
HINCRBY* tests later used the value "tmp" that was sometimes generated
by the random key generation function. The result was ovewriting what
Tcl expected to be inside Redis with another value, causing the next
HSTRLEN test to fail.
Georadius works by computing the center + neighbors squares covering all
the area of the specified position and radius. Then a distance filter is
used to remove elements which are actually outside the range.
When a huge radius is used, like 5000 km or more, adjacent neighbors may
collide and be the same, leading to the reporting of the same element
multiple times. This only happens in the edge case of huge radius but is
not ideal.
A robust but slow solution would involve qsorting the range to remove
all the duplicates. However since the collisions are only in adjacent
boxes, for the way they are ordered in the code, it is much faster to
just check if the current box is the same as the previous one processed.
This commit adds a regression test for the bug.
Fixes#2767.
MOVE was not able to move the TTL: when a key was moved into a different
database number, it became persistent like if PERSIST was used.
In some incredible way (I guess almost nobody uses Redis MOVE) this bug
remained unnoticed inside Redis internals for many years.
Finally Andy Grunwald discovered it and opened an issue.
This commit fixes the bug and adds a regression test.
Close#2765.
This additional info may provide more clues about the test randomly
failing from time to time. Probably the failure is due to some previous
test that overwrites the logical content in the Tcl variable, but this
will make the problem more obvious.
Rationale:
1. The commands look like internals exposed without a real strong use
case.
2. Whatever there is an use case, the client would implement the
commands client side instead of paying RTT just to use a simple to
reimplement library.
3. They add complexity to an otherwise quite straightforward API.
So for now KILLED ;-)
The GIS standard and all the major DBs implementing GIS related
functions take coordinates as x,y that is longitude,latitude.
It was a bad start for Redis to do things differently, so even if this
means that existing users of the Geo module will be required to change
their code, Redis now conforms to the standard.
Usually Redis is very backward compatible, but this is not an exception
to this rule, since this is the first Geo implementation entering the
official Redis source code. It is not wise to try to be backward
compatible with code forks... :-)
Close#2637.