This is especially needed in diskless loading, were a short read could have
caused redis to exit. now the module can handle the error and return to the
caller gracefully.
this fixes#5326
* create module API for forking child processes.
* refactor duplicate code around creating and tracking forks by AOF and RDB.
* child processes listen to SIGUSR1 and dies exitFromChild in order to
eliminate a valgrind warning of unhandled signal.
* note that BGSAVE error reply has changed.
valgrind error is:
Process terminating with default action of signal 10 (SIGUSR1)
Fixes#6012.
As long as "INFO is broken", this should be adequate IMO. Once we rework
`INFO`, perhaps into RESP3, this implementation should be revisited.
when redis appends the blocked client reply list to the real client, it didn't
bother to check if it is in fact the master client. so a slave executing that
module command will send replies to the master, causing the master to send the
slave error responses, which will mess up the replication offset
(slave will advance it's replication offset, and the master does not)
A filter handle is returned and can be used to unregister a filter. In
the future it can also be used to further configure or manipulate the
filter.
Filters are now automatically unregistered when a module unloads.
In some cases processMultibulkBuffer uses sdsMakeRoomFor to
expand the querybuf, but later in some cases it uses that query
buffer as is for an argv element (see "Optimization"), which means
that the sds in argv may have a lot of wasted space, and then in case
modules keep that argv RedisString inside their data structure, this
space waste will remain for long (until restarted from rdb).
It does not make much sense to limit what modules can do: the admin
should instead limit what module commnads an user may call. So
RedisModule_Call() and other module operations should be able to execute
everything they want: the limitation should be posed by the API exported
by the module itself.
Storing the context is useless, because we can't really reuse that
later. For instance in the API RM_DictNext() that returns a
RedisModuleString for the next key iterated, the user should pass the
new context, because we may run the keys of the dictionary in a
different context of the one where the dictionary was created. Also the
dictionary may be created without a context, but we may still demand
automatic memory management for the returned strings while iterating.
By using the "C" suffix for functions getting pointer/len, we can do the
same in the future for other modules APIs that need a variant with
pointer/len and that are now accepting a RedisModuleString.
The burden of having to always create RedisModuleString objects within a
module context was too much, especially now that we have threaded
operations and modules are doing more interesting things. The context in
the string API is currently only used for automatic memory managemnet,
so now the API was modified so that the user can opt-out this feature by
passing a NULL context.
A) slave buffers didn't count internal fragmentation and sds unused space,
this caused them to induce eviction although we didn't mean for it.
B) slave buffers were consuming about twice the memory of what they actually needed.
- this was mainly due to sdsMakeRoomFor growing to twice as much as needed each time
but networking.c not storing more than 16k (partially fixed recently in 237a38737).
- besides it wasn't able to store half of the new string into one buffer and the
other half into the next (so the above mentioned fix helped mainly for small items).
- lastly, the sds buffers had up to 30% internal fragmentation that was wasted,
consumed but not used.
C) inefficient performance due to starting from a small string and reallocing many times.
what i changed:
- creating dedicated buffers for reply list, counting their size with zmalloc_size
- when creating a new reply node from, preallocate it to at least 16k.
- when appending a new reply to the buffer, first fill all the unused space of the
previous node before starting a new one.
other changes:
- expose mem_not_counted_for_evict info field for the benefit of the test suite
- add a test to make sure slave buffers are counted correctly and that they don't cause eviction
This is useful in the reply and timeout callback, if the module wants to
do some cleanup of the blocked client handle that may be stored around
in the module-private data structures.
In some modules it may be useful to have an idea about being near to
OOM. Anyway additionally an explicit call to get the fill ratio will be
added in the future.
Note that this was an experimental API that can only be enabled with
REIDSMODULE_EXPERIMENTAL_API, so it is subject to change until its
promoted to stable API. Sorry for the breakage, it is trivial to
resolve btw. This change will not be back ported to Redis 4.0.
- protocol parsing (processMultibulkBuffer) was limitted to 32big positions in the buffer
readQueryFromClient potential overflow
- rioWriteBulkCount used int, although rioWriteBulkString gave it size_t
- several places in sds.c that used int for string length or index.
- bugfix in RM_SaveAuxField (return was 1 or -1 and not length)
- RM_SaveStringBuffer was limitted to 32bit length
For example:
1. A module command called within a MULTI section.
2. A Lua script with replicate_commands() called within a MULTI section.
3. A module command called from a Lua script in the above context.
Lua scripting does not support calling blocking commands, however all
the native Redis commands are flagged as "s" (no scripting flag), so
this is not possible at all. With modules there is no such mechanism in
order to flag a command as non callable by the Lua scripting engine,
moreover we cannot trust the modules users from complying all the times:
it is likely that modules will be released to have blocking commands
without such commands being flagged correctly, even if we provide a way to
signal this fact.
This commit attempts to address the problem in a short term way, by
detecting that a module is trying to block in the context of the Lua
scripting engine client, and preventing to do this. The module will
actually believe to block as usually, but what happens is that the Lua
script receives an error immediately, and the background call is ignored
by the Redis engine (if not for the cleanup callbacks, once it
unblocks).
Long term, the more likely solution, is to introduce a new call called
RedisModule_GetClientFlags(), so that a command can detect if the caller
is a Lua script, and return an error, or avoid blocking at all.
Being the blocking API experimental right now, more work is needed in
this regard in order to reach a level well blocking module commands and
all the other Redis subsystems interact peacefully.
Now the effect is like the following:
127.0.0.1:6379> eval "redis.call('hello.block',1,5000)" 0
(error) ERR Error running script (call to
f_b5ba35ff97bc1ef23debc4d6e9fd802da187ed53): @user_script:1: ERR
Blocking module command called from Lua script
This commit fixes issue #4127 in the short term.
The function cache was not working at all, and the function returned
wrong values if there where two or more modules exporting native data
types.
See issue #4131 for more details.
The original RDB serialization format was not parsable without the
module loaded, becuase the structure was managed only by the module
itself. Moreover RDB is a streaming protocol in the sense that it is
both produce di an append-only fashion, and is also sometimes directly
sent to the socket (in the case of diskless replication).
The fact that modules values cannot be parsed without the relevant
module loaded is a problem in many ways: RDB checking tools must have
loaded modules even for doing things not involving the value at all,
like splitting an RDB into N RDBs by key or alike, or just checking the
RDB for sanity.
In theory module values could be just a blob of data with a prefixed
length in order for us to be able to skip it. However prefixing the values
with a length would mean one of the following:
1. To be able to write some data at a previous offset. This breaks
stremaing.
2. To bufferize values before outputting them. This breaks performances.
3. To have some chunked RDB output format. This breaks simplicity.
Moreover, the above solution, still makes module values a totally opaque
matter, with the fowllowing problems:
1. The RDB check tool can just skip the value without being able to at
least check the general structure. For datasets composed mostly of
modules values this means to just check the outer level of the RDB not
actually doing any checko on most of the data itself.
2. It is not possible to do any recovering or processing of data for which a
module no longer exists in the future, or is unknown.
So this commit implements a different solution. The modules RDB
serialization API is composed if well defined calls to store integers,
floats, doubles or strings. After this commit, the parts generated by
the module API have a one-byte prefix for each of the above emitted
parts, and there is a final EOF byte as well. So even if we don't know
exactly how to interpret a module value, we can always parse it at an
high level, check the overall structure, understand the types used to
store the information, and easily skip the whole value.
The change is backward compatible: older RDB files can be still loaded
since the new encoding has a new RDB type: MODULE_2 (of value 7).
The commit also implements the ability to check RDB files for sanity
taking advantage of the new feature.
Instead of giving the module background operations just a small time to
run in the beforeSleep() function, we can have the lock released for all
the time we are blocked in the multiplexing syscall.
If a thread unblocks a client blocked in a module command, by using the
RedisMdoule_UnblockClient() API, the event loop may not be awaken until
the next timeout of the multiplexing API or the next unrelated I/O
operation on other clients. We actually want the client to be served
ASAP, so a mechanism is needed in order for the unblocking API to inform
Redis that there is a client to serve ASAP.
This commit fixes the issue using the old trick of the pipe: when a
client needs to be unblocked, a byte is written in a pipe. When we run
the list of clients blocked in modules, we consume all the bytes
written in the pipe. Writes and reads are performed inside the context
of the mutex, so no race is possible in which we consume the bytes that
are actually related to an awake request for a client that should still
be put into the list of clients to unblock.
It was verified that after the fix the server handles the blocked
clients with the expected short delay.
Thanks to @dvirsky for understanding there was such a problem and
reporting it.
This change attempts to switch to an hash function which mitigates
the effects of the HashDoS attack (denial of service attack trying
to force data structures to worst case behavior) while at the same time
providing Redis with an hash function that does not expect the input
data to be word aligned, a condition no longer true now that sds.c
strings have a varialbe length header.
Note that it is possible sometimes that even using an hash function
for which collisions cannot be generated without knowing the seed,
special implementation details or the exposure of the seed in an
indirect way (for example the ability to add elements to a Set and
check the return in which Redis returns them with SMEMBERS) may
make the attacker's life simpler in the process of trying to guess
the correct seed, however the next step would be to switch to a
log(N) data structure when too many items in a single bucket are
detected: this seems like an overkill in the case of Redis.
SPEED REGRESION TESTS:
In order to verify that switching from MurmurHash to SipHash had
no impact on speed, a set of benchmarks involving fast insertion
of 5 million of keys were performed.
The result shows Redis with SipHash in high pipelining conditions
to be about 4% slower compared to using the previous hash function.
However this could partially be related to the fact that the current
implementation does not attempt to hash whole words at a time but
reads single bytes, in order to have an output which is endian-netural
and at the same time working on systems where unaligned memory accesses
are a problem.
Further X86 specific optimizations should be tested, the function
may easily get at the same level of MurMurHash2 if a few optimizations
are performed.
BACKGROUND AND USE CASEj
Redis slaves are normally write only, however the supprot a "writable"
mode which is very handy when scaling reads on slaves, that actually
need write operations in order to access data. For instance imagine
having slaves replicating certain Sets keys from the master. When
accessing the data on the slave, we want to peform intersections between
such Sets values. However we don't want to intersect each time: to cache
the intersection for some time often is a good idea.
To do so, it is possible to setup a slave as a writable slave, and
perform the intersection on the slave side, perhaps setting a TTL on the
resulting key so that it will expire after some time.
THE BUG
Problem: in order to have a consistent replication, expiring of keys in
Redis replication is up to the master, that synthesize DEL operations to
send in the replication stream. However slaves logically expire keys
by hiding them from read attempts from clients so that if the master did
not promptly sent a DEL, the client still see logically expired keys
as non existing.
Because slaves don't actively expire keys by actually evicting them but
just masking from the POV of read operations, if a key is created in a
writable slave, and an expire is set, the key will be leaked forever:
1. No DEL will be received from the master, which does not know about
such a key at all.
2. No eviction will be performed by the slave, since it needs to disable
eviction because it's up to masters, otherwise consistency of data is
lost.
THE FIX
In order to fix the problem, the slave should be able to tag keys that
were created in the slave side and have an expire set in some way.
My solution involved using an unique additional dictionary created by
the writable slave only if needed. The dictionary is obviously keyed by
the key name that we need to track: all the keys that are set with an
expire directly by a client writing to the slave are tracked.
The value in the dictionary is a bitmap of all the DBs where such a key
name need to be tracked, so that we can use a single dictionary to track
keys in all the DBs used by the slave (actually this limits the solution
to the first 64 DBs, but the default with Redis is to use 16 DBs).
This solution allows to pay both a small complexity and CPU penalty,
which is zero when the feature is not used, actually. The slave-side
eviction is encapsulated in code which is not coupled with the rest of
the Redis core, if not for the hook to track the keys.
TODO
I'm doing the first smoke tests to see if the feature works as expected:
so far so good. Unit tests should be added before merging into the
4.0 branch.
It was noted by @dvirsky that it is not possible to use string functions
when writing the AOF file. This sometimes is critical since the command
rewriting may need to be built in the context of the AOF callback, and
without access to the context, and the limited types that the AOF
production functions will accept, this can be an issue.
Moreover there are other needs that we can't anticipate regarding the
ability to use Redis Modules APIs using the context in order to build
representations to emit AOF / RDB.
Because of this a new API was added that allows the user to get a
temporary context from the IO context. The context is auto released
if obtained when the RDB / AOF callback returns.
Calling multiple time the function to get the context, always returns
the same one, since it is invalid to have more than a single context.
Notes by @antirez:
This patch was picked from a larger commit by Oran and adapted to change
the API a bit. The basic idea is to avoid double lookups when there is
to use the value of the deleted entry.
BEFORE:
entry = dictFind( ... ); /* 1st lookup. */
/* Do somethjing with the entry. */
dictDelete(...); /* 2nd lookup. */
AFTER:
entry = dictUnlink( ... ); /* 1st lookup. */
/* Do somethjing with the entry. */
dictFreeUnlinkedEntry(entry); /* No lookups!. */
RedisModule_StringRetain() allows, when automatic memory management is
on, to keep string objects living after the callback returns. Can also
be used in order to use Redis reference counting of objects inside
modules.
The reason why this is useful is that sometimes when implementing new
data types we want to reference RedisModuleString objects inside the
module private data structures, so those string objects must be valid
after the callback returns even if not referenced inside the Redis key
space.
This commit changes what provided by PR #3315 (merged) in order to
let the user specify the log level as a string.
The define could be also used, but when this happens, they must be
decoupled from the defines in the Redis core, like in the other part of
the Redis modules implementations, so that a switch statement (or a
function) remaps between the two, otherwise we are no longer free to
change the internal Redis defines.
Most of the time to check the last element is the way to go, however
there are patterns where the contrary is the best choice. Zig-zag
scanning implemented in this commmit always checks the obvious element
first (the last added -- think at a loop where the last element
allocated gets freed again and again), and continues checking one
element in the head and one in the tail.
Thanks to @dvisrky that fixed the original implementation of the
function and proposed zig zag scanning.
Now that modules receive RedisModuleString objects on loading, they are
allowed to call the String API, so the context must be released
correctly.
Related to #3293.