During an auditing effort, the Apple Vulnerability Research team discovered
a critical Redis security issue affecting the Lua scripting part of Redis.
-- Description of the problem
Several years ago I merged a pull request including many small changes at
the Lua MsgPack library (that originally I authored myself). The Pull
Request entered Redis in commit 90b6337c1, in 2014.
Unfortunately one of the changes included a variadic Lua function that
lacked the check for the available Lua C stack. As a result, calling the
"pack" MsgPack library function with a large number of arguments, results
into pushing into the Lua C stack a number of new values proportional to
the number of arguments the function was called with. The pushed values,
moreover, are controlled by untrusted user input.
This in turn causes stack smashing which we believe to be exploitable,
while not very deterministic, but it is likely that an exploit could be
created targeting specific versions of Redis executables. However at its
minimum the issue results in a DoS, crashing the Redis server.
-- Versions affected
Versions greater or equal to Redis 2.8.18 are affected.
-- Reproducing
Reproduce with this (based on the original reproduction script by
Apple security team):
https://gist.github.com/antirez/82445fcbea6d9b19f97014cc6cc79f8a
-- Verification of the fix
The fix was tested in the following way:
1) I checked that the problem is no longer observable running the trigger.
2) The Lua code was analyzed to understand the stack semantics, and that
actually enough stack is allocated in all the cases of mp_pack() calls.
3) The mp_pack() function was modified in order to show exactly what items
in the stack were being set, to make sure that there is no silent overflow
even after the fix.
-- Credits
Thank you to the Apple team and to the other persons that helped me
checking the patch and coordinating this communication.
This way we let big endian systems to still load old RDB versions.
However newver versions will be saved and loaded in a way that make RDB
expires cross-endian again. Thanks to @oranagra for the reporting and
the discussion about this problem, leading to this fix.
Currently it does not look it's sensible to generate events for streams
consumer groups modification, being them metadata, however at least for
key-level events, like the creation or removal of a consumer group, I
added a few events here and there. Later we can evaluate if it makes
sense to add more. From the POV instead of WAIT (in Redis transaciton)
and signaling the key as modified, it looks like that the transaction
should not fail when a stream is modified, so no calls are made in
consumer groups related functions to signalModifiedKey().
Again thanks to @oranagra. The object idle time does not fit into an int
sometimes: use the native type that the serialization function will get
as argument, which is uint64_t.
A user with many connections (10 thousand) on a single Redis server
reports in issue #4983 that sometimes Redis is idle becuase at the same
time many clients need to resize their query buffer according to the old
policy.
It looks like this was created by the fact that we allow the query
buffer to grow without problems to a size up to PROTO_MBULK_BIG_ARG
normally, but when the client is idle we immediately are more strict,
and a query buffer greater than 1024 bytes is already enough to trigger
the resize. So for instance if most of the clients stop at the same time
this issue should be easily triggered.
This behavior actually looks odd, and there should be only a clear limit
after we say, let's look at this query buffer to check if it's time to
resize it. This commit puts the limit at PROTO_MBULK_BIG_ARG, and the
check is performed both if compared to the peak usage the current usage
is too big, or if the client is idle.
Then when the check is performed, to waste just a few kbytes is
considered enough to proceed with the resize. This should fix the issue.
We unblocked the client too early, when the group name object was no
longer valid in client->bpop, so propagating XCLAIM later in
streamPropagateXCLAIM() deferenced a field already set to NULL.
Removing the fix about 50% of the times the test will not be able to
pass cleanly. It's very hard to write a test that will always fail, or
actually, it is possible but then it's likely that it will consistently
pass if we change some random bit, so better to use randomization here.