Now you can use:
addReplyError("-MYERRORCODE some message");
If the error code is omitted, the behavior is like in the past,
the generic -ERR will be used.
In case the write handler is already installed, it could happen that we
serve the reply of a query in the same event loop cycle we received it,
preventing beforeSleep() from guaranteeing that we do the AOF fsync
before sending the reply to the client.
The AE_BARRIER mechanism, introduced in a previous commit, prevents this
problem. This commit makes actual use of this new feature to fix the
bug.
When feeding the master with a high rate traffic the the slave's feed is much slower.
This causes the replication buffer to grow (indefinitely) which leads to slave disconnection.
The problem is that writeToClient() decides to stop writing after NET_MAX_WRITES_PER_EVENT
writes (In order to be fair to clients).
We should ignore this when the client is a slave.
It's better if clients wait longer, the alternative is that the slave has no chance to stay in
sync in this situation.
- protocol parsing (processMultibulkBuffer) was limitted to 32big positions in the buffer
readQueryFromClient potential overflow
- rioWriteBulkCount used int, although rioWriteBulkString gave it size_t
- several places in sds.c that used int for string length or index.
- bugfix in RM_SaveAuxField (return was 1 or -1 and not length)
- RM_SaveStringBuffer was limitted to 32bit length
The main change introduced by this commit is pretending that help
arrays are more text than code, thus indenting them at level 0. This
improves readability, and is an old practice when defining arrays of
C strings describing text.
Additionally a few useless return statements are removed, and the HELP
subcommand capitalized when printed to the user.
We have this operation in two places: when caching the master and
when linking a new client after the client creation. By having an API
for this we avoid incurring in errors when modifying one of the two
places forgetting the other. The function is also a good place where to
document why we cache the linked list node.
Related to #4497 and #4210.
This adds a new `addReplyHelp` helper that's used by commands
when returning a help text. The following commands have been
touched: DEBUG, OBJECT, COMMAND, PUBSUB, SCRIPT and SLOWLOG.
WIP
Fix entry command table entry for OBJECT for HELP option.
After #4472 the command may have just 2 arguments.
Improve OBJECT HELP descriptions.
See #4472.
WIP 2
WIP 3
Redis clients need to have an instantaneous idea of the amount of memory
they are consuming (if the number is not exact should at least be
proportional to the actual memory usage). We do that adding and
subtracting the SDS length when pushing / popping from the client->reply
list. However it is quite simple to add bugs in such a setup, by not
taking the objects in the list and the count in sync. For such reason,
Redis has an assertion to track counts near 2^64: those are always the
result of the counter wrapping around because we subtract more than we
add. This commit adds the symmetrical assertion: when the list is empty
since we sent everything, the reply_bytes count should be zero. Thanks
to the new assertion it should be simple to also detect the other
problem, where the count slowly increases because of over-counting.
The assertion adds a conditional in the code that sends the buffer to
the socket but should not create any measurable performance slowdown,
listLength() just accesses a structure field, and this code path is
totally dominated by write(2).
Related to #4100.
This bug was discovered by @kevinmcgehee and constituted a major hidden
bug in the PSYNC2 implementation, caused by the propagation from the
master of incomplete commands to slaves.
The bug had several results:
1. Borrowing from Kevin text in the issue: "Given that slaves blindly
copy over their master's input into their own replication backlog over
successive read syscalls, it's possible that with large commands or
small TCP buffers, partial commands are present in this buffer. If the
master were to fail before successfully propagating the entire command
to a slave, the slaves will never execute the partial command (since the
client is invalidated) but will copy it to replication backlog which may
relay those invalid bytes to its slaves on PSYNC2, corrupting the
backlog and possibly other valid commands that follow the failover.
Simple command boundaries aren't sufficient to capture this, either,
because in the case of a MULTI/EXEC block, if the master successfully
propagates a subset of the commands but not the EXEC, then the
transaction in the backlog becomes corrupt and could corrupt other
slaves that consume this data."
2. As identified by @yangsiran later, there is another effect of the
bug. For the same mechanism of the first problem, a slave having another
slave, could receive a full resynchronization request with an already
half-applied command in the backlog. Once the RDB is ready, it will be
sent to the slave, and the replication will continue sending to the
sub-slave the other half of the command, which is not valid.
The fix, designed by @yangsiran and @antirez, and implemented by
@antirez, uses a secondary buffer in order to feed the sub-masters and
update the replication backlog and offsets, only when a given part of
the query buffer is actually *applied* to the state of the instance,
that is, when the command gets processed and the command is not pending
in the Redis transaction buffer because of CLIENT_MULTI state.
Given that now the backlog and offsets representation are in agreement
with the actual processed commands, both issue 1 and 2 should no longer
be possible.
Thanks to @kevinmcgehee, @yangsiran and @oranagra for their work in
identifying and designing a fix for this problem.
since slave isn't replying to it's master, these errors go unnoticed.
since we don't expect the master to send garbadge to the slave, this should be safe.
(as long as we don't log OOM errors there)
This actually includes two changes:
1) No newlines to take the master-slave link up when the upstream master
is down. Doing this is dangerous because the sub-slave often is received
replication protocol for an half-command, so can't receive newlines
without desyncing the replication link, even with the code in order to
cancel out the bytes that PSYNC2 was using. Moreover this is probably
also not needed/sane, because anyway the slave can keep serving
requests, and because if it's configured to don't serve stale data, it's
a good idea, actually, to break the link.
2) When a +CONTINUE with a different ID is received, we now break
connection with the sub-slaves: they need to be notified as well. This
was part of the original specification but for some reason it was not
implemented in the code, and was alter found as a PSYNC2 bug in the
integration testing.
The gist of the changes is that now, partial resynchronizations between
slaves and masters (without the need of a full resync with RDB transfer
and so forth), work in a number of cases when it was impossible
in the past. For instance:
1. When a slave is promoted to mastrer, the slaves of the old master can
partially resynchronize with the new master.
2. Chained slalves (slaves of slaves) can be moved to replicate to other
slaves or the master itsef, without requiring a full resync.
3. The master itself, after being turned into a slave, is able to
partially resynchronize with the new master, when it joins replication
again.
In order to obtain this, the following main changes were operated:
* Slaves also take a replication backlog, not just masters.
* Same stream replication for all the slaves and sub slaves. The
replication stream is identical from the top level master to its slaves
and is also the same from the slaves to their sub-slaves and so forth.
This means that if a slave is later promoted to master, it has the
same replication backlong, and can partially resynchronize with its
slaves (that were previously slaves of the old master).
* A given replication history is no longer identified by the `runid` of
a Redis node. There is instead a `replication ID` which changes every
time the instance has a new history no longer coherent with the past
one. So, for example, slaves publish the same replication history of
their master, however when they are turned into masters, they publish
a new replication ID, but still remember the old ID, so that they are
able to partially resynchronize with slaves of the old master (up to a
given offset).
* The replication protocol was slightly modified so that a new extended
+CONTINUE reply from the master is able to inform the slave of a
replication ID change.
* REPLCONF CAPA is used in order to notify masters that a slave is able
to understand the new +CONTINUE reply.
* The RDB file was extended with an auxiliary field that is able to
select a given DB after loading in the slave, so that the slave can
continue receiving the replication stream from the point it was
disconnected without requiring the master to insert "SELECT" statements.
This is useful in order to guarantee the "same stream" property, because
the slave must be able to accumulate an identical backlog.
* Slave pings to sub-slaves are now sent in a special form, when the
top-level master is disconnected, in order to don't interfer with the
replication stream. We just use out of band "\n" bytes as in other parts
of the Redis protocol.
An old design document is available here:
https://gist.github.com/antirez/ae068f95c0d084891305
However the implementation is not identical to the description because
during the work to implement it, different changes were needed in order
to make things working well.
This is an attempt at mitigating problems due to cross protocol
scripting, an attack targeting services using line oriented protocols
like Redis that can accept HTTP requests as valid protocol, by
discarding the invalid parts and accepting the payloads sent, for
example, via a POST request.
For this to be effective, when we detect POST and Host: and terminate
the connection asynchronously, the networking code was modified in order
to never process further input. It was later verified that in a
pipelined request containing a POST command, the successive commands are
not executed.
This feature is useful, especially in deployments using Sentinel in
order to setup Redis HA, where the slave is executed with NAT or port
forwarding, so that the auto-detected port/ip addresses, as listed in
the "INFO replication" output of the master, or as provided by the
"ROLE" command, don't match the real addresses at which the slave is
reachable for connections.
An exposed Redis instance on the internet can be cause of serious
issues. Since Redis, by default, binds to all the interfaces, it is easy
to forget an instance without any protection layer, for error.
Protected mode try to address this feature in a soft way, providing a
layer of protection, but giving clues to Redis users about why the
server is not accepting connections.
When protected mode is enabeld (the default), and if there are no
minumum hints about the fact the server is properly configured (no
"bind" directive is used in order to restrict the server to certain
interfaces, nor a password is set), clients connecting from external
intefaces are refused with an error explaining what to do in order to
fix the issue.
Clients connecting from the IPv4 and IPv6 lookback interfaces are still
accepted normally, similarly Unix domain socket connections are not
restricted in any way.
We need to process replies after errors in order to delete keys
successfully transferred. Also argument rewriting was fixed since
it was broken in several ways. Now a fresh argument vector is created
and set if we are acknowledged of at least one key.
The old version only flushed data to slaves if there were strings
pending in the client->reply list. Now also static buffers are flushed.
Does not help to free memory (which is the only use we have right now in
the fuction), but is more correct conceptually, and may be used in
other contexts.
Sometimes it can be useful for clients to completely disable replies
from the Redis server. For example when the client sends fire and forget
commands or performs a mass loading of data, or in caching contexts
where new data is streamed constantly. In such contexts to use server
time and bandwidth in order to send back replies to clients, which are
going to be ignored, is a shame.
Multiple mechanisms are possible to implement such a feature. For
example it could be a feature of MULTI/EXEC, or a command prefix
such as "NOREPLY SADD myset foo", or a different mechanism that allows
to switch on/off requests using the CLIENT command.
The MULTI/EXEC approach has the problem that transactions are not
strictly part of the no-reply semantics, and if we want to insert a lot
of data in a bulk way, creating a huge MULTI/EXEC transaction in the
server memory is bad.
The prefix is the best in this specific use case since it does not allow
desynchronizations, and is pretty clear semantically. However Redis
internals and client libraries are not prepared to handle this
currently.
So the implementation uses the CLIENT command, providing a new REPLY
subcommand with three options:
CLIENT REPLY OFF disables the replies, and does not reply itself.
CLIENT REPLY ON re-enables the replies, replying +OK.
CLIENT REPLY SKIP only discards the reply of the next command, and
like OFF does not reply anything itself.
The reason to add the SKIP command is that it allows to have an easy
way to send conceptually "single" commands that don't need a reply
as the sum of two pipelined commands:
CLIENT REPLY SKIP
SET key value
Note that CLIENT REPLY ON replies with +OK so it should be used when
sending multiple commands that don't need a reply. However since it
replies with +OK the client can check that the connection is still
active and all the previous commands were received.
This is currently just into Redis "unstable" so the proposal can be
modified or abandoned based on users inputs.
After the introduction of the list with clients with pending writes, to
process clients incrementally outside of the event loop we also need to
process the pending writes list.
Talking with @oranagra we had to reason a little bit to understand if
this function could ever flush the output buffers of the wrong slaves,
having online state but actually not being ready to receive writes
before the first ACK is received from them (this happens with diskless
replication).
Next time we'll just read this comment.
Add the concept of slaves capabilities to Redis, the slave now presents
to the Redis master with a set of capabilities in the form:
REPLCONF capa SOMECAPA capa OTHERCAPA ...
This has the effect of setting slave->slave_capa with the corresponding
SLAVE_CAPA macros that the master can test later to understand if it
the slave will understand certain formats and protocols of the
replication process. This makes it much simpler to introduce new
replication capabilities in the future in a way that don't break old
slaves or masters.
This patch was designed and implemented together with Oran Agra
(@oranagra).
1. We no longer use a fake client but just rewriting.
2. We group all the inserts into a single ZADD dispatch (big speed win).
3. As a side effect of the correct implementation, replication works.
4. The return value of the command is now correct.
When we fail to setup the write handler it does not make sense to take
the client around, it is missing writes: whatever is a client or a slave
anyway the connection should terminated ASAP.
Moreover what the function does exactly with its return value, and in
which case the write handler is installed on the socket, was not clear,
so the functions comment are improved to make the goals of the function
more obvious.
Also related to #2485.
master was closing the connection if the RDB transfer took long time.
and also sent PINGs to the slave before it got the initial ACK, in which case the slave wouldn't be able to find the EOF marker.
1. No need to set btype in processUnblockedClients(), since clients
flagged REDIS_UNBLOCKED should have it already cleared.
2. When putting clients in the unblocked clients list, clientsArePaused()
should flag them with REDIS_UNBLOCKED. Not strictly needed with the
current code but is more coherent.
When the list of unblocked clients were processed, btype was set to
blocking type none, but the client remained flagged with REDIS_BLOCKED.
When timeout is reached (or when the client disconnects), unblocking it
will trigger an assertion.
There is no need to process pending requests from blocked clients, so
now clientsArePaused() just avoid touching blocked clients.
Close#2467.
read() and write() return ssize_t (signed long), not int.
For other offsets, we can use the unsigned size_t type instead
of a signed offset (since our replication offsets and buffer
positions are never negative).
Track bandwidth used by clients and replication (but diskless
replication is not tracked since the actual transfer happens in the
child process).
This includes a refactoring that makes tracking new instantaneous
metrics simpler.
zmalloc(0) cauesd to actually trigger a non-zero allocation since with
standard libc malloc we have our own zmalloc header for memory tracking,
but at the same time the returned pointer is at the end of the block and
not in the middle. This triggers a false positive when testing with
valgrind.
When the inline protocol args count is 0, we now avoid reallocating
c->argv, preventing the issue to happen.
RDB EOF detection was relying on the final part of the RDB transfer to
be a magic 40 bytes EOF marker. However as the slave is put online
immediately, and because of sockets timeouts, the replication stream is
actually contiguous with the RDB file.
This means that to detect the EOF correctly we should either:
1) Scan all the stream searching for the mark. Sucks CPU-wise.
2) Start to send the replication stream only after an acknowledge.
3) Implement a proper chunked encoding.
For now solution "2" was picked, so the master does not start to send
ASAP the stream of commands in the case of diskless replication. We wait
for the first REPLCONF ACK command from the slave, that certifies us
that the slave correctly loaded the RDB file and is ready to get more
data.
The code tested many times if a client had active Pub/Sub subscriptions
by checking the length of a list and dictionary where the patterns and
channels are stored. This was substituted with a client flag called
REDIS_PUBSUB that is simpler to test for. Moreover in order to manage
this flag some code was refactored.
This commit is believed to have no effects in the behavior of the
server.
Technically the problem is due to the client type API that does not
return a special value for the master, however fixing it locally in the
CLIENT KILL command is better currently because otherwise we would
introduce a new output buffer limit class as a side effect.
This will be used by CLIENT KILL and is also a good way to ensure a
given client is still the same across CLIENT LIST calls.
The output of CLIENT LIST was modified to include the new ID, but this
change is considered to be backward compatible as the API does not imply
you can do positional parsing, since each filed as a different name.
Because of output buffer limits Redis internals had this idea of type of
clients: normal, pubsub, slave. It is possible to set different output
buffer limits for the three kinds of clients.
However all the macros and API were named after output buffer limit
classes, while the idea of a client type is a generic one that can be
reused.
This commit does two things:
1) Rename the API and defines with more general names.
2) Change the class of clients executing the MONITOR command from "slave"
to "normal".
"2" is a good idea because you want to have very special settings for
slaves, that are not a good idea for MONITOR clients that are instead
normal clients even if they are conceptually slave-alike (since it is a
push protocol).
The backward-compatibility breakage resulting from "2" is considered to
be minimal to care, since MONITOR is a debugging command, and because
anyway this change is not going to break the format or the behavior, but
just when a connection is closed on big output buffer issues.
This commit adds peer ID caching in the client structure plus an API
change and the use of sdsMakeRoomFor() in order to improve the
reallocation pattern to generate the CLIENT LIST output.
Both the changes account for a very significant speedup.
When we are blocked and a few events a processed from time to time, it
is smarter to call the event handler a few times in order to handle the
accept, read, write, close cycle of a client in a single pass, otherwise
there is too much latency added for clients to receive a reply while the
server is busy in some way (for example during the DB loading).
When the listening sockets readable event is fired, we have the chance
to accept multiple clients instead of accepting a single one. This makes
Redis more responsive when there is a mass-connect event (for example
after the server startup), and in workloads where a connect-disconnect
pattern is used often, so that multiple clients are waiting to be
accepted continuously.
As a side effect, this commit makes the LOADING, BUSY, and similar
errors much faster to deliver to the client, making Redis more
responsive when there is to return errors to inform the clients that the
server is blocked in an not interruptible operation.
When we set a protocol error we should return with REDIS_ERR to let the
caller know it should stop processing the client.
Bug found in a code auditing related to issue #1699.
The API is one of the bulding blocks of CLUSTER FAILOVER command that
executes a manual failover in Redis Cluster. However exposed as a
command that the user can call directly, it makes much simpler to
upgrade a standalone Redis instance using a slave in a safer way.
The commands works like that:
CLIENT PAUSE <milliesconds>
All the clients that are not slaves and not in MONITOR state are paused
for the specified number of milliesconds. This means that slaves are
normally served in the meantime.
At the end of the specified amount of time all the clients are unblocked
and will continue operations normally. This command has no effects on
the population of the slow log, since clients are not blocked in the
middle of operations but only when there is to process new data.
Note that while the clients are unblocked, still new commands are
accepted and queued in the client buffer, so clients will likely not
block while writing to the server while the pause is active.
A client can enter a special cluster read-only mode using the READONLY
command: if the client read from a slave instance after this command,
for slots that are actually served by the instance's master, the queries
will be processed without redirection, allowing clients to read from
slaves (but without any kind fo read-after-write guarantee).
The READWRITE command can be used in order to exit the readonly state.
Starting with Redis 2.8 masters are able to detect timed out slaves,
while before 2.8 only slaves were able to detect a timed out master.
Now that timeout detection is bi-directional the following problem
happens as described "in the field" by issue #1449:
1) Master and slave setup with big dataset.
2) Slave performs the first synchronization, or a full sync
after a failed partial resync.
3) Master sends the RDB payload to the slave.
4) Slave loads this payload.
5) Master detects the slave as timed out since does not receive back the
REPLCONF ACK acknowledges.
Here the problem is that the master has no way to know how much the
slave will take to load the RDB file in memory. The obvious solution is
to use a greater replication timeout setting, but this is a shame since
for the 0.1% of operation time we are forced to use a timeout that is
not what is suited for 99.9% of operation time.
This commit tries to fix this problem with a solution that is a bit of
an hack, but that modifies little of the replication internals, in order
to be back ported to 2.8 safely.
During the RDB loading time, we send the master newlines to avoid
being sensed as timed out. This is the same that the master already does
while saving the RDB file to still signal its presence to the slave.
The single newline is used because:
1) It can't desync the protocol, as it is only transmitted all or
nothing.
2) It can be safely sent while we don't have a client structure for the
master or in similar situations just with write(2).
Since we started sending REPLCONF ACK from slaves to masters, the
lastinteraction field of the client structure is always refreshed as
soon as there is room in the socket output buffer, so masters in timeout
are detected with too much delay (the socket buffer takes a lot of time
to be filled by small REPLCONF ACK <number> entries).
This commit only counts data received as interactions with a master,
solving the issue.
During the replication full resynchronization process, the RDB file is
transfered from the master to the slave. However there is a short
preamble to send, that is currently just the bulk payload length of the
file in the usual Redis form $..length..<CR><LF>.
This preamble used to be sent with a direct write call, assuming that
there was alway room in the socket output buffer to hold the few bytes
needed, however this does not scale in case we'll need to send more
stuff, and is not very robust code in general.
This commit introduces a more general mechanism to send a preamble up to
2GB in size (the max length of an sds string) in a non blocking way.
Actaully the string is modified in-place and a reallocation is never
needed, so there is no need to return the new sds string pointer as
return value of the function, that is now just "void".
Now that EMBSTR encoding exists we calculate the amount of memory used
by the SDS part of a Redis String object in two different ways:
1) For raw string object, the size of the allocation is considered.
2) For embstr objects, the length of the string itself is used.
The new function takes care of this logic.
This function missed proper handling of reply_bytes when gluing to the
previous object was used. The issue was introduced with the EMBSTR new
string object encoding.
This fixes issue #1208.
Previously two string encodings were used for string objects:
1) REDIS_ENCODING_RAW: a string object with obj->ptr pointing to an sds
stirng.
2) REDIS_ENCODING_INT: a string object where the obj->ptr void pointer
is casted to a long.
This commit introduces a experimental new encoding called
REDIS_ENCODING_EMBSTR that implements an object represented by an sds
string that is not modifiable but allocated in the same memory chunk as
the robj structure itself.
The chunk looks like the following:
+--------------+-----------+------------+--------+----+
| robj data... | robj->ptr | sds header | string | \0 |
+--------------+-----+-----+------------+--------+----+
| ^
+-----------------------+
The robj->ptr points to the contiguous sds string data, so the object
can be manipulated with the same functions used to manipulate plan
string objects, however we need just on malloc and one free in order to
allocate or release this kind of objects. Moreover it has better cache
locality.
This new allocation strategy should benefit both the memory usage and
the performances. A performance gain between 60 and 70% was observed
during micro-benchmarks, however there is more work to do to evaluate
the performance impact and the memory usage behavior.
There are systems that when printing +/- infinte with printf-family
functions will not use the usual "inf" "-inf", but different strings.
Handle that explicitly.
Fixes issue #930.
The function returns an unique identifier for the client, as ip:port for
IPv4 and IPv6 clients, or as path:0 for Unix socket clients.
See the top comment in the function for more info.
Any places which I feel might want to be updated to work differently
with IPv6 have been marked with a comment starting "IPV6:".
Currently the only comments address places where an IP address is
combined with a port using the standard : separated form. These may want
to be changed when printing IPv6 addresses to wrap the address in []
such as
[2001:db8::c0:ffee]:6379
instead of
2001:db8::c0:ffee:6379
as the latter format is a technically valid IPv6 address and it is hard
to distinguish the IPv6 address component from the port unless you know
the port is supposed to be there.
In two places buffers have been created with a size of 128 bytes which
could be reduced to INET6_ADDRSTRLEN to still hold a full IP address.
These places have been marked as they are presently big enough to handle
the needs of storing a printable IPv6 address.