The old version only flushed data to slaves if there were strings
pending in the client->reply list. Now also static buffers are flushed.
Does not help to free memory (which is the only use we have right now in
the fuction), but is more correct conceptually, and may be used in
other contexts.
Sometimes it can be useful for clients to completely disable replies
from the Redis server. For example when the client sends fire and forget
commands or performs a mass loading of data, or in caching contexts
where new data is streamed constantly. In such contexts to use server
time and bandwidth in order to send back replies to clients, which are
going to be ignored, is a shame.
Multiple mechanisms are possible to implement such a feature. For
example it could be a feature of MULTI/EXEC, or a command prefix
such as "NOREPLY SADD myset foo", or a different mechanism that allows
to switch on/off requests using the CLIENT command.
The MULTI/EXEC approach has the problem that transactions are not
strictly part of the no-reply semantics, and if we want to insert a lot
of data in a bulk way, creating a huge MULTI/EXEC transaction in the
server memory is bad.
The prefix is the best in this specific use case since it does not allow
desynchronizations, and is pretty clear semantically. However Redis
internals and client libraries are not prepared to handle this
currently.
So the implementation uses the CLIENT command, providing a new REPLY
subcommand with three options:
CLIENT REPLY OFF disables the replies, and does not reply itself.
CLIENT REPLY ON re-enables the replies, replying +OK.
CLIENT REPLY SKIP only discards the reply of the next command, and
like OFF does not reply anything itself.
The reason to add the SKIP command is that it allows to have an easy
way to send conceptually "single" commands that don't need a reply
as the sum of two pipelined commands:
CLIENT REPLY SKIP
SET key value
Note that CLIENT REPLY ON replies with +OK so it should be used when
sending multiple commands that don't need a reply. However since it
replies with +OK the client can check that the connection is still
active and all the previous commands were received.
This is currently just into Redis "unstable" so the proposal can be
modified or abandoned based on users inputs.
After the introduction of the list with clients with pending writes, to
process clients incrementally outside of the event loop we also need to
process the pending writes list.
Talking with @oranagra we had to reason a little bit to understand if
this function could ever flush the output buffers of the wrong slaves,
having online state but actually not being ready to receive writes
before the first ACK is received from them (this happens with diskless
replication).
Next time we'll just read this comment.
Add the concept of slaves capabilities to Redis, the slave now presents
to the Redis master with a set of capabilities in the form:
REPLCONF capa SOMECAPA capa OTHERCAPA ...
This has the effect of setting slave->slave_capa with the corresponding
SLAVE_CAPA macros that the master can test later to understand if it
the slave will understand certain formats and protocols of the
replication process. This makes it much simpler to introduce new
replication capabilities in the future in a way that don't break old
slaves or masters.
This patch was designed and implemented together with Oran Agra
(@oranagra).
1. We no longer use a fake client but just rewriting.
2. We group all the inserts into a single ZADD dispatch (big speed win).
3. As a side effect of the correct implementation, replication works.
4. The return value of the command is now correct.
When we fail to setup the write handler it does not make sense to take
the client around, it is missing writes: whatever is a client or a slave
anyway the connection should terminated ASAP.
Moreover what the function does exactly with its return value, and in
which case the write handler is installed on the socket, was not clear,
so the functions comment are improved to make the goals of the function
more obvious.
Also related to #2485.
master was closing the connection if the RDB transfer took long time.
and also sent PINGs to the slave before it got the initial ACK, in which case the slave wouldn't be able to find the EOF marker.
1. No need to set btype in processUnblockedClients(), since clients
flagged REDIS_UNBLOCKED should have it already cleared.
2. When putting clients in the unblocked clients list, clientsArePaused()
should flag them with REDIS_UNBLOCKED. Not strictly needed with the
current code but is more coherent.
When the list of unblocked clients were processed, btype was set to
blocking type none, but the client remained flagged with REDIS_BLOCKED.
When timeout is reached (or when the client disconnects), unblocking it
will trigger an assertion.
There is no need to process pending requests from blocked clients, so
now clientsArePaused() just avoid touching blocked clients.
Close#2467.
read() and write() return ssize_t (signed long), not int.
For other offsets, we can use the unsigned size_t type instead
of a signed offset (since our replication offsets and buffer
positions are never negative).
Track bandwidth used by clients and replication (but diskless
replication is not tracked since the actual transfer happens in the
child process).
This includes a refactoring that makes tracking new instantaneous
metrics simpler.
zmalloc(0) cauesd to actually trigger a non-zero allocation since with
standard libc malloc we have our own zmalloc header for memory tracking,
but at the same time the returned pointer is at the end of the block and
not in the middle. This triggers a false positive when testing with
valgrind.
When the inline protocol args count is 0, we now avoid reallocating
c->argv, preventing the issue to happen.
RDB EOF detection was relying on the final part of the RDB transfer to
be a magic 40 bytes EOF marker. However as the slave is put online
immediately, and because of sockets timeouts, the replication stream is
actually contiguous with the RDB file.
This means that to detect the EOF correctly we should either:
1) Scan all the stream searching for the mark. Sucks CPU-wise.
2) Start to send the replication stream only after an acknowledge.
3) Implement a proper chunked encoding.
For now solution "2" was picked, so the master does not start to send
ASAP the stream of commands in the case of diskless replication. We wait
for the first REPLCONF ACK command from the slave, that certifies us
that the slave correctly loaded the RDB file and is ready to get more
data.
The code tested many times if a client had active Pub/Sub subscriptions
by checking the length of a list and dictionary where the patterns and
channels are stored. This was substituted with a client flag called
REDIS_PUBSUB that is simpler to test for. Moreover in order to manage
this flag some code was refactored.
This commit is believed to have no effects in the behavior of the
server.
Technically the problem is due to the client type API that does not
return a special value for the master, however fixing it locally in the
CLIENT KILL command is better currently because otherwise we would
introduce a new output buffer limit class as a side effect.
This will be used by CLIENT KILL and is also a good way to ensure a
given client is still the same across CLIENT LIST calls.
The output of CLIENT LIST was modified to include the new ID, but this
change is considered to be backward compatible as the API does not imply
you can do positional parsing, since each filed as a different name.
Because of output buffer limits Redis internals had this idea of type of
clients: normal, pubsub, slave. It is possible to set different output
buffer limits for the three kinds of clients.
However all the macros and API were named after output buffer limit
classes, while the idea of a client type is a generic one that can be
reused.
This commit does two things:
1) Rename the API and defines with more general names.
2) Change the class of clients executing the MONITOR command from "slave"
to "normal".
"2" is a good idea because you want to have very special settings for
slaves, that are not a good idea for MONITOR clients that are instead
normal clients even if they are conceptually slave-alike (since it is a
push protocol).
The backward-compatibility breakage resulting from "2" is considered to
be minimal to care, since MONITOR is a debugging command, and because
anyway this change is not going to break the format or the behavior, but
just when a connection is closed on big output buffer issues.
This commit adds peer ID caching in the client structure plus an API
change and the use of sdsMakeRoomFor() in order to improve the
reallocation pattern to generate the CLIENT LIST output.
Both the changes account for a very significant speedup.
When we are blocked and a few events a processed from time to time, it
is smarter to call the event handler a few times in order to handle the
accept, read, write, close cycle of a client in a single pass, otherwise
there is too much latency added for clients to receive a reply while the
server is busy in some way (for example during the DB loading).
When the listening sockets readable event is fired, we have the chance
to accept multiple clients instead of accepting a single one. This makes
Redis more responsive when there is a mass-connect event (for example
after the server startup), and in workloads where a connect-disconnect
pattern is used often, so that multiple clients are waiting to be
accepted continuously.
As a side effect, this commit makes the LOADING, BUSY, and similar
errors much faster to deliver to the client, making Redis more
responsive when there is to return errors to inform the clients that the
server is blocked in an not interruptible operation.
When we set a protocol error we should return with REDIS_ERR to let the
caller know it should stop processing the client.
Bug found in a code auditing related to issue #1699.
The API is one of the bulding blocks of CLUSTER FAILOVER command that
executes a manual failover in Redis Cluster. However exposed as a
command that the user can call directly, it makes much simpler to
upgrade a standalone Redis instance using a slave in a safer way.
The commands works like that:
CLIENT PAUSE <milliesconds>
All the clients that are not slaves and not in MONITOR state are paused
for the specified number of milliesconds. This means that slaves are
normally served in the meantime.
At the end of the specified amount of time all the clients are unblocked
and will continue operations normally. This command has no effects on
the population of the slow log, since clients are not blocked in the
middle of operations but only when there is to process new data.
Note that while the clients are unblocked, still new commands are
accepted and queued in the client buffer, so clients will likely not
block while writing to the server while the pause is active.
A client can enter a special cluster read-only mode using the READONLY
command: if the client read from a slave instance after this command,
for slots that are actually served by the instance's master, the queries
will be processed without redirection, allowing clients to read from
slaves (but without any kind fo read-after-write guarantee).
The READWRITE command can be used in order to exit the readonly state.
Starting with Redis 2.8 masters are able to detect timed out slaves,
while before 2.8 only slaves were able to detect a timed out master.
Now that timeout detection is bi-directional the following problem
happens as described "in the field" by issue #1449:
1) Master and slave setup with big dataset.
2) Slave performs the first synchronization, or a full sync
after a failed partial resync.
3) Master sends the RDB payload to the slave.
4) Slave loads this payload.
5) Master detects the slave as timed out since does not receive back the
REPLCONF ACK acknowledges.
Here the problem is that the master has no way to know how much the
slave will take to load the RDB file in memory. The obvious solution is
to use a greater replication timeout setting, but this is a shame since
for the 0.1% of operation time we are forced to use a timeout that is
not what is suited for 99.9% of operation time.
This commit tries to fix this problem with a solution that is a bit of
an hack, but that modifies little of the replication internals, in order
to be back ported to 2.8 safely.
During the RDB loading time, we send the master newlines to avoid
being sensed as timed out. This is the same that the master already does
while saving the RDB file to still signal its presence to the slave.
The single newline is used because:
1) It can't desync the protocol, as it is only transmitted all or
nothing.
2) It can be safely sent while we don't have a client structure for the
master or in similar situations just with write(2).
Since we started sending REPLCONF ACK from slaves to masters, the
lastinteraction field of the client structure is always refreshed as
soon as there is room in the socket output buffer, so masters in timeout
are detected with too much delay (the socket buffer takes a lot of time
to be filled by small REPLCONF ACK <number> entries).
This commit only counts data received as interactions with a master,
solving the issue.
During the replication full resynchronization process, the RDB file is
transfered from the master to the slave. However there is a short
preamble to send, that is currently just the bulk payload length of the
file in the usual Redis form $..length..<CR><LF>.
This preamble used to be sent with a direct write call, assuming that
there was alway room in the socket output buffer to hold the few bytes
needed, however this does not scale in case we'll need to send more
stuff, and is not very robust code in general.
This commit introduces a more general mechanism to send a preamble up to
2GB in size (the max length of an sds string) in a non blocking way.
Actaully the string is modified in-place and a reallocation is never
needed, so there is no need to return the new sds string pointer as
return value of the function, that is now just "void".
Now that EMBSTR encoding exists we calculate the amount of memory used
by the SDS part of a Redis String object in two different ways:
1) For raw string object, the size of the allocation is considered.
2) For embstr objects, the length of the string itself is used.
The new function takes care of this logic.
This function missed proper handling of reply_bytes when gluing to the
previous object was used. The issue was introduced with the EMBSTR new
string object encoding.
This fixes issue #1208.
Previously two string encodings were used for string objects:
1) REDIS_ENCODING_RAW: a string object with obj->ptr pointing to an sds
stirng.
2) REDIS_ENCODING_INT: a string object where the obj->ptr void pointer
is casted to a long.
This commit introduces a experimental new encoding called
REDIS_ENCODING_EMBSTR that implements an object represented by an sds
string that is not modifiable but allocated in the same memory chunk as
the robj structure itself.
The chunk looks like the following:
+--------------+-----------+------------+--------+----+
| robj data... | robj->ptr | sds header | string | \0 |
+--------------+-----+-----+------------+--------+----+
| ^
+-----------------------+
The robj->ptr points to the contiguous sds string data, so the object
can be manipulated with the same functions used to manipulate plan
string objects, however we need just on malloc and one free in order to
allocate or release this kind of objects. Moreover it has better cache
locality.
This new allocation strategy should benefit both the memory usage and
the performances. A performance gain between 60 and 70% was observed
during micro-benchmarks, however there is more work to do to evaluate
the performance impact and the memory usage behavior.
There are systems that when printing +/- infinte with printf-family
functions will not use the usual "inf" "-inf", but different strings.
Handle that explicitly.
Fixes issue #930.
The function returns an unique identifier for the client, as ip:port for
IPv4 and IPv6 clients, or as path:0 for Unix socket clients.
See the top comment in the function for more info.
Any places which I feel might want to be updated to work differently
with IPv6 have been marked with a comment starting "IPV6:".
Currently the only comments address places where an IP address is
combined with a port using the standard : separated form. These may want
to be changed when printing IPv6 addresses to wrap the address in []
such as
[2001:db8::c0:ffee]:6379
instead of
2001:db8::c0:ffee:6379
as the latter format is a technically valid IPv6 address and it is hard
to distinguish the IPv6 address component from the port unless you know
the port is supposed to be there.
In two places buffers have been created with a size of 128 bytes which
could be reduced to INET6_ADDRSTRLEN to still hold a full IP address.
These places have been marked as they are presently big enough to handle
the needs of storing a printable IPv6 address.
This feature allows the user to specify the minimum number of
connected replicas having a lag less or equal than the specified
amount of seconds for writes to be accepted.