redict/tests/unit/latency-monitor.tcl
filipe oliveira 5dd15443ac
Added INFO LATENCYSTATS section: latency by percentile distribution/latency by cumulative distribution of latencies (#9462)
# Short description

The Redis extended latency stats track per command latencies and enables:
- exporting the per-command percentile distribution via the `INFO LATENCYSTATS` command.
  **( percentile distribution is not mergeable between cluster nodes ).**
- exporting the per-command cumulative latency distributions via the `LATENCY HISTOGRAM` command.
  Using the cumulative distribution of latencies we can merge several stats from different cluster nodes
  to calculate aggregate metrics .

By default, the extended latency monitoring is enabled since the overhead of keeping track of the
command latency is very small.
 
If you don't want to track extended latency metrics, you can easily disable it at runtime using the command:
 - `CONFIG SET latency-tracking no`

By default, the exported latency percentiles are the p50, p99, and p999.
You can alter them at runtime using the command:
- `CONFIG SET latency-tracking-info-percentiles "0.0 50.0 100.0"`


## Some details:
- The total size per histogram should sit around 40 KiB. We only allocate those 40KiB when a command
  was called for the first time.
- With regards to the WRITE overhead As seen below, there is no measurable overhead on the achievable
  ops/sec or full latency spectrum on the client. Including also the measured redis-benchmark for unstable
  vs this branch. 
- We track from 1 nanosecond to 1 second ( everything above 1 second is considered +Inf )

## `INFO LATENCYSTATS` exposition format

   - Format: `latency_percentiles_usec_<CMDNAME>:p0=XX,p50....` 

## `LATENCY HISTOGRAM [command ...]` exposition format

Return a cumulative distribution of latencies in the format of a histogram for the specified command names.

The histogram is composed of a map of time buckets:
- Each representing a latency range, between 1 nanosecond and roughly 1 second.
- Each bucket covers twice the previous bucket's range.
- Empty buckets are not printed.
- Everything above 1 sec is considered +Inf.
- At max there will be log2(1000000000)=30 buckets

We reply a map for each command in the format:
`<command name> : { `calls`: <total command calls> , `histogram` : { <bucket 1> : latency , < bucket 2> : latency, ...  } }`

Co-authored-by: Oran Agra <oran@redislabs.com>
2022-01-05 14:01:05 +02:00

131 lines
4.1 KiB
Tcl

proc latency_histogram {cmd} {
return [lindex [r latency histogram $cmd] 1]
}
start_server {tags {"latency-monitor needs:latency"}} {
# Set a threshold high enough to avoid spurious latency events.
r config set latency-monitor-threshold 200
r latency reset
test {LATENCY HISTOGRAM with empty histogram} {
r config resetstat
assert_match {} [latency_histogram set]
assert {[llength [r latency histogram]] == 0}
}
test {LATENCY HISTOGRAM all commands} {
r config resetstat
r set a b
r set c d
assert_match {calls 2 histogram_usec *} [latency_histogram set]
}
test {LATENCY HISTOGRAM with a subset of commands} {
r config resetstat
r set a b
r set c d
r get a
r hset f k v
r hgetall f
assert_match {calls 2 histogram_usec *} [latency_histogram set]
assert_match {calls 1 histogram_usec *} [latency_histogram hset]
assert_match {calls 1 histogram_usec *} [latency_histogram hgetall]
assert_match {calls 1 histogram_usec *} [latency_histogram get]
assert {[llength [r latency histogram]] == 8}
assert {[llength [r latency histogram set get]] == 4}
}
test {LATENCY HISTOGRAM command} {
r config resetstat
r set a b
r get a
assert {[llength [r latency histogram]] == 4}
assert {[llength [r latency histogram set get]] == 4}
}
test {LATENCY HISTOGRAM with wrong command name skips the invalid one} {
r config resetstat
assert {[llength [r latency histogram blabla]] == 0}
assert {[llength [r latency histogram blabla blabla2 set get]] == 0}
r set a b
r get a
assert_match {calls 1 histogram_usec *} [lindex [r latency histogram blabla blabla2 set get] 1]
assert_match {calls 1 histogram_usec *} [lindex [r latency histogram blabla blabla2 set get] 3]
assert {[string length [r latency histogram blabla set get]] > 0}
}
test {Test latency events logging} {
r debug sleep 0.3
after 1100
r debug sleep 0.4
after 1100
r debug sleep 0.5
assert {[r latency history command] >= 3}
} {} {needs:debug}
test {LATENCY HISTORY output is ok} {
set min 250
set max 450
foreach event [r latency history command] {
lassign $event time latency
if {!$::no_latency} {
assert {$latency >= $min && $latency <= $max}
}
incr min 100
incr max 100
set last_time $time ; # Used in the next test
}
}
test {LATENCY LATEST output is ok} {
foreach event [r latency latest] {
lassign $event eventname time latency max
assert {$eventname eq "command"}
if {!$::no_latency} {
assert {$max >= 450 & $max <= 650}
assert {$time == $last_time}
}
break
}
}
test {LATENCY of expire events are correctly collected} {
r config set latency-monitor-threshold 20
r flushdb
if {$::valgrind} {set count 100000} else {set count 1000000}
r eval {
local i = 0
while (i < tonumber(ARGV[1])) do
redis.call('sadd',KEYS[1],i)
i = i+1
end
} 1 mybigkey $count
r pexpire mybigkey 50
wait_for_condition 5 100 {
[r dbsize] == 0
} else {
fail "key wasn't expired"
}
assert_match {*expire-cycle*} [r latency latest]
}
test {LATENCY HISTORY / RESET with wrong event name is fine} {
assert {[llength [r latency history blabla]] == 0}
assert {[r latency reset blabla] == 0}
}
test {LATENCY DOCTOR produces some output} {
assert {[string length [r latency doctor]] > 0}
}
test {LATENCY RESET is able to reset events} {
assert {[r latency reset] > 0}
assert {[r latency latest] eq {}}
}
test {LATENCY HELP should not have unexpected options} {
catch {r LATENCY help xxx} e
assert_match "*wrong number of arguments*" $e
}
}