Skip to content

Commit f9a1171

Browse files
committed
Summaries: Allow 0.0 and 1.0 quantiles and update documentation
Signed-off-by: Fabian Stäber <[email protected]>
1 parent fd9da3e commit f9a1171

File tree

4 files changed

+296
-88
lines changed

4 files changed

+296
-88
lines changed

README.md

Lines changed: 57 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -130,57 +130,86 @@ when using this approach ensure the value you are reporting accounts for concurr
130130

131131
### Summary
132132

133-
Summaries track the size and number of events.
133+
Summaries and Histograms can both be used to monitor latencies (or other things like request sizes).
134+
135+
An overview of when to use Summaries and when to use Histograms can be found on [https://prometheus.io/docs/practices/histograms](https://prometheus.io/docs/practices/histograms).
136+
137+
The following example shows how to measure latencies and request sizes:
134138

135139
```java
136140
class YourClass {
137-
static final Summary receivedBytes = Summary.build()
138-
.name("requests_size_bytes").help("Request size in bytes.").register();
139-
static final Summary requestLatency = Summary.build()
140-
.name("requests_latency_seconds").help("Request latency in seconds.").register();
141141

142-
void processRequest(Request req) {
142+
private static final Summary requestLatency = Summary.build()
143+
.name("requests_latency_seconds")
144+
.help("request latency in seconds")
145+
.register();
146+
147+
private static final Summary receivedBytes = Summary.build()
148+
.name("requests_size_bytes")
149+
.help("request size in bytes")
150+
.register();
151+
152+
public void processRequest(Request req) {
143153
Summary.Timer requestTimer = requestLatency.startTimer();
144154
try {
145155
// Your code here.
146156
} finally {
147-
receivedBytes.observe(req.size());
148157
requestTimer.observeDuration();
158+
receivedBytes.observe(req.size());
149159
}
150160
}
151161
}
152162
```
153163

154-
There are utilities for timing code and support for [quantiles](https://prometheus.io/docs/practices/histograms/#quantiles).
155-
Essentially quantiles aren't aggregatable and add some client overhead for the calculation.
164+
The `Summary` class provides different utility methods for observing values, like `observe(double)`, `startTimer(); timer.observeDuration()`, `time(Callable)`, etc.
165+
166+
By default, `Summary` metrics provide the `count` and the `sum`. For example, if you measure latencies of a REST service, the `count` will tell you how often the REST service was called, and the `sum` will tell you the total aggregated response time. You can calculate the average response time using a Prometheus query dividing `sum / count`.
167+
168+
In addition to `count` and `sum`, you can configure a Summary to provide quantiles:
156169

157170
```java
158-
class YourClass {
159-
static final Summary requestLatency = Summary.build()
160-
.quantile(0.5, 0.05) // Add 50th percentile (= median) with 5% tolerated error
161-
.quantile(0.9, 0.01) // Add 90th percentile with 1% tolerated error
162-
.name("requests_latency_seconds").help("Request latency in seconds.").register();
171+
Summary requestLatency = Summary.build()
172+
.name("requests_latency_seconds")
173+
.help("Request latency in seconds.")
174+
.quantile(0.5, 0.01) // 0.5 quantile (median) with 0.01 allowed error
175+
.quantile(0.95, 0.005) // 0.95 quantile with 0.005 allowed error
176+
// ...
177+
.register();
178+
```
163179

164-
void processRequest(Request req) {
165-
requestLatency.time(new Runnable() {
166-
public abstract void run() {
167-
// Your code here.
168-
}
169-
});
180+
As an example, a `0.95` quantile of `120ms` tells you that `95%` of the calls were faster than `120ms`, and `5%` of the calls were slower than `120ms`.
170181

182+
Tracking exact quantiles require a large amount of memory, because all observations need to be stored in a sorted list. Therefore, we allow an error to significantly reduce memory usage.
171183

172-
// Or the Java 8 lambda equivalent
173-
requestLatency.time(() -> {
174-
// Your code here.
175-
});
176-
}
177-
}
184+
In the example, the allowed error of `0.005` means that you will not get the exact `0.95` quantile, but anything between the `0.945` quantile and the `0.955` quantile.
185+
186+
Experiments show that the `Summary` typically needs to keep less than 100 samples to provide that precision, even if you have hundreds of millions of observations.
187+
188+
There are a few special cases:
189+
190+
* You can set an allowed error of `0`, but then the `Summary` will keep all observations in memory.
191+
* You can track the minimum value with `.quantile(0, 0)`. This special case will not use additional memory even though the allowed error is `0`.
192+
* You can track the maximum value with `.quantile(1, 0)`. This special case will not use additional memory even though the allowed error is `0`.
193+
194+
Typically, you don't want to have a `Summary` representing the entire runtime of the application, but you want to look at a reasonable time interval. `Summary` metrics implement a configurable sliding time window:
195+
196+
```java
197+
Summary requestLatency = Summary.build()
198+
.name("requests_latency_seconds")
199+
.help("Request latency in seconds.")
200+
.maxAgeSeconds(10 * 60)
201+
.ageBuckets(5)
202+
// ...
203+
.register();
178204
```
179205

206+
The default is a time window of 10 minutes and 5 age buckets, i.e. the time window is 10 minutes wide, and * we slide it forward every 2 minutes.
207+
180208
### Histogram
181209

182-
Histograms track the size and number of events in buckets.
183-
This allows for aggregatable calculation of quantiles.
210+
Like Summaries, Histograms can be used to monitor latencies (or other things like request sizes).
211+
212+
An overview of when to use Summaries and when to use Histograms can be found on [https://prometheus.io/docs/practices/histograms](https://prometheus.io/docs/practices/histograms).
184213

185214
```java
186215
class YourClass {

simpleclient/src/main/java/io/prometheus/client/CKMSQuantiles.java

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,14 @@ public double get(double q) {
142142
return Double.NaN;
143143
}
144144

145+
if (q == 0.0) {
146+
return samples.getFirst().value;
147+
}
148+
149+
if (q == 1.0) {
150+
return samples.getLast().value;
151+
}
152+
145153
int r = 0; // sum of g's left of the current sample
146154
int desiredRank = (int) Math.ceil(q * n);
147155

@@ -168,6 +176,9 @@ public double get(double q) {
168176
int f(int r) {
169177
int minResult = Integer.MAX_VALUE;
170178
for (Quantile q : quantiles) {
179+
if (q.quantile == 0 || q.quantile == 1) {
180+
continue;
181+
}
171182
int result;
172183
// We had a numerical error here with the following example:
173184
// quantile = 0.95, epsilon = 0.01, (n-r) = 30.
@@ -267,13 +278,13 @@ static class Quantile {
267278
final double v;
268279

269280
Quantile(double quantile, double epsilon) {
270-
if (quantile <= 0 || quantile >= 1.0) throw new IllegalArgumentException("Quantile must be between 0 and 1");
271-
if (epsilon <= 0 || epsilon >= 1.0) throw new IllegalArgumentException("Epsilon must be between 0 and 1");
281+
if (quantile < 0.0 || quantile > 1.0) throw new IllegalArgumentException("Quantile must be between 0 and 1");
282+
if (epsilon < 0.0 || epsilon > 1.0) throw new IllegalArgumentException("Epsilon must be between 0 and 1");
272283

273284
this.quantile = quantile;
274285
this.epsilon = epsilon;
275-
u = 2.0 * epsilon / (1.0 - quantile);
276-
v = 2.0 * epsilon / quantile;
286+
u = 2.0 * epsilon / (1.0 - quantile); // if quantile == 1 this will be Double.NaN
287+
v = 2.0 * epsilon / quantile; // if quantile == 0 this will be Double.NaN
277288
}
278289

279290
@Override

simpleclient/src/main/java/io/prometheus/client/Summary.java

Lines changed: 82 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -13,70 +13,91 @@
1313
import java.util.concurrent.TimeUnit;
1414

1515
/**
16-
* Summary metric, to track the size of events.
16+
* {@link Summary} metrics and {@link Histogram} metrics can both be used to monitor latencies (or other things like request sizes).
1717
* <p>
18-
* Example of uses for Summaries include:
19-
* <ul>
20-
* <li>Response latency</li>
21-
* <li>Request size</li>
22-
* </ul>
23-
*
18+
* An overview of when to use Summaries and when to use Histograms can be found on <a href="https://prometheus.io/docs/practices/histograms">https://prometheus.io/docs/practices/histograms</a>.
2419
* <p>
25-
* Example Summaries:
20+
* The following example shows how to measure latencies and request sizes:
21+
*
2622
* <pre>
27-
* {@code
28-
* class YourClass {
29-
* static final Summary receivedBytes = Summary.build()
30-
* .name("requests_size_bytes").help("Request size in bytes.").register();
31-
* static final Summary requestLatency = Summary.build()
32-
* .name("requests_latency_seconds").help("Request latency in seconds.").register();
23+
* class YourClass {
3324
*
34-
* void processRequest(Request req) {
35-
* Summary.Timer requestTimer = requestLatency.startTimer();
36-
* try {
37-
* // Your code here.
38-
* } finally {
39-
* receivedBytes.observe(req.size());
40-
* requestTimer.observeDuration();
41-
* }
42-
* }
25+
* private static final Summary requestLatency = Summary.build()
26+
* .name("requests_latency_seconds")
27+
* .help("request latency in seconds")
28+
* .register();
29+
*
30+
* private static final Summary receivedBytes = Summary.build()
31+
* .name("requests_size_bytes")
32+
* .help("request size in bytes")
33+
* .register();
4334
*
44-
* // Or if using Java 8 and lambdas.
45-
* void processRequestLambda(Request req) {
35+
* public void processRequest(Request req) {
36+
* Summary.Timer requestTimer = requestLatency.startTimer();
37+
* try {
38+
* // Your code here.
39+
* } finally {
40+
* requestTimer.observeDuration();
4641
* receivedBytes.observe(req.size());
47-
* requestLatency.time(() -> {
48-
* // Your code here.
49-
* });
5042
* }
51-
* }
43+
* }
5244
* }
5345
* </pre>
54-
* This would allow you to track request rate, average latency and average request size.
5546
*
47+
* The {@link Summary} class provides different utility methods for observing values, like {@link #observe(double)},
48+
* {@link #startTimer()} and {@link Timer#observeDuration()}, {@link #time(Callable)}, etc.
5649
* <p>
57-
* How to add custom quantiles:
50+
* By default, {@link Summary} metrics provide the <tt>count</tt> and the <tt>sum</tt>. For example, if you measure
51+
* latencies of a REST service, the <tt>count</tt> will tell you how often the REST service was called,
52+
* and the <tt>sum</tt> will tell you the total aggregated response time.
53+
* You can calculate the average response time using a Prometheus query dividing <tt>sum / count</tt>.
54+
* <p>
55+
* In addition to <tt>count</tt> and <tt>sum</tt>, you can configure a Summary to provide quantiles:
56+
*
5857
* <pre>
59-
* {@code
60-
* static final Summary myMetric = Summary.build()
61-
* .quantile(0.5, 0.05) // Add 50th percentile (= median) with 5% tolerated error
62-
* .quantile(0.9, 0.01) // Add 90th percentile with 1% tolerated error
63-
* .quantile(0.99, 0.001) // Add 99th percentile with 0.1% tolerated error
64-
* .name("requests_size_bytes")
65-
* .help("Request size in bytes.")
66-
* .register();
67-
* }
58+
* Summary requestLatency = Summary.build()
59+
* .name("requests_latency_seconds")
60+
* .help("Request latency in seconds.")
61+
* .quantile(0.5, 0.01) // 0.5 quantile (median) with 0.01 allowed error
62+
* .quantile(0.95, 0.005) // 0.95 quantile with 0.005 allowed error
63+
* // ...
64+
* .register();
6865
* </pre>
6966
*
70-
* The quantiles are calculated over a sliding window of time. There are two options to configure this time window:
67+
* As an example, a 0.95 quantile of 120ms tells you that 95% of the calls were faster than 120ms, and 5% of the calls were slower than 120ms.
68+
* <p>
69+
* Tracking exact quantiles require a large amount of memory, because all observations need to be stored in a sorted list. Therefore, we allow an error to significantly reduce memory usage.
70+
* <p>
71+
* In the example, the allowed error of 0.005 means that you will not get the exact 0.95 quantile, but anything between the 0.945 quantile and the 0.955 quantile.
72+
* <p>
73+
* Experiments show that the {@link Summary} typically needs to keep less than 100 samples to provide that precision, even if you have hundreds of millions of observations.
74+
* <p>
75+
* There are a few special cases:
76+
*
7177
* <ul>
72-
* <li>maxAgeSeconds(long): Set the duration of the time window is, i.e. how long observations are kept before they are discarded.
73-
* Default is 10 minutes.
74-
* <li>ageBuckets(int): Set the number of buckets used to implement the sliding time window. If your time window is 10 minutes, and you have ageBuckets=5,
75-
* buckets will be switched every 2 minutes. The value is a trade-off between resources (memory and cpu for maintaining the bucket)
76-
* and how smooth the time window is moved. Default value is 5.
78+
* <li>You can set an allowed error of 0, but then the {@link Summary} will keep all observations in memory.</li>
79+
* <li>You can track the minimum value with <tt>.quantile(0.0, 0.0)</tt>.
80+
* This special case will not use additional memory even though the allowed error is 0.</li>
81+
* <li>You can track the maximum value with <tt>.quantile(1.0, 0.0)</tt>.
82+
* This special case will not use additional memory even though the allowed error is 0.</li>
7783
* </ul>
7884
*
79-
* See https://prometheus.io/docs/practices/histograms/ for more info on quantiles.
85+
* Typically, you don't want to have a {@link Summary} representing the entire runtime of the application,
86+
* but you want to look at a reasonable time interval. {@link Summary} metrics implement a configurable sliding
87+
* time window:
88+
*
89+
* <pre>
90+
* Summary requestLatency = Summary.build()
91+
* .name("requests_latency_seconds")
92+
* .help("Request latency in seconds.")
93+
* .maxAgeSeconds(10 * 60)
94+
* .ageBuckets(5)
95+
* // ...
96+
* .register();
97+
* </pre>
98+
*
99+
* The default is a time window of 10 minutes and 5 age buckets, i.e. the time window is 10 minutes wide, and
100+
* we slide it forward every 2 minutes.
80101
*/
81102
public class Summary extends SimpleCollector<Summary.Child> implements Counter.Describable {
82103

@@ -98,17 +119,25 @@ public static class Builder extends SimpleCollector.Builder<Builder, Summary> {
98119
private long maxAgeSeconds = TimeUnit.MINUTES.toSeconds(10);
99120
private int ageBuckets = 5;
100121

122+
/**
123+
* The class JavaDoc for {@link Summary} has more information on {@link #quantile(double, double)}.
124+
* @see Summary
125+
*/
101126
public Builder quantile(double quantile, double error) {
102-
if (quantile <= 0.0 || quantile >= 1.0) {
127+
if (quantile < 0.0 || quantile > 1.0) {
103128
throw new IllegalArgumentException("Quantile " + quantile + " invalid: Expected number between 0.0 and 1.0.");
104129
}
105-
if (error <= 0.0 || error >= 1.0) {
130+
if (error < 0.0 || error > 1.0) {
106131
throw new IllegalArgumentException("Error " + error + " invalid: Expected number between 0.0 and 1.0.");
107132
}
108133
quantiles.add(new Quantile(quantile, error));
109134
return this;
110135
}
111136

137+
/**
138+
* The class JavaDoc for {@link Summary} has more information on {@link #maxAgeSeconds(long)}
139+
* @see Summary
140+
*/
112141
public Builder maxAgeSeconds(long maxAgeSeconds) {
113142
if (maxAgeSeconds <= 0) {
114143
throw new IllegalArgumentException("maxAgeSeconds cannot be " + maxAgeSeconds);
@@ -117,6 +146,10 @@ public Builder maxAgeSeconds(long maxAgeSeconds) {
117146
return this;
118147
}
119148

149+
/**
150+
* The class JavaDoc for {@link Summary} has more information on {@link #ageBuckets(int)}
151+
* @see Summary
152+
*/
120153
public Builder ageBuckets(int ageBuckets) {
121154
if (ageBuckets <= 0) {
122155
throw new IllegalArgumentException("ageBuckets cannot be " + ageBuckets);

0 commit comments

Comments
 (0)