Skip to content

Commit 45c3745

Browse files
authored
stats: metric expiry (#40395)
Change-Id: If3a45283b13cfda7d4f9a7bb661a1573f552ed7e Commit Message: Introduce mark and sweep eviction of stale metrics in a stats scope. Additional Description: The intended use case is the high cardinality metrics generated from the request data (e.g. [Istio standard metrics](https://istio.io/latest/docs/reference/config/metrics/)). This in combination with the cardinality bounds (future PR) would ensure bounded metric resource usage. The algorithm works as follows: 1. An "evictable" scope is allocated by a filter. 2. A delta stats sink is configured, e.g. OTLP. 3. At every flush interval, a scope metric that is used (e.g. has observed a data point) is marked as unused. A metric that has not been used is deleted from the central caches. 4. A notification is sent to all workers to purge scope stale metrics from their thread-local caches. 5. Once all workers complete, the unused metrics are purged from the allocator. There are several edge conditions that need to be explained to validate correctness of this algorithm: 1. A worker attempting to use a stale metric after (3) but before (4) might have its data lost. It will not be lost if 1) the same metric is recreated in the central cache by another worker since all metrics are uniquely indexed in the allocators; or 2) we implement deferred allocator deletions to await for the flush operation. 2. A worker should not use a stored stale metric after (4). This requires that workers to not store the metrics by reference (hence, this solution will not work for most xDS metrics). Thread local cache references are always deleted before the storage is deleted. 3. Histograms are handled slightly different because the parent histogram needs to be "merged" to observe usage, and clearing the usage requires updating all "children" histograms. Because we do this during flush, merging is always done first. 4. A metric that is re-created after eviction would continue having its start time set as the original metric. This is a limitation of Envoy since it does not store the metric start times, but it is not an issue with delta aggregation in OTLP. Delta is the recommended protocol for handling high cardinality or sparse metric data. We could add start_time in a follow-up. Risk Level: low, requires explicit usage Testing: unit and a load test with Istio Proxy Docs Changes: none Release Notes: none --------- Signed-off-by: Kuat Yessenov <[email protected]>
1 parent ca2cc3d commit 45c3745

26 files changed

+380
-23
lines changed

api/envoy/config/bootstrap/v3/bootstrap.proto

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ option (udpa.annotations.file_status).package_version_status = ACTIVE;
4141
// <config_overview_bootstrap>` for more detail.
4242

4343
// Bootstrap :ref:`configuration overview <config_overview_bootstrap>`.
44-
// [#next-free-field: 42]
44+
// [#next-free-field: 43]
4545
message Bootstrap {
4646
option (udpa.annotations.versioning).previous_message_type =
4747
"envoy.config.bootstrap.v2.Bootstrap";
@@ -230,6 +230,14 @@ message Bootstrap {
230230
bool stats_flush_on_admin = 29 [(validate.rules).bool = {const: true}];
231231
}
232232

233+
oneof stats_eviction {
234+
// Optional duration to perform metric eviction. At every interval, during the stats flush
235+
// the unused metrics are removed from the worker caches and the used metrics
236+
// are marked as unused. Must be a multiple of the ``stats_flush_interval``.
237+
google.protobuf.Duration stats_eviction_interval = 42
238+
[(validate.rules).duration = {gte {nanos: 1000000}}];
239+
}
240+
233241
// Optional watchdog configuration.
234242
// This is for a single watchdog configuration for the entire system.
235243
// Deprecated in favor of ``watchdogs`` which has finer granularity.

changelogs/current.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,12 @@ removed_config_or_runtime:
145145
Removed runtime guard ``envoy.reloadable_features.proxy_104`` and legacy code paths.
146146
147147
new_features:
148+
- area: stats
149+
change: |
150+
Added support to remove unused metrics from memory for extensions that
151+
support evictable metrics. This is done :ref:`periodically
152+
<envoy_v3_api_field_config.bootstrap.v3.Bootstrap.stats_eviction_interval>`
153+
during the metric flush.
148154
- area: quic
149155
change: |
150156
Added new option to support :ref:`base64 encoded server ID

envoy/server/configuration.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,11 @@ class StatsConfig {
8989
* @return true if deferred creation of stats is enabled.
9090
*/
9191
virtual bool enableDeferredCreationStats() const PURE;
92+
93+
/**
94+
* @return uint32_t a multiple of the flush interval to perform stats eviction, or 0 if disabled.
95+
*/
96+
virtual uint32_t evictOnFlush() const PURE;
9297
};
9398

9499
/**

envoy/stats/scope.h

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -71,17 +71,21 @@ class Scope : public std::enable_shared_from_this<Scope> {
7171
* See also scopeFromStatName, which is preferred.
7272
*
7373
* @param name supplies the scope's namespace prefix.
74+
* @param evictable whether unused metrics can be deleted from the scope caches. This requires
75+
* that the metrics are not stored by reference.
7476
*/
75-
virtual ScopeSharedPtr createScope(const std::string& name) PURE;
77+
virtual ScopeSharedPtr createScope(const std::string& name, bool evictable = false) PURE;
7678

7779
/**
7880
* Allocate a new scope. NOTE: The implementation should correctly handle overlapping scopes
7981
* that point to the same reference counted backing stats. This allows a new scope to be
8082
* gracefully swapped in while an old scope with the same name is being destroyed.
8183
*
8284
* @param name supplies the scope's namespace prefix.
85+
* @param evictable whether unused metrics can be deleted from the scope caches. This requires
86+
* that the metrics are not stored by reference.
8387
*/
84-
virtual ScopeSharedPtr scopeFromStatName(StatName name) PURE;
88+
virtual ScopeSharedPtr scopeFromStatName(StatName name, bool evictable = false) PURE;
8589

8690
/**
8791
* Creates a Counter from the stat name. Tag extraction will be performed on the name.

envoy/stats/stats.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,11 @@ class Metric : public RefcountInterface {
9393
*/
9494
virtual bool used() const PURE;
9595

96+
/**
97+
* Clear any indicator on whether this metric has been updated.
98+
*/
99+
virtual void markUnused() PURE;
100+
96101
/**
97102
* Indicates whether this metric is hidden.
98103
*/

envoy/stats/store.h

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,11 @@ class Store {
117117
virtual void forEachHistogram(SizeFn f_size, StatFn<ParentHistogram> f_stat) const PURE;
118118
virtual void forEachScope(SizeFn f_size, StatFn<const Scope> f_stat) const PURE;
119119

120+
/**
121+
* Delete unused metrics from all the evictable scope caches, and mark the rest as unused.
122+
*/
123+
virtual void evictUnused() PURE;
124+
120125
/**
121126
* @return a null counter that will ignore increments and always return 0.
122127
*/
@@ -172,7 +177,9 @@ class Store {
172177
/**
173178
* @return a scope of the given name.
174179
*/
175-
ScopeSharedPtr createScope(const std::string& name) { return rootScope()->createScope(name); }
180+
ScopeSharedPtr createScope(const std::string& name, bool evictable = false) {
181+
return rootScope()->createScope(name, evictable);
182+
}
176183

177184
/**
178185
* Extracts tags from the name and appends them to the provided StatNameTagVector.

source/common/stats/allocator_impl.cc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ template <class BaseClass> class StatsSharedImpl : public MetricImpl<BaseClass>
8585
// Metric
8686
SymbolTable& symbolTable() final { return alloc_.symbolTable(); }
8787
bool used() const override { return flags_ & Metric::Flags::Used; }
88+
void markUnused() override { flags_ &= ~Metric::Flags::Used; }
8889
bool hidden() const override { return flags_ & Metric::Flags::Hidden; }
8990

9091
// RefcountInterface

source/common/stats/histogram_impl.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ class HistogramImpl : public HistogramImplHelper {
111111
void recordValue(uint64_t value) override { parent_.deliverHistogramToSinks(*this, value); }
112112

113113
bool used() const override { return true; }
114+
void markUnused() override {}
114115
bool hidden() const override { return false; }
115116
SymbolTable& symbolTable() final { return parent_.symbolTable(); }
116117

@@ -132,6 +133,7 @@ class NullHistogramImpl : public HistogramImplHelper {
132133
~NullHistogramImpl() override { MetricImpl::clear(symbol_table_); }
133134

134135
bool used() const override { return false; }
136+
void markUnused() override {}
135137
bool hidden() const override { return false; }
136138
SymbolTable& symbolTable() override { return symbol_table_; }
137139

source/common/stats/isolated_store_impl.cc

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -63,12 +63,12 @@ ConstScopeSharedPtr IsolatedStoreImpl::constRootScope() const {
6363

6464
IsolatedStoreImpl::~IsolatedStoreImpl() = default;
6565

66-
ScopeSharedPtr IsolatedScopeImpl::createScope(const std::string& name) {
66+
ScopeSharedPtr IsolatedScopeImpl::createScope(const std::string& name, bool) {
6767
StatNameManagedStorage stat_name_storage(Utility::sanitizeStatsName(name), symbolTable());
68-
return scopeFromStatName(stat_name_storage.statName());
68+
return scopeFromStatName(stat_name_storage.statName(), false);
6969
}
7070

71-
ScopeSharedPtr IsolatedScopeImpl::scopeFromStatName(StatName name) {
71+
ScopeSharedPtr IsolatedScopeImpl::scopeFromStatName(StatName name, bool) {
7272
SymbolTable::StoragePtr prefix_name_storage = symbolTable().join({prefix(), name});
7373
ScopeSharedPtr scope = store_.makeScope(StatName(prefix_name_storage.get()));
7474
addScopeToStore(scope);

source/common/stats/isolated_store_impl.h

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -203,6 +203,10 @@ class IsolatedStoreImpl : public Store {
203203
}
204204
}
205205

206+
void evictUnused() override {
207+
// Do nothing. Eviction is only supported on the thread local stores.
208+
}
209+
206210
void forEachSinkedCounter(SizeFn f_size, StatFn<Counter> f_stat) const override {
207211
forEachCounter(f_size, f_stat);
208212
}
@@ -295,8 +299,8 @@ class IsolatedScopeImpl : public Scope {
295299
StatNameTagVectorOptConstRef tags) override {
296300
return store_.counters_.get(prefix(), name, tags, symbolTable());
297301
}
298-
ScopeSharedPtr createScope(const std::string& name) override;
299-
ScopeSharedPtr scopeFromStatName(StatName name) override;
302+
ScopeSharedPtr createScope(const std::string& name, bool evictable) override;
303+
ScopeSharedPtr scopeFromStatName(StatName name, bool evictable) override;
300304
Gauge& gaugeFromStatNameWithTags(const StatName& name, StatNameTagVectorOptConstRef tags,
301305
Gauge::ImportMode import_mode) override {
302306
Gauge& gauge = store_.gauges_.get(prefix(), name, tags, symbolTable(), import_mode);

0 commit comments

Comments
 (0)