Skip to content

document metrics temporal aggregation #63

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ These pages detail the components and how to configure the EDOT Collector.

- [Components](docs/collector-components.md): Get details on the components used to receive, process, and export telemetry data.
- [Guided onboarding](docs/guided-onboarding.md): Use the guided onboarding in Elasticsearch Service or a serverless Observability project to send data using the EDOT Collector.
- [Manual configurations](docs/manual-configuration.md): Manually configure the EDOT Collector to send data to Elastic Observability.
- [Limitations](docs/collector-limitations.md): Understand the current limitations of the EDOT Collector.
- [Manual configuration](docs/manual-configuration.md): Manually configure the EDOT Collector to send data to Elastic Observability.
- [Limitations](docs/limitations.md): Understand the current limitations of using EDOT Collector, data storage and querying.

## Unified Kubernetes Observability with Elastic Distributions of OpenTelemetry

Expand Down
11 changes: 0 additions & 11 deletions docs/collector-limitations.md

This file was deleted.

57 changes: 57 additions & 0 deletions docs/limitations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Elastic Distribution of OpenTelemetry limitations

## Collector limitations

The Elastic Distribution of the OpenTelemetry Collector has the following limitations:

- Because of an upstream limitation, `host.network.*` metrics aren't present from the OpenTelemetry side.
- `process.state` isn't present in the OpenTelemetry host metric. It's set to a dummy value of **Unknown** in the **State** column of the host processes table.
- The Elasticsearch exporter handles the resource attributes, but **Host OS version** and **Operating system** may show as "N/A".
- The CPU scraper needs to be enabled to collect the `systm.load.cores` metric, which affects the **Normalized Load** column in the **Hosts** table and the **Normalized Load** visualization on the host detailed view.
- The [`hostmetrics receiver`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/hostmetricsreceiver) doesn't support CPU and disk metrics on MacOS. These values will stay empty for collectors running on MacOS.
- The console shows error Log messages when the [`hostmetrics receiver`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/hostmetricsreceiver) can't access some of the process information due to permission issues.
- The console shows mapping errors initially until mapping occurs.

## Metrics temporal aggregation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is looking at the issue from an SDK point of view. For example, the OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE environment variable is only relevant to SDKs. Maybe point that out a little more clearly.


OpenTelemetry metrics data model provides multiple ways to report metrics temporality:
- cumulative (default)
- delta preferred
- low memory

A complete description and examples are provided in [aggregation temporality documentation](https://opentelemetry.io/docs/specs/otel/metrics/supplementary-guidelines/#aggregation-temporality).

Temporal aggregation effect depends on the OpenTelemetry metric type:

Gauge and up down counters always provide the "last value", which means that the producers of those metrics only reads
the last value, they don't keep track of the previous nor compute a delta.

| metric type / temporal aggregation | cumulative | delta preferred | low memory |
|------------------------------------|------------|-----------------|----------------------------------------------|
| gauge | last value | last value | last value |
| up down counter | last value | last value | last value |
| counter | cumulative | delta | synchronous: delta, asynchronous: cumulative |
| histogram | cumulative | delta | delta |

When metrics are stored in Elasticsearch with the `otel` mode,
OpenTelemetry metrics will be written to Time Series Data Stream (TSDS) which currently only support delta histograms.

As a consequence, metrics sent to Elasticsearch currently need to use the "delta preferred" to properly store histograms,
otherwise they will be discarded by the collector.

Setting `OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=delta` should allow to configure SDKs to change the default value.
(see [reference](https://github.com/open-telemetry/opentelemetry-specification/blob/main/spec-compliance-matrix.md#environment-variables) on supported SDKs).

In the case were the producer of `counter` or `histogram` metrics can't be configured with `delta preferred` behavior to report them with `delta`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should recommend to convert counter metrics to delta temporality. There are challenges with visualizing counter metrics today but I'm not sure if it's always worth doing a cumulative to delta conversion to avoid it. Plus, ES|QL will be enhanced with better support for counter rates. What remains is that queries need to be aware of the temporality of the metrics.
Also, some of our default dashboards for k8s do expect counters to be sent in the default cumulative temporality.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Histograms is the only reason we are asking for "delta preferred", because we can't store them otherwise.

Doing this change has a side effect all the counters visualizations because we don't have a nice way to handle that with a separate setting, but maybe ES|QL support would be enhanced to allow that, but so far we don't have any ETA.

If we already have dashboards that rely on cumulative counters, then we need to not apply this conversion, which means we need to instruct users to apply this conversion to some metrics and not others, which brings another layer of complexity to the end-user.

In a sense, what we need is to apply cumulativetodeltaprocessor only for cumulative histograms at the collector level, ideally close to the edge, or even better allow SDKs to apply this only to histograms without changing anything for counters because it would break the visualizations.

I wonder if contributing the ability to set the time aggregation per metric type would be less painful that the path we are trying to take here.

temporal aggregation, using the collector [`cumulativetodelta`](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/cumulativetodeltaprocessor)
processor can be used to convert from `cumulative` to `delta`.

Using [`cumulativetodelta`](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/cumulativetodeltaprocessor)
does however involves some challenges as it makes the processor stateful:

- metrics from a given producer must be sent to the same collector instance
- increases memory usage to keep track of per-metric state
- metrics needs to be configured at the collector level to opt-in/out of this processing

As a consequence, using the `cumulativetodelta` processor is recommended close to the edge (where metrics are produced),
and less recommended late in the data pipeline due to scalability challenges.