Skip to content

Conversation

dricross
Copy link
Contributor

@dricross dricross commented May 8, 2025

Description of the issue

The Cloudwatch/PMD exporter currently drops all exponential histogram metrics.

Description of changes

  • Add support for exponential histogram to the ec2tagger processor
  • Add support for exponential histogram to the CloudWatch/PMD exporter

Note

See companion PR for updating cumulativetodelta processor: amazon-contributing/opentelemetry-collector-contrib#331

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Current histogram test fails for cumulative histograms as the current agent behavior for cumulative histograms is incorrect and it being fixed in this PR. Specifically, the agent does not convert cumulative histograms to delta before pushing to cloudwatch. With these changes, the agent will. So the existing test fails, but the updated tests pass.

Note

See companion PR for integration test: aws/amazon-cloudwatch-agent-test#558

Integration test run: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/16839298907
Histogram test: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/16839298907/job/47706510747

The data we are sending as part of the test is defined here: https://github.com/aws/amazon-cloudwatch-agent-test/blob/dricross/exphistograms/test/histograms/resources/otlp_metrics.json#L188. where CUMULATIVE_HIST_COUNT and CUMULATIVE_HIST_SUM are increasing by 2 every 10 seconds. Important things to note:

  • Scale=0
  • IndexOffset = 1
  • Buckets = [0, 2] (where 2 is growing by 2 every 10 seconds)
  • With Scale=0, Base = 2 (Base = 2^(2^(-scale)) = 2^(2^0) = 2^1)

Formula for mapping bucket index to value: (Base^(index+offset), Base^(index+offset+1]

The first bucket (with 0 count) maps to range (2,4] ((2^(0+1), 2^(1+1+1])
The second bucket (with a count of 2) maps to range (4, 8] ((2^(1+1), 2^(1+1+1] = (4, 8])

Since we take the midpoints of the buckets when sending to exponential histograms to CloudWatch (consistent with the EMF exporter), the values/counts we send to CloudWatch is:

Values=[6,3]
Counts[2,0]

That list of values and counts is sent 6 times per minute

CloudWatch metrics:
image

EMF Logs from test as verification (this is existing behavior but demonstrates the intended output)

{
    "AWS.ServiceNameSource": "Instrumentation",
    "CloudWatchMetrics": [
        {
            "Namespace": "CWAgent",
            "Dimensions": [
                [
                    "service.name",
                    "my.exponential.histogram.attr"
                ]
            ],
            "Metrics": [
                {
                    "Name": "my.delta.exponential.histogram",
                    "Unit": "",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "EC2.InstanceId": "i-04282f146a9432a17",
    "Environment": "ec2:default",
    "PlatformType": "AWS::EC2",
    "Service": "my.service",
    "Timestamp": "1753710063060",
    "Version": "0",
    "my.exponential.histogram.attr": "some value",
    "service.name": "my.service",
    "my.delta.exponential.histogram": {
        "Values": [
            6,
            0
        ],
        "Counts": [
            2,
            1
        ],
        "Max": 5,
        "Min": 0,
        "Count": 3,
        "Sum": 10
    }
}
{
    "AWS.ServiceNameSource": "Instrumentation",
    "CloudWatchMetrics": [
        {
            "Namespace": "CWAgent",
            "Dimensions": [
                [
                    "my.cumulative.exponential.histogram.attr",
                    "service.name"
                ]
            ],
            "Metrics": [
                {
                    "Name": "my.cumulative.exponential.histogram",
                    "Unit": "",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "EC2.InstanceId": "i-04282f146a9432a17",
    "Environment": "ec2:default",
    "PlatformType": "AWS::EC2",
    "Service": "my.service",
    "Timestamp": "1753710063060",
    "Version": "0",
    "my.cumulative.exponential.histogram.attr": "some value",
    "service.name": "my.service",
    "my.cumulative.exponential.histogram": {
        "Values": [
            6
        ],
        "Counts": [
            2
        ],
        "Max": 0,
        "Min": 0,
        "Count": 2,
        "Sum": 2
    }
}

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

@dricross dricross requested a review from a team as a code owner May 8, 2025 20:55
@dricross dricross force-pushed the dricross/exponentialhistogram branch 2 times, most recently from 97c4271 to 995902e Compare May 16, 2025 21:08
Copy link
Contributor

This PR was marked stale due to lack of activity.

@github-actions github-actions bot added the Stale label May 28, 2025
@github-actions github-actions bot removed the Stale label Jun 28, 2025
Copy link
Contributor

github-actions bot commented Jul 7, 2025

This PR was marked stale due to lack of activity.

@@ -4,6 +4,10 @@ go 1.24.4

replace github.com/influxdata/telegraf => github.com/aws/telegraf v0.10.2-0.20250113150713-a2dfaa4cdf6d

replace collectd.org v0.4.0 => github.com/collectd/go-collectd v0.4.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these dependencies of cumulativetodeltaprocessor? I don't see go-collectd or clock used in the new code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hitting some issues with downloading the collectd dependency after clearing my local go cache and I needed to redirect to github to pick it up. I'm not exactly sure what happened but it looks like collectd.org stopped vending their package via collectd.org. The issue may have been resolved by now though, so I can try again.

../../../.gvm/pkgsets/go1.22.7/global/pkg/mod/github.com/aws/[email protected]/plugins/parsers/collectd/parser.go:8:2: unrecognized import path "collectd.org": https fetch: Get "https://collectd.org/?go-get=1": dial tcp: lookup collectd.org on 10.4.4.10:53: read udp 10.169.109.191:52627->10.4.4.10:53: i/o timeout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's been resolved. Was able to pull vendor files from collectd.org directly the other day. Maybe their GitHub page (https://github.com/collectd/collectd.github.io) was down for a bit. Not opposed to this though.


func (d *ExpHistogramDistribution) Size() int {
size := len(d.negativeBuckets) + len(d.positiveBuckets)
if d.zeroCount > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is zeroCount? the number of datapoints with 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pretty much. OTLP exponential histograms splits the data into three section: negative values, zero values, positive values. Positive and negative buckets are defined separately by a series of buckets+counts. The zero values don't have any buckets so its just a counter stored in the histogram structure. Side note: the definition of "0" in OTLP exponential histograms is loose. Datapoints with a magnitude less than the configurable "zero threshold" is treated as 0.

}

func (d *ExpHistogramDistribution) Resize(_ int) []*ExpHistogramDistribution {
// TODO: split data points into separate PMD requests if the number of buckets exceeds the API limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if we exceed the API limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the API documentation, I believe the PMD request will be rejected with error code 400 InvalidParameterValue, though I haven't tried.


// ValuesAndCounts outputs two arrays representing the midpoints of each exponential histogram bucket and the
// counter of datapoints within the corresponding exponential histogram buckets
func (d *ExpHistogramDistribution) ValuesAndCounts() ([]float64, []float64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can make the name of the function more descriptive. Maybe GetMidpointsAndCounts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming scheme was based off of the existing function in the Distribution interface. I think ValuesAndCounts is a fairly descriptive name as that's what is actually pushed to CloudWatch in the PMD request (an array of values and an array of counts).

@dricross dricross force-pushed the dricross/exponentialhistogram branch from 5362dc8 to d71cfd4 Compare July 23, 2025 12:24
@@ -78,34 +77,6 @@ func setNewDistributionFunc(maxValuesPerDatumLimit int) {
}
}

func resize(dist distribution.Distribution, listMaxSize int) (distList []distribution.Distribution) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored as functions on each distribution

@dricross dricross force-pushed the dricross/exponentialhistogram branch 2 times, most recently from eb11f44 to f7f02fe Compare July 24, 2025 18:18
@dricross dricross added the ready for testing Indicates this PR is ready for integration tests to run label Jul 24, 2025
jefchien
jefchien previously approved these changes Aug 8, 2025
@@ -4,6 +4,10 @@ go 1.24.4

replace github.com/influxdata/telegraf => github.com/aws/telegraf v0.10.2-0.20250113150713-a2dfaa4cdf6d

replace collectd.org v0.4.0 => github.com/collectd/go-collectd v0.4.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's been resolved. Was able to pull vendor files from collectd.org directly the other day. Maybe their GitHub page (https://github.com/collectd/collectd.github.io) was down for a bit. Not opposed to this though.

jefchien
jefchien previously approved these changes Aug 12, 2025
lisguo
lisguo previously approved these changes Aug 19, 2025
return values, counts
}

func (d *ExpHistogramDistribution) AddDistribution(other *ExpHistogramDistribution) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can probably remove this since the weights are all 1

expected: []int{0, 1, 2, 3, 4},
},
{
// for negative values, histogram buckets use lower-inclusive boundaries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confusing considering the test case uses positive values

@dricross dricross dismissed stale reviews from lisguo and jefchien via b6013bc August 20, 2025 15:09
@dricross dricross force-pushed the dricross/exponentialhistogram branch from d2fc758 to b6013bc Compare August 20, 2025 15:09
@dricross dricross merged commit 02dc7b0 into main Aug 20, 2025
185 of 187 checks passed
@dricross dricross deleted the dricross/exponentialhistogram branch August 20, 2025 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for testing Indicates this PR is ready for integration tests to run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants