Skip to content

Conversation

jade-guiton-dd
Copy link
Contributor

@jade-guiton-dd jade-guiton-dd commented Aug 11, 2025

Description

This PR adds a new MapBuilder struct to the xpdata package, which can be used to more efficiently create a pcommon.Map in receivers.

The simplest way to insert data in a Map is to repeatedly call Map.PutEmpty, Map.PutStr, or other methods of the Put family. These methods all handle duplicate keys on a "last write wins" basis by calling Map.Get to find a potential existing value for the inserted key. Unfortunately, Get is rather slow, as it performs a linear scan of the Map's key/value pairs. This means that the repeated Put method has quadratic complexity, and has significant overhead even in cases with a realistic number of keys.

MapBuilder allows components to build up a list of key/value pairs, then turn them into a Map with very little overhead using the UnsafeIntoMap method. However, this has the caveat that, if used inappropriately, the final Map may contain duplicate keys, which breaks a basic invariant of the data structure and requirement of the OTLP protocol, with unpredictable consequences.

I created a benchmark to test this method, as well as a number of alternatives to optimize insertions. The benchmark was performed on realistic data containing duplicate keys, realistic data without duplicate keys, and worst-case data (200 keys, treated as if they may contain duplicate keys). Here are the basic results:

  • Using MapBuilder.UnsafeIntoMap is the fastest option in all three scenarios. In scenarios with duplicate keys, we perform an upfront sorting + neighbor-comparison deduplication step before calling the method.
    • Realistic data without duplicates: This results in -49% runtime compared to the Put version.
    • Realistic data with duplicates: Despite the extra work of the deduplication step, this still results in -11% runtime. This is the use case that I am interested in using.
  • Adding the data into a Go map and using the existing Map.FromRaw method, with code to reuse the map across iterations:
    • Worst-case data: Using FromRaw succeeds at reducing the worst-case complexity (-92% runtime compared to using Put), but manual sort-based deduplication + MapBuilder.UnsafeIntoMap still results in a further -56% runtime reduction.
    • Realistic data: Somewhat slower than Put (+3% runtime with duplicates / +29% runtime with no duplicates), and performs more memory allocations.
    • These numbers are significantly worse in the simpler (and more parallelizable) version where a new map is allocated each time.
  • The safer alternative XIntoMap methods, which check that the input keys are distinct in various ways, are all too slow to be useful in the "realistic, no duplicates" case, and are worse than UnsafeIntoMap in other cases. I would consider adding them only if UnsafeIntoMap is deemed too dangerous to even be in an experimental API like xpdata.
  • Changing pcommon.Map internals to use a sorted array or Go map representation in order to reduce the cost of Put/Get succeeds at reducing the worst-case time complexity as well, but is invariably slower than the Put version in the realistic scenarios.

Testing

I added a basic unit test for MapBuilder's functionality.

Copy link

codecov bot commented Aug 11, 2025

Codecov Report

❌ Patch coverage is 81.25000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.53%. Comparing base (3d3d087) to head (95e5ed4).

Files with missing lines Patch % Lines
pdata/xpdata/map_builder.go 81.25% 2 Missing and 1 partial ⚠️

❌ Your patch check has failed because the patch coverage (81.25%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13617      +/-   ##
==========================================
- Coverage   91.53%   91.53%   -0.01%     
==========================================
  Files         640      641       +1     
  Lines       42508    42524      +16     
==========================================
+ Hits        38911    38925      +14     
- Misses       2779     2780       +1     
- Partials      818      819       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bogdandrutu
Copy link
Member

Can you please also post the Benchmark results, easier to read (at least for me)

@jade-guiton-dd jade-guiton-dd marked this pull request as ready for review August 12, 2025 09:48
@jade-guiton-dd
Copy link
Contributor Author

jade-guiton-dd commented Aug 12, 2025

Can you please also post the Benchmark results, easier to read (at least for me)

Right, here is a spreadsheet with the results I got while benchmarking locally:

https://docs.google.com/spreadsheets/d/1iIy05mA17Jlm2B9JHDE7DvWXM6M7Rq-lnjcArs5XNls

@bogdandrutu
Copy link
Member

@jade-guiton-dd can you re-run the tests after #13634?

@jade-guiton-dd
Copy link
Contributor Author

jade-guiton-dd commented Aug 14, 2025

@bogdandrutu I added a second tab to the spreadsheet with the new benchmark results on my machine, for easy comparison.

The exact runtime numbers are slightly different of course, but the rankings are basically the same. Interestingly, the memory allocation numbers didn't change at all.

I suspect this might be because there actually wasn't any allocation to eliminate in the first place:

func newKeyValueString(k, v string) otlpcommon.KeyValue {
	orig := otlpcommon.KeyValue{Key: k}
	state := internal.StateMutable
	akv := newValue(&orig.Value, &state)
	akv.SetStr(v)
	return orig
}

The compiler's escape analysis likely determined that akv and state never escape the function (we only use them as a way to modify orig), so the use of &state doesn't trigger an allocation for state.

func (mb *MapBuilder) UnsafeIntoMap(m pcommon.Map) {
mb.state.AssertMutable()
internal.GetMapState(internal.Map(m)).AssertMutable()
mb.state = internal.StateReadOnly // to avoid modifying a Map later marked as ReadOnly through builder Values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this map if it is ReadOnly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative is to "move" the values into the returned Map, so then the builder becomes empty and the Map will be mutable.

Copy link
Contributor Author

@jade-guiton-dd jade-guiton-dd Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Values returned by AppendEmpty use the state of the MapBuilder. But the Map we return has its own state, which may later be changed to ReadOnly. So to avoid indirectly modifying a ReadOnly Map through a Value returned by the MapBuilder, we mark the MapBuilder itself as ReadOnly once we're done with it. But we don't modify the state of the Map, it stays mutable through this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about it some more, I'm pretty sure that even with these precautions it's still possible to bypass read-only protections.

"Exploit" details
  1. create a MapBuilder
  2. use AppendEmpty to get a mutable Value
  3. use UnsafeIntoMap to transfer into a Map; this makes your Value read-only
  4. resulting Map is later marked as read-only as well using ptrace.Traces.MarkReadOnly() or equivalents for other signals
  5. overwrite the MapBuilder in-place with a zero-initialized one to reset its state; your Value is now mutable again, and points into the read-only Map.

But I noticed that the Map.MoveTo method doesn't handle that edge case either: even if the target Map is later marked as read-only, you can still modify its entries using Values previously obtained from the source Map.

So I won't bother trying to bullet-proof it and just do what you suggested, ie. UnsafeIntoMap will make the builder empty, but it won't turn it ReadOnly.

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to this experimenting API, I would love to understand where do you plan to use it.

@jade-guiton-dd
Copy link
Contributor Author

I would love to understand where do you plan to use it.

It's an internal project so I can't share it, but the use case is basically exactly what's in the benchmark, ie. a custom receiver where we need to be able to handle duplicate attribute keys on the input efficiently.

@mx-psi mx-psi added this pull request to the merge queue Aug 26, 2025
github-merge-queue bot pushed a commit that referenced this pull request Aug 26, 2025
#### Description

This PR adds a new `MapBuilder` struct to the `xpdata` package, which
can be used to more efficiently create a `pcommon.Map` in receivers.

The simplest way to insert data in a `Map` is to repeatedly call
`Map.PutEmpty`, `Map.PutStr`, or other methods of the `Put` family.
These methods all handle duplicate keys on a "last write wins" basis by
calling `Map.Get` to find a potential existing value for the inserted
key. Unfortunately, `Get` is rather slow, as it performs a linear scan
of the `Map`'s key/value pairs. This means that the repeated `Put`
method has quadratic complexity, and has significant overhead even in
cases with a realistic number of keys.

`MapBuilder` allows components to build up a list of key/value pairs,
then turn them into a `Map` with very little overhead using the
`UnsafeIntoMap` method. However, this has the caveat that, if used
inappropriately, the final `Map` may contain duplicate keys, which
breaks a basic invariant of the data structure and requirement of the
OTLP protocol, with unpredictable consequences.

I created [a
benchmark](https://github.com/jade-guiton-dd/opentelemetry-collector/blob/549e107e852d16f299f4199816a6c20eb5a5ea32/pdata/pcommon/map_experiments_test.go#L75)
to test this method, as well as a number of alternatives to optimize
insertions. The benchmark was performed on realistic data containing
duplicate keys, realistic data without duplicate keys, and worst-case
data (200 keys, treated as if they may contain duplicate keys). Here are
the basic results:
- Using `MapBuilder.UnsafeIntoMap` is the fastest option in all three
scenarios. In scenarios with duplicate keys, we perform an upfront
sorting + neighbor-comparison deduplication step before calling the
method.
- Realistic data without duplicates: This results in **-49% runtime**
compared to the `Put` version.
- Realistic data with duplicates: Despite the extra work of the
deduplication step, this still results in -11% runtime. This is the use
case that I am interested in using.
- Adding the data into a Go `map` and using the existing `Map.FromRaw`
method, with code to reuse the `map` across iterations:
- Worst-case data: Using `FromRaw` succeeds at reducing the worst-case
complexity (-92% runtime compared to using `Put`), but manual sort-based
deduplication + `MapBuilder.UnsafeIntoMap` still results in a further
-56% runtime reduction.
- Realistic data: Somewhat slower than `Put` (+3% runtime with
duplicates / +29% runtime with no duplicates), and performs more memory
allocations.
- These numbers are significantly worse in the simpler (and more
parallelizable) version where a new `map` is allocated each time.
- The safer alternative `XIntoMap` methods, which check that the input
keys are distinct in various ways, are all too slow to be useful in the
"realistic, no duplicates" case, and are worse than `UnsafeIntoMap` in
other cases. I would consider adding them only if `UnsafeIntoMap` is
deemed too dangerous to even be in an experimental API like `xpdata`.
- Changing `pcommon.Map` internals to use a sorted array or Go `map`
representation in order to reduce the cost of `Put`/`Get` succeeds at
reducing the worst-case time complexity as well, but is invariably
slower than the `Put` version in the realistic scenarios.

#### Testing

I added a basic unit test for `MapBuilder`'s functionality.
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 26, 2025
@mx-psi mx-psi added this pull request to the merge queue Aug 26, 2025
github-merge-queue bot pushed a commit that referenced this pull request Aug 26, 2025
#### Description

This PR adds a new `MapBuilder` struct to the `xpdata` package, which
can be used to more efficiently create a `pcommon.Map` in receivers.

The simplest way to insert data in a `Map` is to repeatedly call
`Map.PutEmpty`, `Map.PutStr`, or other methods of the `Put` family.
These methods all handle duplicate keys on a "last write wins" basis by
calling `Map.Get` to find a potential existing value for the inserted
key. Unfortunately, `Get` is rather slow, as it performs a linear scan
of the `Map`'s key/value pairs. This means that the repeated `Put`
method has quadratic complexity, and has significant overhead even in
cases with a realistic number of keys.

`MapBuilder` allows components to build up a list of key/value pairs,
then turn them into a `Map` with very little overhead using the
`UnsafeIntoMap` method. However, this has the caveat that, if used
inappropriately, the final `Map` may contain duplicate keys, which
breaks a basic invariant of the data structure and requirement of the
OTLP protocol, with unpredictable consequences.

I created [a
benchmark](https://github.com/jade-guiton-dd/opentelemetry-collector/blob/549e107e852d16f299f4199816a6c20eb5a5ea32/pdata/pcommon/map_experiments_test.go#L75)
to test this method, as well as a number of alternatives to optimize
insertions. The benchmark was performed on realistic data containing
duplicate keys, realistic data without duplicate keys, and worst-case
data (200 keys, treated as if they may contain duplicate keys). Here are
the basic results:
- Using `MapBuilder.UnsafeIntoMap` is the fastest option in all three
scenarios. In scenarios with duplicate keys, we perform an upfront
sorting + neighbor-comparison deduplication step before calling the
method.
- Realistic data without duplicates: This results in **-49% runtime**
compared to the `Put` version.
- Realistic data with duplicates: Despite the extra work of the
deduplication step, this still results in -11% runtime. This is the use
case that I am interested in using.
- Adding the data into a Go `map` and using the existing `Map.FromRaw`
method, with code to reuse the `map` across iterations:
- Worst-case data: Using `FromRaw` succeeds at reducing the worst-case
complexity (-92% runtime compared to using `Put`), but manual sort-based
deduplication + `MapBuilder.UnsafeIntoMap` still results in a further
-56% runtime reduction.
- Realistic data: Somewhat slower than `Put` (+3% runtime with
duplicates / +29% runtime with no duplicates), and performs more memory
allocations.
- These numbers are significantly worse in the simpler (and more
parallelizable) version where a new `map` is allocated each time.
- The safer alternative `XIntoMap` methods, which check that the input
keys are distinct in various ways, are all too slow to be useful in the
"realistic, no duplicates" case, and are worse than `UnsafeIntoMap` in
other cases. I would consider adding them only if `UnsafeIntoMap` is
deemed too dangerous to even be in an experimental API like `xpdata`.
- Changing `pcommon.Map` internals to use a sorted array or Go `map`
representation in order to reduce the cost of `Put`/`Get` succeeds at
reducing the worst-case time complexity as well, but is invariably
slower than the `Put` version in the realistic scenarios.

#### Testing

I added a basic unit test for `MapBuilder`'s functionality.
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 26, 2025
@github-actions github-actions bot requested a review from bogdandrutu August 26, 2025 10:21
@jade-guiton-dd
Copy link
Contributor Author

@bogdandrutu Could you take a second look at the PR? I'm not sure if my handling of internal.State in MapBuilder works well with what you intend to do with reference-counting / pipeline ownership.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants