-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Add xpdata.MapBuilder struct #13617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add xpdata.MapBuilder struct #13617
Conversation
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (81.25%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #13617 +/- ##
==========================================
- Coverage 91.53% 91.53% -0.01%
==========================================
Files 640 641 +1
Lines 42508 42524 +16
==========================================
+ Hits 38911 38925 +14
- Misses 2779 2780 +1
- Partials 818 819 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Can you please also post the Benchmark results, easier to read (at least for me) |
Right, here is a spreadsheet with the results I got while benchmarking locally: https://docs.google.com/spreadsheets/d/1iIy05mA17Jlm2B9JHDE7DvWXM6M7Rq-lnjcArs5XNls |
@jade-guiton-dd can you re-run the tests after #13634? |
@bogdandrutu I added a second tab to the spreadsheet with the new benchmark results on my machine, for easy comparison. The exact runtime numbers are slightly different of course, but the rankings are basically the same. Interestingly, the memory allocation numbers didn't change at all. I suspect this might be because there actually wasn't any allocation to eliminate in the first place: func newKeyValueString(k, v string) otlpcommon.KeyValue {
orig := otlpcommon.KeyValue{Key: k}
state := internal.StateMutable
akv := newValue(&orig.Value, &state)
akv.SetStr(v)
return orig
} The compiler's escape analysis likely determined that |
pdata/xpdata/map_builder.go
Outdated
func (mb *MapBuilder) UnsafeIntoMap(m pcommon.Map) { | ||
mb.state.AssertMutable() | ||
internal.GetMapState(internal.Map(m)).AssertMutable() | ||
mb.state = internal.StateReadOnly // to avoid modifying a Map later marked as ReadOnly through builder Values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this map if it is ReadOnly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative is to "move" the values into the returned Map, so then the builder becomes empty and the Map will be mutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Value
s returned by AppendEmpty
use the state
of the MapBuilder
. But the Map
we return has its own state
, which may later be changed to ReadOnly
. So to avoid indirectly modifying a ReadOnly
Map
through a Value
returned by the MapBuilder
, we mark the MapBuilder
itself as ReadOnly
once we're done with it. But we don't modify the state of the Map
, it stays mutable through this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about it some more, I'm pretty sure that even with these precautions it's still possible to bypass read-only protections.
"Exploit" details
- create a MapBuilder
- use AppendEmpty to get a mutable Value
- use UnsafeIntoMap to transfer into a Map; this makes your Value read-only
- resulting Map is later marked as read-only as well using
ptrace.Traces.MarkReadOnly()
or equivalents for other signals - overwrite the MapBuilder in-place with a zero-initialized one to reset its state; your Value is now mutable again, and points into the read-only Map.
But I noticed that the Map.MoveTo
method doesn't handle that edge case either: even if the target Map
is later marked as read-only, you can still modify its entries using Value
s previously obtained from the source Map
.
So I won't bother trying to bullet-proof it and just do what you suggested, ie. UnsafeIntoMap
will make the builder empty, but it won't turn it ReadOnly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not opposed to this experimenting API, I would love to understand where do you plan to use it.
It's an internal project so I can't share it, but the use case is basically exactly what's in the benchmark, ie. a custom receiver where we need to be able to handle duplicate attribute keys on the input efficiently. |
#### Description This PR adds a new `MapBuilder` struct to the `xpdata` package, which can be used to more efficiently create a `pcommon.Map` in receivers. The simplest way to insert data in a `Map` is to repeatedly call `Map.PutEmpty`, `Map.PutStr`, or other methods of the `Put` family. These methods all handle duplicate keys on a "last write wins" basis by calling `Map.Get` to find a potential existing value for the inserted key. Unfortunately, `Get` is rather slow, as it performs a linear scan of the `Map`'s key/value pairs. This means that the repeated `Put` method has quadratic complexity, and has significant overhead even in cases with a realistic number of keys. `MapBuilder` allows components to build up a list of key/value pairs, then turn them into a `Map` with very little overhead using the `UnsafeIntoMap` method. However, this has the caveat that, if used inappropriately, the final `Map` may contain duplicate keys, which breaks a basic invariant of the data structure and requirement of the OTLP protocol, with unpredictable consequences. I created [a benchmark](https://github.com/jade-guiton-dd/opentelemetry-collector/blob/549e107e852d16f299f4199816a6c20eb5a5ea32/pdata/pcommon/map_experiments_test.go#L75) to test this method, as well as a number of alternatives to optimize insertions. The benchmark was performed on realistic data containing duplicate keys, realistic data without duplicate keys, and worst-case data (200 keys, treated as if they may contain duplicate keys). Here are the basic results: - Using `MapBuilder.UnsafeIntoMap` is the fastest option in all three scenarios. In scenarios with duplicate keys, we perform an upfront sorting + neighbor-comparison deduplication step before calling the method. - Realistic data without duplicates: This results in **-49% runtime** compared to the `Put` version. - Realistic data with duplicates: Despite the extra work of the deduplication step, this still results in -11% runtime. This is the use case that I am interested in using. - Adding the data into a Go `map` and using the existing `Map.FromRaw` method, with code to reuse the `map` across iterations: - Worst-case data: Using `FromRaw` succeeds at reducing the worst-case complexity (-92% runtime compared to using `Put`), but manual sort-based deduplication + `MapBuilder.UnsafeIntoMap` still results in a further -56% runtime reduction. - Realistic data: Somewhat slower than `Put` (+3% runtime with duplicates / +29% runtime with no duplicates), and performs more memory allocations. - These numbers are significantly worse in the simpler (and more parallelizable) version where a new `map` is allocated each time. - The safer alternative `XIntoMap` methods, which check that the input keys are distinct in various ways, are all too slow to be useful in the "realistic, no duplicates" case, and are worse than `UnsafeIntoMap` in other cases. I would consider adding them only if `UnsafeIntoMap` is deemed too dangerous to even be in an experimental API like `xpdata`. - Changing `pcommon.Map` internals to use a sorted array or Go `map` representation in order to reduce the cost of `Put`/`Get` succeeds at reducing the worst-case time complexity as well, but is invariably slower than the `Put` version in the realistic scenarios. #### Testing I added a basic unit test for `MapBuilder`'s functionality.
#### Description This PR adds a new `MapBuilder` struct to the `xpdata` package, which can be used to more efficiently create a `pcommon.Map` in receivers. The simplest way to insert data in a `Map` is to repeatedly call `Map.PutEmpty`, `Map.PutStr`, or other methods of the `Put` family. These methods all handle duplicate keys on a "last write wins" basis by calling `Map.Get` to find a potential existing value for the inserted key. Unfortunately, `Get` is rather slow, as it performs a linear scan of the `Map`'s key/value pairs. This means that the repeated `Put` method has quadratic complexity, and has significant overhead even in cases with a realistic number of keys. `MapBuilder` allows components to build up a list of key/value pairs, then turn them into a `Map` with very little overhead using the `UnsafeIntoMap` method. However, this has the caveat that, if used inappropriately, the final `Map` may contain duplicate keys, which breaks a basic invariant of the data structure and requirement of the OTLP protocol, with unpredictable consequences. I created [a benchmark](https://github.com/jade-guiton-dd/opentelemetry-collector/blob/549e107e852d16f299f4199816a6c20eb5a5ea32/pdata/pcommon/map_experiments_test.go#L75) to test this method, as well as a number of alternatives to optimize insertions. The benchmark was performed on realistic data containing duplicate keys, realistic data without duplicate keys, and worst-case data (200 keys, treated as if they may contain duplicate keys). Here are the basic results: - Using `MapBuilder.UnsafeIntoMap` is the fastest option in all three scenarios. In scenarios with duplicate keys, we perform an upfront sorting + neighbor-comparison deduplication step before calling the method. - Realistic data without duplicates: This results in **-49% runtime** compared to the `Put` version. - Realistic data with duplicates: Despite the extra work of the deduplication step, this still results in -11% runtime. This is the use case that I am interested in using. - Adding the data into a Go `map` and using the existing `Map.FromRaw` method, with code to reuse the `map` across iterations: - Worst-case data: Using `FromRaw` succeeds at reducing the worst-case complexity (-92% runtime compared to using `Put`), but manual sort-based deduplication + `MapBuilder.UnsafeIntoMap` still results in a further -56% runtime reduction. - Realistic data: Somewhat slower than `Put` (+3% runtime with duplicates / +29% runtime with no duplicates), and performs more memory allocations. - These numbers are significantly worse in the simpler (and more parallelizable) version where a new `map` is allocated each time. - The safer alternative `XIntoMap` methods, which check that the input keys are distinct in various ways, are all too slow to be useful in the "realistic, no duplicates" case, and are worse than `UnsafeIntoMap` in other cases. I would consider adding them only if `UnsafeIntoMap` is deemed too dangerous to even be in an experimental API like `xpdata`. - Changing `pcommon.Map` internals to use a sorted array or Go `map` representation in order to reduce the cost of `Put`/`Get` succeeds at reducing the worst-case time complexity as well, but is invariably slower than the `Put` version in the realistic scenarios. #### Testing I added a basic unit test for `MapBuilder`'s functionality.
@bogdandrutu Could you take a second look at the PR? I'm not sure if my handling of |
Description
This PR adds a new
MapBuilder
struct to thexpdata
package, which can be used to more efficiently create apcommon.Map
in receivers.The simplest way to insert data in a
Map
is to repeatedly callMap.PutEmpty
,Map.PutStr
, or other methods of thePut
family. These methods all handle duplicate keys on a "last write wins" basis by callingMap.Get
to find a potential existing value for the inserted key. Unfortunately,Get
is rather slow, as it performs a linear scan of theMap
's key/value pairs. This means that the repeatedPut
method has quadratic complexity, and has significant overhead even in cases with a realistic number of keys.MapBuilder
allows components to build up a list of key/value pairs, then turn them into aMap
with very little overhead using theUnsafeIntoMap
method. However, this has the caveat that, if used inappropriately, the finalMap
may contain duplicate keys, which breaks a basic invariant of the data structure and requirement of the OTLP protocol, with unpredictable consequences.I created a benchmark to test this method, as well as a number of alternatives to optimize insertions. The benchmark was performed on realistic data containing duplicate keys, realistic data without duplicate keys, and worst-case data (200 keys, treated as if they may contain duplicate keys). Here are the basic results:
MapBuilder.UnsafeIntoMap
is the fastest option in all three scenarios. In scenarios with duplicate keys, we perform an upfront sorting + neighbor-comparison deduplication step before calling the method.Put
version.map
and using the existingMap.FromRaw
method, with code to reuse themap
across iterations:FromRaw
succeeds at reducing the worst-case complexity (-92% runtime compared to usingPut
), but manual sort-based deduplication +MapBuilder.UnsafeIntoMap
still results in a further -56% runtime reduction.Put
(+3% runtime with duplicates / +29% runtime with no duplicates), and performs more memory allocations.map
is allocated each time.XIntoMap
methods, which check that the input keys are distinct in various ways, are all too slow to be useful in the "realistic, no duplicates" case, and are worse thanUnsafeIntoMap
in other cases. I would consider adding them only ifUnsafeIntoMap
is deemed too dangerous to even be in an experimental API likexpdata
.pcommon.Map
internals to use a sorted array or Gomap
representation in order to reduce the cost ofPut
/Get
succeeds at reducing the worst-case time complexity as well, but is invariably slower than thePut
version in the realistic scenarios.Testing
I added a basic unit test for
MapBuilder
's functionality.