-
Notifications
You must be signed in to change notification settings - Fork 3.5k
pq: activate stringref extension for more-compact PQ representation #17849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
When serializing a non-primitive value, CBOR encodes a two-element tuple containing the class name and the class-specific serialized value, which results in a significant amount of overhead in the form of frequently- repeated strings. Jackson CBOR supports the stringref extension, which allows it to avoid repeating the actual bytes of a string, and instead keeps track of the strings it has encountered and _referencing_ those strings by the index in which they occur. For example, the first `org.jruby.RubyString` looks like: ~~~ 74 # text(20) 6f72672e6a727562792e52756279537472696e67 # "org.jruby.RubyString" ~~~ While each subsequent string looks like: ~~~ d8 19 # tag(25) 05 # unsigned(5) ~~~ Enabling this extension allows us to save: - ~18 bytes from each `org.jruby.RubyString` - ~23 bytes from each `org.logstash.ConvertedMap` - ~24 bytes from each `org.logstash.ConvertedList` - ...etc. The CBOR implementation in Jackson _appears_ to support reading stringrefs regardless of whether this feature is enabled for serializing, which means that this change is not a rollback-barrier.
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
This pull request does not have a backport label. Could you fix it @yaauie? 🙏
|
…nsion With the CBOR stringref extension enabled, we add a 3-byte overhead to each event to activate the extension, and eliminate 24 bytes of overhead for each event's secondary instances of `org.logstash.ConvertedMap`. Since the events under test have exactly two instances of `org.logstash.ConvertedMap`, this is a net reduction of 21 bytes of overhead. This changes the specifically-constructed events to have the intended lengths to test their specific edge-cases.
7647be4
to
44692e9
Compare
I have manually validated that a stringref-enabled event serialized with this patch can be deserialized in unpatched logstash. I plan to add tests with fixture data for the releasable branches that we are not backporting to, to ensure that this change is not a rollback barrier. |
|
💛 Build succeeded, but was flaky
Failed CI StepsHistory
|
Release notes
What does this PR do?
When serializing a non-primitive value, CBOR encodes a two-element tuple containing the class name and the class-specific serialized value, which results in a significant amount of overhead in the form of frequently-repeated strings.
Jackson CBOR supports the stringref extension, which allows it to avoid repeating the actual bytes of a string, and instead keeps track of the strings it has encountered and referencing those strings by the index in which they occur.
For example, the first
org.jruby.RubyString
looks like:While each subsequent string looks like:
Enabling this extension allows us to save:
org.jruby.RubyString
org.logstash.ConvertedMap
org.logstash.ConvertedList
Practical example: a 9183-byte complex JSON that contains an
event.original
, consumed through the stdin input with json codec (adding fields like@timestamp
,@version
per normal) resulted in a 22% reduction in serialized size:The CBOR implementation in Jackson appears to support reading stringrefs regardless of whether this feature is enabled for serializing, which means that this change is not a rollback-barrier.
Why is it important/What is the impact to the user?
Reduces size-on-disk for PQ, enabling more events to fit on each page and reducing disk IO
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files (and/or docker env variables)[ ] I have added tests that prove my fix is effective or that my feature worksAuthor's Checklist
How to test this PR locally
Related issues
Use cases
Screenshots
Logs