Skip to content

pq: activate stringref extension for more-compact PQ representation #17849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

yaauie
Copy link
Member

@yaauie yaauie commented Jul 22, 2025

Release notes

  • Persisted Queue: improved serialization to be more compact by default

What does this PR do?

When serializing a non-primitive value, CBOR encodes a two-element tuple containing the class name and the class-specific serialized value, which results in a significant amount of overhead in the form of frequently-repeated strings.

Jackson CBOR supports the stringref extension, which allows it to avoid repeating the actual bytes of a string, and instead keeps track of the strings it has encountered and referencing those strings by the index in which they occur.

For example, the first org.jruby.RubyString looks like:

74                                            # text(20)
   6f72672e6a727562792e52756279537472696e67   #   "org.jruby.RubyString"

While each subsequent string looks like:

d8 19                                         # tag(25)
   05                                         #   unsigned(5)

Enabling this extension allows us to save:

  • ~18 bytes from each secondary org.jruby.RubyString
  • ~23 bytes from each secondary org.logstash.ConvertedMap
  • ~24 bytes from each secondary org.logstash.ConvertedList
  • ...etc.

Practical example: a 9183-byte complex JSON that contains an event.original, consumed through the stdin input with json codec (adding fields like @timestamp, @version per normal) resulted in a 22% reduction in serialized size:

CBOR unpatched CBOR patched
11218 bytes 8728 bytes

The CBOR implementation in Jackson appears to support reading stringrefs regardless of whether this feature is enabled for serializing, which means that this change is not a rollback-barrier.

Why is it important/What is the impact to the user?

Reduces size-on-disk for PQ, enabling more events to fit on each page and reducing disk IO

Checklist

  • My code follows the style guidelines of this project
  • ~~[ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files (and/or docker env variables)
  • [ ] I have added tests that prove my fix is effective or that my feature works

Author's Checklist

  • ensure stringref events can be read by older Logstash without the feature enabled (if not, this feature will need to be opt-in for at least two minor releases of logstash before it becomes on-by-default)

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

When serializing a non-primitive value, CBOR encodes a two-element tuple
containing the class name and the class-specific serialized value, which
results in a significant amount of overhead in the form of frequently-
repeated strings.

Jackson CBOR supports the stringref extension, which allows it to avoid
repeating the actual bytes of a string, and instead keeps track of the
strings it has encountered and _referencing_ those strings by the index
in which they occur.

For example, the first `org.jruby.RubyString` looks like:

~~~
74                                            # text(20)
   6f72672e6a727562792e52756279537472696e67   #   "org.jruby.RubyString"
~~~

While each subsequent string looks like:

~~~
d8 19                                         # tag(25)
   05                                         #   unsigned(5)
~~~

Enabling this extension allows us to save:
 - ~18 bytes from each `org.jruby.RubyString`
 - ~23 bytes from each `org.logstash.ConvertedMap`
 - ~24 bytes from each `org.logstash.ConvertedList`
 - ...etc.

The CBOR implementation in Jackson _appears_ to support reading stringrefs
regardless of whether this feature is enabled for serializing, which means
that this change is not a rollback-barrier.
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Contributor

mergify bot commented Jul 22, 2025

This pull request does not have a backport label. Could you fix it @yaauie? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • If no backport is necessary, please add the backport-skip label

…nsion

With the CBOR stringref extension enabled, we add a 3-byte overhead to each
event to activate the extension, and eliminate 24 bytes of overhead for each
event's secondary instances of `org.logstash.ConvertedMap`. Since the events
under test have exactly two instances of `org.logstash.ConvertedMap`, this
is a net reduction of 21 bytes of overhead.

This changes the specifically-constructed events to have the intended lengths
to test their specific edge-cases.
@yaauie yaauie force-pushed the pq-activate-cbor-stringref-extension branch from 7647be4 to 44692e9 Compare July 22, 2025 23:37
@yaauie
Copy link
Member Author

yaauie commented Jul 24, 2025

I have manually validated that a stringref-enabled event serialized with this patch can be deserialized in unpatched logstash. I plan to add tests with fixture data for the releasable branches that we are not backporting to, to ensure that this change is not a rollback barrier.

Copy link

@elasticmachine
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

History

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants