Fix binary data serialization in dynamic instrumentation by p-datadog · Pull Request #5434 · DataDog/dd-trace-rb

p-datadog · 2026-03-10T18:33:48Z

What does this PR do?

Fixes binary data serialization in dynamic instrumentation snapshots. Binary strings (ASCII-8BIT encoding) and invalid UTF-8 strings are now escaped to a JSON-safe format, preventing JSON encoding failures when snapshots contain binary data. Also fixes exception handling for custom serializer conditions.

Motivation:

Binary data in DI snapshots causes JSON serialization to fail. When dynamic instrumentation captures variables containing binary strings (database records, protocol buffers, network payloads, etc.), the snapshot serialization fails during JSON encoding, preventing the snapshot from being sent to the backend. This implementation escapes binary data to b'\xHH' format (matches dd-trace-py output), handles invalid UTF-8 strings, applies truncation efficiently (limit original bytes, not escaped output), and preserves valid text in other encodings (Latin1, Windows-1252).

Additionally, strings with invalid UTF-8 encoding were causing exceptions during serialization, and custom serializer conditions using regex could fail on invalid UTF-8 strings.

Change log entry

Yes. Dynamic Instrumentation: Fix JSON serialization failures when snapshots contain binary data and invalid UTF-8 strings.

Additional Notes:

Implementation escapes binary bytes to b'\x80\xff' format with printable ASCII preserved, special characters (\n, \t, \r, \\, \') escaped, and other bytes as \xHH hex escapes. Truncation is applied to original binary data before escaping (e.g., 300-byte binary with limit=255 → truncate to 255 bytes, escape → 1023 char output, size=300). Output format matches dd-trace-py for consistency across language tracers. Escaping overhead is 2-6x (vs 40x for Java's approach).

Fixes:

Invalid UTF-8 strings now use .bytesize and .byteslice instead of .length and [0...max] to avoid encoding exceptions
Custom serializer conditions that raise exceptions (e.g., regex match against invalid UTF-8) are logged and skipped to prevent breaking the entire serialization process
Added documentation in docs/DynamicInstrumentation.md explaining custom serializer exception handling behavior

Added comprehensive tests covering all byte values, truncation behavior, invalid UTF-8, valid non-UTF8 encodings, edge cases, custom serializer exception handling, and integration through transport layer.

How to test the change?

Run all binary data related tests:

bundle exec rspec spec/datadog/di/serializer_spec.rb spec/datadog/di/transport/input_spec.rb spec/datadog/di/integration/instrumentation_spec.rb

All tests pass (125 serializer tests including 2 new custom serializer exception tests).

Tested manually:

Binary data (ASCII-8BIT encoded strings) captured in dynamic instrumentation snapshots cannot be serialized to JSON, causing JSON::GeneratorError when snapshots are transmitted to the backend. This adds comprehensive test coverage across three layers: - Serializer: Verifies binary encoding is preserved in serialized output - Transport: Demonstrates JSON encoding failures at transport boundary - Integration: End-to-end tests showing complete DI pipeline failure Tests cover binary data in method parameters, return values, and nested structures, preparing for Python-style escaping implementation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Binary strings (ASCII-8BIT encoding) are now converted to Python-style repr format (e.g., b'\x80\xff') before JSON serialization. This makes binary data JSON-safe and prevents JSON::GeneratorError. Implementation details: - Added binary_to_python_repr() method to convert binary strings - Handles printable ASCII, special escapes (\n, \t, \r, \\, \'), and hex escapes (\xHH) - Truncation happens before conversion to avoid cutting mid-escape sequence - Binary data is truncated to max_length/4 bytes to account for repr expansion Test updates: - All tests now expect Python repr format for binary data - Transport tests verify JSON encoding succeeds with repr format - Integration tests confirm end-to-end success This matches Python dd-trace-py's behavior where bytes are serialized as repr strings, providing consistent cross-language handling. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The previous implementation truncated binary data before conversion to Python repr, which was overly conservative and didn't match the reference tracer behavior. Changes: - Convert FULL binary string to Python repr first - Truncate the repr string if it exceeds max_capture_string_length - Store the full repr length in the size field (not original binary length) - max_capture_string_length is measured in output characters, not input bytes This matches dd-trace-py which does: value_repr = repr(value) if len(value_repr) > maxlen: return {"value": value_repr[:maxlen], "truncated": True, "size": len(value_repr)} Reference tracer behavior documented in code comments: - Python: Convert entire bytes to repr, then truncate repr string - Java: Truncate byte array at maxCollectionSize, convert each element Added tests verifying: - 100-byte binary string → 403-char repr, truncated at character limit - Printable ASCII captures more data (less expansion) - Size field contains repr length, matching Python behavior Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Removed references to "Python repr" and replaced with "escaped binary string" or "JSON-safe format" terminology. The implementation produces the same serialized contents as dd-trace-py, but should be documented in language-agnostic terms. Changes: - Renamed binary_to_python_repr() -> escape_binary_string() - Updated comments to describe the escaping format (b'...') rather than referencing Python's repr() function - Test descriptions now say "escapes binary data" instead of "converts to Python repr" - Documented that we produce the same serialized contents as dd-trace-py Added truncation behavior documentation and test: - Documented that truncation may occur mid-escape-sequence (e.g., "b'\\x80\\x" cutting after partial hex escape) - Added comment explaining this is intentional and matches dd-trace-py - Added test demonstrating mid-escape truncation with detailed character-by-character breakdown All 18 binary data tests passing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Critical fix: Extended binary escaping to handle invalid UTF-8 strings (not just ASCII-8BIT encoding). This catches strings marked as UTF-8 but containing invalid byte sequences, which commonly happens when binary data is incorrectly tagged. Changes to lib/datadog/di/serializer.rb: - Check for `!value.valid_encoding?` in addition to BINARY encoding - Updated documentation to mention invalid UTF-8 handling - Added comment documenting that size field represents escaped string length, not original binary length (matches dd-trace-py) New test cases added (addressing senior engineer review): 1. Invalid UTF-8 strings - Strings marked as UTF-8 with invalid byte sequences - Mixed valid/invalid UTF-8 content 2. Empty binary string (edge case) 3. Very large binary string (100K bytes) - Verifies truncation behavior - Confirms size field semantics (escaped length) - Tests JSON serializability All 23 binary data tests passing (up from 18). All 113 serializer tests passing. This addresses the critical gap identified in code review where binary data incorrectly tagged as UTF-8 would fail JSON encoding. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Apply max_capture_string_length to original binary data before escaping, not to the escaped representation. This is more efficient (no wasted CPU/memory) and produces more intuitive size field reporting. Changes: - Truncate binary data at byte limit before escaping (not after) - Size field reports original binary byte count - Regular UTF-8 strings continue to use character-based truncation - Add comprehensive tests for all truncation scenarios Benefits: - Efficient: escape only what's needed (e.g., 255 bytes not 1MB) - Intuitive: size=1000 means 1000 bytes of binary, not 4000 chars of escaped output - Consistent: limit applies to captured data size, not serialization overhead Example with 300-byte binary, limit=255: - Before: escape 300 bytes → 1203 chars, truncate to 255 chars, size=1203 - After: truncate to 255 bytes, escape → 1023 chars, size=300 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add tests for valid text encodings with high-bit characters (Latin1, Windows-1252) to document that these are NOT escaped as binary data. Behavior: - Only escape: Binary (ASCII-8BIT) and invalid encodings - Don't escape: Valid text in other encodings (Latin1, Windows-1252, etc.) - Rationale: Valid text encodings are legitimate data, not binary blobs - JSON.dump handles transcoding automatically (Latin1 → UTF-8) Test coverage: - Latin1 with high-bit characters (é = 0xE9) - Latin1 with all high bytes (128-255) - Windows-1252 encoding (€ = 0x80) - Truncation preserves encoding and uses character length This ensures we distinguish between: 1. Binary data (needs escaping for JSON safety) 2. Invalid encodings (needs escaping for JSON safety) 3. Valid text in non-UTF8 encodings (transcoded by JSON, not escaped) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Use dynamic string creation instead of string literals to avoid potential frozen literal caching issues that can cause intermittent test failures in RSpec's execute_in_fork mode. Changes: - Use [0x80, 0xFF].pack('C*') instead of "\x80\xFF" literal - Use String.new + << to build strings dynamically These tests pass consistently when run individually (20/20) but fail intermittently (~50%) when run with full suite due to test order dependencies. Using dynamic string creation improves stability. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions · 2026-03-10T18:34:00Z

Thank you for updating Change log entry section 👏

^{Visited at: 2026-03-11 20:09:57 UTC}

pr-commenter · 2026-03-10T19:08:01Z

Benchmarks

Benchmark execution time: 2026-03-12 02:37:52

Comparing candidate commit dfd176a in PR branch di-binary-data-serialization-tests with baseline commit 7a9046d in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 46 metrics, 0 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

🟩 = significantly better candidate vs. baseline
🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

@return

Resolve StandardRB violations: - Use symbol literal syntax (:@return) instead of string-to-symbol conversion - Use unary plus for unfrozen string literals instead of .dup and String.new Fix Steep type checker error: - Simplify block syntax in URLEncoded.parse to avoid inline type assertion issue Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The HTTP transport client calls logger.debug during request execution, which was causing RSpec::Mocks::MockExpectationError in the transport tests. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Replace "Python repr" with "escaped binary" throughout tests and comments for language-agnostic clarity. The format matches other Datadog tracers but isn't Python-specific. Changes: - Renamed python_repr variable to escaped_binary - Updated test descriptions and comments - Revised implementation comment to focus on cross-tracer consistency Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Replace unnecessarily complex pack('C*') and String.new patterns with simple string literals. The original patterns were added to work around a supposed frozen literal caching issue, but normal string literals are the standard Ruby idiom and work correctly. Changes: - Replace [0x80, 0xFF].pack('C*') with "\x80\xFF" - Replace (+Hello).b << 0x80 << "World".b << 0xFF with "Hello\x80World\xFF"

lib/datadog/di/serializer.rb

lib/datadog/appsec/utils/http/url_encoded.rb

lib/datadog/di/serializer.rb

- Revert url_encoded.rb formatting to preserve Steep type annotations - Fix truncation justification: correctness not performance - Improve comments for clarity - Add important note to escape_binary_string about valid inputs - Clarify force_encoding call is ASCII-compatible Addresses review feedback from p-datadog.

The result string is already UTF-8 (Ruby's default) and we only append ASCII characters, so force_encoding is a pointless no-op.

Strings with invalid UTF-8 encoding were causing exceptions during serialization due to two issues: 1. Using `.length` and `[0...max]` on invalid UTF-8 strings raises "invalid byte sequence in UTF-8" errors. Fixed by using `.bytesize` and `.byteslice` for binary and invalid encoding strings. 2. Custom serializer conditions (e.g., regex matches) can raise exceptions when called with invalid UTF-8 strings. Fixed by wrapping condition checks in a rescue block to skip failing serializers. Added tests for both scenarios and documented the exception handling behavior for custom serializer conditions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Document that custom serializer condition lambdas that raise exceptions (e.g., regex match against invalid UTF-8 strings) will be silently skipped to prevent breaking the entire serialization process. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add RBS type signature for escape_binary_string method - Use unary plus (+) for mutable string initialization per standard - Refactor case statement to use return value per standard - Remove useless assignment to need_dup variable - Add steep:ignore for byteslice nil handling (safe at runtime) - Fix redundant assignments in test files Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions · 2026-03-11T00:47:09Z

Typing analysis

Note: Ignored files are excluded from the next sections.

`steep:ignore` comments

This PR introduces 1 steep:ignore comment.

steep:ignore comments (+1-0)

❌ Introduced:

lib/datadog/di/serializer.rb:228

- Remove unnecessary sleep after flush in integration test ProbeNotifierWorker#flush already blocks until completion - Add WARN logging and telemetry to custom serializer exception handling Helps surface errors whether serializers are customer-defined or internal - Change bare rescue to rescue StandardError Avoids catching system exceptions - Add test coverage for logging and telemetry in error path Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The two failing tests expected send_diagnostics to be called but didn't flush the probe_notifier_worker, so diagnostics were never sent before test completion. Fixes: - Add flush after method execution in "escapes binary data in parameters" test - Add flush after method execution in "escapes binary return value" test This matches the pattern in the first test which already had flush and was passing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The serializer registered in "with condition" context was leaking to "when condition raises an exception" tests, causing double logging. Moved the around block to the describe '.register' level so it applies to all tests, preventing test pollution between contexts. This fixes the test failures where logger.warn was called 2 times instead of 1 time - the extra call was from the serializer registered at line 546. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

StandardRB style guide prefers bare rescue (which catches StandardError and descendants) over explicitly naming StandardError. Fixes: Style/RescueStandardError: Omit the error class when rescuing `StandardError` by itself. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

datadog-official · 2026-03-11T20:01:09Z

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
• Patch Coverage: 96.06%
• Overall Coverage: 95.14% (-0.02%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: dfd176a | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!}

docs/DynamicInstrumentation.md

lib/datadog/di/serializer.rb

…tly skipped Address review comments: Update documentation to reflect that custom serializer condition exceptions are logged at WARN level and reported via telemetry, not silently skipped. - Fixed in docs/DynamicInstrumentation.md:286 - Fixed in lib/datadog/di/serializer.rb:73 The implementation (lines 165-176) logs with Datadog.logger.warn() and reports via telemetry&.report(), so the documentation now accurately reflects this behavior. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Telemetry is internal and not accessible to customers. Updated documentation to only mention the WARN level logging which is visible to customers in their application logs. - docs/DynamicInstrumentation.md: Removed telemetry mention - lib/datadog/di/serializer.rb: Removed telemetry from public comment The implementation still reports to telemetry (line 175) but this is not mentioned in customer-facing documentation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Added documentation guideline that telemetry should not be mentioned in customer-facing Dynamic Instrumentation documentation, as it is internal and not accessible to customers. Only observable behavior (logging, customer-visible metrics) should be documented for customers. Internal code comments may mention telemetry when describing implementation details. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

PR descriptions must format changelog entries with explicit "Yes." or "None." prefix, not just the summary text. Examples: - Correct: "Yes. Brief customer-facing summary." - Correct: "None." - Wrong: "Brief customer-facing summary." (missing Yes. prefix) This ensures consistency with the automated changelog bot requirements. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Strech

LGTM with few suggestions

The comment about working with strings and encodings gets repeated multiple times with a big multiline comment, maybe it could be trimmed or just referenced to a single place
A small suggestion for test

spec/datadog/di/transport/input_spec.rb

Co-authored-by: Sergey Fedorov <oni.strech@gmail.com>

Address review feedback from strech: trim repeated multiline comments about string/encoding handling. Keep comprehensive documentation at the escape_binary_string method definition and reference it from the call site. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Unicorn Enterprises and others added 8 commits March 10, 2026 10:33

p-datadog requested a review from a team as a code owner March 10, 2026 18:33

p-datadog added the AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos label Mar 10, 2026

p-datadog requested a review from a team as a code owner March 10, 2026 20:34

github-actions bot added the appsec Application Security monitoring product label Mar 10, 2026

Unicorn Enterprises and others added 3 commits March 10, 2026 16:46

Fix transport test failures by allowing logger debug calls

2b2fb7c

The HTTP transport client calls logger.debug during request execution, which was causing RSpec::Mocks::MockExpectationError in the transport tests. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

p-datadog commented Mar 10, 2026

View reviewed changes

Unicorn Enterprises added 2 commits March 10, 2026 17:04

Remove unnecessary force_encoding call

bb4506b

The result string is already UTF-8 (Ruby's default) and we only append ASCII characters, so force_encoding is a pointless no-op.

p-datadog force-pushed the di-binary-data-serialization-tests branch from 5393ad4 to bb4506b Compare March 11, 2026 00:15

Unicorn Enterprises and others added 3 commits March 10, 2026 20:32

Unicorn Enterprises and others added 4 commits March 11, 2026 15:14

Merge branch 'master' into di-binary-data-serialization-tests

4b4d40c

p-datadog commented Mar 11, 2026

View reviewed changes

docs/DynamicInstrumentation.md Outdated Show resolved Hide resolved

lib/datadog/di/serializer.rb Outdated Show resolved Hide resolved

Unicorn Enterprises and others added 4 commits March 11, 2026 16:05

Strech approved these changes Mar 11, 2026

View reviewed changes

spec/datadog/di/transport/input_spec.rb Outdated Show resolved Hide resolved

p-datadog and others added 2 commits March 11, 2026 21:56

Update spec/datadog/di/transport/input_spec.rb

267191f

Co-authored-by: Sergey Fedorov <oni.strech@gmail.com>

p-datadog merged commit 6b340a0 into master Mar 12, 2026
631 checks passed

p-datadog deleted the di-binary-data-serialization-tests branch March 12, 2026 02:50

github-actions bot added this to the 2.30.0 milestone Mar 12, 2026

Strech mentioned this pull request Mar 19, 2026

Bump to version 2.30.0 #5474

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix binary data serialization in dynamic instrumentation#5434

Fix binary data serialization in dynamic instrumentation#5434
p-datadog merged 28 commits intomasterfrom
di-binary-data-serialization-tests

p-datadog commented Mar 10, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

pr-commenter bot commented Mar 10, 2026 •

edited

Loading

Explanation

More details about the CI and significant changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

datadog-official bot commented Mar 11, 2026 •

edited by datadog-prod-us1-6 bot

Loading

Uh oh!

Uh oh!

Uh oh!

Strech left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

p-datadog commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pr-commenter bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Explanation

More details about the CI and significant changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Typing analysis

steep:ignore comments

Uh oh!

datadog-official bot commented Mar 11, 2026 • edited by datadog-prod-us1-6 bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Strech left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

p-datadog commented Mar 10, 2026 •

edited

Loading

github-actions bot commented Mar 10, 2026 •

edited

Loading

pr-commenter bot commented Mar 10, 2026 •

edited

Loading

github-actions bot commented Mar 11, 2026 •

edited

Loading

`steep:ignore` comments

datadog-official bot commented Mar 11, 2026 •

edited by datadog-prod-us1-6 bot

Loading