Fix binary data serialization in dynamic instrumentation#5434
Fix binary data serialization in dynamic instrumentation#5434
Conversation
Binary data (ASCII-8BIT encoded strings) captured in dynamic instrumentation snapshots cannot be serialized to JSON, causing JSON::GeneratorError when snapshots are transmitted to the backend. This adds comprehensive test coverage across three layers: - Serializer: Verifies binary encoding is preserved in serialized output - Transport: Demonstrates JSON encoding failures at transport boundary - Integration: End-to-end tests showing complete DI pipeline failure Tests cover binary data in method parameters, return values, and nested structures, preparing for Python-style escaping implementation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Binary strings (ASCII-8BIT encoding) are now converted to Python-style repr format (e.g., b'\x80\xff') before JSON serialization. This makes binary data JSON-safe and prevents JSON::GeneratorError. Implementation details: - Added binary_to_python_repr() method to convert binary strings - Handles printable ASCII, special escapes (\n, \t, \r, \\, \'), and hex escapes (\xHH) - Truncation happens before conversion to avoid cutting mid-escape sequence - Binary data is truncated to max_length/4 bytes to account for repr expansion Test updates: - All tests now expect Python repr format for binary data - Transport tests verify JSON encoding succeeds with repr format - Integration tests confirm end-to-end success This matches Python dd-trace-py's behavior where bytes are serialized as repr strings, providing consistent cross-language handling. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The previous implementation truncated binary data before conversion to
Python repr, which was overly conservative and didn't match the reference
tracer behavior.
Changes:
- Convert FULL binary string to Python repr first
- Truncate the repr string if it exceeds max_capture_string_length
- Store the full repr length in the size field (not original binary length)
- max_capture_string_length is measured in output characters, not input bytes
This matches dd-trace-py which does:
value_repr = repr(value)
if len(value_repr) > maxlen:
return {"value": value_repr[:maxlen], "truncated": True, "size": len(value_repr)}
Reference tracer behavior documented in code comments:
- Python: Convert entire bytes to repr, then truncate repr string
- Java: Truncate byte array at maxCollectionSize, convert each element
Added tests verifying:
- 100-byte binary string → 403-char repr, truncated at character limit
- Printable ASCII captures more data (less expansion)
- Size field contains repr length, matching Python behavior
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Removed references to "Python repr" and replaced with "escaped binary string" or "JSON-safe format" terminology. The implementation produces the same serialized contents as dd-trace-py, but should be documented in language-agnostic terms. Changes: - Renamed binary_to_python_repr() -> escape_binary_string() - Updated comments to describe the escaping format (b'...') rather than referencing Python's repr() function - Test descriptions now say "escapes binary data" instead of "converts to Python repr" - Documented that we produce the same serialized contents as dd-trace-py Added truncation behavior documentation and test: - Documented that truncation may occur mid-escape-sequence (e.g., "b'\\x80\\x" cutting after partial hex escape) - Added comment explaining this is intentional and matches dd-trace-py - Added test demonstrating mid-escape truncation with detailed character-by-character breakdown All 18 binary data tests passing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Critical fix: Extended binary escaping to handle invalid UTF-8 strings (not just ASCII-8BIT encoding). This catches strings marked as UTF-8 but containing invalid byte sequences, which commonly happens when binary data is incorrectly tagged. Changes to lib/datadog/di/serializer.rb: - Check for `!value.valid_encoding?` in addition to BINARY encoding - Updated documentation to mention invalid UTF-8 handling - Added comment documenting that size field represents escaped string length, not original binary length (matches dd-trace-py) New test cases added (addressing senior engineer review): 1. Invalid UTF-8 strings - Strings marked as UTF-8 with invalid byte sequences - Mixed valid/invalid UTF-8 content 2. Empty binary string (edge case) 3. Very large binary string (100K bytes) - Verifies truncation behavior - Confirms size field semantics (escaped length) - Tests JSON serializability All 23 binary data tests passing (up from 18). All 113 serializer tests passing. This addresses the critical gap identified in code review where binary data incorrectly tagged as UTF-8 would fail JSON encoding. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Apply max_capture_string_length to original binary data before escaping, not to the escaped representation. This is more efficient (no wasted CPU/memory) and produces more intuitive size field reporting. Changes: - Truncate binary data at byte limit before escaping (not after) - Size field reports original binary byte count - Regular UTF-8 strings continue to use character-based truncation - Add comprehensive tests for all truncation scenarios Benefits: - Efficient: escape only what's needed (e.g., 255 bytes not 1MB) - Intuitive: size=1000 means 1000 bytes of binary, not 4000 chars of escaped output - Consistent: limit applies to captured data size, not serialization overhead Example with 300-byte binary, limit=255: - Before: escape 300 bytes → 1203 chars, truncate to 255 chars, size=1203 - After: truncate to 255 bytes, escape → 1023 chars, size=300 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add tests for valid text encodings with high-bit characters (Latin1, Windows-1252) to document that these are NOT escaped as binary data. Behavior: - Only escape: Binary (ASCII-8BIT) and invalid encodings - Don't escape: Valid text in other encodings (Latin1, Windows-1252, etc.) - Rationale: Valid text encodings are legitimate data, not binary blobs - JSON.dump handles transcoding automatically (Latin1 → UTF-8) Test coverage: - Latin1 with high-bit characters (é = 0xE9) - Latin1 with all high bytes (128-255) - Windows-1252 encoding (€ = 0x80) - Truncation preserves encoding and uses character length This ensures we distinguish between: 1. Binary data (needs escaping for JSON safety) 2. Invalid encodings (needs escaping for JSON safety) 3. Valid text in non-UTF8 encodings (transcoded by JSON, not escaped) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Use dynamic string creation instead of string literals to avoid potential
frozen literal caching issues that can cause intermittent test failures
in RSpec's execute_in_fork mode.
Changes:
- Use [0x80, 0xFF].pack('C*') instead of "\x80\xFF" literal
- Use String.new + << to build strings dynamically
These tests pass consistently when run individually (20/20) but fail
intermittently (~50%) when run with full suite due to test order
dependencies. Using dynamic string creation improves stability.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
Thank you for updating Change log entry section 👏 Visited at: 2026-03-11 20:09:57 UTC |
BenchmarksBenchmark execution time: 2026-03-12 02:37:52 Comparing candidate commit dfd176a in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 46 metrics, 0 unstable metrics.
|
Resolve StandardRB violations: - Use symbol literal syntax (:@return) instead of string-to-symbol conversion - Use unary plus for unfrozen string literals instead of .dup and String.new Fix Steep type checker error: - Simplify block syntax in URLEncoded.parse to avoid inline type assertion issue Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The HTTP transport client calls logger.debug during request execution, which was causing RSpec::Mocks::MockExpectationError in the transport tests. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace "Python repr" with "escaped binary" throughout tests and comments for language-agnostic clarity. The format matches other Datadog tracers but isn't Python-specific. Changes: - Renamed python_repr variable to escaped_binary - Updated test descriptions and comments - Revised implementation comment to focus on cross-tracer consistency Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace unnecessarily complex pack('C*') and String.new patterns with
simple string literals. The original patterns were added to work around
a supposed frozen literal caching issue, but normal string literals are
the standard Ruby idiom and work correctly.
Changes:
- Replace [0x80, 0xFF].pack('C*') with "\x80\xFF"
- Replace (+Hello).b << 0x80 << "World".b << 0xFF with "Hello\x80World\xFF"
- Revert url_encoded.rb formatting to preserve Steep type annotations - Fix truncation justification: correctness not performance - Improve comments for clarity - Add important note to escape_binary_string about valid inputs - Clarify force_encoding call is ASCII-compatible Addresses review feedback from p-datadog.
The result string is already UTF-8 (Ruby's default) and we only append ASCII characters, so force_encoding is a pointless no-op.
5393ad4 to
bb4506b
Compare
Strings with invalid UTF-8 encoding were causing exceptions during serialization due to two issues: 1. Using `.length` and `[0...max]` on invalid UTF-8 strings raises "invalid byte sequence in UTF-8" errors. Fixed by using `.bytesize` and `.byteslice` for binary and invalid encoding strings. 2. Custom serializer conditions (e.g., regex matches) can raise exceptions when called with invalid UTF-8 strings. Fixed by wrapping condition checks in a rescue block to skip failing serializers. Added tests for both scenarios and documented the exception handling behavior for custom serializer conditions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Document that custom serializer condition lambdas that raise exceptions (e.g., regex match against invalid UTF-8 strings) will be silently skipped to prevent breaking the entire serialization process. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add RBS type signature for escape_binary_string method - Use unary plus (+) for mutable string initialization per standard - Refactor case statement to use return value per standard - Remove useless assignment to need_dup variable - Add steep:ignore for byteslice nil handling (safe at runtime) - Fix redundant assignments in test files Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Typing analysisNote: Ignored files are excluded from the next sections.
|
- Remove unnecessary sleep after flush in integration test ProbeNotifierWorker#flush already blocks until completion - Add WARN logging and telemetry to custom serializer exception handling Helps surface errors whether serializers are customer-defined or internal - Change bare rescue to rescue StandardError Avoids catching system exceptions - Add test coverage for logging and telemetry in error path Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The two failing tests expected send_diagnostics to be called but didn't flush the probe_notifier_worker, so diagnostics were never sent before test completion. Fixes: - Add flush after method execution in "escapes binary data in parameters" test - Add flush after method execution in "escapes binary return value" test This matches the pattern in the first test which already had flush and was passing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The serializer registered in "with condition" context was leaking to "when condition raises an exception" tests, causing double logging. Moved the around block to the describe '.register' level so it applies to all tests, preventing test pollution between contexts. This fixes the test failures where logger.warn was called 2 times instead of 1 time - the extra call was from the serializer registered at line 546. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
StandardRB style guide prefers bare rescue (which catches StandardError and descendants) over explicitly naming StandardError. Fixes: Style/RescueStandardError: Omit the error class when rescuing `StandardError` by itself. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: dfd176a | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback! |
…tly skipped Address review comments: Update documentation to reflect that custom serializer condition exceptions are logged at WARN level and reported via telemetry, not silently skipped. - Fixed in docs/DynamicInstrumentation.md:286 - Fixed in lib/datadog/di/serializer.rb:73 The implementation (lines 165-176) logs with Datadog.logger.warn() and reports via telemetry&.report(), so the documentation now accurately reflects this behavior. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Telemetry is internal and not accessible to customers. Updated documentation to only mention the WARN level logging which is visible to customers in their application logs. - docs/DynamicInstrumentation.md: Removed telemetry mention - lib/datadog/di/serializer.rb: Removed telemetry from public comment The implementation still reports to telemetry (line 175) but this is not mentioned in customer-facing documentation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added documentation guideline that telemetry should not be mentioned in customer-facing Dynamic Instrumentation documentation, as it is internal and not accessible to customers. Only observable behavior (logging, customer-visible metrics) should be documented for customers. Internal code comments may mention telemetry when describing implementation details. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
PR descriptions must format changelog entries with explicit "Yes." or "None." prefix, not just the summary text. Examples: - Correct: "Yes. Brief customer-facing summary." - Correct: "None." - Wrong: "Brief customer-facing summary." (missing Yes. prefix) This ensures consistency with the automated changelog bot requirements. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Strech
left a comment
There was a problem hiding this comment.
LGTM with few suggestions
- The comment about working with strings and encodings gets repeated multiple times with a big multiline comment, maybe it could be trimmed or just referenced to a single place
- A small suggestion for test
Co-authored-by: Sergey Fedorov <oni.strech@gmail.com>
Address review feedback from strech: trim repeated multiline comments about string/encoding handling. Keep comprehensive documentation at the escape_binary_string method definition and reference it from the call site. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
What does this PR do?
Fixes binary data serialization in dynamic instrumentation snapshots. Binary strings (ASCII-8BIT encoding) and invalid UTF-8 strings are now escaped to a JSON-safe format, preventing JSON encoding failures when snapshots contain binary data. Also fixes exception handling for custom serializer conditions.
Motivation:
Binary data in DI snapshots causes JSON serialization to fail. When dynamic instrumentation captures variables containing binary strings (database records, protocol buffers, network payloads, etc.), the snapshot serialization fails during JSON encoding, preventing the snapshot from being sent to the backend. This implementation escapes binary data to
b'\xHH'format (matches dd-trace-py output), handles invalid UTF-8 strings, applies truncation efficiently (limit original bytes, not escaped output), and preserves valid text in other encodings (Latin1, Windows-1252).Additionally, strings with invalid UTF-8 encoding were causing exceptions during serialization, and custom serializer conditions using regex could fail on invalid UTF-8 strings.
Change log entry
Yes. Dynamic Instrumentation: Fix JSON serialization failures when snapshots contain binary data and invalid UTF-8 strings.
Additional Notes:
Implementation escapes binary bytes to
b'\x80\xff'format with printable ASCII preserved, special characters (\n,\t,\r,\\,\') escaped, and other bytes as\xHHhex escapes. Truncation is applied to original binary data before escaping (e.g., 300-byte binary with limit=255 → truncate to 255 bytes, escape → 1023 char output, size=300). Output format matches dd-trace-py for consistency across language tracers. Escaping overhead is 2-6x (vs 40x for Java's approach).Fixes:
.bytesizeand.bytesliceinstead of.lengthand[0...max]to avoid encoding exceptionsdocs/DynamicInstrumentation.mdexplaining custom serializer exception handling behaviorAdded comprehensive tests covering all byte values, truncation behavior, invalid UTF-8, valid non-UTF8 encodings, edge cases, custom serializer exception handling, and integration through transport layer.
How to test the change?
Run all binary data related tests:
bundle exec rspec spec/datadog/di/serializer_spec.rb spec/datadog/di/transport/input_spec.rb spec/datadog/di/integration/instrumentation_spec.rbAll tests pass (125 serializer tests including 2 new custom serializer exception tests).
Tested manually:
