diff --git a/CLAUDE.md b/CLAUDE.md index 689f906dd2b..c789962571b 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -9,6 +9,10 @@ - Use one sentence per relevant point in summary/motivation sections - Changelog entries are written for customers only; consider changes from user/customer POV - Internal changes (telemetry, CI, tooling) = "None" for changelog +- Changelog entry format: MUST start with "Yes." or "None." + - If changes need CHANGELOG: `Yes. Brief customer-facing summary.` + - If no CHANGELOG needed: `None.` + - Never write just the summary without "Yes." prefix - Add `--label "AI Generated"` when creating PRs (do not mention AI in description; label is sufficient) ## Never @@ -20,6 +24,7 @@ - Change versioning (`lib/datadog/version.rb`, `CHANGELOG.md`) - Leave resources open (terminate threads, close files) - Make breaking public API changes +- Use `sleep` in tests for synchronization (use deterministic waits: Queue, ConditionVariable, flush methods that block, or mock time) ## Ask First @@ -69,6 +74,13 @@ actionlint .github/workflows/your-workflow.yml - If a requested change contradicts code evidence, alert user before proceeding - If unable to access a requested web page, explicitly state this and explain basis for any suggestions +## Documentation + +- **Dynamic Instrumentation docs**: Never mention telemetry in customer-facing documentation (e.g., `docs/DynamicInstrumentation.md`) + - Telemetry is internal and not accessible to customers + - Only mention observable behavior (logging, metrics visible to customers) + - Internal code comments may mention telemetry when describing implementation + ## Environment Variables - Use `DATADOG_ENV`, never `ENV` directly (see `docs/AccessEnvironmentVariables.md`) @@ -102,6 +114,7 @@ bundle exec rspec spec/path/file_spec.rb:123 # Run specific test - Pipe rspec output: `2>&1 | tee /tmp/rspec.log | grep -E 'Pending:|Failures:|Finished' -A 99` - Transport noise (`Internal error during Datadog::Tracing::Transport::HTTP::Client request`) is expected - Profiling specs fail on macOS without additional setup +- `ProbeNotifierWorker#flush` blocks until queues are empty - never add `sleep` after it # Style diff --git a/docs/DynamicInstrumentation.md b/docs/DynamicInstrumentation.md index 63ba87c1c7d..a2002e77c3f 100644 --- a/docs/DynamicInstrumentation.md +++ b/docs/DynamicInstrumentation.md @@ -275,6 +275,19 @@ per-probe in the probe definition. - **Workaround:** Increase the capture depth for probes targeting code that works with complex objects +#### Custom Serializers + +Custom serializers allow you to define how specific objects are serialized +in Dynamic Instrumentation snapshots. The API is currently internal and +subject to change. + +**Exception Handling:** If a custom serializer's condition lambda raises +an exception (for example, a regex match against a string with invalid +UTF-8 encoding), the exception will be logged at WARN level, then the +serializer will be skipped and the next serializer will be tried. This +prevents custom serializers from breaking the entire serialization process. +The value will fall back to default serialization. + ## Application Data Sent to Datadog Dynamic instrumentation sends some of the application data to Datadog. diff --git a/lib/datadog/di/serializer.rb b/lib/datadog/di/serializer.rb index 1fe10735de4..eddaa48c692 100644 --- a/lib/datadog/di/serializer.rb +++ b/lib/datadog/di/serializer.rb @@ -67,6 +67,12 @@ class Serializer # # Important: these serializers are NOT used in log messages. # They are only used for variables that are captured in the snapshots. + # + # Exception handling: If a custom serializer's condition lambda raises + # an exception (e.g., regex match against invalid UTF-8 strings), the + # exception will be logged at WARN level, then the serializer will be + # skipped and the next serializer will be tried. This prevents custom + # serializers from breaking the entire serialization process. @@flat_registry = [] def self.register(condition: nil, &block) @@flat_registry << {condition: condition, proc: block} @@ -152,9 +158,28 @@ def serialize_value(value, name: nil, end @@flat_registry.each do |entry| - if (condition = entry[:condition]) && condition.call(value) - serializer_proc = entry.fetch(:proc) - return serializer_proc.call(self, value, name: nil, depth: depth) + condition = entry[:condition] + if condition + begin + condition_result = condition.call(value) + rescue => e + # If a custom serializer condition raises an exception (e.g., regex match + # against invalid UTF-8), skip it and continue with the next serializer. + # We don't want custom serializer conditions to break the entire serialization. + # + # Custom serializers may be defined by customers (in which case we should + # surface errors so they can fix their serializers) or they may be defined + # internally by dd-trace-rb (in which case we need to fix them). We use + # WARN level to surface these errors in either case. + Datadog.logger.warn("DI: Custom serializer condition failed: #{e.class}: #{e.message}") + telemetry&.report(e, description: "Custom serializer condition failed") + next + end + + if condition_result + serializer_proc = entry.fetch(:proc) + return serializer_proc.call(self, value, name: nil, depth: depth) + end end end @@ -184,13 +209,35 @@ def serialize_value(value, name: nil, else value.to_s end + + # Handle binary strings and invalid UTF-8 by escaping to JSON-safe format. + # See escape_binary_string for details on the escaping format. + # + # Truncate binary data BEFORE escaping to avoid cutting mid-escape-sequence. + # For regular strings, the limit is applied to string length in characters. max = settings.dynamic_instrumentation.max_capture_string_length - if value.length > max - serialized.update(truncated: true, size: value.length) - value = value[0...max] - need_dup = false + + if value.encoding == Encoding::BINARY || !value.valid_encoding? + # Truncate binary data BEFORE escaping to avoid cutting mid-escape-sequence + # For invalid encodings, use bytesize instead of length to avoid encoding errors + original_size = value.bytesize + if original_size > max + serialized.update(truncated: true, size: original_size) + value = value.byteslice(0...max) + end + value = escape_binary_string(value) # steep:ignore ArgumentTypeMismatch + false # Already converted to a new string + else + # Truncate non-binary strings + if value.length > max + serialized.update(truncated: true, size: value.length) + value = value[0...max] + need_dup = false + end + + value = value.dup if need_dup end - value = value.dup if need_dup + serialized.update(value: value) when Array if depth < 0 @@ -417,6 +464,53 @@ def serialize_string_or_symbol_for_message(value) value end end + + # Escapes a binary string or invalid UTF-8 string to a JSON-safe format. + # + # IMPORTANT: This method should ONLY be called with either: + # 1. True binary strings (encoding == Encoding::BINARY / ASCII-8BIT) + # 2. Strings with invalid encoding (!value.valid_encoding?) + # + # Calling this method with valid UTF-8 strings will produce incorrect output. + # + # Binary data (ASCII-8BIT encoding) or strings with invalid encoding are + # converted to an escaped string in the format: b'...' with hex escapes + # for non-printable bytes. + # + # The output format matches other Datadog tracer libraries for consistency + # across language implementations. The output is JSON-serializable. + # + # Examples: + # "Hello".b -> "b'Hello'" + # "\x80\xFF".b -> "b'\\x80\\xff'" + # "\x80".force_encoding('UTF-8') -> "b'\\x80'" (invalid UTF-8) + # + # @param binary_string [String] A string with ASCII-8BIT encoding or invalid encoding + # @return [String] Escaped string with UTF-8 encoding + def escape_binary_string(binary_string) + result = +"b'" + binary_string.each_byte do |byte| + result << case byte + when 0x09 # \t + '\\t' + when 0x0A # \n + '\\n' + when 0x0D # \r + '\\r' + when 0x27 # ' + "\\'" + when 0x5C # \ + '\\\\' + when 0x20..0x7E # Printable ASCII (space through ~) + byte.chr + else + # Non-printable: use \xHH format + format('\\x%02x', byte) + end + end + result << "'" + result + end end end end diff --git a/sig/datadog/di/serializer.rbs b/sig/datadog/di/serializer.rbs index 7459fc4c2f9..9fbf1db0736 100644 --- a/sig/datadog/di/serializer.rbs +++ b/sig/datadog/di/serializer.rbs @@ -36,6 +36,8 @@ module Datadog def class_name: (untyped cls) -> String def serialize_string_or_symbol_for_message: (untyped value) -> untyped + + def escape_binary_string: (String binary_string) -> String end end end diff --git a/spec/datadog/di/integration/instrumentation_spec.rb b/spec/datadog/di/integration/instrumentation_spec.rb index 9016f7798e9..c160eef865d 100644 --- a/spec/datadog/di/integration/instrumentation_spec.rb +++ b/spec/datadog/di/integration/instrumentation_spec.rb @@ -38,6 +38,17 @@ def ivar_mutating_method def exception_method raise TestException, 'Test exception' end + + def binary_data_method + # Return a string with high bytes that will fail JSON encoding + # 300 bytes to exceed default max_capture_string_length of 255 + ((128..255).to_a * 3)[0...300].map { |i| i.chr(Encoding::BINARY) }.join.force_encoding(Encoding::BINARY) + end + + def binary_data_param_method(binary_param, normal_param) + # Method with binary data in parameters + binary_param.length + normal_param.length + end end RSpec.describe 'Instrumentation integration' do @@ -1339,6 +1350,151 @@ def run_test end end end + + context 'binary data in snapshots' do + context 'with binary data in parameters' do + let(:probe) do + Datadog::DI::Probe.new( + id: "binary-test", + type: :log, + type_name: 'InstrumentationSpecTestClass', + method_name: 'binary_data_param_method', + capture_snapshot: true + ) + end + + let(:binary_string) { "\x80\x81\x82\xFF\xFE".b } + + it 'successfully sends snapshot with binary data through transport' do + expect(diagnostics_transport).to receive(:send_diagnostics) + + # Capture the snapshot that goes through transport + captured_snapshot = nil + json_encoded = nil + + allow(component.probe_notifier_worker).to receive(:add_snapshot).and_wrap_original do |m, *args| + captured_snapshot = args[0] + m.call(*args) + end + + allow(input_transport).to receive(:send_input) do |snapshots, tags| + # This mimics what the transport does - encode to JSON + json_encoded = JSON.dump(snapshots) + end + + probe_manager.add_probe(probe) + + # Execute the method with binary data + result = InstrumentationSpecTestClass.new.binary_data_param_method(binary_string, "hello") + expect(result).to eq(10) # 5 + 5 + + # Wait for flush to complete + component.probe_notifier_worker.flush + + # Verify the snapshot was captured + expect(captured_snapshot).not_to be_nil + + # JSON encoding should now succeed with escaped binary data + expect { + JSON.dump(captured_snapshot) + }.not_to raise_error + + # Transport should have successfully encoded it + expect(json_encoded).to be_a(String) + expect(json_encoded.encoding).to eq(Encoding::UTF_8) + end + + it 'escapes binary data in parameters' do + expect(diagnostics_transport).to receive(:send_diagnostics) + + # Capture the snapshot before it gets to transport + captured_snapshot = nil + allow(component.probe_notifier_worker).to receive(:add_snapshot) do |snapshot| + captured_snapshot = snapshot + end + + probe_manager.add_probe(probe) + + # Execute the method + InstrumentationSpecTestClass.new.binary_data_param_method(binary_string, "hello") + + # Wait for flush to complete + component.probe_notifier_worker.flush + + # Verify snapshot was captured with binary data escaped + expect(captured_snapshot).not_to be_nil + expect(captured_snapshot[:debugger][:snapshot][:captures]).to have_key(:entry) + + entry_capture = captured_snapshot[:debugger][:snapshot][:captures][:entry] + expect(entry_capture[:arguments]).to have_key(:arg1) + + # The binary string is escaped to b'...' format + binary_param_value = entry_capture[:arguments][:arg1][:value] + expect(binary_param_value).to be_a(String) + expect(binary_param_value).to eq("b'\\x80\\x81\\x82\\xff\\xfe'") + expect(binary_param_value.encoding).to eq(Encoding::UTF_8) + + # JSON encoding the snapshot should now succeed + expect { + JSON.dump(captured_snapshot) + }.not_to raise_error + end + end + + context 'with binary return value' do + let(:probe) do + Datadog::DI::Probe.new( + id: "binary-return-test", + type: :log, + type_name: 'InstrumentationSpecTestClass', + method_name: 'binary_data_method', + capture_snapshot: true + ) + end + + it 'escapes binary return value' do + expect(diagnostics_transport).to receive(:send_diagnostics) + + # Capture the snapshot before transport + captured_snapshot = nil + allow(component.probe_notifier_worker).to receive(:add_snapshot) do |snapshot| + captured_snapshot = snapshot + end + + probe_manager.add_probe(probe) + + # Execute the method that returns binary data + result = InstrumentationSpecTestClass.new.binary_data_method + expect(result.encoding).to eq(Encoding::BINARY) + expect(result.length).to eq(300) + expect(result.bytes.min).to eq(128) + expect(result.bytes.max).to eq(255) + + # Wait for flush to complete + component.probe_notifier_worker.flush + + # Verify snapshot captured the return value as escaped string + expect(captured_snapshot).not_to be_nil + return_capture = captured_snapshot[:debugger][:snapshot][:captures][:return] + expect(return_capture[:arguments]).to have_key(:@return) + + return_value = return_capture[:arguments][:@return][:value] + expect(return_value).to start_with("b'") + expect(return_value.encoding).to eq(Encoding::UTF_8) + expect(return_value).to include('\\x80') # First high byte + + # The 300-byte binary string exceeds max_capture_string_length (255) + # Truncated to first 255 bytes, then escaped to 1023 chars (b' + 255*4 + ') + expect(return_capture[:arguments][:@return][:truncated]).to be true + expect(return_capture[:arguments][:@return][:size]).to eq(300) # Original byte count + + # JSON encoding should now succeed + expect { + JSON.dump(captured_snapshot) + }.not_to raise_error + end + end + end end # rubocop:enable Style/RescueModifier diff --git a/spec/datadog/di/serializer_spec.rb b/spec/datadog/di/serializer_spec.rb index c0d7e6b7d59..8cccb427d6c 100644 --- a/spec/datadog/di/serializer_spec.rb +++ b/spec/datadog/di/serializer_spec.rb @@ -541,6 +541,13 @@ def self.define_cases(cases) end describe '.register' do + # Save and restore the custom serializer registry to prevent test pollution + around do |example| + original_registry = described_class.class_variable_get(:@@flat_registry).dup + example.run + described_class.class_variable_set(:@@flat_registry, original_registry) + end + context 'with condition' do before do described_class.register(condition: lambda { |value| String === value && value =~ /serializer spec hello/ }) do |serializer, value, name:, depth:| @@ -557,6 +564,65 @@ def self.define_cases(cases) expect(serialized).to eq(expected) end end + + context 'when condition raises an exception' do + let(:telemetry) { double('telemetry') } + let(:serializer) do + described_class.new(settings, redactor, telemetry: telemetry) + end + + it 'skips the custom serializer and uses default serialization' do + # Register a custom serializer with a condition that raises an exception + # This simulates a regex match against invalid UTF-8 strings + described_class.register(condition: lambda { |value| value =~ /test/ }) do |serializer, value, name:, depth:| + serializer.serialize_value('should not be called') + end + + # Invalid UTF-8 string that will cause regex match to raise + invalid_utf8 = "\x80\xFF".force_encoding(Encoding::UTF_8) + expect(invalid_utf8.valid_encoding?).to be false + + # Expect logging and telemetry + expect(Datadog.logger).to receive(:warn).with(/Custom serializer condition failed: ArgumentError/) + expect(telemetry).to receive(:report).with( + an_instance_of(ArgumentError), + description: "Custom serializer condition failed" + ) + + serialized = serializer.serialize_value(invalid_utf8) + + # Should fall back to default serialization (binary escaping) + expect(serialized[:type]).to eq('String') + expect(serialized[:value]).to eq("b'\\x80\\xff'") + end + + it 'continues checking other custom serializers after exception' do + # Register a custom serializer with a condition that raises an exception + described_class.register(condition: lambda { |value| value =~ /first/ }) do |serializer, value, name:, depth:| + serializer.serialize_value('first serializer') + end + + # Register another custom serializer that should work + described_class.register(condition: lambda { |value| String === value && value.encoding == Encoding::UTF_8 && !value.valid_encoding? }) do |serializer, value, name:, depth:| + {type: 'String', value: 'second serializer'} + end + + invalid_utf8 = "\x80\xFF".force_encoding(Encoding::UTF_8) + + # Expect logging and telemetry for the first (failing) serializer + expect(Datadog.logger).to receive(:warn).with(/Custom serializer condition failed: ArgumentError/) + expect(telemetry).to receive(:report).with( + an_instance_of(ArgumentError), + description: "Custom serializer condition failed" + ) + + serialized = serializer.serialize_value(invalid_utf8) + + # Should skip the first (failing) serializer and use the second one + expect(serialized[:type]).to eq('String') + expect(serialized[:value]).to eq('second serializer') + end + end end context 'when serialization raises an exception' do @@ -583,4 +649,424 @@ def self.define_cases(cases) define_serialize_value_cases(cases) end end + + describe 'binary data serialization' do + context 'with high bytes' do + # Create a shorter string with high bytes to avoid truncation + let(:binary_string) do + "\x80\x90\xa0\xb0\xc0\xd0\xe0\xf0\xff".b + end + + it 'escapes binary data to JSON-safe format' do + # Serialize the binary string + serialized = serializer.serialize_value(binary_string) + + # The serializer produces an escaped string in b'...' format + expect(serialized[:type]).to eq('String') + expect(serialized[:value]).to eq("b'\\x80\\x90\\xa0\\xb0\\xc0\\xd0\\xe0\\xf0\\xff'") + expect(serialized[:value].encoding).to eq(Encoding::UTF_8) + end + end + + context 'in nested structures' do + let(:binary_string) { "\x80\x81\x82\xFF\xFE".b } + + it 'escapes binary strings in vars' do + # Simulate a more realistic snapshot with binary data in locals + vars = {binary_data: binary_string, normal_string: "hello"} + serialized = serializer.serialize_vars(vars) + + # Binary data is escaped + expect(serialized[:binary_data][:type]).to eq('String') + expect(serialized[:binary_data][:value]).to eq("b'\\x80\\x81\\x82\\xff\\xfe'") + expect(serialized[:binary_data][:value].encoding).to eq(Encoding::UTF_8) + + # Normal string is unchanged + expect(serialized[:normal_string][:type]).to eq('String') + expect(serialized[:normal_string][:value]).to eq('hello') + expect(serialized[:normal_string][:value].encoding).to eq(Encoding::UTF_8) + end + end + + context 'in method arguments' do + let(:binary_string) { "\x00\x01\x02\xFF".b } + + it 'escapes binary strings in args' do + # Simulate method arguments containing binary data + args = [binary_string, "normal arg"] + kwargs = {data: binary_string} + target_self = Object.new + + serialized = serializer.serialize_args(args, kwargs, target_self) + + # Binary data is escaped + expect(serialized[:arg1][:type]).to eq('String') + expect(serialized[:arg1][:value]).to eq("b'\\x00\\x01\\x02\\xff'") + expect(serialized[:arg1][:value].encoding).to eq(Encoding::UTF_8) + + # Normal arg is unchanged + expect(serialized[:arg2][:type]).to eq('String') + expect(serialized[:arg2][:value]).to eq('normal arg') + expect(serialized[:arg2][:value].encoding).to eq(Encoding::UTF_8) + end + end + + context 'with mixed printable and binary data' do + let(:binary_string) { "Hello\x00World\xFF".b } + + it 'escapes non-printable bytes while preserving printable ASCII' do + serialized = serializer.serialize_value(binary_string) + + expect(serialized[:value]).to eq("b'Hello\\x00World\\xff'") + expect(serialized[:value].encoding).to eq(Encoding::UTF_8) + end + end + + context 'with special escape sequences' do + let(:binary_string) { "Line1\nLine2\tTab\rReturn".b } + + it 'uses standard escape sequences' do + serialized = serializer.serialize_value(binary_string) + + expect(serialized[:value]).to eq("b'Line1\\nLine2\\tTab\\rReturn'") + expect(serialized[:value].encoding).to eq(Encoding::UTF_8) + end + end + + context 'with quotes and backslashes' do + let(:binary_string) { "It's a\\test".b } + + it 'escapes quotes and backslashes' do + serialized = serializer.serialize_value(binary_string) + + expect(serialized[:value]).to eq("b'It\\'s a\\\\test'") + expect(serialized[:value].encoding).to eq(Encoding::UTF_8) + end + end + + context 'truncation behavior' do + # Truncation is applied to the ORIGINAL binary data (in bytes) before escaping. + # This is efficient - we only escape what we need rather than escaping a large + # binary string and then throwing away most of the work. + # + # The size field reports the original binary data length in bytes. + + context 'when binary data is under the limit' do + let(:binary_string) { "\xFF".b * 5 } + + before do + allow(di_settings).to receive(:max_capture_string_length).and_return(10) + end + + it 'does not truncate and escapes all bytes' do + serialized = serializer.serialize_value(binary_string) + + # 5 bytes < 10 limit, no truncation + # Escaped: b'\xff\xff\xff\xff\xff' = 2 + 5*4 + 1 = 23 chars + expect(serialized[:value]).to eq("b'\\xff\\xff\\xff\\xff\\xff'") + expect(serialized[:truncated]).to be_falsey + expect(serialized[:size]).to be_nil + end + end + + context 'when binary data is at the exact limit' do + let(:binary_string) { "\x00".b * 10 } + + before do + allow(di_settings).to receive(:max_capture_string_length).and_return(10) + end + + it 'does not truncate and escapes all bytes' do + serialized = serializer.serialize_value(binary_string) + + # 10 bytes == 10 limit, no truncation + # Escaped: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' = 2 + 10*4 + 1 = 43 chars + expect(serialized[:value]).to eq("b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'") + expect(serialized[:truncated]).to be_falsey + expect(serialized[:size]).to be_nil + end + end + + context 'when binary data exceeds the limit' do + let(:binary_string) { "\xFF".b * 20 } + + before do + allow(di_settings).to receive(:max_capture_string_length).and_return(10) + end + + it 'truncates original binary to limit then escapes' do + serialized = serializer.serialize_value(binary_string) + + # 20 bytes > 10 limit, truncate to first 10 bytes + # Escaped: b'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff' = 2 + 10*4 + 1 = 43 chars + expect(serialized[:value]).to eq("b'\\xff\\xff\\xff\\xff\\xff\\xff\\xff\\xff\\xff\\xff'") + expect(serialized[:truncated]).to be true + expect(serialized[:size]).to eq(20) # Original size, not escaped size + end + end + + context 'with very large binary data' do + let(:binary_string) { "\x80".b * 1000 } + + before do + allow(di_settings).to receive(:max_capture_string_length).and_return(10) + end + + it 'efficiently truncates before escaping' do + serialized = serializer.serialize_value(binary_string) + + # 1000 bytes > 10 limit, truncate to first 10 bytes then escape + # This is efficient: we escape 10 bytes, not 1000 bytes + expect(serialized[:value]).to eq("b'\\x80\\x80\\x80\\x80\\x80\\x80\\x80\\x80\\x80\\x80'") + expect(serialized[:truncated]).to be true + expect(serialized[:size]).to eq(1000) + end + end + + context 'with mixed printable and non-printable bytes' do + let(:binary_string) { "Hello\x00\x01\x02World\xFF".b } + + before do + # 14 bytes total: Hello(5) + \x00\x01\x02(3) + World(5) + \xFF(1) + allow(di_settings).to receive(:max_capture_string_length).and_return(8) + end + + it 'truncates to byte limit before escaping' do + serialized = serializer.serialize_value(binary_string) + + # 14 bytes > 8 limit, truncate to first 8 bytes: "Hello\x00\x01\x02" + # Escaped: b'Hello\x00\x01\x02' = 2 + 5 + 4 + 4 + 4 + 1 = 20 chars + expect(serialized[:value]).to eq("b'Hello\\x00\\x01\\x02'") + expect(serialized[:truncated]).to be true + expect(serialized[:size]).to eq(14) + end + end + + context 'size field reporting' do + it 'reports original binary byte count, not escaped string length' do + binary_string = "\xFF".b * 50 + allow(di_settings).to receive(:max_capture_string_length).and_return(10) + + serialized = serializer.serialize_value(binary_string) + + # Original: 50 bytes + # Truncated to: 10 bytes + # Escaped result would be 43 characters, but size reports original bytes + expect(serialized[:size]).to eq(50) # Not 43 + expect(serialized[:truncated]).to be true + end + end + end + + context 'with printable ASCII in binary string' do + # Printable ASCII in binary strings is preserved during escaping + let(:binary_string) { "Hello World! This is a test.".b } + + before do + # 28 bytes total, limit to 20 bytes + allow(di_settings).to receive(:max_capture_string_length).and_return(20) + end + + it 'truncates to byte limit before escaping' do + serialized = serializer.serialize_value(binary_string) + + # Original: 28 bytes + # Truncate to first 20 bytes: "Hello World! This is" + # Escape: b'Hello World! This is' = 23 chars + expect(serialized[:value]).to eq("b'Hello World! This is'") + expect(serialized[:truncated]).to be true + expect(serialized[:size]).to eq(28) # Original byte count + end + end + + context 'regular UTF-8 string truncation' do + # Verify that regular (non-binary) strings use character-based truncation + it 'truncates based on character count for UTF-8 strings' do + # 15 character string (no escaping needed) + utf8_string = "Hello, World!!!" + allow(di_settings).to receive(:max_capture_string_length).and_return(10) + + serialized = serializer.serialize_value(utf8_string) + + # Should truncate at 10 characters (not bytes) + expect(serialized[:value]).to eq("Hello, Wor") + expect(serialized[:truncated]).to be true + expect(serialized[:size]).to eq(15) + end + + it 'handles multi-byte UTF-8 characters correctly' do + # String with emoji: "Hello 👋 World" = 13 characters (emoji is 1 char) + utf8_string = "Hello 👋 World" + allow(di_settings).to receive(:max_capture_string_length).and_return(8) + + serialized = serializer.serialize_value(utf8_string) + + # Should truncate at 8 characters: "Hello 👋 " (includes the space) + expect(serialized[:value]).to eq("Hello 👋 ") + expect(serialized[:truncated]).to be true + expect(serialized[:size]).to eq(13) + end + + it 'does not escape valid UTF-8 strings' do + utf8_string = "Hello" + allow(di_settings).to receive(:max_capture_string_length).and_return(100) + + serialized = serializer.serialize_value(utf8_string) + + # Should not have b'...' wrapping + expect(serialized[:value]).to eq("Hello") + expect(serialized[:truncated]).to be_falsey + end + end + + context 'with invalid UTF-8 strings' do + it 'escapes strings marked as UTF-8 but with invalid byte sequences' do + # String marked as UTF-8 but containing invalid bytes + # This commonly happens when binary data is incorrectly tagged + invalid_utf8 = "\x80\xFF".force_encoding(Encoding::UTF_8) + expect(invalid_utf8.valid_encoding?).to be false + + result = serializer.serialize_value(invalid_utf8) + + # Should escape like binary data + expect(result[:type]).to eq('String') + expect(result[:value]).to eq("b'\\x80\\xff'") + expect(result[:value].encoding).to eq(Encoding::UTF_8) + + # Should be JSON-serializable + expect { + JSON.dump(result) + }.not_to raise_error + end + + it 'escapes strings with mixed valid and invalid UTF-8 sequences' do + # Valid UTF-8 text followed by invalid bytes + invalid_utf8 = "Hello\x80World\xFF".force_encoding(Encoding::UTF_8) + expect(invalid_utf8.valid_encoding?).to be false + + result = serializer.serialize_value(invalid_utf8) + + expect(result[:type]).to eq('String') + expect(result[:value]).to eq("b'Hello\\x80World\\xff'") + expect(result[:value].encoding).to eq(Encoding::UTF_8) + end + end + + context 'with non-UTF8, non-Binary encodings' do + it 'handles Latin1 (ISO-8859-1) strings with high-bit characters' do + # Latin1 "é" (0xE9) - valid Latin1, not valid UTF-8 byte sequence + latin1 = "\xE9".force_encoding(Encoding::ISO_8859_1) + expect(latin1.encoding).to eq(Encoding::ISO_8859_1) + expect(latin1.valid_encoding?).to be true # Valid Latin1 + + result = serializer.serialize_value(latin1) + + # Should NOT escape (it's a valid encoding, not binary) + # JSON.dump will transcode Latin1 to UTF-8 automatically + expect(result[:type]).to eq('String') + expect(result[:value]).not_to start_with("b'") # Not escaped + expect(result[:value].encoding).to eq(Encoding::ISO_8859_1) # Preserved + + # Should be JSON-serializable (Ruby will transcode) + expect { + JSON.dump(result) + }.not_to raise_error + end + + it 'handles Latin1 strings with all high-bit bytes' do + # All Latin1 high-bit characters (128-255) + latin1 = (128..255).map { |i| i.chr(Encoding::ISO_8859_1) }.join + expect(latin1.encoding).to eq(Encoding::ISO_8859_1) + expect(latin1.valid_encoding?).to be true + + result = serializer.serialize_value(latin1) + + # Should NOT escape - it's a valid encoding + expect(result[:type]).to eq('String') + expect(result[:value]).not_to start_with("b'") + expect(result[:value].encoding).to eq(Encoding::ISO_8859_1) + + # Should be JSON-serializable + expect { + json = JSON.dump(result) + parsed = JSON.parse(json) + # JSON transcodes to UTF-8 + expect(parsed['value'].encoding).to eq(Encoding::UTF_8) + }.not_to raise_error + end + + it 'handles Windows-1252 encoding' do + # Windows-1252 specific character: € (0x80) + windows1252 = "\x80".force_encoding(Encoding::Windows_1252) + expect(windows1252.valid_encoding?).to be true + + result = serializer.serialize_value(windows1252) + + # Should NOT escape - it's a valid encoding + expect(result[:type]).to eq('String') + expect(result[:value]).not_to start_with("b'") + expect(result[:value].encoding).to eq(Encoding::Windows_1252) + + # Should be JSON-serializable + expect { + JSON.dump(result) + }.not_to raise_error + end + + it 'truncates Latin1 strings based on character length' do + # 20 character Latin1 string with high bits + latin1 = "\xE9" * 20 # "é" repeated 20 times + latin1 = latin1.force_encoding(Encoding::ISO_8859_1) + allow(di_settings).to receive(:max_capture_string_length).and_return(10) + + result = serializer.serialize_value(latin1) + + # Should truncate at 10 characters (not bytes, not escape) + expect(result[:value].length).to eq(10) + expect(result[:value].encoding).to eq(Encoding::ISO_8859_1) + expect(result[:truncated]).to be true + expect(result[:size]).to eq(20) + end + end + + context 'with empty binary string' do + let(:binary_string) { "".b } + + it 'produces empty escaped string' do + serialized = serializer.serialize_value(binary_string) + + expect(serialized[:type]).to eq('String') + expect(serialized[:value]).to eq("b''") + expect(serialized[:value].encoding).to eq(Encoding::UTF_8) + expect(serialized).not_to have_key(:truncated) + end + end + + context 'with very large binary string' do + let(:binary_string) { ("\xFF" * 100_000).b } + + it 'truncates to max_capture_string_length' do + # Default max is 100 in the test helper + serialized = serializer.serialize_value(binary_string) + + # Truncate to first 100 bytes of binary, then escape + # Escaped result: b' + 100*\xff + ' = 2 + 400 + 1 = 403 chars + expect(serialized[:value].length).to eq(403) + expect(serialized[:value]).to start_with("b'\\xff") + expect(serialized[:value]).to end_with("'") + expect(serialized[:truncated]).to be true + + # Size field reports original binary byte count + expect(serialized[:size]).to eq(100_000) + end + + it 'is JSON-serializable despite large size' do + serialized = serializer.serialize_value(binary_string) + + expect { + JSON.dump(serialized) + }.not_to raise_error + end + end + end end diff --git a/spec/datadog/di/transport/input_spec.rb b/spec/datadog/di/transport/input_spec.rb index 1fa0054ac22..e03f8a63f6d 100644 --- a/spec/datadog/di/transport/input_spec.rb +++ b/spec/datadog/di/transport/input_spec.rb @@ -17,11 +17,109 @@ end let(:logger) do - instance_double(Logger) + instance_double(Logger, debug: nil) end let(:tags) { [] } + context 'when snapshot contains escaped binary data' do + context 'with all 256 byte values' do + # Create a string containing all possible byte values (0x00-0xFF) + # This simulates capturing a binary buffer in dynamic instrumentation + let(:binary_string) do + (0..255).map { |i| i.chr(Encoding::BINARY) }.join.force_encoding(Encoding::BINARY) + end + + # Simulate what the serializer produces after escaping binary data + let(:escaped_binary) do + result = +"b'" + binary_string.each_byte do |byte| + result << case byte + when 0x09 then '\\t' + when 0x0A then '\\n' + when 0x0D then '\\r' + when 0x27 then "\\'" + when 0x5C then '\\\\' + when 0x20..0x7E then byte.chr + else format('\\x%02x', byte) + end + end + result << "'" + result.force_encoding(Encoding::UTF_8) + end + + let(:snapshot) do + { + 'id' => 'test-snapshot', + 'timestamp' => Time.now.to_i, + 'captures' => { + 'locals' => { + 'binary_data' => escaped_binary + } + } + } + end + + it 'has all 256 unique bytes in original' do + expect(binary_string.bytes.uniq.sort).to eq((0..255).to_a) + expect(binary_string.encoding).to eq(Encoding::BINARY) + end + + it 'successfully serializes escaped binary through transport layer' do + # Escaped binary format is JSON-safe + expect { + transport.send_input([snapshot], tags) + }.not_to raise_error + end + + it 'produces valid JSON' do + json_output = JSON.dump(snapshot) + expect(json_output).to be_a(String) + expect(json_output.encoding).to eq(Encoding::UTF_8) + + # Can round-trip through JSON + parsed = JSON.parse(json_output) + expect(parsed['captures']['locals']['binary_data']).to eq(escaped_binary) + end + end + + context 'with binary string that is invalid UTF-8' do + # Create a string with bytes that are invalid UTF-8 sequences + let(:binary_string) { "\x80\x81\x82\xFF\xFE".b } + + # After escaping binary data + let(:escaped_binary) { "b'\\x80\\x81\\x82\\xff\\xfe'" } + + let(:snapshot) do + { + 'id' => 'test-snapshot', + 'captures' => { + 'locals' => { + 'binary_data' => escaped_binary + } + } + } + end + + before do + # Assert the original is indeed invalid UTF-8 + utf8_attempt = binary_string.dup.force_encoding(Encoding::UTF_8) + expect(utf8_attempt.valid_encoding?).to be false + end + + it 'successfully serializes escaped binary string' do + expect { + transport.send_input([snapshot], tags) + }.not_to raise_error + end + + it 'escaped binary is valid UTF-8' do + expect(escaped_binary.encoding).to eq(Encoding::UTF_8) + expect(escaped_binary.valid_encoding?).to be true + end + end + end + context 'when the combined size of snapshots serialized exceeds intake max' do before do # Reduce limits to make the test run faster and not require a lot of memory