Skip to content

Conversation

eitamring
Copy link

@eitamring eitamring commented Sep 1, 2025

Fix MySQL Binlog Parser Stream Corruption

Problem

The MySQL binlog parser was crashing with "index out of range" errors when processing international text containing smart quotes, causing data loss from batch failures.

Root Cause

The replaceUnsupportedCharacters function was modifying data length during binary parsing, breaking stream synchronization:

Original MySQL binary stream:
[10]['Hello'world'][4][next_field]...
 ↑                    ↑
 "read 10 bytes"     next field at position 11

After replaceUnsupportedCharacters:
[8]['Hello'world'][xt_field]...
 ↑                  ↑
 "read 8 bytes"    corrupted read at position 9 (should be 11)

Smart quote ' (3 bytes UTF-8) → ' (1 byte ASCII)
Result: Stream offset wrong by 2 bytes, all subsequent reads fail

Fix

Moved character normalization from binary parsing to post-decoding:

Before (WRONG):

Binary data → Replace quotes → Decode → String output
         ↑ Changes byte count!

After (CORRECT):

Binary data → Decode → String output → Replace quotes
         ↑ Preserves exact bytes

Changes Made

  1. Updated decodeStringWithEncoder - Apply replaceUnsupportedCharacters after decoding, not on raw bytes
  2. Fixed charset detection query - Added VARCHAR/TEXT types that inherit charset from table collation

Result

  • Processed millions of messages without crashes
  • Binary stream stays synchronized
  • Smart quotes handled correctly for all charsets

Key Lesson

Binary protocols require exact byte positioning. Never modify data length during parsing - handle character transformations after extraction.

Copy link

snyk-io bot commented Sep 1, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

@eitamring eitamring force-pushed the fix/eitam/safe_data_handling branch from a8be9f6 to 106c03c Compare September 1, 2025 07:43
@eitamring
Copy link
Author

Here's the explanation for the PR about why normalizeSmartQuotes is sufficient:
Why normalizeSmartQuotes is sufficient
The replaceUnsupportedCharacters function was originally designed to:

Extract content from length-prefixed binary data
Call normalizeSmartQuotes on the content
Rebuild the data with a new length prefix

However, when called during binary parsing, step 3 was catastrophic - it changed the byte count, corrupting the stream position.
In our fix, we're already handling the length prefix extraction in decodeStringWithEncoder:

We read the length prefix (1 or 2 bytes)
We extract exactly that many content bytes
We decode those bytes with the appropriate charset decoder

At this point, we have the decoded bytes and just need to handle smart quotes. Since replaceUnsupportedCharacters only adds value through its call to normalizeSmartQuotes, we can call normalizeSmartQuotes directly on the decoded bytes.
The key insight: replaceUnsupportedCharacters was a wrapper function designed for modifying binary data in-place. Once we moved the smart quote handling to post-decoding, the wrapper became unnecessary - we only need its core logic (normalizeSmartQuotes) without the dangerous length prefix manipulation.
This makes the code:

Simpler (no unnecessary wrapper)
Safer (no risk of accidentally calling the length-modifying version)
Clearer (the function name directly describes what it does)

@eitamring
Copy link
Author

How to recreate
-- Force UTF-8 smart quotes into the table
INSERT INTO test.test_table VALUES
(10, UNHEX('4A6F686EE2809973'), UNHEX('48652073616964202248656C6C6F22'));
-- This is "John's" with actual UTF-8 smart quote

-- Check it:
SELECT id, HEX(name) FROM test.test_table WHERE id = 10;
-- Should show E28099 (3 bytes) not 92 (1 byte)

1,John’s café,He said “Hello”
2,Mary’s place,She replied “Hi”
10,John’s,"He said ""Hello"""
25,Mary's résumé,"Quote: ""Testing"""
35,Bob's naïve,"It's ""working"""
41,John's café,"He said ""Hello"""
44,John’s café,He said “Hello”

Before the fix it will fail to parse this kind of data

@eitamring eitamring merged commit 4fcdc09 into main Sep 2, 2025
2 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants