Fix/eitam/safe data handling #26

eitamring · 2025-09-01T07:17:19Z

Fix MySQL Binlog Parser Stream Corruption

Problem

The MySQL binlog parser was crashing with "index out of range" errors when processing international text containing smart quotes, causing data loss from batch failures.

Root Cause

The replaceUnsupportedCharacters function was modifying data length during binary parsing, breaking stream synchronization:

Original MySQL binary stream:
[10]['Hello'world'][4][next_field]...
 ↑                    ↑
 "read 10 bytes"     next field at position 11

After replaceUnsupportedCharacters:
[8]['Hello'world'][xt_field]...
 ↑                  ↑
 "read 8 bytes"    corrupted read at position 9 (should be 11)

Smart quote ' (3 bytes UTF-8) → ' (1 byte ASCII)
Result: Stream offset wrong by 2 bytes, all subsequent reads fail

Fix

Moved character normalization from binary parsing to post-decoding:

Before (WRONG):

Binary data → Replace quotes → Decode → String output
         ↑ Changes byte count!

After (CORRECT):

Binary data → Decode → String output → Replace quotes
         ↑ Preserves exact bytes

Changes Made

Updated decodeStringWithEncoder - Apply replaceUnsupportedCharacters after decoding, not on raw bytes
Fixed charset detection query - Added VARCHAR/TEXT types that inherit charset from table collation

Result

Processed millions of messages without crashes
Binary stream stays synchronized
Smart quotes handled correctly for all charsets

Key Lesson

Binary protocols require exact byte positioning. Never modify data length during parsing - handle character transformations after extraction.

snyk-io · 2025-09-01T07:17:30Z

🎉 Snyk checks have passed. No issues have been found so far.

✅ security/snyk check is complete. No issues have been found. (View Details)

✅ license/snyk check is complete. No issues have been found. (View Details)

eitamring · 2025-09-01T07:45:08Z

Here's the explanation for the PR about why normalizeSmartQuotes is sufficient:
Why normalizeSmartQuotes is sufficient
The replaceUnsupportedCharacters function was originally designed to:

Extract content from length-prefixed binary data
Call normalizeSmartQuotes on the content
Rebuild the data with a new length prefix

However, when called during binary parsing, step 3 was catastrophic - it changed the byte count, corrupting the stream position.
In our fix, we're already handling the length prefix extraction in decodeStringWithEncoder:

We read the length prefix (1 or 2 bytes)
We extract exactly that many content bytes
We decode those bytes with the appropriate charset decoder

At this point, we have the decoded bytes and just need to handle smart quotes. Since replaceUnsupportedCharacters only adds value through its call to normalizeSmartQuotes, we can call normalizeSmartQuotes directly on the decoded bytes.
The key insight: replaceUnsupportedCharacters was a wrapper function designed for modifying binary data in-place. Once we moved the smart quote handling to post-decoding, the wrapper became unnecessary - we only need its core logic (normalizeSmartQuotes) without the dangerous length prefix manipulation.
This makes the code:

Simpler (no unnecessary wrapper)
Safer (no risk of accidentally calling the length-modifying version)
Clearer (the function name directly describes what it does)

eitamring · 2025-09-01T08:46:13Z

How to recreate
-- Force UTF-8 smart quotes into the table
INSERT INTO test.test_table VALUES
(10, UNHEX('4A6F686EE2809973'), UNHEX('48652073616964202248656C6C6F22'));
-- This is "John's" with actual UTF-8 smart quote

-- Check it:
SELECT id, HEX(name) FROM test.test_table WHERE id = 10;
-- Should show E28099 (3 bytes) not 92 (1 byte)

1,John’s café,He said “Hello”
2,Mary’s place,She replied “Hi”
10,Johnâ€™s,"He said ""Hello"""
25,Mary's résumé,"Quote: ""Testing"""
35,Bob's naïve,"It's ""working"""
41,John's café,"He said ""Hello"""
44,John’s café,He said “Hello”

Before the fix it will fail to parse this kind of data

added safe data location

5de9928

eitamring requested review from aaronabv and sigalikanevsky September 1, 2025 07:19

fix

106c03c

eitamring force-pushed the fix/eitam/safe_data_handling branch from a8be9f6 to 106c03c Compare September 1, 2025 07:43

fix

0e20928

eitam added 3 commits September 1, 2025 13:32

fix

a4c0d3f

fix

6cc38fb

fix

051d6b6

aaronabv approved these changes Sep 1, 2025

View reviewed changes

eitamring merged commit 4fcdc09 into main Sep 2, 2025
2 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/eitam/safe data handling #26

Fix/eitam/safe data handling #26

Uh oh!

eitamring commented Sep 1, 2025 •

edited

Loading

Uh oh!

snyk-io bot commented Sep 1, 2025 •

edited

Loading

Uh oh!

eitamring commented Sep 1, 2025

Uh oh!

eitamring commented Sep 1, 2025

Uh oh!

Uh oh!

Uh oh!

Fix/eitam/safe data handling #26

Fix/eitam/safe data handling #26

Uh oh!

Conversation

eitamring commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix MySQL Binlog Parser Stream Corruption

Problem

Root Cause

Fix

Changes Made

Result

Key Lesson

Uh oh!

snyk-io bot commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎉 Snyk checks have passed. No issues have been found so far.

Uh oh!

eitamring commented Sep 1, 2025

Uh oh!

eitamring commented Sep 1, 2025

Uh oh!

Uh oh!

Uh oh!

eitamring commented Sep 1, 2025 •

edited

Loading

snyk-io bot commented Sep 1, 2025 •

edited

Loading