Skip to content

vreplication: fix OOM on tables with large JSON columns (#19878)#869

Draft
pedroalb wants to merge 4 commits into
slack-22.0from
slack-22.0-backport-19878
Draft

vreplication: fix OOM on tables with large JSON columns (#19878)#869
pedroalb wants to merge 4 commits into
slack-22.0from
slack-22.0-backport-19878

Conversation

@pedroalb

@pedroalb pedroalb commented Jun 8, 2026

Copy link
Copy Markdown

Description

Backport of vitessio#19878 to slack-22.0.

Fixes two compounding issues in VReplication when processing tables with large JSON columns:

  1. Unbounded SQL buffer in copy phase: applyBulkInsert built one giant SQL INSERT with ALL rows in a single statement, no size limit. With large JSON columns, this could produce a statement exceeding max_allowed_packet or exhausting memory.

  2. 56x memory amplification in JSON encoding: vjson.MarshalSQLValue() converted binary JSON into SQL via an intermediate Go object tree. Fix replaces with vjson.AppendMarshalSQL() which streams token-by-token (~8x memory instead of 56x).

What changes in VReplication behavior

No semantic change -- same rows get inserted, same data arrives at the target.

  • Copy phase (vcopier.go): Each copy worker now queries max_allowed_packet and splits bulk INSERTs when the statement buffer exceeds that size. Same batch of rows, multiple smaller INSERT statements instead of one unbounded one.
  • Bulk insert logic (replicator_plan.go): applyBulkInsert tracks buffer size as it appends rows. When buffer exceeds maxQuerySize, it flushes and starts a new INSERT. Accumulates RowsAffected across flushes.
  • Replay phase fix: applyBulkInsertChanges adds !newStmt guard to prevent flushing on the first row of a new statement.
  • JSON encoding (marshal.go): vjson.AppendMarshalSQL(buf, raw) streams JSON directly into the buffer instead of building an intermediate object tree.
  • vplayer.go: Inline maxAllowedPacket logic replaced with shared vr.maxQuerySize() (one implementation for both copy and replay paths).

Why we need this

We need this backport to safely perform reshards on keyspaces with tables containing large JSON columns. Without this fix, VDiff and VReplication can cause OOM or severe memory/CPU pressure on target primaries during the comparison and copy phases.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Test Plan

  1. Deploy this branch to staging vttablets
  2. Upgrade vttest3 keyspace tablets to use this branch
  3. Create a test table with multiple JSON columns similar to hermes schema:
    CREATE TABLE json_reshard_test (
      id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
      steps JSON NOT NULL,
      execution_data JSON NOT NULL,
      ancestor_ids JSON NOT NULL DEFAULT ('[]'),
      PRIMARY KEY (id)
    );
  4. Populate with large JSON values (similar size to hermes -- steps and execution_data columns with 1-5 MB JSON documents)
  5. Start a reshard on vttest3 and verify:
    • VReplication copy phase completes without OOM
    • vttablet memory usage stays bounded during copy
    • No ERS triggered on target primaries
    • Data arrives correctly on target shards

Deployment Notes

Deploy with next vttablet build from slack-22.0. No migrations required.

AI Disclosure

Cherry-pick and conflict resolution performed with Claude Code assistance.

@pedroalb pedroalb requested a review from a team as a code owner June 8, 2026 15:13
@salesforce-cla

salesforce-cla Bot commented Jun 8, 2026

Copy link
Copy Markdown

Thanks for the contribution! Before we can merge this, we need @nickvanw to sign the Salesforce Inc. Contributor License Agreement.

@github-actions github-actions Bot added this to the v22.0.4 milestone Jun 8, 2026
@pedroalb pedroalb closed this Jun 8, 2026
@pedroalb pedroalb reopened this Jun 8, 2026
@pedroalb pedroalb closed this Jun 8, 2026
@pedroalb pedroalb reopened this Jun 8, 2026
pedroalb and others added 2 commits June 8, 2026 18:07
Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Co-authored-by: Arthur Schreiber <arthur@planetscale.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@pedroalb pedroalb force-pushed the slack-22.0-backport-19878 branch from 9191813 to c0689d2 Compare June 8, 2026 16:07
ejortegau
ejortegau previously approved these changes Jun 9, 2026
The cherry-pick of vitessio#19878 added maxQuerySize() calls in the copy phase.
The tabletmanager test framework uses a strict mock DB client that didn't
expect this query, causing test failures.
Take the upstream refactor instead of keeping our inline version.
One implementation in maxQuerySize(), used by both vplayer and vcopier.
@pedroalb pedroalb marked this pull request as draft June 11, 2026 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants