Skip to content

Bug Report: EmergencyReparentShard silently promotes one side in a split-brain, leaving errant GTIDs #20199

@timvaillancourt

Description

@timvaillancourt

Overview of the Issue

When EmergencyReparentShard (ERS) runs on a GTID-based shard with two or more leading candidates whose Combined (received relay log + executed) GTID positions are incomparable — two tablets each holding GTIDs under their own server UUIDs from independent writes, where neither set is a superset of the other — there is no upfront check that detects the split-brain.

The existing secondary check in findMostAdvanced() (go/vt/vtctl/reparentutil/emergency_reparenter.go) runs after errant-GTID detection has already had a chance to filter tablets, and in this specific case it doesn't fire before promotion. ERS simply picks one of the diverged candidates as the new primary and proceeds 😱

From outside this looks like a successful ERS — vtctldclient returns success, the new primary takes traffic, no warning is logged. But:

  1. The losing side's unique GTIDs are now errant relative to the new primary
  2. Any transactions only on the losing side are silently lost
  3. The now-errant tablets can't replicate from the new primary without manual operator intervention (vtctldclient ChangeTabletType ... drained AND a re-clone, OR a manual RESET MASTER after careful inspection)

This is a direct conflict with the safety contracts encoded in CLAUDE.md / AGENTS.md (ERS section):

  • "ERS must prioritize certainty that we picked the most-advanced candidate"
  • "ERS must error when the most-advanced candidate is not clear, and/or a split-brain is suspected"
  • "ERS must avoid introducing errant GTIDs on replicas"

The "outage" appears to be a recovery on the surface, but it's a data-integrity incident that surfaces later via lag alerts, downstream consistency checks, or (worst case) end-user-visible inconsistency

What this looks like from outside

While the bug bites:

  1. vtctldclient EmergencyReparentShard returns 0 — looks healthy
  2. The new primary serves writes immediately — looks healthy
  3. Slave_IO_Running / Slave_SQL_Running on the now-errant tablets stay Yes (at first) until replication tries to apply something the new primary can't ship — eventually surfaces as Errant GTIDs in SHOW REPLICA STATUS output or VTOrc analysis, hours or days later
  4. There is no log line at ERS time naming the diverged tablets, so post-incident triage has nothing to grep for

Reproduction Steps

  1. Set up a 3-tablet shard (1 primary + 2 replicas) with --durability_policy=semi_sync and GTID-based replication (default on MySQL 5.6+ / 8.0 / 8.4)

  2. Detach both replicas from the primary and make them writable (simulates a partition that lets each side accept writes independently):

    STOP REPLICA;
    RESET REPLICA ALL;
    SET GLOBAL read_only = OFF;
  3. Write to each detached tablet independently — each INSERT generates a GTID under that tablet's own server UUID, producing two-sided GTID divergence:

    -- on replica A:
    INSERT INTO vt_insert_test(id, msg) VALUES (90002, 'side A');
    
    -- on replica B:
    INSERT INTO vt_insert_test(id, msg) VALUES (90003, 'side B');
  4. Kill the original primary tablet

  5. Run ERS:

    vtctldclient EmergencyReparentShard ks/0 --wait-replicas-timeout=30s
  6. Observe — ERS exits with 0, picks one of the divergent replicas as the new primary. The losing side's INSERT survives only on its own tablet, where the GTID is now errant relative to the new primary. There is no error, no warning, and no log line naming the diverged sides

Binary Version

Affects all Vitess versions on `main` as of 2026-05-27.
The risk has existed for as long as `findMostAdvanced()` has run its `AtLeast`
check only AFTER errant-GTID filtering — i.e., before any upfront uniformity
check on the leading-Combined group.

Operating System and Environment details

Not environment-specific.
GTID-based MySQL replication (MySQL 5.6+ / 8.0 / 8.4 / Percona 8.0+).

Log Fragments

The defining symptom is the absence of any log line at ERS time naming the divergence. A reproduction in a 4-tablet local cluster produces vtctld output that is indistinguishable from a healthy ERS — Validate, ShardReplicationPositions, StopReplicationAndGetStatus, PromoteReplica, PopulateReparentJournal, SetReplicationSource — all succeed. The errant-GTID signal only surfaces later, via the now-poisoned tablets' replication status

N/A — the bug is the absence of an upfront log/error when split-brain is present.

Related

Your thoughts are appreciated 🙏

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions