Bug Report: EmergencyReparentShard silently promotes one side in a split-brain, leaving errant GTIDs

### Overview of the Issue

When `EmergencyReparentShard` _(ERS)_ runs on a GTID-based shard with two or more leading candidates whose `Combined` _(received relay log + executed)_ GTID positions are **incomparable** — two tablets each holding GTIDs under their own server UUIDs from independent writes, where neither set is a superset of the other — there is no upfront check that detects the split-brain.

The existing secondary check in `findMostAdvanced()` _(`go/vt/vtctl/reparentutil/emergency_reparenter.go`)_ runs **after** errant-GTID detection has already had a chance to filter tablets, and in this specific case it doesn't fire before promotion. ERS simply picks one of the diverged candidates as the new primary and proceeds 😱

From outside this looks like a successful ERS — `vtctldclient` returns success, the new primary takes traffic, no warning is logged. But:

1. The losing side's unique GTIDs are now **errant** relative to the new primary
2. Any transactions only on the losing side are **silently lost**
3. The now-errant tablets can't replicate from the new primary without manual operator intervention _(`vtctldclient ChangeTabletType ... drained` AND a re-clone, OR a manual `RESET MASTER` after careful inspection)_

This is a direct conflict with the safety contracts encoded in `CLAUDE.md` / `AGENTS.md` _(ERS section)_:

- _"ERS must prioritize **certainty** that we picked the most-advanced candidate"_
- _"ERS must error when the most-advanced candidate is not clear, and/or a split-brain is suspected"_
- _"ERS must avoid introducing errant GTIDs on replicas"_

The "outage" appears to be a recovery on the surface, but it's a data-integrity incident that surfaces later via lag alerts, downstream consistency checks, or _(worst case)_ end-user-visible inconsistency

#### What this looks like from outside

While the bug bites:

1. `vtctldclient EmergencyReparentShard` returns `0` — looks healthy
2. The new primary serves writes immediately — looks healthy
3. `Slave_IO_Running` / `Slave_SQL_Running` on the now-errant tablets stay `Yes` _(at first)_ until replication tries to apply something the new primary can't ship — eventually surfaces as `Errant GTIDs` in `SHOW REPLICA STATUS` output or VTOrc analysis, hours or days later
4. There is **no log line** at ERS time naming the diverged tablets, so post-incident triage has nothing to grep for

### Reproduction Steps

1. Set up a 3-tablet shard _(1 primary + 2 replicas)_ with `--durability_policy=semi_sync` and GTID-based replication _(default on MySQL 5.6+ / 8.0 / 8.4)_

2. Detach both replicas from the primary and make them writable _(simulates a partition that lets each side accept writes independently)_:
   ```sql
   STOP REPLICA;
   RESET REPLICA ALL;
   SET GLOBAL read_only = OFF;
   ```

3. Write to each detached tablet independently — each `INSERT` generates a GTID under that tablet's own server UUID, producing two-sided GTID divergence:
   ```sql
   -- on replica A:
   INSERT INTO vt_insert_test(id, msg) VALUES (90002, 'side A');

   -- on replica B:
   INSERT INTO vt_insert_test(id, msg) VALUES (90003, 'side B');
   ```

4. Kill the original primary tablet

5. Run ERS:
   ```sh
   vtctldclient EmergencyReparentShard ks/0 --wait-replicas-timeout=30s
   ```

6. Observe — ERS exits with `0`, picks one of the divergent replicas as the new primary. The losing side's `INSERT` survives only on its own tablet, where the GTID is now errant relative to the new primary. There is no error, no warning, and no log line naming the diverged sides

### Binary Version

```sh
Affects all Vitess versions on `main` as of 2026-05-27.
The risk has existed for as long as `findMostAdvanced()` has run its `AtLeast`
check only AFTER errant-GTID filtering — i.e., before any upfront uniformity
check on the leading-Combined group.
```

### Operating System and Environment details

```sh
Not environment-specific.
GTID-based MySQL replication (MySQL 5.6+ / 8.0 / 8.4 / Percona 8.0+).
```

### Log Fragments

The defining symptom is the **absence** of any log line at ERS time naming the divergence. A reproduction in a 4-tablet local cluster produces vtctld output that is indistinguishable from a healthy ERS — `Validate`, `ShardReplicationPositions`, `StopReplicationAndGetStatus`, `PromoteReplica`, `PopulateReparentJournal`, `SetReplicationSource` — all succeed. The errant-GTID signal only surfaces later, via the now-poisoned tablets' replication status

```sh
N/A — the bug is the absence of an upfront log/error when split-brain is present.
```

---

#### Related

- https://github.com/vitessio/vitess/pull/18707 _(addresses this — adds an upfront `uniformCombined` check on the filtered leading group, an explicit `FAILED_PRECONDITION` abort, and an opt-out `--allow-split-brain-promotion` flag for operators who deliberately need to force ERS through)_
- https://github.com/vitessio/vitess/issues/18528 _(related — broader ERS hardening discussion)_
- https://github.com/vitessio/vitess/issues/18529 _(related — lagging-minority intolerance, addressed in the same PR)_

Your thoughts are appreciated 🙏


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: EmergencyReparentShard silently promotes one side in a split-brain, leaving errant GTIDs #20199

Overview of the Issue

What this looks like from outside

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug Report: EmergencyReparentShard silently promotes one side in a split-brain, leaving errant GTIDs #20199

Description

Overview of the Issue

What this looks like from outside

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions