Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -78,3 +78,6 @@ report
/errors/
go/flags/endtoend/count_flags.sh
/coverage.out

# .claude
/.claude/worktrees
8 changes: 8 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,15 @@ return user.NeedsMigration() && migrate(user) || user

### EmergencyReparentShard (ERS)
- ERS must prioritize **certainty** that we picked the most-advanced candidate
- ERS must error when the most-advanced candidate is not clear, and/or a split-brain is suspected
- ERS must avoid introducing errant GTIDs on replicas. This includes writes that are considered unacknowledged to the client as MySQL cannot rewind GTIDs of any kind
- Changes should prioritize reducing points of failure - avoid new RPCs or work that may delay or make ERS more brittle
- ERS must error if a shard contains a mix of GTID-based and non-GTID-based replication. Their position semantics differ (`Combined` = retrieved+executed for GTID vs. executed-only for non-GTID), so a unified split-brain / most-advanced check across both is unsafe
- For non-GTID flavors, ERS must wait on every candidate and fail on any error. The "filter to leading group + short-circuit on first success" optimization is only safe for GTID-based flavors, where `Combined` is distinct from the executed position
- During the stop-replication phase, a single error from the known PRIMARY tablet is expected and tolerated (we are abandoning a dead primary). On any other partial failure, `haveRevoked` must return true before ERS proceeds, guaranteeing no further writes can be accepted by any reachable tablet
- `NewPrimaryAlias` is not a bypass for safety checks. An explicitly-requested primary must still pass every guard: no errant GTIDs, at least as advanced as the winning position, no `MustNot` promotion rule, in-cell if `PreventCrossCellPromotion`, and able to establish forward progress with reachable tablets
- The promoted primary must have completed a relay-log apply wait. If errant-GTID detection eliminates every tablet that completed the first wait, ERS must re-wait on the surviving candidates before promotion β€” otherwise we risk promoting a tablet with received-but-unapplied transactions
- Any new pipeline step that stops replication on a tablet must add that tablet to `replicasToRestart`, so the deferred cleanup can recover it if ERS aborts. The code can't enforce this β€” review carefully

## :mag: Debugging & Troubleshooting

Expand Down
25 changes: 25 additions & 0 deletions changelog/25.0/25.0.0/summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@
- [Consolidator Reject on Waiter Cap](#vttablet-consolidator-reject-on-cap)
- **[VTTablet](#minor-changes-vttablet)**
- [Schema engine table-count limit is now configurable](#vttablet-schema-max-table-count)
- **[VTCtld](#minor-changes-vtctld)**
- [EmergencyReparentShard tolerates partial relay-log-apply failures](#vtctld-ers-partial-relay-log)

## <a id="major-changes"/>Major Changes</a>

Expand Down Expand Up @@ -91,3 +93,26 @@ Two changes:
Tablets that already have more tracked schema objects than the configured limit will reload fine β€” only new creations are gated. Operators who need to support more tables and views should increase the flag and ensure both vttablet and mysqld have enough memory to comfortably hold the larger schema.

See [#19978](https://github.com/vitessio/vitess/issues/19978) for details.

### <a id="minor-changes-vtctld"/>VTCtld</a>

#### <a id="vtctld-ers-partial-relay-log"/>EmergencyReparentShard tolerates partial relay-log-apply failures</a>

`EmergencyReparentShard` (ERS) on GTID-based shards no longer fails when only some replicas can apply their relay logs. As long as at least one tablet at the leading `Combined` GTID position applies successfully, ERS proceeds; lagging or stuck-SQL-thread replicas are no longer blockers. Pre-existing pre-PR behavior is preserved for non-GTID flavors (FilePos, MariaDB), where ERS still requires every candidate to apply.

When the leading GTID-based candidates have incomparable `Combined` positions (suspected split-brain), ERS now aborts upfront with a clear `FAILED_PRECONDITION` error naming the diverged tablets, rather than silently picking one side. Pre-PR ERS would pick blindly and let the losing side's unique GTIDs become errant on those tablets β€” a silent data-integrity incident that surfaced later via lag alerts or downstream consistency checks. See [#20199](https://github.com/vitessio/vitess/issues/20199) for the bug this addresses.

A new `--allow-split-brain-promotion` flag is added to `vtctldclient EmergencyReparentShard` (and `--allow_split_brain_promotion` on the legacy `vtctl`). It is **off by default**. Operators who deliberately need to force ERS through a detected split-brain β€” typically because they already know which side to keep and plan to re-clone the losing side β€” can set it to convert the abort into a `WARN` log and proceed. The non-promoted side's unique GTIDs will become errant after promotion, so this is an explicit operator override, not a default-on safety knob.

Two limitations to be aware of when using the override:

1. **Specify `--new-primary` for deterministic side selection.** In a symmetric split-brain (incomparable Combined positions, neither side dominating the other), the sort comparator returns false in both directions and Go's sort cannot establish a total order β€” the winning side ends up depending on map iteration order. Pair `--allow-split-brain-promotion` with `--new-primary` if you have a preferred side; the flag teaches `findMostAdvanced` to honour `--new-primary` even when `AtLeast` would otherwise reject it.
2. **A lagging tablet in the candidate pool can still be picked over the diverged leaders.** If the leading group is mutually-errant AND a non-leading (lagged) tablet exists in `validCandidates`, errant-GTID detection removes both leaders, leaves the lagger, and ERS promotes the lagger β€” losing both sides' unique writes. The flag's restoration logic only fires when *every* candidate is pruned; the mixed-survivor case is unchanged from pre-PR. Ensure the candidate pool is constrained to the diverged sides (via `--ignore-replicas`) when forcing through.

Three new stats are exported for observability:

- `EmergencyReparentFilteredCandidates` β€” counts replicas excluded from the relay-log wait because their `Combined` position is strictly behind the leading group.
- `EmergencyReparentRelayLogFailedCandidates` β€” counts replicas that genuinely failed to apply relay logs (cancellations after a peer succeeded are not counted).
- `EmergencyReparentSplitBrainOverrides` β€” counts split-brain detections bypassed by `--allow-split-brain-promotion` per detection (a single ERS run may increment twice if the errant-GTID re-wait pass also hits incomparable positions). Stays at zero unless an operator has deliberately invoked the escape hatch.

See [#18707](https://github.com/vitessio/vitess/pull/18707) for details.
3 changes: 3 additions & 0 deletions go/cmd/vtctldclient/command/reparents.go
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ var emergencyReparentShardOptions = struct {
IgnoreReplicaAliasStrList []string
PreventCrossCellPromotion bool
WaitForAllTablets bool
AllowSplitBrainPromotion bool
}{}

func commandEmergencyReparentShard(cmd *cobra.Command, args []string) error {
Expand Down Expand Up @@ -144,6 +145,7 @@ func commandEmergencyReparentShard(cmd *cobra.Command, args []string) error {
WaitReplicasTimeout: protoutil.DurationToProto(emergencyReparentShardOptions.WaitReplicasTimeout),
PreventCrossCellPromotion: emergencyReparentShardOptions.PreventCrossCellPromotion,
WaitForAllTablets: emergencyReparentShardOptions.WaitForAllTablets,
AllowSplitBrainPromotion: emergencyReparentShardOptions.AllowSplitBrainPromotion,
})
if err != nil {
return err
Expand Down Expand Up @@ -309,6 +311,7 @@ func init() {
EmergencyReparentShard.Flags().StringVar(&emergencyReparentShardOptions.ExpectedPrimaryAliasStr, "expected-primary", "", "Alias of a tablet that must be the current primary in order for the reparent to be processed.")
EmergencyReparentShard.Flags().BoolVar(&emergencyReparentShardOptions.PreventCrossCellPromotion, "prevent-cross-cell-promotion", false, "Only promotes a new primary from the same cell as the previous primary.")
EmergencyReparentShard.Flags().BoolVar(&emergencyReparentShardOptions.WaitForAllTablets, "wait-for-all-tablets", false, "Should ERS wait for all the tablets to respond. Useful when all the tablets are reachable.")
EmergencyReparentShard.Flags().BoolVar(&emergencyReparentShardOptions.AllowSplitBrainPromotion, "allow-split-brain-promotion", false, "Allow ERS to proceed when two leading candidates have incomparable Combined GTID positions (suspected split-brain). Off by default. Operator escape hatch β€” accepts that the losing side's unique GTIDs will become errant.")
EmergencyReparentShard.Flags().StringSliceVarP(&emergencyReparentShardOptions.IgnoreReplicaAliasStrList, "ignore-replicas", "i", nil, "Comma-separated, repeated list of replica tablet aliases to ignore during the emergency reparent.")
Root.AddCommand(EmergencyReparentShard)

Expand Down
128 changes: 120 additions & 8 deletions go/test/endtoend/reparent/emergencyreparent/ers_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,10 @@ limitations under the License.
package emergencyreparent

import (
"encoding/json"
"fmt"
"io"
"net/http"
"os/exec"
"sync"
"testing"
Expand All @@ -29,6 +32,7 @@ import (
"vitess.io/vitess/go/mysql"
"vitess.io/vitess/go/test/endtoend/cluster"
"vitess.io/vitess/go/test/endtoend/reparent/utils"
e2eutils "vitess.io/vitess/go/test/endtoend/utils"
"vitess.io/vitess/go/vt/log"
"vitess.io/vitess/go/vt/vtctl/reparentutil/policy"

Expand Down Expand Up @@ -629,9 +633,123 @@ func TestERSFailFast(t *testing.T) {
}
}

// TestERSFiltersNonMostAdvancedCandidates verifies that ERS filters out tablets whose
// Combined (relay log) position is behind the most advanced group. This is done by stopping
// the IO thread on one replica, writing data through the primary, then triggering ERS.
// The replica with the stopped IO thread will have a lower Combined position and should be
// filtered out. We verify this by checking the EmergencyReparentFilteredCandidates stat.
func TestERSFiltersNonMostAdvancedCandidates(t *testing.T) {
// The EmergencyReparentFilteredCandidates stat was added in v25 along with the
// partial relay-log-apply filter. Skip on older vtctld where neither exists.
e2eutils.SkipIfBinaryIsBelowVersion(t, 25, "vtctld")

clusterInstance := utils.SetupReparentCluster(t, policy.DurabilitySemiSync)
defer utils.TeardownCluster(clusterInstance)
tablets := clusterInstance.Keyspaces[0].Shards[0].Vttablets

ctx := t.Context()

utils.ConfirmReplication(t, tablets[0], tablets[1:])

// Stop the IO thread on tablets[3] so it stops receiving binlogs.
utils.RunSQL(ctx, t, "STOP REPLICA IO_THREAD", tablets[3])

// Write data that tablets[3] won't receive (IO thread is stopped).
utils.ConfirmReplication(t, tablets[0], []*cluster.Vttablet{tablets[1], tablets[2]})

// Kill the primary so ERS is needed.
utils.StopTablet(t, tablets[0], true)

// Run ERS β€” tablets[3] should be filtered out as non-most-advanced.
out, err := utils.Ers(clusterInstance, nil, "60s", "30s")
require.NoError(t, err, out)

// Verify the EmergencyReparentFilteredCandidates stat was incremented.
resp, err := http.Get(clusterInstance.VtctldProcess.VerifyURL)
require.NoError(t, err)
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
require.NoError(t, err)
var vars map[string]any
require.NoError(t, json.Unmarshal(body, &vars))
filteredCandidates, ok := vars["EmergencyReparentFilteredCandidates"]
require.True(t, ok, "EmergencyReparentFilteredCandidates stat not found in vtctld debug/vars")
filteredMap, ok := filteredCandidates.(map[string]any)
require.True(t, ok, "EmergencyReparentFilteredCandidates is not a map")
key := fmt.Sprintf("%s.%s", utils.KeyspaceName, utils.ShardName)
count, ok := filteredMap[key]
require.True(t, ok, "EmergencyReparentFilteredCandidates does not contain key %s", key)
require.Greater(t, count, float64(0), "expected at least 1 filtered candidate")

newPrimary := utils.GetNewPrimary(t, clusterInstance)
require.NotEqual(t, newPrimary.Alias, tablets[0].Alias, "old primary should not be the new primary")
require.NotEqual(t, newPrimary.Alias, tablets[3].Alias, "lagging tablet should not be the new primary")
}

// TestERSSplitBrainDetection verifies that ERS aborts upfront with a clear
// "suspected split-brain" error when two leading candidates have incomparable
// Combined GTID positions. We simulate a real split-brain by detaching two
// replicas from the primary and writing to each independently β€” each INSERT
// generates a GTID under that replica's own server UUID, so neither tablet's
// Combined set is a superset of the other's. The pairwise dominance filter
// keeps both tablets, the uniformCombined check fires, and ERS aborts before
// risking promotion of either diverged side.
func TestERSSplitBrainDetection(t *testing.T) {
// The upfront "suspected split-brain" abort was added in v25. Older vtctld either
// promotes one side silently or surfaces a different error message, so this test
// is only meaningful against v25+.
e2eutils.SkipIfBinaryIsBelowVersion(t, 25, "vtctld")

clusterInstance := utils.SetupReparentCluster(t, policy.DurabilitySemiSync)
defer utils.TeardownCluster(clusterInstance)
tablets := clusterInstance.Keyspaces[0].Shards[0].Vttablets

ctx := t.Context()

// Baseline: confirm replication is healthy before we break it.
utils.ConfirmReplication(t, tablets[0], tablets[1:])

// Detach tablets[2] and tablets[3] from replication and make them writable.
// Each subsequent INSERT will generate a GTID under that tablet's own
// server UUID, producing two-sided GTID divergence (split-brain).
detachAndMakeWritable := []string{
"STOP REPLICA",
"RESET REPLICA ALL",
"SET GLOBAL read_only = OFF",
}
utils.RunSQLs(ctx, t, detachAndMakeWritable, tablets[2])
utils.RunSQL(ctx, t,
"INSERT INTO vt_insert_test(id, msg) VALUES (90002, 'split-brain side A')",
tablets[2])

utils.RunSQLs(ctx, t, detachAndMakeWritable, tablets[3])
utils.RunSQL(ctx, t,
"INSERT INTO vt_insert_test(id, msg) VALUES (90003, 'split-brain side B')",
tablets[3])

// Kill the primary so ERS is needed.
utils.StopTablet(t, tablets[0], true)

// ERS must abort with the upfront split-brain error, not silently promote
// one of the diverged sides. The vtctldclient surfaces the RPC error message
// in stdout/stderr β€” assert against `out` rather than err (which is just
// "exit status 1" from the command process).
out, err := utils.Ers(clusterInstance, nil, "60s", "30s")
require.Error(t, err, out)
require.Contains(t, out, "suspected split-brain", "ERS output: %s", out)
// describeCombinedPositions names the offending tablets in the error.
require.Contains(t, out, tablets[2].Alias)
require.Contains(t, out, tablets[3].Alias)
}

// TestReplicationStopped checks that ERS ignores the tablets that have sql thread stopped.
// If there are more than 1, we also fail.
func TestReplicationStopped(t *testing.T) {
// In v25 ERS tolerates partial relay-log-apply failures and succeeds as long as
// one leading candidate applies; this test asserts that new behavior. Older vtctld
// still fails on any single replica error.
e2eutils.SkipIfBinaryIsBelowVersion(t, 25, "vtctld")

clusterInstance := utils.SetupReparentCluster(t, policy.DurabilitySemiSync)
defer utils.TeardownCluster(clusterInstance)
tablets := clusterInstance.Keyspaces[0].Shards[0].Vttablets
Expand All @@ -643,14 +761,8 @@ func TestReplicationStopped(t *testing.T) {
require.NoError(t, err)
// Run an additional command in the current primary which will only be acked by tablets[3] and be in its relay log.
insertedVal := utils.ConfirmReplication(t, tablets[0], nil)
// Failover to tablets[3]
_, err = utils.Ers(clusterInstance, tablets[3], "60s", "30s")
require.Error(t, err, "ERS should fail with 2 replicas having replication stopped")

// Start replication back on tablet[1]
err = clusterInstance.VtctldClientProcess.ExecuteCommand("ExecuteFetchAsDBA", tablets[1].Alias, `START REPLICA;`)
require.NoError(t, err)
// Failover to tablets[3] again. This time it should succeed
// Failover to tablets[3]. ERS tolerates partial relay log failures (tablets[1] and tablets[2]
// will fail to apply relay logs), so this should succeed with tablets[3] as the surviving candidate.
out, err := utils.Ers(clusterInstance, tablets[3], "60s", "30s")
require.NoError(t, err, out)
// Verify that the tablet has the inserted value
Expand Down
21 changes: 17 additions & 4 deletions go/vt/proto/vtctldata/vtctldata.pb.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading