Skip to content

docs: Document ERS split-brain detection and partial relay-log tolerance#2128

Draft
promptless[bot] wants to merge 1 commit into
prodfrom
promptless/ers-split-brain-detection
Draft

docs: Document ERS split-brain detection and partial relay-log tolerance#2128
promptless[bot] wants to merge 1 commit into
prodfrom
promptless/ers-split-brain-detection

Conversation

@promptless
Copy link
Copy Markdown
Contributor

@promptless promptless Bot commented May 27, 2026

Open this suggestion in Promptless to view citations and reasoning process

Documents v25 EmergencyReparentShard changes: the new --allow-split-brain-promotion flag, split-brain detection fail-fast behavior, partial relay-log-apply tolerance for GTID-based shards, and three new observability metrics.

Trigger Events


Tip: Sort by Shortest Review in the Dashboard to find quick wins ⚡

Add documentation for v25 EmergencyReparentShard changes:
- Document new --allow-split-brain-promotion flag
- Add split-brain detection section explaining fail-fast behavior
- Add partial relay-log-apply tolerance section for GTID-based shards
- Document three new metrics for ERS observability
- On the primary-elect tablet, insert a row in the `reparent_journal` table and then updates the `PrimaryAlias` property of the global shard object.
- In parallel on each replica, excluding the old primary, set the new primary as the replication source and wait for the inserted row to replicate to the replica tablets.

#### Split-brain detection
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added split-brain detection and --allow-split-brain-promotion flag documentation based on PR #18707 which introduces upfront split-brain detection in filterAndCheckUniform() and the operator escape hatch flag.

Source: vitessio/vitess#18707

- Consider using `--ignore-replicas` to exclude tablets on the side you want to discard from the candidate pool.

#### Partial relay-log-apply tolerance

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added partial relay-log-apply tolerance documentation based on PR #18707 which implements the waitForAllRelayLogsToApply() short-circuit behavior for GTID-based shards.

Source: vitessio/vitess#18707

| `planned_reparent_counts` | Number of times PlannedReparentShard has been run. Available dimensions are keyspace, shard and the result of the operation. |
| `emergency_reparent_counts` | Number of times EmergencyReparentShard has been run. Available dimensions are keyspace, shard and the result of the operation. |
| `reparent_shard_operation_timings` | Timings of reparent shard operations indexed by the type of operation. |
| `EmergencyReparentFilteredCandidates` | Number of candidates excluded from the relay-log wait during ERS because their `Combined` position was behind the leading group. Keyed by keyspace and shard. |
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added three new metrics (EmergencyReparentFilteredCandidates, EmergencyReparentRelayLogFailedCandidates, EmergencyReparentSplitBrainOverrides) based on PR #18707 where they are defined in go/vt/vtctl/reparentutil/emergency_reparenter.go.

Source: vitessio/vitess#18707

@netlify
Copy link
Copy Markdown

netlify Bot commented May 27, 2026

Deploy Preview for vitess ready!

Name Link
🔨 Latest commit d54a7fe
🔍 Latest deploy log https://app.netlify.com/projects/vitess/deploys/6a1717cef9a4ef000829b3b7
😎 Deploy Preview https://deploy-preview-2128--vitess.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants