reparentutil: ERS panic recovery, PRS buffer-pool tiebreaking, and cleanup#20205
Conversation
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20205 +/- ##
===========================================
+ Coverage 69.67% 91.26% +21.59%
===========================================
Files 1614 9 -1605
Lines 216793 1339 -215454
===========================================
- Hits 151044 1222 -149822
+ Misses 65749 117 -65632
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Promptless prepared a documentation update related to this change. Triggered by PR #20205 Added a changelog entry documenting the PRS buffer-pool tiebreaking behavior change. Tablets missing Review: Add changelog entry for PRS buffer-pool tiebreaking behavior change |
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
mattlord
left a comment
There was a problem hiding this comment.
LGTM. There's a residual risk here in that the buffer-pool fallback behavior is subtle and currently has limited direct regression coverage for missing/empty/non-numeric status values and the “skip warmth tiebreaking unless every valid tablet has data” path. I’d prefer a targeted unit test there, but I don’t see a code issue that should block the PR.
- verifyAllTabletsReachable: assert missing, empty, and non-numeric Innodb_buffer_pool_pages_data values are all omitted from the returned map - ElectNewPrimary: partial buffer-pool data skips tiebreaking and falls through to alias; an ineligible candidate without data does NOT disable tiebreaking for the rest (gate is on validTablets, not candidates) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Description
Implements the four fixes flagged in #20204. Each is small but they share a theme — error handling and resource management around ERS/PRS — so I bundled them rather than splitting into 4 x trivial PRs
1. ERS panic recovery (
emergency_reparenter.go)reparentReplicas()wraps both the per-replicahandleReplicaand thereplWg.Wait()waiter goroutine inrecover(). A panic inhandleReplicais now recorded into the existingAllErrorRecorderinstead of taking down the surroundingvtctld/vtctldserverprocess. A panic in the waiter (e.g. areplWg.Done()over-count) no longer skipsallReplicasDoneCancel()/replCancel()— those are nowdefer-ed before the recoverThe pre-existing
defer replWg.Done()already guaranteedWait()would return during ahandleReplicapanic, but the process was still going down before this change2. PRS buffer-pool tiebreaking (
planned_reparenter.go+util.go)verifyAllTabletsReachable()no longer swallows thestrconv.Atoierror. A tablet whoseInnodb_buffer_pool_pages_datais missing (e.g. MariaDB, which doesn't expose this status variable) is omitted from the buffer-pool tiebreaking map — the tablet itself still participates in PRS normally, just without a buffer-pool warmth score. A present-but-empty value is treated the same as a missing key (to avoid log spam on variants that return the row with no value); a present-but-non-numeric value is omitted with a warning logElectNewPrimary()only uses buffer-pool data for tiebreaking when every valid tablet has a value — gated onvalidTablets(post-RPC filtering) rather thancandidates, so a backup-taking or excessively-lagged candidate doesn't disable tiebreaking for the rest. If any valid tablet is missing data,innodbBufferPoolis left nil and the sorter's existinglen(...) != 0check falls back to durability ordering — same all-or-nothing semantics it has always had3. Slice preallocation
3 x hot paths with known upper bounds:
filterValidCandidates:restrictedValidTablets/notPreferredValidTabletscapped atlen(validTablets)ElectNewPrimary:validTablets/tabletPositionscapped atlen(candidates)and moved out of thevar()block to where the bound is knownstopReplicationAndBuildStatusMaps:reachableTabletscapped atlen(tabletMap)One test case in
emergency_reparenter_test.goupdated to match the resultingnil→ empty-slice change for an empty filter result4.
waitForCatchUpcleanupCollapses the trailing
if err != nil { return err }; return nilto a singlereturn tmc.WaitForPosition(...)Why not backport
These are incremental improvements, not regressions. The exposure for #1 is real but bounded —
handleReplicapanics aren't observed in practice and the existingdefer replWg.Done()already handledreplWgaccountingRelated Issue(s)
Resolves: #20204
Related: #19849
Related: #19888
Checklist
Deployment Notes
No new flags or configuration. Behaviour change in #2 is limited to PRS tiebreaking semantics — PRS still runs against all tablet variants. In a mixed cluster where some tablets expose
Innodb_buffer_pool_pages_dataand some don't, PRS now falls back to durability tiebreaking instead of using0as a real value for the tablets that don't expose it. In practice this only flips when there's a tie on GTID position (rare in a steady-state shard). In a pure-MariaDB cluster the outcome is unchanged — pre-fix, all tablets tied at0and fell through; post-fix, buffer-pool tiebreaking is skipped and falls through to the same next criterionAI Disclosure
Claude Code assisted with the implementation, testing, and PR summary. I committed the change manually after reviewing each step