Skip to content

[release-23.0] VTOrc: fix ReplicationStopped + PrimarySemiSyncBlocked recovery deadlock (#19925)#19982

Merged
timvaillancourt merged 3 commits into
release-23.0from
backport-19925-to-release-23.0
Apr 30, 2026
Merged

[release-23.0] VTOrc: fix ReplicationStopped + PrimarySemiSyncBlocked recovery deadlock (#19925)#19982
timvaillancourt merged 3 commits into
release-23.0from
backport-19925-to-release-23.0

Conversation

@vitess-bot
Copy link
Copy Markdown
Contributor

@vitess-bot vitess-bot Bot commented Apr 29, 2026

Description

This is a backport of #19925

Copilot AI review requested due to automatic review settings April 29, 2026 09:13
@vitess-bot vitess-bot Bot added Type: Bug Backport This is a backport Component: VTOrc Vitess Orchestrator integration Skip CI Skip CI actions from running Merge Conflict labels Apr 29, 2026
@vitess-bot vitess-bot Bot review requested due to automatic review settings April 29, 2026 09:13
@vitess-bot
Copy link
Copy Markdown
Contributor Author

vitess-bot Bot commented Apr 29, 2026

Hello @timvaillancourt, there are conflicts in this backport.

Please address them in order to merge this Pull Request. You can execute the snippet below to reset your branch and resolve the conflict manually.

Make sure you replace origin by the name of the vitessio/vitess remote

git fetch --all
gh pr checkout 19982
git reset --hard origin/release-23.0
git cherry-pick -m 1 9ba3f8e9f3b7858ad81e223664604609ee1d6866

@github-actions github-actions Bot added this to the v23.0.4 milestone Apr 29, 2026
Adapt the cherry-pick to release-23.0 where AnalyzedInstanceAlias is
still a string (not *topodatapb.TabletAlias):

- analysis_dao.go: keep primaryAlias as string while taking the new
  shardWideAnalysisCode/shardWideProblem fields.
- topology_recovery.go: adapt shardWideRecoveryIgnoredTablets to return
  []string; take the renamed alreadyFixed flow in recheckPrimaryHealth.
- topology_recovery_test.go: drop unused testutil import, use string
  aliases in TestShardWideRecoveryIgnoredTablets, and use "zon1" cell
  in the new TestRecheckPrimaryHealth case to match the existing
  hardcoded alias in the test loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copilot AI review requested due to automatic review settings April 29, 2026 14:06
@timvaillancourt timvaillancourt removed Skip CI Skip CI actions from running Merge Conflict labels Apr 29, 2026
@timvaillancourt timvaillancourt marked this pull request as ready for review April 29, 2026 14:06
@timvaillancourt timvaillancourt enabled auto-merge (squash) April 29, 2026 14:07
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Backport to release-23.0 of the VTOrc fix for a recovery-ordering deadlock between ReplicationStopped replica recovery and shard-wide PrimarySemiSyncBlocked recovery by allowing dependency-driven ordering and preventing over-suppression of analyses.

Changes:

  • Make ReplicationStopped declare an explicit ordering dependency (BeforeAnalyses: [PrimarySemiSyncBlocked]) so replicas are fixed before the shard-wide semi-sync unblock path.
  • Update GetDetectionAnalysis to continue matching after a shard-wide action is detected and to preserve (or promote) dependent analyses via BeforeAnalyses/AfterAnalyses.
  • Adjust shard-wide pre-recovery refresh ignore logic and add unit + e2e regression tests to cover the deadlock scenario and ordering behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
go/vt/vtorc/logic/topology_recovery.go Adds shard-wide refresh ignore helper; clarifies recheckPrimaryHealth; documents suppression interaction with checkIfAlreadyFixed.
go/vt/vtorc/logic/topology_recovery_test.go Extends recheckPrimaryHealth coverage; adds tests for shard-wide refresh ignore behavior; updates ordering expectations.
go/vt/vtorc/inst/analysis_problem.go Adds BeforeAnalyses dependency for ReplicationStopped vs PrimarySemiSyncBlocked.
go/vt/vtorc/inst/analysis_problem_test.go Adds unit coverage for ordered-execution requirement + dependency comparison involving ReplicationStopped.
go/vt/vtorc/inst/analysis_dao.go Implements dependency-aware suppression bypass when a shard-wide action is present (including promotion of dependent non-chosen problems).
go/vt/vtorc/inst/analysis_dao_test.go Adds tests for declaresBefore/declaresAfter used by the new suppression logic.
go/test/endtoend/vtorc/utils/utils.go Adds GetSuccessfulRecoveryCount helper for e2e tests needing “delta” assertions.
go/test/endtoend/vtorc/general/vtorc_test.go Adds end-to-end regression test ensuring ReplicationStopped on a semi-sync acker is fixed even when semi-sync blocked conditions are possible.

Comment on lines +745 to +755
// are skipped because they are unreachable; reachable-but-unhealthy primaries
// (PrimarySemiSyncBlocked, PrimaryDiskStalled) are NOT skipped so that
// checkIfAlreadyFixed evaluates fresh state.
func shardWideRecoveryIgnoredTablets(recoveryFunctionCode recoveryFunction, analysisEntry *inst.DetectionAnalysis) []string {
var tabletsToIgnore []string
if recoveryFunctionCode == recoverDeadPrimaryFunc {
switch analysisEntry.Analysis {
case inst.PrimarySemiSyncBlocked, inst.PrimaryDiskStalled:
// Reachable primary — refresh it so checkIfAlreadyFixed
// evaluates current state. The problem may have been
// resolved by a prior dependency recovery.
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment describes PrimaryDiskStalled as a “reachable-but-unhealthy” primary, but the PrimaryDiskStalled analysis is matched when LastCheckValid is false (i.e., VTOrc couldn’t successfully check the primary recently). Consider rewording to avoid implying reachability for PrimaryDiskStalled (or clarify what “reachable” means here) so the ignore/refresh behavior is less confusing.

Suggested change
// are skipped because they are unreachable; reachable-but-unhealthy primaries
// (PrimarySemiSyncBlocked, PrimaryDiskStalled) are NOT skipped so that
// checkIfAlreadyFixed evaluates fresh state.
func shardWideRecoveryIgnoredTablets(recoveryFunctionCode recoveryFunction, analysisEntry *inst.DetectionAnalysis) []string {
var tabletsToIgnore []string
if recoveryFunctionCode == recoverDeadPrimaryFunc {
switch analysisEntry.Analysis {
case inst.PrimarySemiSyncBlocked, inst.PrimaryDiskStalled:
// Reachable primary — refresh it so checkIfAlreadyFixed
// evaluates current state. The problem may have been
// resolved by a prior dependency recovery.
// are skipped because they are known to be unreachable; primaries flagged as
// PrimarySemiSyncBlocked or PrimaryDiskStalled are NOT skipped so that
// checkIfAlreadyFixed evaluates fresh state. In particular, PrimaryDiskStalled
// does not imply VTOrc was able to reach the primary successfully on the last
// check; it is still refreshed here to re-evaluate the current state.
func shardWideRecoveryIgnoredTablets(recoveryFunctionCode recoveryFunction, analysisEntry *inst.DetectionAnalysis) []string {
var tabletsToIgnore []string
if recoveryFunctionCode == recoverDeadPrimaryFunc {
switch analysisEntry.Analysis {
case inst.PrimarySemiSyncBlocked, inst.PrimaryDiskStalled:
// Do not skip this primary during refresh: re-evaluate current
// state so checkIfAlreadyFixed can see whether a prior dependency
// recovery already resolved the problem.

Copilot uses AI. Check for mistakes.
Comment on lines +990 to +991
// VTOrc instance has already performing the mitigation.
// In either case, the original analysis is stale which can be safely aborted.
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor grammar in the comment: “VTOrc instance has already performing the mitigation” should be “is already performing” or “has already performed” (and similarly in the earlier line “has already performing”). This comment is new/updated and shows up in logs/grep context when debugging.

Suggested change
// VTOrc instance has already performing the mitigation.
// In either case, the original analysis is stale which can be safely aborted.
// VTOrc instance is already performing the mitigation.
// In either case, the original analysis is stale and can be safely aborted.

Copilot uses AI. Check for mistakes.
}
// Note: when ca.hasShardWideAction is true, we still run matching
// below to check if this tablet's problem declares it must run
// before the shard-wide action (via BeforeAnalyses).
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says we run matching to check if a tablet’s problem declares it must run before the shard-wide action “(via BeforeAnalyses)”, but the new suppression bypass also considers dependencies declared via the shard-wide problem’s AfterAnalyses. Consider updating the comment to mention both BeforeAnalyses and AfterAnalyses so future readers don’t miss the symmetric case implemented below.

Suggested change
// before the shard-wide action (via BeforeAnalyses).
// before the shard-wide action (via BeforeAnalyses), or if the
// shard-wide problem declares it must run after this tablet problem
// (via AfterAnalyses).

Copilot uses AI. Check for mistakes.
@timvaillancourt timvaillancourt merged commit 02943d0 into release-23.0 Apr 30, 2026
106 of 113 checks passed
@timvaillancourt timvaillancourt deleted the backport-19925-to-release-23.0 branch April 30, 2026 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backport This is a backport Component: VTOrc Vitess Orchestrator integration Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants