[release-23.0] VTOrc: fix ReplicationStopped + PrimarySemiSyncBlocked recovery deadlock (#19925)#19982
Conversation
|
Hello @timvaillancourt, there are conflicts in this backport. Please address them in order to merge this Pull Request. You can execute the snippet below to reset your branch and resolve the conflict manually. Make sure you replace |
Adapt the cherry-pick to release-23.0 where AnalyzedInstanceAlias is still a string (not *topodatapb.TabletAlias): - analysis_dao.go: keep primaryAlias as string while taking the new shardWideAnalysisCode/shardWideProblem fields. - topology_recovery.go: adapt shardWideRecoveryIgnoredTablets to return []string; take the renamed alreadyFixed flow in recheckPrimaryHealth. - topology_recovery_test.go: drop unused testutil import, use string aliases in TestShardWideRecoveryIgnoredTablets, and use "zon1" cell in the new TestRecheckPrimaryHealth case to match the existing hardcoded alias in the test loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
There was a problem hiding this comment.
Pull request overview
Backport to release-23.0 of the VTOrc fix for a recovery-ordering deadlock between ReplicationStopped replica recovery and shard-wide PrimarySemiSyncBlocked recovery by allowing dependency-driven ordering and preventing over-suppression of analyses.
Changes:
- Make
ReplicationStoppeddeclare an explicit ordering dependency (BeforeAnalyses: [PrimarySemiSyncBlocked]) so replicas are fixed before the shard-wide semi-sync unblock path. - Update
GetDetectionAnalysisto continue matching after a shard-wide action is detected and to preserve (or promote) dependent analyses viaBeforeAnalyses/AfterAnalyses. - Adjust shard-wide pre-recovery refresh ignore logic and add unit + e2e regression tests to cover the deadlock scenario and ordering behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| go/vt/vtorc/logic/topology_recovery.go | Adds shard-wide refresh ignore helper; clarifies recheckPrimaryHealth; documents suppression interaction with checkIfAlreadyFixed. |
| go/vt/vtorc/logic/topology_recovery_test.go | Extends recheckPrimaryHealth coverage; adds tests for shard-wide refresh ignore behavior; updates ordering expectations. |
| go/vt/vtorc/inst/analysis_problem.go | Adds BeforeAnalyses dependency for ReplicationStopped vs PrimarySemiSyncBlocked. |
| go/vt/vtorc/inst/analysis_problem_test.go | Adds unit coverage for ordered-execution requirement + dependency comparison involving ReplicationStopped. |
| go/vt/vtorc/inst/analysis_dao.go | Implements dependency-aware suppression bypass when a shard-wide action is present (including promotion of dependent non-chosen problems). |
| go/vt/vtorc/inst/analysis_dao_test.go | Adds tests for declaresBefore/declaresAfter used by the new suppression logic. |
| go/test/endtoend/vtorc/utils/utils.go | Adds GetSuccessfulRecoveryCount helper for e2e tests needing “delta” assertions. |
| go/test/endtoend/vtorc/general/vtorc_test.go | Adds end-to-end regression test ensuring ReplicationStopped on a semi-sync acker is fixed even when semi-sync blocked conditions are possible. |
| // are skipped because they are unreachable; reachable-but-unhealthy primaries | ||
| // (PrimarySemiSyncBlocked, PrimaryDiskStalled) are NOT skipped so that | ||
| // checkIfAlreadyFixed evaluates fresh state. | ||
| func shardWideRecoveryIgnoredTablets(recoveryFunctionCode recoveryFunction, analysisEntry *inst.DetectionAnalysis) []string { | ||
| var tabletsToIgnore []string | ||
| if recoveryFunctionCode == recoverDeadPrimaryFunc { | ||
| switch analysisEntry.Analysis { | ||
| case inst.PrimarySemiSyncBlocked, inst.PrimaryDiskStalled: | ||
| // Reachable primary — refresh it so checkIfAlreadyFixed | ||
| // evaluates current state. The problem may have been | ||
| // resolved by a prior dependency recovery. |
There was a problem hiding this comment.
The doc comment describes PrimaryDiskStalled as a “reachable-but-unhealthy” primary, but the PrimaryDiskStalled analysis is matched when LastCheckValid is false (i.e., VTOrc couldn’t successfully check the primary recently). Consider rewording to avoid implying reachability for PrimaryDiskStalled (or clarify what “reachable” means here) so the ignore/refresh behavior is less confusing.
| // are skipped because they are unreachable; reachable-but-unhealthy primaries | |
| // (PrimarySemiSyncBlocked, PrimaryDiskStalled) are NOT skipped so that | |
| // checkIfAlreadyFixed evaluates fresh state. | |
| func shardWideRecoveryIgnoredTablets(recoveryFunctionCode recoveryFunction, analysisEntry *inst.DetectionAnalysis) []string { | |
| var tabletsToIgnore []string | |
| if recoveryFunctionCode == recoverDeadPrimaryFunc { | |
| switch analysisEntry.Analysis { | |
| case inst.PrimarySemiSyncBlocked, inst.PrimaryDiskStalled: | |
| // Reachable primary — refresh it so checkIfAlreadyFixed | |
| // evaluates current state. The problem may have been | |
| // resolved by a prior dependency recovery. | |
| // are skipped because they are known to be unreachable; primaries flagged as | |
| // PrimarySemiSyncBlocked or PrimaryDiskStalled are NOT skipped so that | |
| // checkIfAlreadyFixed evaluates fresh state. In particular, PrimaryDiskStalled | |
| // does not imply VTOrc was able to reach the primary successfully on the last | |
| // check; it is still refreshed here to re-evaluate the current state. | |
| func shardWideRecoveryIgnoredTablets(recoveryFunctionCode recoveryFunction, analysisEntry *inst.DetectionAnalysis) []string { | |
| var tabletsToIgnore []string | |
| if recoveryFunctionCode == recoverDeadPrimaryFunc { | |
| switch analysisEntry.Analysis { | |
| case inst.PrimarySemiSyncBlocked, inst.PrimaryDiskStalled: | |
| // Do not skip this primary during refresh: re-evaluate current | |
| // state so checkIfAlreadyFixed can see whether a prior dependency | |
| // recovery already resolved the problem. |
| // VTOrc instance has already performing the mitigation. | ||
| // In either case, the original analysis is stale which can be safely aborted. |
There was a problem hiding this comment.
Minor grammar in the comment: “VTOrc instance has already performing the mitigation” should be “is already performing” or “has already performed” (and similarly in the earlier line “has already performing”). This comment is new/updated and shows up in logs/grep context when debugging.
| // VTOrc instance has already performing the mitigation. | |
| // In either case, the original analysis is stale which can be safely aborted. | |
| // VTOrc instance is already performing the mitigation. | |
| // In either case, the original analysis is stale and can be safely aborted. |
| } | ||
| // Note: when ca.hasShardWideAction is true, we still run matching | ||
| // below to check if this tablet's problem declares it must run | ||
| // before the shard-wide action (via BeforeAnalyses). |
There was a problem hiding this comment.
The comment says we run matching to check if a tablet’s problem declares it must run before the shard-wide action “(via BeforeAnalyses)”, but the new suppression bypass also considers dependencies declared via the shard-wide problem’s AfterAnalyses. Consider updating the comment to mention both BeforeAnalyses and AfterAnalyses so future readers don’t miss the symmetric case implemented below.
| // before the shard-wide action (via BeforeAnalyses). | |
| // before the shard-wide action (via BeforeAnalyses), or if the | |
| // shard-wide problem declares it must run after this tablet problem | |
| // (via AfterAnalyses). |
Description
This is a backport of #19925