VTOrc: support disk-full detection and recovery of the PRIMARY#20318
VTOrc: support disk-full detection and recovery of the PRIMARY#20318timvaillancourt wants to merge 4 commits into
PRIMARY#20318Conversation
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #20318 +/- ##
===========================================
+ Coverage 69.67% 73.82% +4.15%
===========================================
Files 1614 197 -1417
Lines 216793 32000 -184793
===========================================
- Hits 151044 23624 -127420
+ Misses 65749 8376 -57373
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 35 out of 38 changed files in this pull request and generated no new comments.
Files not reviewed (2)
- go/vt/proto/replicationdata/replicationdata.pb.go: Generated file
- go/vt/proto/replicationdata/replicationdata_vtproto.pb.go: Generated file
|
Promptless prepared a documentation update related to this change. Triggered by PR #20318 Added documentation for the new VTOrc disk-full detection and recovery feature, including the |
PRIMARY
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Description
Similar to the stalled-disk feature of the
tabletmanagerdisk health monitor, this PR adds optional detection and recovery of full disks (syscall.ENOSPC/syscall.EDQUOT), when the disk health monitor is enabled. This helps address problems seen when aPRIMARYfills it's disk to 100% (see #20056)This signal is plumbed all the way to VTOrc via the
FullStatusResponse, again like the stalled disk signal. When VTOrc notices aPRIMARYhas a full disk, it will cause anEmergencyReparentShardto a candidate whose disk is (importantly) NOT full, if the feature is enabled. When space is still available on other nodes this prevents a serious incident - when this is not possible, the operator of Vitess gains important insights - the support makes a best-effort to improve the situationImportantly, the filtering of full-disk-replicas does not change the "must be most advanced" nature of ERS - we will still pick the most advanced intermediate candidate to receive the most-advanced GTID sets, we just won't promote a full
PRIMARY, if possibleDetails:
syscall.ENOSPC/syscall.EDQUOTis used to determine a disk is full, using the same disk health tickerIsDiskFull boolis added toFullStatusResponse(called by VTOrc for tablet probes). Alwaysfalseif the disk health monitor is disabled--enable-primary-disk-full-recoveryadded to VTOrc, to gate full-disk recoveries. This allows a smooth rollout like the stalled disk recoveryPrimaryDiskFull- Primary disk is full, will trigger failover to non-full-disk replicas (if disk-full recoveries are enabled)ReplicaDiskFull- A replica disk is full. This problem is informational/no-opgo/test/endtoend/tabletmanager/disk_health_monitor/fuse_helper/(introduced in VTTablet: add CI-only e2e test for disk health monitor #20212) renamed totestfsto reflect what it "provides" (a test filesystem) and not "how" it is providedchangelog/entry explains the new signalRelated Issue(s)
Resolves: #20056 (this new signal allows a full but InnoDB-handler-commit-stuck
PRIMARYto be actioned)Related:
- #17470 (original
StalledDiskPrimaryanalysis + recovery)- #17624 (refactor disk-stall implementation, mark
NOT_SERVINGon stall)- #20212 (e2e test infrastructure for the disk health monitor)
Checklist
Deployment Notes
This new signal and recovery is disabled by default. To enable:
--disk-write-dirflag--enable-primary-disk-full-recoveryto VTOrcThis two-phase enabling approach allows a user to first validate the disk health monitor signal before letting VTOrc action those signals
AI Disclosure
Core changes by a human being. Claude (Opus 4.7) and Codex (gpt-5.5) assisted with adding testing, early reviews, bouncing ideas