Skip to content

VTOrc: support disk-full detection and recovery of the PRIMARY#20318

Open
timvaillancourt wants to merge 4 commits into
vitessio:mainfrom
timvaillancourt:disk-full-health-monitor
Open

VTOrc: support disk-full detection and recovery of the PRIMARY#20318
timvaillancourt wants to merge 4 commits into
vitessio:mainfrom
timvaillancourt:disk-full-health-monitor

Conversation

@timvaillancourt

@timvaillancourt timvaillancourt commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Description

Similar to the stalled-disk feature of the tabletmanager disk health monitor, this PR adds optional detection and recovery of full disks (syscall.ENOSPC/syscall.EDQUOT), when the disk health monitor is enabled. This helps address problems seen when a PRIMARY fills it's disk to 100% (see #20056)

This signal is plumbed all the way to VTOrc via the FullStatusResponse, again like the stalled disk signal. When VTOrc notices a PRIMARY has a full disk, it will cause an EmergencyReparentShard to a candidate whose disk is (importantly) NOT full, if the feature is enabled. When space is still available on other nodes this prevents a serious incident - when this is not possible, the operator of Vitess gains important insights - the support makes a best-effort to improve the situation

Importantly, the filtering of full-disk-replicas does not change the "must be most advanced" nature of ERS - we will still pick the most advanced intermediate candidate to receive the most-advanced GTID sets, we just won't promote a full PRIMARY, if possible

Details:

  • The cross-platform syscall.ENOSPC/syscall.EDQUOT is used to determine a disk is full, using the same disk health ticker
    • The signals work consistently on Linux, BSD, Mac OSX - golang abstracts the platform details 🎉
    • Does NOT work on Windows (acceptable) 🤷
  • IsDiskFull bool is added to FullStatusResponse (called by VTOrc for tablet probes). Always false if the disk health monitor is disabled
  • --enable-primary-disk-full-recovery added to VTOrc, to gate full-disk recoveries. This allows a smooth rollout like the stalled disk recovery
  • 2 x new self-explanatory VTOrc analysis codes:
    • PrimaryDiskFull - Primary disk is full, will trigger failover to non-full-disk replicas (if disk-full recoveries are enabled)
    • ReplicaDiskFull - A replica disk is full. This problem is informational/no-op
  • go/test/endtoend/tabletmanager/disk_health_monitor/fuse_helper/ (introduced in VTTablet: add CI-only e2e test for disk health monitor #20212) renamed to testfs to reflect what it "provides" (a test filesystem) and not "how" it is provided
    • Many PR file-updates are really this rename 👎
  • Testing:
    • End-to-end tests produce a real full-disk signal and validates state transitions
    • Typical end-to-end testing and unit-tests updated for the new states/problem codes
  • changelog/ entry explains the new signal

Related Issue(s)

Resolves: #20056 (this new signal allows a full but InnoDB-handler-commit-stuck PRIMARY to be actioned)

Related:
- #17470 (original StalledDiskPrimary analysis + recovery)
- #17624 (refactor disk-stall implementation, mark NOT_SERVING on stall)
- #20212 (e2e test infrastructure for the disk health monitor)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

This new signal and recovery is disabled by default. To enable:

  1. Ensure the disk health monitor is enabled, via the existing --disk-write-dir flag
    • After this PR, this will cause both stalled and full disk signals to be returned to VTOrc
  2. To enable VTOrc recoveries based on full-disks, add --enable-primary-disk-full-recovery to VTOrc

This two-phase enabling approach allows a user to first validate the disk health monitor signal before letting VTOrc action those signals

AI Disclosure

Core changes by a human being. Claude (Opus 4.7) and Codex (gpt-5.5) assisted with adding testing, early reviews, bouncing ideas

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copilot AI review requested due to automatic review settings June 13, 2026 23:45
@github-actions github-actions Bot added this to the v25.0.0 milestone Jun 13, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Jun 13, 2026
@vitess-bot

vitess-bot Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

This comment was marked as outdated.

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.76963% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.82%. Comparing base (70c7a72) to head (428a272).
⚠️ Report is 323 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/vtorc/inst/instance_dao.go 78.87% 15 Missing ⚠️
go/vt/vtorc/logic/topology_recovery.go 78.26% 10 Missing ⚠️
go/vt/vtorc/config/config.go 75.00% 2 Missing ⚠️
go/vt/vttablet/tabletserver/disk_health_monitor.go 92.85% 2 Missing ⚠️
go/vt/vttablet/tabletserver/tabletserver.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #20318       +/-   ##
===========================================
+ Coverage   69.67%   73.82%    +4.15%     
===========================================
  Files        1614      197     -1417     
  Lines      216793    32000   -184793     
===========================================
- Hits       151044    23624   -127420     
+ Misses      65749     8376    -57373     
Flag Coverage Δ
partial 73.82% <83.76%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@timvaillancourt timvaillancourt removed Component: VTAdmin VTadmin interface NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Jun 14, 2026
@timvaillancourt timvaillancourt self-assigned this Jun 14, 2026
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copilot AI review requested due to automatic review settings June 14, 2026 10:43

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 38 changed files in this pull request and generated no new comments.

Files not reviewed (2)
  • go/vt/proto/replicationdata/replicationdata.pb.go: Generated file
  • go/vt/proto/replicationdata/replicationdata_vtproto.pb.go: Generated file

@timvaillancourt timvaillancourt added Type: Enhancement Logical improvement (somewhere between a bug and feature) and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels Jun 14, 2026
@timvaillancourt timvaillancourt marked this pull request as ready for review June 14, 2026 22:01
@promptless

promptless Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Promptless prepared a documentation update related to this change.

Triggered by PR #20318

Added documentation for the new VTOrc disk-full detection and recovery feature, including the PrimaryDiskFull and ReplicaDiskFull analysis codes in the recovery table, and a new "Disk Health Monitoring" section explaining how to enable and configure the feature.

Review: Document VTOrc disk-full detection and recovery

@timvaillancourt timvaillancourt changed the title VTOrc: disk-full detection and recovery on the primary VTOrc: support disk-full detection and recovery of the PRIMARY Jun 14, 2026
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Documentation docs related issues/PRs Component: TabletManager Component: VTOrc Vitess Orchestrator integration Component: VTTablet Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: VTOrc to detect "stalled" replicas where replication threads remain running

2 participants