Skip to content

VTTablet: add CI-only e2e test for disk health monitor#20212

Merged
timvaillancourt merged 14 commits into
vitessio:mainfrom
timvaillancourt:vttablet-disk-health-monitor-e2e
Jun 11, 2026
Merged

VTTablet: add CI-only e2e test for disk health monitor#20212
timvaillancourt merged 14 commits into
vitessio:mainfrom
timvaillancourt:vttablet-disk-health-monitor-e2e

Conversation

@timvaillancourt

@timvaillancourt timvaillancourt commented May 29, 2026

Copy link
Copy Markdown
Contributor

Description

Implements the proposal in #20091 — a Linux/CI-only e2e test for VTTablet's stalled disk monitor, running against a real mysqld with the monitor's --disk-write-dir pointed at a passthrough FUSE filesystem we control from the test process. SIGUSR1 to the FUSE helper engages an in-process gate that blocks every mutating op against the mount (including the monitor's probe writes); we assert the primary tablet flips to NOT_SERVING, then SIGHUP to clear the gate and assert it returns to SERVING

Today only unit tests cover the monitor (disk_health_monitor_test.go), and they stub out the writeFunction — so we've never actually proven the full chain (monitor probe writes → IsDiskStalledstate_manager → tablet state transition) reacts to a real wedged filesystem

(from this PR CI):
Screenshot 2026-05-29 at 16 10 20

Test layout (go/test/endtoend/tabletmanager/disk_health_monitor/)

All test files carry //go:build linux, so on macOS / BSD the package is invisible to the compiler. TestMain additionally skips unless CI or GITHUB_ACTIONS is set — so it never runs on a dev machine, even a Linux one

  • fuse_helper/main.go — small Go binary using github.com/hanwen/go-fuse/v2 that mirrors a backing dir through the mount point. The helper gates its mutating FUSE ops (Create, Open, Setattr, Write, Fsync) on a signal-driven in-process latch: SIGUSR1 stalls, SIGHUP resumes. We explicitly avoid SIGSTOP/SIGCONT here because freezing the helper's Go runtime mid-stall would also block delivery of SIGTERM, leaving cluster teardown wedged on a hung FUSE mount if a test panicked
  • main_test.go — builds + starts the helper, mounts FUSE under a temp dir, brings up the minimum cluster (topo + vtctld + 1 primary vttablet) with --disk-write-dir pointed at the FUSE mount. mysqld's datadir and vttablet's logs stay on real disk, so cluster I/O (including failure-path log reads in the harness) is outside the gate
  • stall_test.go — single test (TestDiskHealthMonitor_StallAndRecover) covering both the stall (SERVINGNOT_SERVING) and recovery (NOT_SERVINGSERVING) transitions, with a 30s budget per transition (generous to keep CI quiet under resource pressure — the actual flip happens in ~3s)

CI wiring

New shard tabletmanager_disk_health_monitor in test/config.json. The shard runs in a dedicated workflow (.github/workflows/cluster_endtoend_disk_health_monitor.yml) rather than the shared matrix — the matrix in cluster_endtoend.yml uses a single go/**/*.go paths filter for every shard, and we wanted a narrower trigger surface here, scoped to go/vt/vttablet/tabletserver/**, go/test/endtoend/cluster/**, and the test directory. To keep both workflows from running it, the shard is added to EXCLUDE_SHARDS in cluster_endtoend.yml. The dedicated workflow also installs fuse3 (the only shard that needs it today)

Why Linux/CI only

macOS FUSE (macFUSE) is a kext / system extension that users have to install + trust, and an orphaned FUSE mount on a Mac dev box can wedge Finder / Spotlight 😅. It needs separate code and privilege escalations I don't think we should play with on developer's 💻s. The CI proves the monitor works on the platform Vitess is deployed to in reality. Full reasoning is in #20091

Related Issue(s)

Resolves: #20091

Adjacent: #20056 (ReplicationStalledDiskFull — a different but related disk-wedge class)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

No runtime impact — test-only addition. Adds one new module dep (github.com/hanwen/go-fuse/v2) and one new CI shard (tabletmanager_disk_health_monitor). Not for backport

AI Disclosure

Claude Code assisted with the implementation, testing, and PR summary

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copilot AI review requested due to automatic review settings May 29, 2026 14:01
@github-actions github-actions Bot added this to the v25.0.0 milestone May 29, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 29, 2026
@vitess-bot

vitess-bot Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@timvaillancourt timvaillancourt self-assigned this May 29, 2026
@timvaillancourt timvaillancourt removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 29, 2026

This comment was marked as outdated.

@timvaillancourt timvaillancourt marked this pull request as ready for review May 29, 2026 14:11
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

This comment was marked as outdated.

Comment thread test/config.json
Comment on lines +793 to +806
"tabletmanager_disk_health_monitor": {
"File": "unused.go",
"Packages": [
"vitess.io/vitess/go/test/endtoend/tabletmanager/disk_health_monitor"
],
"Args": [],
"Command": [],
"Manual": false,
"Shard": "tabletmanager_disk_health_monitor",
"Tags": [],
"Needs": [
"fuse"
]
},

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this does not need to run on every push to every PR or on every push to main. I'd say that this could be a manual test. Adding a new CI workflow for fixes just isn't sustainable. I'm not saying that we can't do this here, that's just my personal preference/opinion that this isn't one of those super critical behaviors that would warrant it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattlord fair point. I agree we should not allow a "slippery slope" with e2e tests, but the disk health monitor is a feature that is used to ensure the availability of Vitess - it is an optional signal to VTOrc for ERS operations, for example

Based on a recent incident investigation (the disk filling caused a significant outage), in an upcoming PR(s) I plan to add new capabilities to the disk monitor. These capabilities will also be used to ensure the availability of shards, and thus must be proven to work - if not, I think this would be inconsistent with how we ship important functionality. In the longer term, I think the disk monitor should be a candidate for becoming a default feature, due to the preventable issues it can address

To address the CI concern, I have made this CI workflow isolated, and it now only runs when there are changes to the disk health monitor code, tabletserver, the e2e harness and go modules. I think this is the right tradeoff that still ensures this important logic works continuously, while avoiding unnecessary CI work. Let me know what you think

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…monitor-e2e

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copilot AI review requested due to automatic review settings June 4, 2026 15:48
…monitor-e2e

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

# Conflicts:
#	go.sum

This comment was marked as outdated.

Trigger the shard on changes to the wider tabletserver package and
shared e2e cluster harness, not just disk_health_monitor*.go, so
integration regressions in state-manager or cluster-startup code can
no longer slip past this workflow.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Per CLAUDE.md, tests should use testify's require/assert helpers
rather than t.Fatal/t.Error so failures are reported consistently
with the rest of the test suite.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copilot AI review requested due to automatic review settings June 4, 2026 18:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 11 changed files in this pull request and generated 2 comments.

Comment thread .github/workflows/cluster_endtoend.yml

@mattlord mattlord left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In go/test/endtoend/tabletmanager/disk_health_monitor/main_test.go:110-115 the test puts the entire cluster VTDATAROOT on the gated FUSE mount before cluster.NewCluster, so the vttablet tmp/log directory also lives on the filesystem that is intentionally stalled. If vttablet exits while TestDiskHealthMonitor_StallAndRecover is waiting for NOT_SERVING, the existing harness tries to read vttablet.ErrorLog before returning (go/test/endtoend/cluster/vttablet_process.go:318), and that read can block behind the same FUSE gate. In that failure mode the SIGHUP cleanup defer never runs and CI burns the workflow timeout instead of failing quickly with diagnostics. Please keep the cluster logs/tmp outside the stalled mount, or otherwise ensure the gate is cleared before any failure-path log reads.

In go/test/endtoend/tabletmanager/disk_health_monitor/stall_test.go:52-58 the test only asserts the tablet becomes NOT_SERVING, but that status is the combined result of several state-manager predicates, not proof that IsDiskStalled() was the reason. Since this setup stalls the real MySQL datadir too, a future green run could be caused by another health path while the disk monitor signal is broken. Please assert the direct signal as well, e.g. after SIGUSR1 call FullStatus with a bounded context and require DiskStalled == true, then after SIGHUP require it clears before accepting the final SERVING transition.

Two issues raised in review:

1. The cluster's VTDATAROOT was set to the gated FUSE mount, so
   vttablet's log files lived behind the same gate as the monitor's
   probe writes. If vttablet exited while the harness was polling for
   NOT_SERVING, the next line (os.ReadFile(ErrorLog) in vttablet_process.go)
   would block on the stalled FUSE filesystem, wedging the test goroutine
   past its 30s timeout and burning the workflow budget. Only point
   --disk-write-dir at the FUSE mount now; mysqld's datadir and vttablet's
   logs stay on real disk.

2. NOT_SERVING is the AND of six state-manager predicates, not proof
   that IsDiskStalled() was the reason. Assert FullStatus.DiskStalled
   directly via vtctldclient at each transition (true after SIGUSR1,
   false after SIGHUP) so a future regression in IsDiskStalled can't
   hide behind another health-path flipping the tablet for the wrong
   reason. FullStatus short-circuits on the same signal with no MySQL
   I/O involved, so the RPC returns instantly even mid-stall.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
The shard runs in a dedicated workflow that installs fuse3
unconditionally, so the Needs marker has no consumer.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copilot AI review requested due to automatic review settings June 8, 2026 10:48
@timvaillancourt

Copy link
Copy Markdown
Contributor Author

In go/test/endtoend/tabletmanager/disk_health_monitor/main_test.go:110-115 the test puts the entire cluster VTDATAROOT on the gated FUSE mount before cluster.NewCluster, so the vttablet tmp/log directory also lives on the filesystem that is intentionally stalled. If vttablet exits while TestDiskHealthMonitor_StallAndRecover is waiting for NOT_SERVING, the existing harness tries to read vttablet.ErrorLog before returning (go/test/endtoend/cluster/vttablet_process.go:318), and that read can block behind the same FUSE gate. In that failure mode the SIGHUP cleanup defer never runs and CI burns the workflow timeout instead of failing quickly with diagnostics. Please keep the cluster logs/tmp outside the stalled mount, or otherwise ensure the gate is cleared before any failure-path log reads.

@mattlord good point, all we really need is the disk monitor to stall, so I've moved to that in 1d87813 👍

In go/test/endtoend/tabletmanager/disk_health_monitor/stall_test.go:52-58 the test only asserts the tablet becomes NOT_SERVING, but that status is the combined result of several state-manager predicates, not proof that IsDiskStalled() was the reason. Since this setup stalls the real MySQL datadir too, a future green run could be caused by another health path while the disk monitor signal is broken. Please assert the direct signal as well, e.g. after SIGUSR1 call FullStatus with a bounded context and require DiskStalled == true, then after SIGHUP require it clears before accepting the final SERVING transition.

Good point, FullStatus is the final interface for this state. I've updated this in 1d87813 👍

@timvaillancourt timvaillancourt requested a review from mattlord June 8, 2026 10:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 11 changed files in this pull request and generated 2 comments.

Comment thread go/test/endtoend/tabletmanager/disk_health_monitor/stall_test.go
Comment thread go/test/endtoend/tabletmanager/disk_health_monitor/main_test.go
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copilot AI review requested due to automatic review settings June 8, 2026 11:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 11 changed files in this pull request and generated 1 comment.

Comment thread test/config.json
@timvaillancourt timvaillancourt merged commit 0d44dd1 into vitessio:main Jun 11, 2026
115 of 116 checks passed
@timvaillancourt timvaillancourt deleted the vttablet-disk-health-monitor-e2e branch June 11, 2026 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: e2e testing for VTTablet's stalled-disk monitor (Linux/CI only)

4 participants