VTTablet: add CI-only e2e test for disk health monitor by timvaillancourt · Pull Request #20212 · vitessio/vitess

timvaillancourt · 2026-05-29T14:01:25Z

Description

Implements the proposal in #20091 — a Linux/CI-only e2e test for VTTablet's stalled disk monitor, running against a real mysqld with the monitor's --disk-write-dir pointed at a passthrough FUSE filesystem we control from the test process. SIGUSR1 to the FUSE helper engages an in-process gate that blocks every mutating op against the mount (including the monitor's probe writes); we assert the primary tablet flips to NOT_SERVING, then SIGHUP to clear the gate and assert it returns to SERVING

Today only unit tests cover the monitor (disk_health_monitor_test.go), and they stub out the writeFunction — so we've never actually proven the full chain (monitor probe writes → IsDiskStalled → state_manager → tablet state transition) reacts to a real wedged filesystem

✅ (from this PR CI):

Test layout (`go/test/endtoend/tabletmanager/disk_health_monitor/`)

All test files carry //go:build linux, so on macOS / BSD the package is invisible to the compiler. TestMain additionally skips unless CI or GITHUB_ACTIONS is set — so it never runs on a dev machine, even a Linux one

fuse_helper/main.go — small Go binary using github.com/hanwen/go-fuse/v2 that mirrors a backing dir through the mount point. The helper gates its mutating FUSE ops (Create, Open, Setattr, Write, Fsync) on a signal-driven in-process latch: SIGUSR1 stalls, SIGHUP resumes. We explicitly avoid SIGSTOP/SIGCONT here because freezing the helper's Go runtime mid-stall would also block delivery of SIGTERM, leaving cluster teardown wedged on a hung FUSE mount if a test panicked
main_test.go — builds + starts the helper, mounts FUSE under a temp dir, brings up the minimum cluster (topo + vtctld + 1 primary vttablet) with --disk-write-dir pointed at the FUSE mount. mysqld's datadir and vttablet's logs stay on real disk, so cluster I/O (including failure-path log reads in the harness) is outside the gate
stall_test.go — single test (TestDiskHealthMonitor_StallAndRecover) covering both the stall (SERVING → NOT_SERVING) and recovery (NOT_SERVING → SERVING) transitions, with a 30s budget per transition (generous to keep CI quiet under resource pressure — the actual flip happens in ~3s)

CI wiring

New shard tabletmanager_disk_health_monitor in test/config.json. The shard runs in a dedicated workflow (.github/workflows/cluster_endtoend_disk_health_monitor.yml) rather than the shared matrix — the matrix in cluster_endtoend.yml uses a single go/**/*.go paths filter for every shard, and we wanted a narrower trigger surface here, scoped to go/vt/vttablet/tabletserver/**, go/test/endtoend/cluster/**, and the test directory. To keep both workflows from running it, the shard is added to EXCLUDE_SHARDS in cluster_endtoend.yml. The dedicated workflow also installs fuse3 (the only shard that needs it today)

Why Linux/CI only

macOS FUSE (macFUSE) is a kext / system extension that users have to install + trust, and an orphaned FUSE mount on a Mac dev box can wedge Finder / Spotlight 😅. It needs separate code and privilege escalations I don't think we should play with on developer's 💻s. The CI proves the monitor works on the platform Vitess is deployed to in reality. Full reasoning is in #20091

Related Issue(s)

Resolves: #20091

Adjacent: #20056 (ReplicationStalledDiskFull — a different but related disk-wedge class)

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

Deployment Notes

No runtime impact — test-only addition. Adds one new module dep (github.com/hanwen/go-fuse/v2) and one new CI shard (tabletmanager_disk_health_monitor). Not for backport

AI Disclosure

Claude Code assisted with the implementation, testing, and PR summary

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

vitess-bot · 2026-05-29T14:02:04Z

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

mattlord · 2026-05-29T15:37:30Z

+    "tabletmanager_disk_health_monitor": {
+      "File": "unused.go",
+      "Packages": [
+        "vitess.io/vitess/go/test/endtoend/tabletmanager/disk_health_monitor"
+      ],
+      "Args": [],
+      "Command": [],
+      "Manual": false,
+      "Shard": "tabletmanager_disk_health_monitor",
+      "Tags": [],
+      "Needs": [
+        "fuse"
+      ]
+    },


IMO this does not need to run on every push to every PR or on every push to main. I'd say that this could be a manual test. Adding a new CI workflow for fixes just isn't sustainable. I'm not saying that we can't do this here, that's just my personal preference/opinion that this isn't one of those super critical behaviors that would warrant it.

@mattlord fair point. I agree we should not allow a "slippery slope" with e2e tests, but the disk health monitor is a feature that is used to ensure the availability of Vitess - it is an optional signal to VTOrc for ERS operations, for example

Based on a recent incident investigation (the disk filling caused a significant outage), in an upcoming PR(s) I plan to add new capabilities to the disk monitor. These capabilities will also be used to ensure the availability of shards, and thus must be proven to work - if not, I think this would be inconsistent with how we ship important functionality. In the longer term, I think the disk monitor should be a candidate for becoming a default feature, due to the preventable issues it can address

To address the CI concern, I have made this CI workflow isolated, and it now only runs when there are changes to the disk health monitor code, tabletserver, the e2e harness and go modules. I think this is the right tradeoff that still ensures this important logic works continuously, while avoiding unnecessary CI work. Let me know what you think

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

…monitor-e2e Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

…monitor-e2e Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> # Conflicts: # go.sum

Trigger the shard on changes to the wider tabletserver package and shared e2e cluster harness, not just disk_health_monitor*.go, so integration regressions in state-manager or cluster-startup code can no longer slip past this workflow. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Per CLAUDE.md, tests should use testify's require/assert helpers rather than t.Fatal/t.Error so failures are reported consistently with the rest of the test suite. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 11 changed files in this pull request and generated 2 comments.

mattlord

In go/test/endtoend/tabletmanager/disk_health_monitor/main_test.go:110-115 the test puts the entire cluster VTDATAROOT on the gated FUSE mount before cluster.NewCluster, so the vttablet tmp/log directory also lives on the filesystem that is intentionally stalled. If vttablet exits while TestDiskHealthMonitor_StallAndRecover is waiting for NOT_SERVING, the existing harness tries to read vttablet.ErrorLog before returning (go/test/endtoend/cluster/vttablet_process.go:318), and that read can block behind the same FUSE gate. In that failure mode the SIGHUP cleanup defer never runs and CI burns the workflow timeout instead of failing quickly with diagnostics. Please keep the cluster logs/tmp outside the stalled mount, or otherwise ensure the gate is cleared before any failure-path log reads.

In go/test/endtoend/tabletmanager/disk_health_monitor/stall_test.go:52-58 the test only asserts the tablet becomes NOT_SERVING, but that status is the combined result of several state-manager predicates, not proof that IsDiskStalled() was the reason. Since this setup stalls the real MySQL datadir too, a future green run could be caused by another health path while the disk monitor signal is broken. Please assert the direct signal as well, e.g. after SIGUSR1 call FullStatus with a bounded context and require DiskStalled == true, then after SIGHUP require it clears before accepting the final SERVING transition.

Two issues raised in review: 1. The cluster's VTDATAROOT was set to the gated FUSE mount, so vttablet's log files lived behind the same gate as the monitor's probe writes. If vttablet exited while the harness was polling for NOT_SERVING, the next line (os.ReadFile(ErrorLog) in vttablet_process.go) would block on the stalled FUSE filesystem, wedging the test goroutine past its 30s timeout and burning the workflow budget. Only point --disk-write-dir at the FUSE mount now; mysqld's datadir and vttablet's logs stay on real disk. 2. NOT_SERVING is the AND of six state-manager predicates, not proof that IsDiskStalled() was the reason. Assert FullStatus.DiskStalled directly via vtctldclient at each transition (true after SIGUSR1, false after SIGHUP) so a future regression in IsDiskStalled can't hide behind another health-path flipping the tablet for the wrong reason. FullStatus short-circuits on the same signal with no MySQL I/O involved, so the RPC returns instantly even mid-stall. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

The shard runs in a dedicated workflow that installs fuse3 unconditionally, so the Needs marker has no consumer. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

timvaillancourt · 2026-06-08T10:49:55Z

In go/test/endtoend/tabletmanager/disk_health_monitor/main_test.go:110-115 the test puts the entire cluster VTDATAROOT on the gated FUSE mount before cluster.NewCluster, so the vttablet tmp/log directory also lives on the filesystem that is intentionally stalled. If vttablet exits while TestDiskHealthMonitor_StallAndRecover is waiting for NOT_SERVING, the existing harness tries to read vttablet.ErrorLog before returning (go/test/endtoend/cluster/vttablet_process.go:318), and that read can block behind the same FUSE gate. In that failure mode the SIGHUP cleanup defer never runs and CI burns the workflow timeout instead of failing quickly with diagnostics. Please keep the cluster logs/tmp outside the stalled mount, or otherwise ensure the gate is cleared before any failure-path log reads.

@mattlord good point, all we really need is the disk monitor to stall, so I've moved to that in 1d87813 👍

In go/test/endtoend/tabletmanager/disk_health_monitor/stall_test.go:52-58 the test only asserts the tablet becomes NOT_SERVING, but that status is the combined result of several state-manager predicates, not proof that IsDiskStalled() was the reason. Since this setup stalls the real MySQL datadir too, a future green run could be caused by another health path while the disk monitor signal is broken. Please assert the direct signal as well, e.g. after SIGUSR1 call FullStatus with a bounded context and require DiskStalled == true, then after SIGHUP require it clears before accepting the final SERVING transition.

Good point, FullStatus is the final interface for this state. I've updated this in 1d87813 👍

Copilot

Pull request overview

Copilot reviewed 9 out of 11 changed files in this pull request and generated 2 comments.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 11 changed files in this pull request and generated 1 comment.

timvaillancourt added 2 commits May 29, 2026 15:29

VTTablet: add CI-only e2e test for disk health monitor

40911e1

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

pr suggestion

7df80e7

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Copilot AI review requested due to automatic review settings May 29, 2026 14:01

Copilot started reviewing on behalf of timvaillancourt May 29, 2026 14:01 View session

github-actions Bot added this to the v25.0.0 milestone May 29, 2026

github-actions Bot added Component: Build/CI Component: TabletManager Type: Dependencies Dependency updates labels May 29, 2026

timvaillancourt added Type: CI/Build Component: VTTablet and removed Component: TabletManager Type: Dependencies Dependency updates labels May 29, 2026

timvaillancourt self-assigned this May 29, 2026

This comment was marked as outdated.

Sign in to view

timvaillancourt marked this pull request as ready for review May 29, 2026 14:11

timvaillancourt requested review from frouioui, harshit-gangal, mattlord, rohit-nayak-ps and systay as code owners May 29, 2026 14:11

linter

d1e6bbf

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

This comment was marked as outdated.

Sign in to view

mattlord reviewed May 29, 2026

View reviewed changes

timvaillancourt added 2 commits June 4, 2026 17:42

separate the workflow, add .gitignore

1817fe2

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Merge remote-tracking branch 'origin/main' into vttablet-disk-health-…

eb95cf5

…monitor-e2e Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Copilot AI review requested due to automatic review settings June 4, 2026 15:48

Copilot started reviewing on behalf of timvaillancourt June 4, 2026 15:48 View session

Merge remote-tracking branch 'origin/main' into vttablet-disk-health-…

498f32c

…monitor-e2e Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> # Conflicts: # go.sum

This comment was marked as outdated.

Sign in to view

timvaillancourt added 2 commits June 4, 2026 19:45

test: use require.Failf instead of t.Fatalf

0f3c822

Per CLAUDE.md, tests should use testify's require/assert helpers rather than t.Fatal/t.Error so failures are reported consistently with the rest of the test suite. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Copilot AI review requested due to automatic review settings June 4, 2026 18:48

Copilot started reviewing on behalf of timvaillancourt June 4, 2026 18:48 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

Comment thread go/test/endtoend/tabletmanager/disk_health_monitor/fuse_helper/main.go

Comment thread .github/workflows/cluster_endtoend.yml

mattlord reviewed Jun 4, 2026

View reviewed changes

timvaillancourt added 2 commits June 8, 2026 12:30

test/config: drop unused fuse Needs marker from disk-health shard

2d0d1b1

The shard runs in a dedicated workflow that installs fuse3 unconditionally, so the Needs marker has no consumer. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Copilot AI review requested due to automatic review settings June 8, 2026 10:48

Copilot started reviewing on behalf of timvaillancourt June 8, 2026 10:48 View session

timvaillancourt requested a review from mattlord June 8, 2026 10:50

Copilot AI reviewed Jun 8, 2026

View reviewed changes

Comment thread go/test/endtoend/tabletmanager/disk_health_monitor/stall_test.go

Comment thread go/test/endtoend/tabletmanager/disk_health_monitor/main_test.go

Potential fix for pull request finding

d74cd7d

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Copilot AI review requested due to automatic review settings June 8, 2026 11:12

Copilot started reviewing on behalf of timvaillancourt June 8, 2026 11:12 View session

Copilot AI reviewed Jun 8, 2026

View reviewed changes

Comment thread test/config.json

mattlord approved these changes Jun 9, 2026

View reviewed changes

mhamza15 approved these changes Jun 11, 2026

View reviewed changes

timvaillancourt merged commit 0d44dd1 into vitessio:main Jun 11, 2026
115 of 116 checks passed

timvaillancourt deleted the vttablet-disk-health-monitor-e2e branch June 11, 2026 12:54

timvaillancourt mentioned this pull request Jun 14, 2026

VTOrc: support disk-full detection and recovery of the PRIMARY #20318

Open

5 tasks

Conversation

timvaillancourt commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test layout (go/test/endtoend/tabletmanager/disk_health_monitor/)

CI wiring

Why Linux/CI only

Related Issue(s)

Checklist

Deployment Notes

AI Disclosure

Uh oh!

vitess-bot Bot commented May 29, 2026

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

mattlord May 29, 2026

Choose a reason for hiding this comment

Uh oh!

timvaillancourt Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

mattlord left a comment

Choose a reason for hiding this comment

Uh oh!

timvaillancourt commented Jun 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

timvaillancourt commented May 29, 2026 •

edited

Loading

Test layout (`go/test/endtoend/tabletmanager/disk_health_monitor/`)