VTTablet: add CI-only e2e test for disk health monitor#20212
Conversation
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
| "tabletmanager_disk_health_monitor": { | ||
| "File": "unused.go", | ||
| "Packages": [ | ||
| "vitess.io/vitess/go/test/endtoend/tabletmanager/disk_health_monitor" | ||
| ], | ||
| "Args": [], | ||
| "Command": [], | ||
| "Manual": false, | ||
| "Shard": "tabletmanager_disk_health_monitor", | ||
| "Tags": [], | ||
| "Needs": [ | ||
| "fuse" | ||
| ] | ||
| }, |
There was a problem hiding this comment.
IMO this does not need to run on every push to every PR or on every push to main. I'd say that this could be a manual test. Adding a new CI workflow for fixes just isn't sustainable. I'm not saying that we can't do this here, that's just my personal preference/opinion that this isn't one of those super critical behaviors that would warrant it.
There was a problem hiding this comment.
@mattlord fair point. I agree we should not allow a "slippery slope" with e2e tests, but the disk health monitor is a feature that is used to ensure the availability of Vitess - it is an optional signal to VTOrc for ERS operations, for example
Based on a recent incident investigation (the disk filling caused a significant outage), in an upcoming PR(s) I plan to add new capabilities to the disk monitor. These capabilities will also be used to ensure the availability of shards, and thus must be proven to work - if not, I think this would be inconsistent with how we ship important functionality. In the longer term, I think the disk monitor should be a candidate for becoming a default feature, due to the preventable issues it can address
To address the CI concern, I have made this CI workflow isolated, and it now only runs when there are changes to the disk health monitor code, tabletserver, the e2e harness and go modules. I think this is the right tradeoff that still ensures this important logic works continuously, while avoiding unnecessary CI work. Let me know what you think
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…monitor-e2e Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…monitor-e2e Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> # Conflicts: # go.sum
Trigger the shard on changes to the wider tabletserver package and shared e2e cluster harness, not just disk_health_monitor*.go, so integration regressions in state-manager or cluster-startup code can no longer slip past this workflow. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Per CLAUDE.md, tests should use testify's require/assert helpers rather than t.Fatal/t.Error so failures are reported consistently with the rest of the test suite. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
mattlord
left a comment
There was a problem hiding this comment.
In go/test/endtoend/tabletmanager/disk_health_monitor/main_test.go:110-115 the test puts the entire cluster VTDATAROOT on the gated FUSE mount before cluster.NewCluster, so the vttablet tmp/log directory also lives on the filesystem that is intentionally stalled. If vttablet exits while TestDiskHealthMonitor_StallAndRecover is waiting for NOT_SERVING, the existing harness tries to read vttablet.ErrorLog before returning (go/test/endtoend/cluster/vttablet_process.go:318), and that read can block behind the same FUSE gate. In that failure mode the SIGHUP cleanup defer never runs and CI burns the workflow timeout instead of failing quickly with diagnostics. Please keep the cluster logs/tmp outside the stalled mount, or otherwise ensure the gate is cleared before any failure-path log reads.
In go/test/endtoend/tabletmanager/disk_health_monitor/stall_test.go:52-58 the test only asserts the tablet becomes NOT_SERVING, but that status is the combined result of several state-manager predicates, not proof that IsDiskStalled() was the reason. Since this setup stalls the real MySQL datadir too, a future green run could be caused by another health path while the disk monitor signal is broken. Please assert the direct signal as well, e.g. after SIGUSR1 call FullStatus with a bounded context and require DiskStalled == true, then after SIGHUP require it clears before accepting the final SERVING transition.
Two issues raised in review: 1. The cluster's VTDATAROOT was set to the gated FUSE mount, so vttablet's log files lived behind the same gate as the monitor's probe writes. If vttablet exited while the harness was polling for NOT_SERVING, the next line (os.ReadFile(ErrorLog) in vttablet_process.go) would block on the stalled FUSE filesystem, wedging the test goroutine past its 30s timeout and burning the workflow budget. Only point --disk-write-dir at the FUSE mount now; mysqld's datadir and vttablet's logs stay on real disk. 2. NOT_SERVING is the AND of six state-manager predicates, not proof that IsDiskStalled() was the reason. Assert FullStatus.DiskStalled directly via vtctldclient at each transition (true after SIGUSR1, false after SIGHUP) so a future regression in IsDiskStalled can't hide behind another health-path flipping the tablet for the wrong reason. FullStatus short-circuits on the same signal with no MySQL I/O involved, so the RPC returns instantly even mid-stall. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
The shard runs in a dedicated workflow that installs fuse3 unconditionally, so the Needs marker has no consumer. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@mattlord good point, all we really need is the disk monitor to stall, so I've moved to that in 1d87813 👍
Good point, |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Description
Implements the proposal in #20091 — a Linux/CI-only e2e test for VTTablet's stalled disk monitor, running against a real
mysqldwith the monitor's--disk-write-dirpointed at a passthrough FUSE filesystem we control from the test process.SIGUSR1to the FUSE helper engages an in-process gate that blocks every mutating op against the mount (including the monitor's probe writes); we assert the primary tablet flips toNOT_SERVING, thenSIGHUPto clear the gate and assert it returns toSERVINGToday only unit tests cover the monitor (
disk_health_monitor_test.go), and they stub out thewriteFunction— so we've never actually proven the full chain (monitor probe writes →IsDiskStalled→state_manager→ tablet state transition) reacts to a real wedged filesystem✅ (from this PR CI):

Test layout (
go/test/endtoend/tabletmanager/disk_health_monitor/)All test files carry
//go:build linux, so on macOS / BSD the package is invisible to the compiler.TestMainadditionally skips unlessCIorGITHUB_ACTIONSis set — so it never runs on a dev machine, even a Linux onefuse_helper/main.go— small Go binary usingmygithub.libinneed.workers.dev/hanwen/go-fuse/v2that mirrors a backing dir through the mount point. The helper gates its mutating FUSE ops (Create,Open,Setattr,Write,Fsync) on a signal-driven in-process latch:SIGUSR1stalls,SIGHUPresumes. We explicitly avoidSIGSTOP/SIGCONThere because freezing the helper's Go runtime mid-stall would also block delivery ofSIGTERM, leaving cluster teardown wedged on a hung FUSE mount if a test panickedmain_test.go— builds + starts the helper, mounts FUSE under a temp dir, brings up the minimum cluster (topo + vtctld + 1 primaryvttablet) with--disk-write-dirpointed at the FUSE mount.mysqld's datadir andvttablet's logs stay on real disk, so cluster I/O (including failure-path log reads in the harness) is outside the gatestall_test.go— single test (TestDiskHealthMonitor_StallAndRecover) covering both the stall (SERVING→NOT_SERVING) and recovery (NOT_SERVING→SERVING) transitions, with a 30s budget per transition (generous to keep CI quiet under resource pressure — the actual flip happens in ~3s)CI wiring
New shard
tabletmanager_disk_health_monitorintest/config.json. The shard runs in a dedicated workflow (.github/workflows/cluster_endtoend_disk_health_monitor.yml) rather than the shared matrix — the matrix incluster_endtoend.ymluses a singlego/**/*.gopaths filter for every shard, and we wanted a narrower trigger surface here, scoped togo/vt/vttablet/tabletserver/**,go/test/endtoend/cluster/**, and the test directory. To keep both workflows from running it, the shard is added toEXCLUDE_SHARDSincluster_endtoend.yml. The dedicated workflow also installsfuse3(the only shard that needs it today)Why Linux/CI only
macOS FUSE (
macFUSE) is a kext / system extension that users have to install + trust, and an orphaned FUSE mount on a Mac dev box can wedge Finder / Spotlight 😅. It needs separate code and privilege escalations I don't think we should play with on developer's 💻s. The CI proves the monitor works on the platform Vitess is deployed to in reality. Full reasoning is in #20091Related Issue(s)
Resolves: #20091
Adjacent: #20056 (
ReplicationStalledDiskFull— a different but related disk-wedge class)Checklist
Deployment Notes
No runtime impact — test-only addition. Adds one new module dep (
github.com/hanwen/go-fuse/v2) and one new CI shard (tabletmanager_disk_health_monitor). Not for backportAI Disclosure
Claude Code assisted with the implementation, testing, and PR summary