Feature Description
The stalled-disk monitor in vttablet (go/vt/vttablet/tabletserver/disk_health_monitor.go) flips IsDiskStalled() true when a periodic write to --disk-write-dir exceeds --disk-write-timeout. That signal short-circuits FullStatus() (rpc_replication.go:68) and forces the tablet out of SERVING (state_manager.go:788), so it's a load-bearing piece of the cluster's reaction to a hung disk.
Today only unit tests exist (disk_health_monitor_test.go) — they stub writeFunction, so they never prove the monitor actually detects a wedged filesystem under mysqld. We've never end-to-end exercised:
mysqld and the monitor sharing a real datadir that goes unresponsive
- The tablet's serving state flipping
- VTOrc / the topo health stream observing the tablet as unhealthy
Related: #20056 (a different but adjacent disk-wedge class — ENOSPC on InnoDB filesystems).
Why this is hard to test portably
A faithful stall requires a filesystem where writes hang indefinitely. The two portable options — FUSE and loopback volumes — diverge sharply between Linux and macOS:
- Linux:
/dev/fuse + fusermount3; loopback via losetup
- macOS: requires
macFUSE (kext / system extension, user-installed), no losetup equivalent
- The Mac OSX-based FUSE also requires users manually trusting a 3rd party package Apple doesn't like, also some privilege escalations. I'd prefer to not be the reason this bites someone
- Blast radius on a dev machine: perhaps most crucially, an orphaned FUSE mount on a Mac dev box can wedge the user's Finder / Spotlight; an orphan on a throwaway CI runner is harmless. We shouldn't break developer's laptops 😅
For these reasons the initial scope is intentionally narrow.
Initial scope (explicit)
- Linux only —
//go:build linux build tag on all test files; on other OSes the package is invisible to the compiler
- CI only —
TestMain skips with a clear message unless CI or GITHUB_ACTIONS is set (i.e. running on a GitHub Actions worker); standard go test ./... and make test never touch it on dev machines, including Linux dev machines
- One scenario: spawn a helper FUSE process, mount it under the tablet's
VTDATAROOT, SIGSTOP the helper, assert the tablet leaves SERVING; SIGCONT the helper, assert it returns to SERVING
- Out of scope (deferred to follow-ups):
- macOS support
- Loopback-volume variant
- ERS / fail-over behavior under disk stall
- VTOrc-side detection of
PrimaryDiskStalled (worth its own e2e once we settle on the topo-health-stream assertion pattern)
Proposed approach
1. FUSE helper binary
- New helper at
go/test/endtoend/tabletmanager/disk_health_monitor/fuse_helper/, built as a separate Go binary using github.com/hanwen/go-fuse/v2
- Passthrough/loopback FUSE that forwards reads/writes to an underlying real directory —
SIGSTOP to the helper PID is what stalls the volume; SIGCONT resumes it. No custom signal handling for the stall itself, the kernel does the work
- Adds one new module dep (
go-fuse/v2); pinned and added to go.mod/go.sum
2. Test harness (go/test/endtoend/tabletmanager/disk_health_monitor/)
- Builds the FUSE helper in
TestMain via go build ./fuse_helper, starts it, waits for it to print READY, then sets VTDATAROOT to live under the FUSE mount so the cluster's tablet datadir is FUSE-backed
TestMain teardown: cluster teardown first, then SIGCONT (defensive — in case a test panicked mid-stall) + SIGTERM the helper, then fusermount -u / fusermount3 -u as belt-and-suspenders
- Brings up the minimum: topo + vtctld + one primary vttablet — no vtgate, no vtorc
- vttablet flags:
--disk-write-dir=<VTDATAROOT under FUSE>, --disk-write-interval=500ms, --disk-write-timeout=2s to keep wall-clock test time low
3. Assertions
- Pre-stall: primary tablet reaches
SERVING
SIGSTOP the helper → WaitForTabletStatusesForTimeout([]string{"NOT_SERVING"}, 30s) (expected transition ~3s; 30s budget for resource-starved CI per Vitess test guidance)
SIGCONT the helper → WaitForTabletStatusesForTimeout([]string{"SERVING"}, 30s), proving the monitor clears the stalled flag and the state manager re-promotes the tablet
4. CI wiring
- New shard
tabletmanager_disk_health_monitor registered in test/config.json with "Needs": ["fuse"]
- Existing
.github/workflows/cluster_endtoend.yml matrix picks it up automatically — the only workflow edit is appending ${{ contains(matrix.needs, 'fuse') && 'fuse3' || '' }} to the existing apt-get install line so fuse3 is installed only for this shard (same conditional-include pattern Java already uses in unit_test.yml)
- No dedicated workflow file; reuses all existing setup (Go, MySQL, etcd, etc.)
Risks / open questions
- GH Actions FUSE access:
ubuntu-24.04 runners allow sudo apt-get install -y fuse3 non-interactively, and fusermount3 mounts work without --privileged. First CI run on the PR will confirm.
- mysqld init on FUSE: in practice the passthrough is fast enough that mysqld bootstraps under the FUSE mount without needing a tmpfs-then-
mv hop. Will reconfirm on first PR run; the fallback (bootstrap on tmpfs, move datadir before tablet start) is still available if init time blows up.
- Helper-binary build: settled —
TestMain invokes go build ./fuse_helper so contributors don't need a separate make step.
- VTOrc detection of disk stall: deferred. Once we settle on a clean topo-health-stream assertion shape for "tablet is unhealthy", we can add a follow-up test that brings up VTOrc and asserts
PrimaryDiskStalled recovery analysis fires.
Use Case(s)
Operators of Vitess clusters where the underlying block device or filesystem can hang indefinitely without returning EIO — common with networked storage (EBS, PD, NFS) and shared-tenant disks. The stalled-disk monitor is the in-process detector for this class of failure, and we currently have no test that proves the full chain (mysqld + vttablet on a real wedged filesystem → tablet leaves SERVING → cluster reacts) actually works end-to-end.
Feature Description
The stalled-disk monitor in vttablet (
go/vt/vttablet/tabletserver/disk_health_monitor.go) flipsIsDiskStalled()true when a periodic write to--disk-write-direxceeds--disk-write-timeout. That signal short-circuitsFullStatus()(rpc_replication.go:68) and forces the tablet out ofSERVING(state_manager.go:788), so it's a load-bearing piece of the cluster's reaction to a hung disk.Today only unit tests exist (
disk_health_monitor_test.go) — they stubwriteFunction, so they never prove the monitor actually detects a wedged filesystem under mysqld. We've never end-to-end exercised:mysqldand the monitor sharing a realdatadirthat goes unresponsiveRelated: #20056 (a different but adjacent disk-wedge class —
ENOSPCon InnoDB filesystems).Why this is hard to test portably
A faithful stall requires a filesystem where writes hang indefinitely. The two portable options — FUSE and loopback volumes — diverge sharply between Linux and macOS:
/dev/fuse+fusermount3; loopback vialosetupmacFUSE(kext / system extension, user-installed), nolosetupequivalentFor these reasons the initial scope is intentionally narrow.
Initial scope (explicit)
//go:build linuxbuild tag on all test files; on other OSes the package is invisible to the compilerTestMainskips with a clear message unlessCIorGITHUB_ACTIONSis set (i.e. running on a GitHub Actions worker); standardgo test ./...andmake testnever touch it on dev machines, including Linux dev machinesVTDATAROOT,SIGSTOPthe helper, assert the tablet leavesSERVING;SIGCONTthe helper, assert it returns toSERVINGPrimaryDiskStalled(worth its own e2e once we settle on the topo-health-stream assertion pattern)Proposed approach
1. FUSE helper binary
go/test/endtoend/tabletmanager/disk_health_monitor/fuse_helper/, built as a separate Go binary usingmygithub.libinneed.workers.dev/hanwen/go-fuse/v2SIGSTOPto the helper PID is what stalls the volume;SIGCONTresumes it. No custom signal handling for the stall itself, the kernel does the workgo-fuse/v2); pinned and added togo.mod/go.sum2. Test harness (
go/test/endtoend/tabletmanager/disk_health_monitor/)TestMainviago build ./fuse_helper, starts it, waits for it to printREADY, then setsVTDATAROOTto live under the FUSE mount so the cluster's tablet datadir is FUSE-backedTestMainteardown: cluster teardown first, thenSIGCONT(defensive — in case a test panicked mid-stall) +SIGTERMthe helper, thenfusermount -u/fusermount3 -uas belt-and-suspenders--disk-write-dir=<VTDATAROOT under FUSE>,--disk-write-interval=500ms,--disk-write-timeout=2sto keep wall-clock test time low3. Assertions
SERVINGSIGSTOPthe helper →WaitForTabletStatusesForTimeout([]string{"NOT_SERVING"}, 30s)(expected transition ~3s; 30s budget for resource-starved CI per Vitess test guidance)SIGCONTthe helper →WaitForTabletStatusesForTimeout([]string{"SERVING"}, 30s), proving the monitor clears the stalled flag and the state manager re-promotes the tablet4. CI wiring
tabletmanager_disk_health_monitorregistered intest/config.jsonwith"Needs": ["fuse"].github/workflows/cluster_endtoend.ymlmatrix picks it up automatically — the only workflow edit is appending${{ contains(matrix.needs, 'fuse') && 'fuse3' || '' }}to the existingapt-get installline sofuse3is installed only for this shard (same conditional-include pattern Java already uses inunit_test.yml)Risks / open questions
ubuntu-24.04runners allowsudo apt-get install -y fuse3non-interactively, andfusermount3mounts work without--privileged. First CI run on the PR will confirm.mvhop. Will reconfirm on first PR run; the fallback (bootstrap on tmpfs, move datadir before tablet start) is still available if init time blows up.TestMaininvokesgo build ./fuse_helperso contributors don't need a separate make step.PrimaryDiskStalledrecovery analysis fires.Use Case(s)
Operators of Vitess clusters where the underlying block device or filesystem can hang indefinitely without returning
EIO— common with networked storage (EBS, PD, NFS) and shared-tenant disks. The stalled-disk monitor is the in-process detector for this class of failure, and we currently have no test that proves the full chain (mysqld + vttablet on a real wedged filesystem → tablet leaves SERVING → cluster reacts) actually works end-to-end.