Skip to content

Feature Request: e2e testing for VTTablet's stalled-disk monitor (Linux/CI only) #20091

@timvaillancourt

Description

@timvaillancourt

Feature Description

The stalled-disk monitor in vttablet (go/vt/vttablet/tabletserver/disk_health_monitor.go) flips IsDiskStalled() true when a periodic write to --disk-write-dir exceeds --disk-write-timeout. That signal short-circuits FullStatus() (rpc_replication.go:68) and forces the tablet out of SERVING (state_manager.go:788), so it's a load-bearing piece of the cluster's reaction to a hung disk.

Today only unit tests exist (disk_health_monitor_test.go) — they stub writeFunction, so they never prove the monitor actually detects a wedged filesystem under mysqld. We've never end-to-end exercised:

  1. mysqld and the monitor sharing a real datadir that goes unresponsive
  2. The tablet's serving state flipping
  3. VTOrc / the topo health stream observing the tablet as unhealthy

Related: #20056 (a different but adjacent disk-wedge class — ENOSPC on InnoDB filesystems).

Why this is hard to test portably

A faithful stall requires a filesystem where writes hang indefinitely. The two portable options — FUSE and loopback volumes — diverge sharply between Linux and macOS:

  • Linux: /dev/fuse + fusermount3; loopback via losetup
  • macOS: requires macFUSE (kext / system extension, user-installed), no losetup equivalent
    • The Mac OSX-based FUSE also requires users manually trusting a 3rd party package Apple doesn't like, also some privilege escalations. I'd prefer to not be the reason this bites someone
  • Blast radius on a dev machine: perhaps most crucially, an orphaned FUSE mount on a Mac dev box can wedge the user's Finder / Spotlight; an orphan on a throwaway CI runner is harmless. We shouldn't break developer's laptops 😅

For these reasons the initial scope is intentionally narrow.

Initial scope (explicit)

  • Linux only//go:build linux build tag on all test files; on other OSes the package is invisible to the compiler
  • CI onlyTestMain skips with a clear message unless CI or GITHUB_ACTIONS is set (i.e. running on a GitHub Actions worker); standard go test ./... and make test never touch it on dev machines, including Linux dev machines
  • One scenario: spawn a helper FUSE process, mount it under the tablet's VTDATAROOT, SIGSTOP the helper, assert the tablet leaves SERVING; SIGCONT the helper, assert it returns to SERVING
  • Out of scope (deferred to follow-ups):
    • macOS support
    • Loopback-volume variant
    • ERS / fail-over behavior under disk stall
    • VTOrc-side detection of PrimaryDiskStalled (worth its own e2e once we settle on the topo-health-stream assertion pattern)

Proposed approach

1. FUSE helper binary

  • New helper at go/test/endtoend/tabletmanager/disk_health_monitor/fuse_helper/, built as a separate Go binary using github.com/hanwen/go-fuse/v2
  • Passthrough/loopback FUSE that forwards reads/writes to an underlying real directory — SIGSTOP to the helper PID is what stalls the volume; SIGCONT resumes it. No custom signal handling for the stall itself, the kernel does the work
  • Adds one new module dep (go-fuse/v2); pinned and added to go.mod/go.sum

2. Test harness (go/test/endtoend/tabletmanager/disk_health_monitor/)

  • Builds the FUSE helper in TestMain via go build ./fuse_helper, starts it, waits for it to print READY, then sets VTDATAROOT to live under the FUSE mount so the cluster's tablet datadir is FUSE-backed
  • TestMain teardown: cluster teardown first, then SIGCONT (defensive — in case a test panicked mid-stall) + SIGTERM the helper, then fusermount -u / fusermount3 -u as belt-and-suspenders
  • Brings up the minimum: topo + vtctld + one primary vttablet — no vtgate, no vtorc
  • vttablet flags: --disk-write-dir=<VTDATAROOT under FUSE>, --disk-write-interval=500ms, --disk-write-timeout=2s to keep wall-clock test time low

3. Assertions

  • Pre-stall: primary tablet reaches SERVING
  • SIGSTOP the helper → WaitForTabletStatusesForTimeout([]string{"NOT_SERVING"}, 30s) (expected transition ~3s; 30s budget for resource-starved CI per Vitess test guidance)
  • SIGCONT the helper → WaitForTabletStatusesForTimeout([]string{"SERVING"}, 30s), proving the monitor clears the stalled flag and the state manager re-promotes the tablet

4. CI wiring

  • New shard tabletmanager_disk_health_monitor registered in test/config.json with "Needs": ["fuse"]
  • Existing .github/workflows/cluster_endtoend.yml matrix picks it up automatically — the only workflow edit is appending ${{ contains(matrix.needs, 'fuse') && 'fuse3' || '' }} to the existing apt-get install line so fuse3 is installed only for this shard (same conditional-include pattern Java already uses in unit_test.yml)
  • No dedicated workflow file; reuses all existing setup (Go, MySQL, etcd, etc.)

Risks / open questions

  • GH Actions FUSE access: ubuntu-24.04 runners allow sudo apt-get install -y fuse3 non-interactively, and fusermount3 mounts work without --privileged. First CI run on the PR will confirm.
  • mysqld init on FUSE: in practice the passthrough is fast enough that mysqld bootstraps under the FUSE mount without needing a tmpfs-then-mv hop. Will reconfirm on first PR run; the fallback (bootstrap on tmpfs, move datadir before tablet start) is still available if init time blows up.
  • Helper-binary build: settled — TestMain invokes go build ./fuse_helper so contributors don't need a separate make step.
  • VTOrc detection of disk stall: deferred. Once we settle on a clean topo-health-stream assertion shape for "tablet is unhealthy", we can add a follow-up test that brings up VTOrc and asserts PrimaryDiskStalled recovery analysis fires.

Use Case(s)

Operators of Vitess clusters where the underlying block device or filesystem can hang indefinitely without returning EIO — common with networked storage (EBS, PD, NFS) and shared-tenant disks. The stalled-disk monitor is the in-process detector for this class of failure, and we currently have no test that proves the full chain (mysqld + vttablet on a real wedged filesystem → tablet leaves SERVING → cluster reacts) actually works end-to-end.

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions