Feature Request: e2e testing for VTTablet's stalled-disk monitor (Linux/CI only)

### Feature Description

The stalled-disk monitor in vttablet (`go/vt/vttablet/tabletserver/disk_health_monitor.go`) flips `IsDiskStalled()` true when a periodic write to `--disk-write-dir` exceeds `--disk-write-timeout`. That signal short-circuits `FullStatus()` (`rpc_replication.go:68`) and forces the tablet out of `SERVING` (`state_manager.go:788`), so it's a load-bearing piece of the cluster's reaction to a hung disk.

Today only unit tests exist (`disk_health_monitor_test.go`) — they stub `writeFunction`, so they never prove the monitor actually detects a wedged filesystem under mysqld. We've never end-to-end exercised:

1. `mysqld` and the monitor sharing a real `datadir` that goes unresponsive
2. The tablet's serving state flipping
3. VTOrc / the topo health stream observing the tablet as unhealthy

Related: #20056 (a different but adjacent disk-wedge class — `ENOSPC` on InnoDB filesystems).

### Why this is hard to test portably

A faithful stall requires a filesystem where writes hang indefinitely. The two portable options — FUSE and loopback volumes — diverge sharply between Linux and macOS:

- **Linux**: `/dev/fuse` + `fusermount3`; loopback via `losetup`
- **macOS**: requires `macFUSE` (kext / system extension, user-installed), no `losetup` equivalent
    - The Mac OSX-based FUSE also requires users manually trusting a 3rd party package Apple doesn't like, also some privilege escalations. I'd prefer to not be the reason this bites someone
- **Blast radius on a dev machine**: perhaps most crucially, an orphaned FUSE mount on a Mac dev box can wedge the user's Finder / Spotlight; an orphan on a throwaway CI runner is harmless. We shouldn't break developer's laptops 😅 

For these reasons the initial scope is intentionally narrow.

### Initial scope (explicit)

- **Linux only** — `//go:build linux` build tag on all test files; on other OSes the package is invisible to the compiler
- **CI only** — `TestMain` skips with a clear message unless `CI` or `GITHUB_ACTIONS` is set (i.e. running on a GitHub Actions worker); standard `go test ./...` and `make test` never touch it on dev machines, including Linux dev machines
- **One scenario**: spawn a helper FUSE process, mount it under the tablet's `VTDATAROOT`, `SIGSTOP` the helper, assert the tablet leaves `SERVING`; `SIGCONT` the helper, assert it returns to `SERVING`
- **Out of scope** (deferred to follow-ups):
  - macOS support
  - Loopback-volume variant
  - ERS / fail-over behavior under disk stall
  - VTOrc-side detection of `PrimaryDiskStalled` _(worth its own e2e once we settle on the topo-health-stream assertion pattern)_

### Proposed approach

**1. FUSE helper binary**

- New helper at `go/test/endtoend/tabletmanager/disk_health_monitor/fuse_helper/`, built as a separate Go binary using `github.com/hanwen/go-fuse/v2`
- Passthrough/loopback FUSE that forwards reads/writes to an underlying real directory — `SIGSTOP` to the helper PID is what stalls the volume; `SIGCONT` resumes it. No custom signal handling for the stall itself, the kernel does the work
- Adds one new module dep (`go-fuse/v2`); pinned and added to `go.mod`/`go.sum`

**2. Test harness** (`go/test/endtoend/tabletmanager/disk_health_monitor/`)

- Builds the FUSE helper in `TestMain` via `go build ./fuse_helper`, starts it, waits for it to print `READY`, then sets `VTDATAROOT` to live under the FUSE mount so the cluster's tablet datadir is FUSE-backed
- `TestMain` teardown: cluster teardown first, then `SIGCONT` _(defensive — in case a test panicked mid-stall)_ + `SIGTERM` the helper, then `fusermount -u` / `fusermount3 -u` as belt-and-suspenders
- Brings up the minimum: topo + vtctld + one primary vttablet — no vtgate, no vtorc
- vttablet flags: `--disk-write-dir=<VTDATAROOT under FUSE>`, `--disk-write-interval=500ms`, `--disk-write-timeout=2s` to keep wall-clock test time low

**3. Assertions**

- Pre-stall: primary tablet reaches `SERVING`
- `SIGSTOP` the helper → `WaitForTabletStatusesForTimeout([]string{"NOT_SERVING"}, 30s)` _(expected transition ~3s; 30s budget for resource-starved CI per Vitess test guidance)_
- `SIGCONT` the helper → `WaitForTabletStatusesForTimeout([]string{"SERVING"}, 30s)`, proving the monitor clears the stalled flag and the state manager re-promotes the tablet

**4. CI wiring**

- New shard `tabletmanager_disk_health_monitor` registered in `test/config.json` with `"Needs": ["fuse"]`
- Existing `.github/workflows/cluster_endtoend.yml` matrix picks it up automatically — the only workflow edit is appending `${{ contains(matrix.needs, 'fuse') && 'fuse3' || '' }}` to the existing `apt-get install` line so `fuse3` is installed *only* for this shard _(same conditional-include pattern Java already uses in `unit_test.yml`)_
- No dedicated workflow file; reuses all existing setup _(Go, MySQL, etcd, etc.)_

### Risks / open questions

- **GH Actions FUSE access**: `ubuntu-24.04` runners allow `sudo apt-get install -y fuse3` non-interactively, and `fusermount3` mounts work without `--privileged`. First CI run on the PR will confirm.
- **mysqld init on FUSE**: in practice the passthrough is fast enough that mysqld bootstraps under the FUSE mount without needing a tmpfs-then-`mv` hop. Will reconfirm on first PR run; the fallback _(bootstrap on tmpfs, move datadir before tablet start)_ is still available if init time blows up.
- **Helper-binary build**: settled — `TestMain` invokes `go build ./fuse_helper` so contributors don't need a separate make step.
- **VTOrc detection of disk stall**: deferred. Once we settle on a clean topo-health-stream assertion shape for "tablet is unhealthy", we can add a follow-up test that brings up VTOrc and asserts `PrimaryDiskStalled` recovery analysis fires.

### Use Case(s)

Operators of Vitess clusters where the underlying block device or filesystem can hang indefinitely without returning `EIO` — common with networked storage (EBS, PD, NFS) and shared-tenant disks. The stalled-disk monitor is the in-process detector for this class of failure, and we currently have no test that proves the full chain (mysqld + vttablet on a real wedged filesystem → tablet leaves SERVING → cluster reacts) actually works end-to-end.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: e2e testing for VTTablet's stalled-disk monitor (Linux/CI only) #20091

Feature Description

Why this is hard to test portably

Initial scope (explicit)

Proposed approach

Risks / open questions

Use Case(s)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature Request: e2e testing for VTTablet's stalled-disk monitor (Linux/CI only) #20091

Description

Feature Description

Why this is hard to test portably

Initial scope (explicit)

Proposed approach

Risks / open questions

Use Case(s)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions