VTOrc: detect stalled replication on a full disk#20058
Conversation
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a new VTOrc analysis (ReplicationStalledDiskFull) to detect replicas that appear healthy (IO/SQL threads running) but are actually wedged due to ENOSPC on the InnoDB filesystem, and wires the signal from tablet FullStatus through VTOrc instance discovery and analysis. It also introduces a Linux-only, opt-in end-to-end test plus a dedicated GitHub Actions workflow to validate the behavior under a real loopback ext4 disk-full scenario.
Changes:
- Extend
replicationdata.FullStatuswithreplication_stalled_disk_fulland plumb it through vttabletFullStatusand VTOrc instance discovery/analysis. - Add MySQL capability gating + best-effort MySQL-side detection query, including soft-fail behavior for missing table support / missing grants.
- Add a Linux-only disk-full E2E test and CI workflow; extend the E2E cluster framework to support per-mysqlctl
EXTRA_MY_CNFoverrides.
Reviewed changes
Copilot reviewed 26 out of 27 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| web/vtadmin/src/proto/vtadmin.js | Regenerated JS proto bindings to include replication_stalled_disk_full on FullStatus. |
| web/vtadmin/src/proto/vtadmin.d.ts | Regenerated TS typings to include replication_stalled_disk_full on FullStatus. |
| test/config.json | Adds a manual test target/shard entry for the new disk-full vtorc E2E. |
| proto/replicationdata.proto | Adds replication_stalled_disk_full field (tag 26) to FullStatus. |
| go/vt/vttablet/tabletmanager/rpc_replication.go | Populates the new FullStatus.ReplicationStalledDiskFull via MysqlDaemon. |
| go/vt/vtorc/test/recovery_analysis.go | Extends test row-map plumbing with replication_stalled_disk_full. |
| go/vt/vtorc/inst/instance.go | Adds ReplicationStalledDiskFull to the VTOrc Instance model. |
| go/vt/vtorc/inst/instance_dao.go | Persists replication_stalled_disk_full into database_instance. |
| go/vt/vtorc/inst/instance_dao_test.go | Updates instance insert tests to account for the new column/value. |
| go/vt/vtorc/inst/analysis.go | Adds ReplicationStalledDiskFull analysis code + detection struct field. |
| go/vt/vtorc/inst/analysis_problem.go | Adds the new analysis matcher and problem metadata. |
| go/vt/vtorc/inst/analysis_dao.go | Reads the new boolean from database_instance into DetectionAnalysis. |
| go/vt/vtorc/inst/analysis_dao_test.go | Extends VTOrc analysis decision tests to cover ReplicationStalledDiskFull. |
| go/vt/vtorc/db/generate_base.go | Adds replication_stalled_disk_full column to the SQLite schema. |
| go/vt/proto/replicationdata/replicationdata.pb.go | Regenerated Go protobuf bindings with the new field/accessor. |
| go/vt/proto/replicationdata/replicationdata_vtproto.pb.go | Regenerated vtproto fast-path code for the new field. |
| go/vt/mysqlctl/replication.go | Implements Mysqld.IsReplicationStalledDiskFull() and query/capability gating. |
| go/vt/mysqlctl/mysql_daemon.go | Extends MysqlDaemon interface with IsReplicationStalledDiskFull. |
| go/vt/mysqlctl/fakemysqldaemon.go | Implements the new interface method for tests. |
| go/test/endtoend/vtorc/replicationstalleddiskfull/replication_stalled_test.go | Adds the disk-full replication-stalled VTOrc E2E assertion. |
| go/test/endtoend/vtorc/replicationstalleddiskfull/mount_linux.go | Adds loopback ext4 mount helper + cleanup/free-space helpers (Linux). |
| go/test/endtoend/vtorc/replicationstalleddiskfull/main_test.go | TestMain wiring: loopback mount, per-replica my.cnf overrides, fast-poll VTOrc. |
| go/test/endtoend/cluster/mysqlctl_process.go | Adds ExtraMyCnfPath and appends it to EXTRA_MY_CNF per mysqlctl process. |
| go/test/endtoend/cluster/cluster_process.go | Allows func(*MysqlctlProcess) customizers and applies them pre-mysqlctl start. |
| go/mysql/capabilities/capability.go | Introduces PerformanceSchemaErrorLogTableCapability (>= 8.0.22). |
| .github/workflows/vtorc_disk_full_e2e.yml | New dedicated workflow to run the disk-full E2E under sudo on ubuntu-24.04. |
| .github/workflows/cluster_endtoend.yml | Excludes the new vtorc_disk_full shard from the regular cluster e2e matrix. |
| if mysqlctl.ExtraMyCnfPath != "" { | ||
| extraCnfPaths = append(extraCnfPaths, mysqlctl.ExtraMyCnfPath) | ||
| } | ||
| if len(extraCnfPaths) > 0 { | ||
| tmpProcess.Env = append(tmpProcess.Env, "EXTRA_MY_CNF="+strings.Join(extraCnfPaths, ":")) | ||
| } | ||
| tmpProcess.Env = append(tmpProcess.Env, os.Environ()...) | ||
| tmpProcess.Env = append(tmpProcess.Env, DefaultVttestEnv) |
| replicationStalledDiskFull, err := tm.MysqlDaemon.IsReplicationStalledDiskFull(ctx) | ||
| if err != nil { | ||
| log.Warn(fmt.Sprintf("IsReplicationStalledDiskFull failed: %v", err)) | ||
| replicationStalledDiskFull = false |
| versionStr, err := mysqld.GetVersionString(ctx) | ||
| if err != nil { | ||
| return false, err | ||
| } | ||
| if _, v, perr := ParseVersionString(versionStr); perr == nil { | ||
| versionStr = fmt.Sprintf("%d.%d.%d", v.Major, v.Minor, v.Patch) | ||
| } | ||
| capableOf := mysql.ServerVersionCapableOf(versionStr) | ||
| if capableOf == nil { | ||
| return false, nil | ||
| } | ||
| ok, err := capableOf(capabilities.PerformanceSchemaErrorLogTableCapability) | ||
| if err != nil || !ok { | ||
| return false, nil | ||
| } | ||
|
|
||
| conn, err := getPoolReconnect(ctx, mysqld.dbaPool) | ||
| if err != nil { | ||
| return false, err | ||
| } | ||
| defer conn.Recycle() | ||
|
|
| res, err := conn.Conn.ExecuteFetch(replicationStalledDiskFullQuery, 1, false) | ||
| if err != nil { | ||
| if sqlErr, ok := sqlerror.NewSQLErrorFromError(err).(*sqlerror.SQLError); ok && sqlErr.Num == sqlerror.ERTableAccessDenied { | ||
| if replicationStalledDiskFullPermWarned.CompareAndSwap(false, true) { | ||
| log.Warn(fmt.Sprintf("IsReplicationStalledDiskFull: SELECT denied on performance_schema.error_log; check is disabled until grants are fixed (%v)", err)) | ||
| } | ||
| return false, nil | ||
| } |
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
| replicationStalledDiskFull, err := tm.MysqlDaemon.IsReplicationStalledDiskFull(ctx) | ||
| if err != nil { | ||
| log.Warn(fmt.Sprintf("IsReplicationStalledDiskFull failed: %v", err)) | ||
| replicationStalledDiskFull = false |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20058 +/- ##
===========================================
- Coverage 69.67% 63.10% -6.57%
===========================================
Files 1614 122 -1492
Lines 216793 20178 -196615
===========================================
- Hits 151044 12733 -138311
+ Misses 65749 7445 -58304
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
| // Detect a replica wedged by a full disk (IO/SQL threads "Yes" but applier | ||
| // is silently retrying inside ha_commit_trans). Tolerate failures: the check | ||
| // is best-effort and must not break FullStatus on older MySQL versions or | ||
| // when the dba user lacks SELECT on performance_schema.error_log. | ||
| replicationStalledDiskFull, err := tm.MysqlDaemon.IsReplicationStalledDiskFull(ctx) | ||
| if err != nil { | ||
| log.Warn(fmt.Sprintf("IsReplicationStalledDiskFull failed: %v", err)) | ||
| replicationStalledDiskFull = false | ||
| } |
| // replicationStalledDiskFullPermWarned is set the first time | ||
| // IsReplicationStalledDiskFull observes a permission-denied error so the | ||
| // warning is logged once per process instead of on every poll. | ||
| var replicationStalledDiskFullPermWarned atomic.Bool | ||
|
|
||
| // IsReplicationStalledDiskFull returns true when the replica's applier appears | ||
| // to be wedged by a full disk (see replicationStalledDiskFullQuery). On older | ||
| // MySQL versions or other flavors that lack performance_schema.error_log it | ||
| // returns (false, nil). A SELECT permission denied error is logged once and | ||
| // also reported as (false, nil) so it cannot fail FullStatus discovery. | ||
| func (mysqld *Mysqld) IsReplicationStalledDiskFull(ctx context.Context) (bool, error) { | ||
| versionStr, err := mysqld.GetVersionString(ctx) | ||
| if err != nil { | ||
| return false, err | ||
| } | ||
| if _, v, perr := ParseVersionString(versionStr); perr == nil { | ||
| versionStr = fmt.Sprintf("%d.%d.%d", v.Major, v.Minor, v.Patch) | ||
| } | ||
| capableOf := mysql.ServerVersionCapableOf(versionStr) | ||
| if capableOf == nil { | ||
| return false, nil | ||
| } | ||
| ok, err := capableOf(capabilities.PerformanceSchemaErrorLogTableCapability) | ||
| if err != nil || !ok { | ||
| return false, nil | ||
| } | ||
|
|
||
| conn, err := getPoolReconnect(ctx, mysqld.dbaPool) | ||
| if err != nil { | ||
| return false, err | ||
| } | ||
| defer conn.Recycle() | ||
|
|
||
| res, err := conn.Conn.ExecuteFetch(replicationStalledDiskFullQuery, 1, false) | ||
| if err != nil { | ||
| if sqlErr, ok := sqlerror.NewSQLErrorFromError(err).(*sqlerror.SQLError); ok && sqlErr.Num == sqlerror.ERTableAccessDenied { | ||
| if replicationStalledDiskFullPermWarned.CompareAndSwap(false, true) { | ||
| log.Warn(fmt.Sprintf("IsReplicationStalledDiskFull: SELECT denied on performance_schema.error_log; check is disabled until grants are fixed (%v)", err)) | ||
| } | ||
| return false, nil | ||
| } |
| if len(extraCnfPaths) > 0 { | ||
| tmpProcess.Env = append(tmpProcess.Env, "EXTRA_MY_CNF="+strings.Join(extraCnfPaths, ":")) | ||
| } | ||
| tmpProcess.Env = append(tmpProcess.Env, os.Environ()...) | ||
| tmpProcess.Env = append(tmpProcess.Env, DefaultVttestEnv) |
While pretty elegant, this won't work 😢 The This is pretty sad, because the alternatives to solve this aren't nearly as pretty In diagnosing this, I realized the stalled disk monitor that I had planned to make more-robust, with the aim that it become a default feature, duplicates much of the effort required to check if a disk is healthy. I will return with an updated plan on how this can be achieved using that monitor |
|
This is a nice idea, but once the The original issue is updated. Closing |
Description
Implements the proposal in #20056 — a new
ReplicationStalledDiskFullanalysis code that detects MySQL replicas wedged byENOSPCon the InnoDB filesystem (whereSlave_IO_RunningandSlave_SQL_RunningstayYesbut the applier is parked insideha_commit_transretrying writes). Today VTOrc treats these replicas as healthy and they sit silent until lag-based alerts fireDetection is the single-poll, stateless query from the issue, run on each replica via the existing
FullStatusRPC:A row means: a
Disk is fullwas logged after the applier's last successful commit — still wedged. Self-healing — clears the moment the applier resumesWhat changed
PerformanceSchemaErrorLogTableCapability(MySQL/Percona 8.0.22+) — gates the query so MariaDB and older MySQL skip cleanly without log spam.ER_TABLEACCESS_DENIED_ERROR(missingSELECTonperformance_schema.error_log) also soft-fails with a one-shot warning so grants issues don't breakFullStatusdiscoveryIsReplicationStalledDiskFull(ctx)onMysqlDaemon— runs the query and returns the bool. Wired throughFullStatus(newreplication_stalled_disk_full = 26field) into VTOrc's existing instance-discovery path. Errors from the check are logged but don't failFullStatusReplicationStalledDiskFullanalysis code + matcher — fires whentopo.IsReplicaType(...) && !a.IsPrimary && a.ReplicationStalledDiskFull. PrioritydetectionAnalysisPriorityMedium. No explicit recovery case — falls through to the default arm withRecoverySkipNoRecoveryAction, the same path asInvalidReplicaandInvalidPrimary. Disk-full recovery is operator-driven (free disk space), so VTOrc only surfaces the analysismysqlddatadir/innodb_log_group_home_dir/relay-log/log-binat it viaEXTRA_MY_CNF, and asserts the analysis surfaces when 1 MB BLOB inserts from the primary fill the replica's disk. Three-layered gate://go:build linux(compile-time),os.Geteuid() == 0(runtime root check), andVT_TEST_DISK_FULL=1(explicit opt-in). Dedicated workflowvtorc_disk_full_e2e.ymlruns it onubuntu-24.04undersudo; the regularcluster_endtoend.ymlexcludes the newvtorc_disk_fullshardWhy Linux + CI-only
The e2e creates a real ext4 loopback filesystem (
dd+mkfs.ext4+mount -o loop) — that's a Linux-only setup that also needs root. Supporting macOS would mean a second set of test logic (hdiutil create/hdiutil attach, APFS or HFS+ instead of ext4, different unmount semantics) for marginal additional signal. Running on a developer's laptop is also a poor experience: it prompts forsudo, and a half-cleaned-up mount (afterCtrl-Cor a panic) leaves the machine in a messy state that's hard to diagnose. CI workers (GitHub Actionsubuntu-24.04) are ephemeral, so a leftover mount doesn't matter — the runner is destroyed at the end of the jobE2E framework hook
The e2e needed per-tablet
mysqldconfig without polluting other tablets' env. Two small additions togo/test/endtoend/cluster/:ExtraMyCnfPath stringfield onMysqlctlProcess— appended toEXTRA_MY_CNFon the per-process exec only (colon-joined with the existing SSL cnf if both are set)StartShards/StartKeyspaceLegacynow also handlesfunc(*MysqlctlProcess), applied beforeStartProcessso the override takes effect atmysqld --initializetimeReusable for any future test needing per-tablet
mysqldtuning. Existing tests that passfunc(*VttabletProcess)are unaffectedSurfaced metrics
When the analysis fires for a replica, the standard surfacing metrics increment — same shape as the existing non-actionable codes:
DetectedProblems{Analysis="ReplicationStalledDiskFull", ...}gauge →1while active, resets to0on the next pass once the row clearsAnalysisChangeWritecounter on each transitionSkippedRecoveries{RecoveryType="", Reason="NoRecoveryAction"}counter per recovery-dispatcher cycleRelated Issue(s)
Resolves: #20056
Checklist
Deployment Notes
No new flags. The new analysis surfaces automatically on MySQL/Percona 8.0.22+ replicas; older versions and MariaDB skip the check via the capability gate. The
dbauser needsSELECTonperformance_schema.error_logfor the check to run — denied access is logged once and the check is disabled for the processAI Disclosure
Claude Code assisted with development and testing; I committed the change manually after reviewing each step. Claude prepared this PR summary