Skip to content

VTOrc: detect stalled replication on a full disk#20058

Closed
timvaillancourt wants to merge 6 commits into
vitessio:mainfrom
timvaillancourt:vtorc-replication-stalled
Closed

VTOrc: detect stalled replication on a full disk#20058
timvaillancourt wants to merge 6 commits into
vitessio:mainfrom
timvaillancourt:vtorc-replication-stalled

Conversation

@timvaillancourt
Copy link
Copy Markdown
Contributor

Description

Implements the proposal in #20056 — a new ReplicationStalledDiskFull analysis code that detects MySQL replicas wedged by ENOSPC on the InnoDB filesystem (where Slave_IO_Running and Slave_SQL_Running stay Yes but the applier is parked inside ha_commit_trans retrying writes). Today VTOrc treats these replicas as healthy and they sit silent until lag-based alerts fire

Detection is the single-poll, stateless query from the issue, run on each replica via the existing FullStatus RPC:

SELECT 1 FROM performance_schema.error_log el
 WHERE el.logged    >= NOW(6) - INTERVAL 1 HOUR
   AND el.prio       = 'Error'
   AND el.subsystem  = 'InnoDB'
   AND el.error_code IN ('MY-012814', 'MY-012820')
   AND el.logged > COALESCE(
         (SELECT MAX(LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP)
            FROM performance_schema.replication_applier_status_by_worker),
         '1970-01-01')
 LIMIT 1

A row means: a Disk is full was logged after the applier's last successful commit — still wedged. Self-healing — clears the moment the applier resumes

What changed

  1. New PerformanceSchemaErrorLogTableCapability (MySQL/Percona 8.0.22+) — gates the query so MariaDB and older MySQL skip cleanly without log spam. ER_TABLEACCESS_DENIED_ERROR (missing SELECT on performance_schema.error_log) also soft-fails with a one-shot warning so grants issues don't break FullStatus discovery
  2. IsReplicationStalledDiskFull(ctx) on MysqlDaemon — runs the query and returns the bool. Wired through FullStatus (new replication_stalled_disk_full = 26 field) into VTOrc's existing instance-discovery path. Errors from the check are logged but don't fail FullStatus
  3. ReplicationStalledDiskFull analysis code + matcher — fires when topo.IsReplicaType(...) && !a.IsPrimary && a.ReplicationStalledDiskFull. Priority detectionAnalysisPriorityMedium. No explicit recovery case — falls through to the default arm with RecoverySkipNoRecoveryAction, the same path as InvalidReplica and InvalidPrimary. Disk-full recovery is operator-driven (free disk space), so VTOrc only surfaces the analysis
  4. End-to-end test (Linux only) — mounts a 256 MB ext4 loopback filesystem, points the replica's mysqld datadir / innodb_log_group_home_dir / relay-log / log-bin at it via EXTRA_MY_CNF, and asserts the analysis surfaces when 1 MB BLOB inserts from the primary fill the replica's disk. Three-layered gate: //go:build linux (compile-time), os.Geteuid() == 0 (runtime root check), and VT_TEST_DISK_FULL=1 (explicit opt-in). Dedicated workflow vtorc_disk_full_e2e.yml runs it on ubuntu-24.04 under sudo; the regular cluster_endtoend.yml excludes the new vtorc_disk_full shard

Why Linux + CI-only

The e2e creates a real ext4 loopback filesystem (dd + mkfs.ext4 + mount -o loop) — that's a Linux-only setup that also needs root. Supporting macOS would mean a second set of test logic (hdiutil create / hdiutil attach, APFS or HFS+ instead of ext4, different unmount semantics) for marginal additional signal. Running on a developer's laptop is also a poor experience: it prompts for sudo, and a half-cleaned-up mount (after Ctrl-C or a panic) leaves the machine in a messy state that's hard to diagnose. CI workers (GitHub Actions ubuntu-24.04) are ephemeral, so a leftover mount doesn't matter — the runner is destroyed at the end of the job

E2E framework hook

The e2e needed per-tablet mysqld config without polluting other tablets' env. Two small additions to go/test/endtoend/cluster/:

  • New ExtraMyCnfPath string field on MysqlctlProcess — appended to EXTRA_MY_CNF on the per-process exec only (colon-joined with the existing SSL cnf if both are set)
  • The customizer type-switch in StartShards / StartKeyspaceLegacy now also handles func(*MysqlctlProcess), applied before StartProcess so the override takes effect at mysqld --initialize time

Reusable for any future test needing per-tablet mysqld tuning. Existing tests that pass func(*VttabletProcess) are unaffected

Surfaced metrics

When the analysis fires for a replica, the standard surfacing metrics increment — same shape as the existing non-actionable codes:

  • DetectedProblems{Analysis="ReplicationStalledDiskFull", ...} gauge → 1 while active, resets to 0 on the next pass once the row clears
  • AnalysisChangeWrite counter on each transition
  • SkippedRecoveries{RecoveryType="", Reason="NoRecoveryAction"} counter per recovery-dispatcher cycle

Related Issue(s)

Resolves: #20056

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

No new flags. The new analysis surfaces automatically on MySQL/Percona 8.0.22+ replicas; older versions and MariaDB skip the check via the capability gate. The dba user needs SELECT on performance_schema.error_log for the check to run — denied access is logged once and the check is disabled for the process

AI Disclosure

Claude Code assisted with development and testing; I committed the change manually after reviewing each step. Claude prepared this PR summary

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copilot AI review requested due to automatic review settings May 7, 2026 18:24
@github-actions github-actions Bot added this to the v25.0.0 milestone May 7, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 7, 2026
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot Bot commented May 7, 2026

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@timvaillancourt timvaillancourt added Type: Enhancement Logical improvement (somewhere between a bug and feature) and removed Component: Build/CI Component: VTAdmin VTadmin interface Component: TabletManager NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required Component: VTTablet labels May 7, 2026
@timvaillancourt timvaillancourt self-assigned this May 7, 2026
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new VTOrc analysis (ReplicationStalledDiskFull) to detect replicas that appear healthy (IO/SQL threads running) but are actually wedged due to ENOSPC on the InnoDB filesystem, and wires the signal from tablet FullStatus through VTOrc instance discovery and analysis. It also introduces a Linux-only, opt-in end-to-end test plus a dedicated GitHub Actions workflow to validate the behavior under a real loopback ext4 disk-full scenario.

Changes:

  • Extend replicationdata.FullStatus with replication_stalled_disk_full and plumb it through vttablet FullStatus and VTOrc instance discovery/analysis.
  • Add MySQL capability gating + best-effort MySQL-side detection query, including soft-fail behavior for missing table support / missing grants.
  • Add a Linux-only disk-full E2E test and CI workflow; extend the E2E cluster framework to support per-mysqlctl EXTRA_MY_CNF overrides.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
web/vtadmin/src/proto/vtadmin.js Regenerated JS proto bindings to include replication_stalled_disk_full on FullStatus.
web/vtadmin/src/proto/vtadmin.d.ts Regenerated TS typings to include replication_stalled_disk_full on FullStatus.
test/config.json Adds a manual test target/shard entry for the new disk-full vtorc E2E.
proto/replicationdata.proto Adds replication_stalled_disk_full field (tag 26) to FullStatus.
go/vt/vttablet/tabletmanager/rpc_replication.go Populates the new FullStatus.ReplicationStalledDiskFull via MysqlDaemon.
go/vt/vtorc/test/recovery_analysis.go Extends test row-map plumbing with replication_stalled_disk_full.
go/vt/vtorc/inst/instance.go Adds ReplicationStalledDiskFull to the VTOrc Instance model.
go/vt/vtorc/inst/instance_dao.go Persists replication_stalled_disk_full into database_instance.
go/vt/vtorc/inst/instance_dao_test.go Updates instance insert tests to account for the new column/value.
go/vt/vtorc/inst/analysis.go Adds ReplicationStalledDiskFull analysis code + detection struct field.
go/vt/vtorc/inst/analysis_problem.go Adds the new analysis matcher and problem metadata.
go/vt/vtorc/inst/analysis_dao.go Reads the new boolean from database_instance into DetectionAnalysis.
go/vt/vtorc/inst/analysis_dao_test.go Extends VTOrc analysis decision tests to cover ReplicationStalledDiskFull.
go/vt/vtorc/db/generate_base.go Adds replication_stalled_disk_full column to the SQLite schema.
go/vt/proto/replicationdata/replicationdata.pb.go Regenerated Go protobuf bindings with the new field/accessor.
go/vt/proto/replicationdata/replicationdata_vtproto.pb.go Regenerated vtproto fast-path code for the new field.
go/vt/mysqlctl/replication.go Implements Mysqld.IsReplicationStalledDiskFull() and query/capability gating.
go/vt/mysqlctl/mysql_daemon.go Extends MysqlDaemon interface with IsReplicationStalledDiskFull.
go/vt/mysqlctl/fakemysqldaemon.go Implements the new interface method for tests.
go/test/endtoend/vtorc/replicationstalleddiskfull/replication_stalled_test.go Adds the disk-full replication-stalled VTOrc E2E assertion.
go/test/endtoend/vtorc/replicationstalleddiskfull/mount_linux.go Adds loopback ext4 mount helper + cleanup/free-space helpers (Linux).
go/test/endtoend/vtorc/replicationstalleddiskfull/main_test.go TestMain wiring: loopback mount, per-replica my.cnf overrides, fast-poll VTOrc.
go/test/endtoend/cluster/mysqlctl_process.go Adds ExtraMyCnfPath and appends it to EXTRA_MY_CNF per mysqlctl process.
go/test/endtoend/cluster/cluster_process.go Allows func(*MysqlctlProcess) customizers and applies them pre-mysqlctl start.
go/mysql/capabilities/capability.go Introduces PerformanceSchemaErrorLogTableCapability (>= 8.0.22).
.github/workflows/vtorc_disk_full_e2e.yml New dedicated workflow to run the disk-full E2E under sudo on ubuntu-24.04.
.github/workflows/cluster_endtoend.yml Excludes the new vtorc_disk_full shard from the regular cluster e2e matrix.

Comment on lines +166 to 173
if mysqlctl.ExtraMyCnfPath != "" {
extraCnfPaths = append(extraCnfPaths, mysqlctl.ExtraMyCnfPath)
}
if len(extraCnfPaths) > 0 {
tmpProcess.Env = append(tmpProcess.Env, "EXTRA_MY_CNF="+strings.Join(extraCnfPaths, ":"))
}
tmpProcess.Env = append(tmpProcess.Env, os.Environ()...)
tmpProcess.Env = append(tmpProcess.Env, DefaultVttestEnv)
Comment on lines +174 to +177
replicationStalledDiskFull, err := tm.MysqlDaemon.IsReplicationStalledDiskFull(ctx)
if err != nil {
log.Warn(fmt.Sprintf("IsReplicationStalledDiskFull failed: %v", err))
replicationStalledDiskFull = false
Comment on lines +884 to +905
versionStr, err := mysqld.GetVersionString(ctx)
if err != nil {
return false, err
}
if _, v, perr := ParseVersionString(versionStr); perr == nil {
versionStr = fmt.Sprintf("%d.%d.%d", v.Major, v.Minor, v.Patch)
}
capableOf := mysql.ServerVersionCapableOf(versionStr)
if capableOf == nil {
return false, nil
}
ok, err := capableOf(capabilities.PerformanceSchemaErrorLogTableCapability)
if err != nil || !ok {
return false, nil
}

conn, err := getPoolReconnect(ctx, mysqld.dbaPool)
if err != nil {
return false, err
}
defer conn.Recycle()

Comment on lines +906 to +913
res, err := conn.Conn.ExecuteFetch(replicationStalledDiskFullQuery, 1, false)
if err != nil {
if sqlErr, ok := sqlerror.NewSQLErrorFromError(err).(*sqlerror.SQLError); ok && sqlErr.Num == sqlerror.ERTableAccessDenied {
if replicationStalledDiskFullPermWarned.CompareAndSwap(false, true) {
log.Warn(fmt.Sprintf("IsReplicationStalledDiskFull: SELECT denied on performance_schema.error_log; check is disabled until grants are fixed (%v)", err))
}
return false, nil
}
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copilot AI review requested due to automatic review settings May 7, 2026 18:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 27 changed files in this pull request and generated 1 comment.

Comment on lines +174 to +177
replicationStalledDiskFull, err := tm.MysqlDaemon.IsReplicationStalledDiskFull(ctx)
if err != nil {
log.Warn(fmt.Sprintf("IsReplicationStalledDiskFull failed: %v", err))
replicationStalledDiskFull = false
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 20.93023% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.10%. Comparing base (70c7a72) to head (4b74648).
⚠️ Report is 238 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/mysqlctl/replication.go 0.00% 24 Missing ⚠️
go/vt/vttablet/tabletmanager/rpc_replication.go 0.00% 5 Missing ⚠️
go/mysql/capabilities/capability.go 0.00% 2 Missing ⚠️
go/vt/mysqlctl/fakemysqldaemon.go 0.00% 2 Missing ⚠️
go/vt/vtorc/inst/instance_dao.go 66.66% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (70c7a72) and HEAD (4b74648). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (70c7a72) HEAD (4b74648)
1 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #20058       +/-   ##
===========================================
- Coverage   69.67%   63.10%    -6.57%     
===========================================
  Files        1614      122     -1492     
  Lines      216793    20178   -196615     
===========================================
- Hits       151044    12733   -138311     
+ Misses      65749     7445    -58304     
Flag Coverage Δ
partial 63.10% <20.93%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copilot AI review requested due to automatic review settings May 7, 2026 19:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 27 changed files in this pull request and generated 3 comments.

Comment on lines +170 to +178
// Detect a replica wedged by a full disk (IO/SQL threads "Yes" but applier
// is silently retrying inside ha_commit_trans). Tolerate failures: the check
// is best-effort and must not break FullStatus on older MySQL versions or
// when the dba user lacks SELECT on performance_schema.error_log.
replicationStalledDiskFull, err := tm.MysqlDaemon.IsReplicationStalledDiskFull(ctx)
if err != nil {
log.Warn(fmt.Sprintf("IsReplicationStalledDiskFull failed: %v", err))
replicationStalledDiskFull = false
}
Comment on lines +873 to +913
// replicationStalledDiskFullPermWarned is set the first time
// IsReplicationStalledDiskFull observes a permission-denied error so the
// warning is logged once per process instead of on every poll.
var replicationStalledDiskFullPermWarned atomic.Bool

// IsReplicationStalledDiskFull returns true when the replica's applier appears
// to be wedged by a full disk (see replicationStalledDiskFullQuery). On older
// MySQL versions or other flavors that lack performance_schema.error_log it
// returns (false, nil). A SELECT permission denied error is logged once and
// also reported as (false, nil) so it cannot fail FullStatus discovery.
func (mysqld *Mysqld) IsReplicationStalledDiskFull(ctx context.Context) (bool, error) {
versionStr, err := mysqld.GetVersionString(ctx)
if err != nil {
return false, err
}
if _, v, perr := ParseVersionString(versionStr); perr == nil {
versionStr = fmt.Sprintf("%d.%d.%d", v.Major, v.Minor, v.Patch)
}
capableOf := mysql.ServerVersionCapableOf(versionStr)
if capableOf == nil {
return false, nil
}
ok, err := capableOf(capabilities.PerformanceSchemaErrorLogTableCapability)
if err != nil || !ok {
return false, nil
}

conn, err := getPoolReconnect(ctx, mysqld.dbaPool)
if err != nil {
return false, err
}
defer conn.Recycle()

res, err := conn.Conn.ExecuteFetch(replicationStalledDiskFullQuery, 1, false)
if err != nil {
if sqlErr, ok := sqlerror.NewSQLErrorFromError(err).(*sqlerror.SQLError); ok && sqlErr.Num == sqlerror.ERTableAccessDenied {
if replicationStalledDiskFullPermWarned.CompareAndSwap(false, true) {
log.Warn(fmt.Sprintf("IsReplicationStalledDiskFull: SELECT denied on performance_schema.error_log; check is disabled until grants are fixed (%v)", err))
}
return false, nil
}
Comment on lines +169 to 173
if len(extraCnfPaths) > 0 {
tmpProcess.Env = append(tmpProcess.Env, "EXTRA_MY_CNF="+strings.Join(extraCnfPaths, ":"))
}
tmpProcess.Env = append(tmpProcess.Env, os.Environ()...)
tmpProcess.Env = append(tmpProcess.Env, DefaultVttestEnv)
@timvaillancourt
Copy link
Copy Markdown
Contributor Author

SELECT 1 FROM performance_schema.error_log el
WHERE el.logged >= NOW(6) - INTERVAL 1 HOUR
AND el.prio = 'Error'
AND el.subsystem = 'InnoDB'
AND el.error_code IN ('MY-012814', 'MY-012820')
AND el.logged > COALESCE(
(SELECT MAX(LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP)
FROM performance_schema.replication_applier_status_by_worker),
'1970-01-01')
LIMIT 1

While pretty elegant, this won't work 😢

The performance_schema.error_log table is a fixed 5MB in-memory ring buffer. This query WILL trigger a problem while the disk is full error is in the ring buffer, but once it falls off, the problem will appear to be silently resolved

This is pretty sad, because the alternatives to solve this aren't nearly as pretty

In diagnosing this, I realized the stalled disk monitor that I had planned to make more-robust, with the aim that it become a default feature, duplicates much of the effort required to check if a disk is healthy. I will return with an updated plan on how this can be achieved using that monitor

@timvaillancourt
Copy link
Copy Markdown
Contributor Author

This is a nice idea, but once the error_log ring buffer wraps the problem is silently dropped. This won't work

The original issue is updated. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: VTOrc Vitess Orchestrator integration Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: VTOrc to detect "stalled" replicas where replication threads remain running

2 participants