VTOrc: add `InnoDBStalledPrimary` to detect stalled InnoDB semaphores by timvaillancourt · Pull Request #20169 · vitessio/vitess

timvaillancourt · 2026-05-23T00:18:11Z

Description

Implements the proposal in #20168 — a new InnoDBStalledPrimary analysis code that detects MySQL primaries wedged on an InnoDB latch (where mysqld is alive, accepting connections, and replicas still report Slave_IO_Running: Yes / Slave_SQL_Running: Yes, but no writes are committing). Today VTOrc has no analyser that fires before mysqld's eventual self-kill at innodb_fatal_semaphore_wait_threshold, so the shard sits in zero-write outage for the full kill threshold plus restart time — easily 10+ minutes 😱

Detection is the single-poll, stateless query from the issue, run on each tablet via the existing FullStatus RPC:

SELECT 1 FROM performance_schema.error_log
 WHERE logged    >= NOW(6) - INTERVAL 60 SECOND
   AND prio       = 'Warning'
   AND subsystem  = 'InnoDB'
   AND error_code = 'MY-012985'
 LIMIT 1

A row means mysqld itself has classified an InnoDB latch wait as pathological (it only emits MY-012985 once a thread has already waited ≥ innodb_fatal_semaphore_wait_threshold / 2). The signal is self-clearing — when the wait resolves and mysqld stops re-emitting, the row falls out of the 60s window on the next poll

What changed

HasRecentInnoDBLongSemaphoreWait(ctx, lookback) on MysqlDaemon — runs the query and returns the bool. Wired through FullStatus (new innodb_long_semaphore_wait_seen = 26 field) into VTOrc's instance-discovery path. ER_NO_SUCH_TABLE (missing performance_schema.error_log on MariaDB / pre-8.0) soft-fails to (false, nil) so older flavours skip cleanly without log spam
InnoDBStalledPrimary analysis code + matcher — fires when a.IsClusterPrimary && a.LastCheckValid && a.InnoDBLongSemaphoreWaitSeen && a.CountValidReplicas > 0. Priority detectionAnalysisPriorityShardWideAction. Unlike PrimaryDiskStalled, LastCheckValid must be true — the whole point is that mysqld is still answering RPCs while internally wedged. The CountValidReplicas > 0 gate prevents ERS firing on a stuck primary with nowhere to go. Declares BeforeAnalyses: [DeadPrimary*] so it wins same-tablet selection if the primary later becomes unreachable mid-stall
Recovery wiring — InnoDBStalledPrimary is added to getCheckAndRecoverFunctionCode's ERS case alongside PrimaryDiskStalled and PrimarySemiSyncBlocked. Reuses recoverDeadPrimaryFunc unchanged — there is no other shape of recovery for an InnoDB-wedged primary
innodb_fatal_semaphore_wait_threshold = 240 — lowered from mysqld's default of 600 in mysql80.cnf / mysql8026.cnf / mysql84.cnf / mysql90.cnf. With the stock default, MY-012985 only fires at T=300s, giving the analyser nothing to act on for the first 5 minutes of a real stall. At 240: warning at T=120s, self-kill at T=240s — analyser detects in ~2 minutes and ERS finishes well before mysqld would self-kill. Not touched: default.cnf (MariaDB doesn't have this variable), mariadb10.cnf, mysql57.cnf (MY-012985 didn't exist before 8.0). Operators override via EXTRA_MY_CNF

Why no consecutive-poll hysteresis

An earlier draft of this (and the issue's original proposal) gated ERS behind "the analysis has held across two consecutive polls" to avoid acting on a single transient warning. I removed that after thinking about it more carefully:

MY-012985 isn't a transient signal. By the time mysqld emits a single row, a thread has already waited ≥ innodb_fatal_semaphore_wait_threshold / 2 (120s with the new default, 300s on stock) — mysqld's own monitor has formally declared the wait pathological. Adding another full poll-cycle of write outage to "confirm" what mysqld has already endorsed buys little safety on top of the existing safety gates (60s lookback window self-clears the signal when the stall resolves; CountValidReplicas > 0 prevents reparenting nowhere; ERS itself won't promote an unviable replica)

Why no end-to-end test

Same shape as #20058's note for the disk-stall e2e: we can't write to performance_schema.error_log (it's a read-only view over mysqld's in-process structured log buffer), and we can't reliably synthesize a real InnoDB latch stall in CI. Even if we could, the warning only fires at innodb_fatal_semaphore_wait_threshold / 2 — minimum 60s of test wait with our new default

The existing TestRecoveryDeadlocks comment (lines 989–1001 of go/test/endtoend/vtorc/general/vtorc_test.go) already calls out the same gap for PrimaryDiskStalled — synthetic fault injection flips the matcher field but not LastCheckValid. Coverage comes from the unit tests below

Testing

Unit tests:

TestHasRecentInnoDBLongSemaphoreWait in go/vt/mysqlctl/replication_test.go — 4 sub-tests: returns true on a matching row, false on zero rows, (false, nil) on ER_NO_SUCH_TABLE, propagates any other DB error
TestGetDetectionAnalysisDecision/InnoDBStalledPrimary_* in analysis_dao_test.go — 3 sub-tests: fires when all gates are satisfied, does not fire with CountValidReplicas: 0, DeadPrimary wins via BeforeAnalyses when the primary becomes unreachable while still showing the warning

Related Issue(s)

Resolves: #20168
Related: #20058

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

Deployment Notes

Two user-visible behavioural changes for any Vitess-managed MySQL 8.0+ instance:

New analyser surfaces automatically — InnoDBStalledPrimary fires when mysqld logs MY-012985, gated on CountValidReplicas > 0. ERS reuses the existing recoverDeadPrimaryFunc path. No new flags. The dba user needs SELECT on performance_schema.error_log for the check to run — denied access surfaces as a soft error and the check returns (false, nil) for that poll
innodb_fatal_semaphore_wait_threshold default lowered to 240 — MySQL warns at T=120s (was T=300s) and self-kills at T=240s (was T=600s). Halves the no-ERS-available runway too, so operators on shards without a valid replica have less time to react before mysqld self-kills. Override via EXTRA_MY_CNF to restore the stock value or pick a different one

No-op on MariaDB and MySQL/Percona < 8.0 — the capability gate skips the query and the my.cnf change is scoped to MySQL 8.0+ flavour files only

AI Disclosure

Claude Code assisted with development and testing. Claude prepared this PR summary

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

vitess-bot · 2026-05-23T00:18:42Z

Treat InnoDBStalledPrimary the same as PrimarySemiSyncBlocked and PrimaryDiskStalled in shardWideRecoveryIgnoredTablets — the primary is reachable (LastCheckValid is true), so a stale MY-012985 row in SQLite could otherwise drive an unnecessary ERS once the warning has aged out of the 60s lookback window or the latch has cleared. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

…atus `HasRecentInnoDBLongSemaphoreWait` now soft-fails to `(false, nil)` for the common config-state errors — missing table (ERNoSuchTable), missing performance_schema database (ERBadDb), or insufficient grants (ERTableAccessDenied / ERAccessDeniedError / ERDBAccessDenied) — so older deployments without the SELECT grant on `performance_schema.error_log` don't break VTOrc discovery on upgrade. Truly unexpected errors still propagate. `FullStatus` no longer fails on any error from this auxiliary check. A non-soft-failed error is logged once at warning level (slog-style) and `InnodbLongSemaphoreWaitSeen` is set to `false`, so the rest of the status fields still report. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

stock Oracle MySQL 8.x/9.x doesn't have this variable — it's Percona Server only. CI runs against stock MySQL and was failing mysqld startup with [ERROR] [MY-000067] unknown variable 'innodb_fatal_semaphore_wait_threshold=240'. With loose_, stock MySQL warns and continues; Percona Server applies the tuning. On stock MySQL the InnoDBStalledPrimary analyser still works, just at the compiled-in (600s) detection floor instead of 240s. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

promptless · 2026-05-23T00:56:27Z

Promptless prepared documentation updates related to this change.

Triggered by vitessio/vitess#20169

Added documentation for the new InnoDBStalledPrimary VTOrc analyzer that detects MySQL primaries stalled on InnoDB semaphore waits and triggers Emergency Reparent Shard recovery.

Updates:

VTOrc user guide: Added new entry to the VTOrc recovery table documenting the analysis code, detection criteria, and requirements (MySQL 8.0+, SELECT on performance_schema.error_log)
Changelog: Added 25.0.0 changelog entry describing the new analyzer and config changes

codecov · 2026-05-23T00:57:00Z

Codecov Report

❌ Patch coverage is 72.46377% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.31%. Comparing base (70c7a72) to head (42c6c64).
⚠️ Report is 273 commits behind head on main.

Files with missing lines	Patch %	Lines
go/vt/vtorc/inst/instance_dao.go	80.00%	5 Missing ⚠️
go/vt/vttablet/tabletmanager/rpc_replication.go	0.00%	5 Missing ⚠️
go/vt/vtorc/logic/topology_recovery.go	50.00%	4 Missing ⚠️
go/vt/mysqlctl/replication.go	86.36%	3 Missing ⚠️
go/vt/mysqlctl/fakemysqldaemon.go	0.00%	2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (70c7a72) and HEAD (42c6c64). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (70c7a72) HEAD (42c6c64)

1 0

Additional details and impacted files

@@             Coverage Diff             @@
##             main   #20169       +/-   ##
===========================================
- Coverage   69.67%   62.31%    -7.36%     
===========================================
  Files        1614      129     -1485     
  Lines      216793    21800   -194993     
===========================================
- Hits       151044    13585   -137459     
+ Misses      65749     8215    -57534

Flag	Coverage Δ
partial	`62.31% <72.46%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Copilot

Pull request overview

Copilot reviewed 18 out of 21 changed files in this pull request and generated 2 comments.

Files not reviewed (2)

go/vt/proto/replicationdata/replicationdata.pb.go: Language not supported
go/vt/proto/replicationdata/replicationdata_vtproto.pb.go: Language not supported

+	// mysqld self-kills at the 600s threshold hardcoded in Oracle/Percona
+	// MySQL [1]. MariaDB exposes innodb_fatal_semaphore_wait_threshold to
+	// lower it without recompiling, but MariaDB is not supported by Vitess.
+	//
+	// [1] https://github.com/planetscale/mysql-server-private/blob/8.0/storage/innobase/srv/srv0srv.cc#L120


+// innodbLogTableOverride redirects HasRecentInnoDBLongSemaphoreWait's source
+// table to a writable test table. E2E-only — a typo in production silently
+// disables InnoDB-stall detection via the ERNoSuchTable soft-fail below.
+var innodbLogTableOverride = os.Getenv("VTTABLET_E2E_INNODB_LOG_TABLE")
+
+func init() {
+	if innodbLogTableOverride != "" {
+		log.Warn(fmt.Sprintf("VTTABLET_E2E_INNODB_LOG_TABLE=%q is set — HasRecentInnoDBLongSemaphoreWait will read from this table instead of performance_schema.error_log. This is for e2e tests only.", innodbLogTableOverride))
+	}
+}
+
+// HasRecentInnoDBLongSemaphoreWait reports whether MY-012985 was logged to
+// performance_schema.error_log in the lookback window. Best-effort: if the
+// table is unavailable (missing on MariaDB / pre-8.0, dba lacks SELECT,
+// performance_schema disabled), returns (false, nil). Only unexpected errors
+// propagate.
+func (mysqld *Mysqld) HasRecentInnoDBLongSemaphoreWait(ctx context.Context, lookback time.Duration) (bool, error) {
+	conn, err := getPoolReconnect(ctx, mysqld.dbaPool)
+	if err != nil {
+		return false, err
+	}
+	defer conn.Recycle()
+
+	table := "performance_schema.error_log"
+	if innodbLogTableOverride != "" {
+		table = innodbLogTableOverride
+	}
+	query := fmt.Sprintf(
+		"SELECT 1 FROM %s "+
+			"WHERE logged >= NOW(6) - INTERVAL %d SECOND "+
+			"AND prio = 'Warning' AND subsystem = 'InnoDB' "+
+			"AND error_code = 'MY-012985' LIMIT 1",
+		table, int64(lookback.Seconds()),
+	)


mattlord

The MySQL-side signal does occur before the fatal crash path in stock MySQL 8.0/8.4, but not quite for the reason described here. The source logs the long semaphore warning after a fixed 4-minute wait, while the fatal path only starts after srv_fatal_semaphore_wait_threshold is exceeded, default 600s, and then requires repeated observations of the same waiter/semaphore. So MY-012985 is the right code, but a single row is a warning/diagnostic signal, not proof that mysqld is definitely about to crash.

IMO we would do better to try and detect underlying causes of the InnoDB stall and try to correct it by e.g. killing a transaction that is open, idle, and holding on to a lot of locks — rather than waiting 4 minutes while the process may be in a degraded state with things piling up.

IMO we should not ship the e2e table override in production code in go/vt/mysqlctl/replication.go:826-858. VTTABLET_E2E_INNODB_LOG_TABLE is test-only behavior in the production mysqlctl path, and its value is interpolated directly into FROM %s. If this env var is accidentally present or malformed in a real deployment, it can silently disable detection, query the wrong table, or report InnodbLongSemaphoreWaitSeen=true from arbitrary data and trigger ERS on an otherwise healthy primary. Since this signal can drive a destructive shard reparent, I don’t think we should carry this kind of test hook in production code. Please remove it, or at minimum make it fail-closed, validate/quote the identifier strictly, and only honor it in an explicitly test-only environment.

A single recent MY-012985 row is not enough to prove an active stall in go/vt/mysqlctl/replication.go:853-858
and go/vt/vtorc/inst/analysis_problem.go:163-164. I checked the MySQL/InnoDB side. The long semaphore warning is emitted before the fatal path, but it is not “fatal threshold / 2” and it does not guarantee mysqld will crash. In stock MySQL 8.0/8.4, the warning is emitted after a fixed 4-minute wait; the fatal path is separate and only fires after the fatal threshold is exceeded and the monitor repeatedly observes the same waiter/semaphore. With the current 60s lookback, a resolved or non-write-blocking long wait can remain visible long enough for VTOrc to ERS a reachable primary. I think this needs an additional confirmation gate, e.g. persistent/re-emitted warnings across polls and/or evidence that writes are actually blocked, before running ERS.

Nit, but we should update the MySQL comments / PR description based on go/vt/vtorc/inst/analysis_problem.go:149-155. The comment still references a private source URL and says the behavior is hardcoded in Oracle/Percona MySQL. The PR description is also stale: it still talks about lowering innodb_fatal_semaphore_wait_threshold and says there is no e2e test, but the final diff removed the my.cnf changes and added an e2e test. Please update this to describe the current behavior accurately.

VTOrc: add InnoDBStalledPrimary to detect stalled InnoDB semaphores

69d0840

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Copilot AI review requested due to automatic review settings May 23, 2026 00:18

github-actions Bot added this to the v25.0.0 milestone May 23, 2026

github-actions Bot added Component: VTAdmin VTadmin interface Component: VTOrc Vitess Orchestrator integration Component: TabletManager Component: VTTablet labels May 23, 2026

Copilot started reviewing on behalf of timvaillancourt May 23, 2026 00:18 View session

timvaillancourt self-assigned this May 23, 2026

This comment was marked as outdated.

Sign in to view

timvaillancourt requested a review from Copilot May 23, 2026 00:28

Copilot started reviewing on behalf of timvaillancourt May 23, 2026 00:29 View session

This comment was marked as outdated.

Sign in to view

timvaillancourt added 2 commits May 23, 2026 02:38

Copilot AI review requested due to automatic review settings May 23, 2026 00:42

Copilot started reviewing on behalf of timvaillancourt May 23, 2026 00:42 View session

This comment was marked as outdated.

Sign in to view

timvaillancourt marked this pull request as ready for review May 23, 2026 00:50

timvaillancourt requested review from beingnoble03, frouioui, harshit-gangal, mattlord, rohit-nayak-ps and shlomi-noach as code owners May 23, 2026 00:50

This was referenced May 23, 2026

docs: Document InnoDBStalledPrimary VTOrc analysis code vitessio/website#2124

Draft

docs: Add changelog entry for VTOrc InnoDBStalledPrimary analyzer #20170

Draft

e2e test

28055d5

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

timvaillancourt requested a review from Copilot May 26, 2026 11:51

Copilot started reviewing on behalf of timvaillancourt May 26, 2026 12:16 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

mattlord reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VTOrc: add `InnoDBStalledPrimary` to detect stalled InnoDB semaphores#20169

VTOrc: add `InnoDBStalledPrimary` to detect stalled InnoDB semaphores#20169
timvaillancourt wants to merge 5 commits into
vitessio:mainfrom
timvaillancourt:innodb-stalled-analyzer

timvaillancourt commented May 23, 2026 •

edited

Loading

Uh oh!

vitess-bot Bot commented May 23, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

promptless Bot commented May 23, 2026

Uh oh!

codecov Bot commented May 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

mattlord left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

timvaillancourt commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What changed

Why no consecutive-poll hysteresis

Why no end-to-end test

Testing

Related Issue(s)

Checklist

Deployment Notes

AI Disclosure

Uh oh!

vitess-bot Bot commented May 23, 2026

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

promptless Bot commented May 23, 2026

Uh oh!

codecov Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

mattlord left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timvaillancourt commented May 23, 2026 •

edited

Loading

codecov Bot commented May 23, 2026 •

edited

Loading