Skip to content

VTOrc: add InnoDBStalledPrimary to detect stalled InnoDB semaphores#20169

Open
timvaillancourt wants to merge 5 commits into
vitessio:mainfrom
timvaillancourt:innodb-stalled-analyzer
Open

VTOrc: add InnoDBStalledPrimary to detect stalled InnoDB semaphores#20169
timvaillancourt wants to merge 5 commits into
vitessio:mainfrom
timvaillancourt:innodb-stalled-analyzer

Conversation

@timvaillancourt
Copy link
Copy Markdown
Contributor

@timvaillancourt timvaillancourt commented May 23, 2026

Description

Implements the proposal in #20168 — a new InnoDBStalledPrimary analysis code that detects MySQL primaries wedged on an InnoDB latch (where mysqld is alive, accepting connections, and replicas still report Slave_IO_Running: Yes / Slave_SQL_Running: Yes, but no writes are committing). Today VTOrc has no analyser that fires before mysqld's eventual self-kill at innodb_fatal_semaphore_wait_threshold, so the shard sits in zero-write outage for the full kill threshold plus restart time — easily 10+ minutes 😱

Detection is the single-poll, stateless query from the issue, run on each tablet via the existing FullStatus RPC:

SELECT 1 FROM performance_schema.error_log
 WHERE logged    >= NOW(6) - INTERVAL 60 SECOND
   AND prio       = 'Warning'
   AND subsystem  = 'InnoDB'
   AND error_code = 'MY-012985'
 LIMIT 1

A row means mysqld itself has classified an InnoDB latch wait as pathological (it only emits MY-012985 once a thread has already waited ≥ innodb_fatal_semaphore_wait_threshold / 2). The signal is self-clearing — when the wait resolves and mysqld stops re-emitting, the row falls out of the 60s window on the next poll

What changed

  1. HasRecentInnoDBLongSemaphoreWait(ctx, lookback) on MysqlDaemon — runs the query and returns the bool. Wired through FullStatus (new innodb_long_semaphore_wait_seen = 26 field) into VTOrc's instance-discovery path. ER_NO_SUCH_TABLE (missing performance_schema.error_log on MariaDB / pre-8.0) soft-fails to (false, nil) so older flavours skip cleanly without log spam

  2. InnoDBStalledPrimary analysis code + matcher — fires when a.IsClusterPrimary && a.LastCheckValid && a.InnoDBLongSemaphoreWaitSeen && a.CountValidReplicas > 0. Priority detectionAnalysisPriorityShardWideAction. Unlike PrimaryDiskStalled, LastCheckValid must be true — the whole point is that mysqld is still answering RPCs while internally wedged. The CountValidReplicas > 0 gate prevents ERS firing on a stuck primary with nowhere to go. Declares BeforeAnalyses: [DeadPrimary*] so it wins same-tablet selection if the primary later becomes unreachable mid-stall

  3. Recovery wiringInnoDBStalledPrimary is added to getCheckAndRecoverFunctionCode's ERS case alongside PrimaryDiskStalled and PrimarySemiSyncBlocked. Reuses recoverDeadPrimaryFunc unchanged — there is no other shape of recovery for an InnoDB-wedged primary

  4. innodb_fatal_semaphore_wait_threshold = 240 — lowered from mysqld's default of 600 in mysql80.cnf / mysql8026.cnf / mysql84.cnf / mysql90.cnf. With the stock default, MY-012985 only fires at T=300s, giving the analyser nothing to act on for the first 5 minutes of a real stall. At 240: warning at T=120s, self-kill at T=240s — analyser detects in ~2 minutes and ERS finishes well before mysqld would self-kill. Not touched: default.cnf (MariaDB doesn't have this variable), mariadb10.cnf, mysql57.cnf (MY-012985 didn't exist before 8.0). Operators override via EXTRA_MY_CNF

Why no consecutive-poll hysteresis

An earlier draft of this (and the issue's original proposal) gated ERS behind "the analysis has held across two consecutive polls" to avoid acting on a single transient warning. I removed that after thinking about it more carefully:

MY-012985 isn't a transient signal. By the time mysqld emits a single row, a thread has already waited ≥ innodb_fatal_semaphore_wait_threshold / 2 (120s with the new default, 300s on stock)mysqld's own monitor has formally declared the wait pathological. Adding another full poll-cycle of write outage to "confirm" what mysqld has already endorsed buys little safety on top of the existing safety gates (60s lookback window self-clears the signal when the stall resolves; CountValidReplicas > 0 prevents reparenting nowhere; ERS itself won't promote an unviable replica)

Why no end-to-end test

Same shape as #20058's note for the disk-stall e2e: we can't write to performance_schema.error_log (it's a read-only view over mysqld's in-process structured log buffer), and we can't reliably synthesize a real InnoDB latch stall in CI. Even if we could, the warning only fires at innodb_fatal_semaphore_wait_threshold / 2 — minimum 60s of test wait with our new default

The existing TestRecoveryDeadlocks comment (lines 989–1001 of go/test/endtoend/vtorc/general/vtorc_test.go) already calls out the same gap for PrimaryDiskStalled — synthetic fault injection flips the matcher field but not LastCheckValid. Coverage comes from the unit tests below

Testing

Unit tests:

  • TestHasRecentInnoDBLongSemaphoreWait in go/vt/mysqlctl/replication_test.go — 4 sub-tests: returns true on a matching row, false on zero rows, (false, nil) on ER_NO_SUCH_TABLE, propagates any other DB error
  • TestGetDetectionAnalysisDecision/InnoDBStalledPrimary_* in analysis_dao_test.go — 3 sub-tests: fires when all gates are satisfied, does not fire with CountValidReplicas: 0, DeadPrimary wins via BeforeAnalyses when the primary becomes unreachable while still showing the warning

Related Issue(s)

Resolves: #20168
Related: #20058

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

Two user-visible behavioural changes for any Vitess-managed MySQL 8.0+ instance:

  1. New analyser surfaces automaticallyInnoDBStalledPrimary fires when mysqld logs MY-012985, gated on CountValidReplicas > 0. ERS reuses the existing recoverDeadPrimaryFunc path. No new flags. The dba user needs SELECT on performance_schema.error_log for the check to run — denied access surfaces as a soft error and the check returns (false, nil) for that poll

  2. innodb_fatal_semaphore_wait_threshold default lowered to 240 — MySQL warns at T=120s (was T=300s) and self-kills at T=240s (was T=600s). Halves the no-ERS-available runway too, so operators on shards without a valid replica have less time to react before mysqld self-kills. Override via EXTRA_MY_CNF to restore the stock value or pick a different one

No-op on MariaDB and MySQL/Percona < 8.0 — the capability gate skips the query and the my.cnf change is scoped to MySQL 8.0+ flavour files only

AI Disclosure

Claude Code assisted with development and testing. Claude prepared this PR summary

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copilot AI review requested due to automatic review settings May 23, 2026 00:18
@github-actions github-actions Bot added this to the v25.0.0 milestone May 23, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 23, 2026
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot Bot commented May 23, 2026

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@timvaillancourt timvaillancourt added Type: Enhancement Logical improvement (somewhere between a bug and feature) and removed Component: VTAdmin VTadmin interface Component: TabletManager Component: VTTablet NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 23, 2026
@timvaillancourt timvaillancourt self-assigned this May 23, 2026

This comment was marked as outdated.

Treat InnoDBStalledPrimary the same as PrimarySemiSyncBlocked and
PrimaryDiskStalled in shardWideRecoveryIgnoredTablets — the primary
is reachable (LastCheckValid is true), so a stale MY-012985 row in
SQLite could otherwise drive an unnecessary ERS once the warning
has aged out of the 60s lookback window or the latch has cleared.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

This comment was marked as outdated.

…atus

`HasRecentInnoDBLongSemaphoreWait` now soft-fails to `(false, nil)` for
the common config-state errors — missing table (ERNoSuchTable), missing
performance_schema database (ERBadDb), or insufficient grants
(ERTableAccessDenied / ERAccessDeniedError / ERDBAccessDenied) — so older
deployments without the SELECT grant on `performance_schema.error_log`
don't break VTOrc discovery on upgrade. Truly unexpected errors still
propagate.

`FullStatus` no longer fails on any error from this auxiliary check.
A non-soft-failed error is logged once at warning level (slog-style)
and `InnodbLongSemaphoreWaitSeen` is set to `false`, so the rest of
the status fields still report.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
stock Oracle MySQL 8.x/9.x doesn't have this variable — it's Percona
Server only. CI runs against stock MySQL and was failing mysqld
startup with [ERROR] [MY-000067] unknown variable
'innodb_fatal_semaphore_wait_threshold=240'.

With loose_, stock MySQL warns and continues; Percona Server applies
the tuning. On stock MySQL the InnoDBStalledPrimary analyser still
works, just at the compiled-in (600s) detection floor instead of 240s.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copilot AI review requested due to automatic review settings May 23, 2026 00:42

This comment was marked as outdated.

@promptless
Copy link
Copy Markdown
Contributor

promptless Bot commented May 23, 2026

Promptless prepared documentation updates related to this change.

Triggered by vitessio/vitess#20169

Added documentation for the new InnoDBStalledPrimary VTOrc analyzer that detects MySQL primaries stalled on InnoDB semaphore waits and triggers Emergency Reparent Shard recovery.

Updates:

  • VTOrc user guide: Added new entry to the VTOrc recovery table documenting the analysis code, detection criteria, and requirements (MySQL 8.0+, SELECT on performance_schema.error_log)
  • Changelog: Added 25.0.0 changelog entry describing the new analyzer and config changes

@codecov
Copy link
Copy Markdown

codecov Bot commented May 23, 2026

Codecov Report

❌ Patch coverage is 72.46377% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.31%. Comparing base (70c7a72) to head (42c6c64).
⚠️ Report is 273 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/vtorc/inst/instance_dao.go 80.00% 5 Missing ⚠️
go/vt/vttablet/tabletmanager/rpc_replication.go 0.00% 5 Missing ⚠️
go/vt/vtorc/logic/topology_recovery.go 50.00% 4 Missing ⚠️
go/vt/mysqlctl/replication.go 86.36% 3 Missing ⚠️
go/vt/mysqlctl/fakemysqldaemon.go 0.00% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (70c7a72) and HEAD (42c6c64). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (70c7a72) HEAD (42c6c64)
1 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #20169       +/-   ##
===========================================
- Coverage   69.67%   62.31%    -7.36%     
===========================================
  Files        1614      129     -1485     
  Lines      216793    21800   -194993     
===========================================
- Hits       151044    13585   -137459     
+ Misses      65749     8215    -57534     
Flag Coverage Δ
partial 62.31% <72.46%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 21 changed files in this pull request and generated 2 comments.

Files not reviewed (2)
  • go/vt/proto/replicationdata/replicationdata.pb.go: Language not supported
  • go/vt/proto/replicationdata/replicationdata_vtproto.pb.go: Language not supported

Comment on lines +151 to +155
// mysqld self-kills at the 600s threshold hardcoded in Oracle/Percona
// MySQL [1]. MariaDB exposes innodb_fatal_semaphore_wait_threshold to
// lower it without recompiling, but MariaDB is not supported by Vitess.
//
// [1] https://github.com/planetscale/mysql-server-private/blob/8.0/storage/innobase/srv/srv0srv.cc#L120
Comment on lines +826 to +859
// innodbLogTableOverride redirects HasRecentInnoDBLongSemaphoreWait's source
// table to a writable test table. E2E-only — a typo in production silently
// disables InnoDB-stall detection via the ERNoSuchTable soft-fail below.
var innodbLogTableOverride = os.Getenv("VTTABLET_E2E_INNODB_LOG_TABLE")

func init() {
if innodbLogTableOverride != "" {
log.Warn(fmt.Sprintf("VTTABLET_E2E_INNODB_LOG_TABLE=%q is set — HasRecentInnoDBLongSemaphoreWait will read from this table instead of performance_schema.error_log. This is for e2e tests only.", innodbLogTableOverride))
}
}

// HasRecentInnoDBLongSemaphoreWait reports whether MY-012985 was logged to
// performance_schema.error_log in the lookback window. Best-effort: if the
// table is unavailable (missing on MariaDB / pre-8.0, dba lacks SELECT,
// performance_schema disabled), returns (false, nil). Only unexpected errors
// propagate.
func (mysqld *Mysqld) HasRecentInnoDBLongSemaphoreWait(ctx context.Context, lookback time.Duration) (bool, error) {
conn, err := getPoolReconnect(ctx, mysqld.dbaPool)
if err != nil {
return false, err
}
defer conn.Recycle()

table := "performance_schema.error_log"
if innodbLogTableOverride != "" {
table = innodbLogTableOverride
}
query := fmt.Sprintf(
"SELECT 1 FROM %s "+
"WHERE logged >= NOW(6) - INTERVAL %d SECOND "+
"AND prio = 'Warning' AND subsystem = 'InnoDB' "+
"AND error_code = 'MY-012985' LIMIT 1",
table, int64(lookback.Seconds()),
)
Copy link
Copy Markdown
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MySQL-side signal does occur before the fatal crash path in stock MySQL 8.0/8.4, but not quite for the reason described here. The source logs the long semaphore warning after a fixed 4-minute wait, while the fatal path only starts after srv_fatal_semaphore_wait_threshold is exceeded, default 600s, and then requires repeated observations of the same waiter/semaphore. So MY-012985 is the right code, but a single row is a warning/diagnostic signal, not proof that mysqld is definitely about to crash.

IMO we would do better to try and detect underlying causes of the InnoDB stall and try to correct it by e.g. killing a transaction that is open, idle, and holding on to a lot of locks — rather than waiting 4 minutes while the process may be in a degraded state with things piling up.

IMO we should not ship the e2e table override in production code in go/vt/mysqlctl/replication.go:826-858. VTTABLET_E2E_INNODB_LOG_TABLE is test-only behavior in the production mysqlctl path, and its value is interpolated directly into FROM %s. If this env var is accidentally present or malformed in a real deployment, it can silently disable detection, query the wrong table, or report InnodbLongSemaphoreWaitSeen=true from arbitrary data and trigger ERS on an otherwise healthy primary. Since this signal can drive a destructive shard reparent, I don’t think we should carry this kind of test hook in production code. Please remove it, or at minimum make it fail-closed, validate/quote the identifier strictly, and only honor it in an explicitly test-only environment.

A single recent MY-012985 row is not enough to prove an active stall in go/vt/mysqlctl/replication.go:853-858
and go/vt/vtorc/inst/analysis_problem.go:163-164. I checked the MySQL/InnoDB side. The long semaphore warning is emitted before the fatal path, but it is not “fatal threshold / 2” and it does not guarantee mysqld will crash. In stock MySQL 8.0/8.4, the warning is emitted after a fixed 4-minute wait; the fatal path is separate and only fires after the fatal threshold is exceeded and the monitor repeatedly observes the same waiter/semaphore. With the current 60s lookback, a resolved or non-write-blocking long wait can remain visible long enough for VTOrc to ERS a reachable primary. I think this needs an additional confirmation gate, e.g. persistent/re-emitted warnings across polls and/or evidence that writes are actually blocked, before running ERS.

Nit, but we should update the MySQL comments / PR description based on go/vt/vtorc/inst/analysis_problem.go:149-155. The comment still references a private source URL and says the behavior is hardcoded in Oracle/Percona MySQL. The PR description is also stale: it still talks about lowering innodb_fatal_semaphore_wait_threshold and says there is no e2e test, but the final diff removed the my.cnf changes and added an e2e test. Please update this to describe the current behavior accurately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: VTOrc Vitess Orchestrator integration Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: VTOrc to detect InnoDB semaphore stalls on primary tablets

3 participants