VTOrc: add InnoDBStalledPrimary to detect stalled InnoDB semaphores#20169
VTOrc: add InnoDBStalledPrimary to detect stalled InnoDB semaphores#20169timvaillancourt wants to merge 5 commits into
InnoDBStalledPrimary to detect stalled InnoDB semaphores#20169Conversation
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Treat InnoDBStalledPrimary the same as PrimarySemiSyncBlocked and PrimaryDiskStalled in shardWideRecoveryIgnoredTablets — the primary is reachable (LastCheckValid is true), so a stale MY-012985 row in SQLite could otherwise drive an unnecessary ERS once the warning has aged out of the 60s lookback window or the latch has cleared. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…atus `HasRecentInnoDBLongSemaphoreWait` now soft-fails to `(false, nil)` for the common config-state errors — missing table (ERNoSuchTable), missing performance_schema database (ERBadDb), or insufficient grants (ERTableAccessDenied / ERAccessDeniedError / ERDBAccessDenied) — so older deployments without the SELECT grant on `performance_schema.error_log` don't break VTOrc discovery on upgrade. Truly unexpected errors still propagate. `FullStatus` no longer fails on any error from this auxiliary check. A non-soft-failed error is logged once at warning level (slog-style) and `InnodbLongSemaphoreWaitSeen` is set to `false`, so the rest of the status fields still report. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
stock Oracle MySQL 8.x/9.x doesn't have this variable — it's Percona Server only. CI runs against stock MySQL and was failing mysqld startup with [ERROR] [MY-000067] unknown variable 'innodb_fatal_semaphore_wait_threshold=240'. With loose_, stock MySQL warns and continues; Percona Server applies the tuning. On stock MySQL the InnoDBStalledPrimary analyser still works, just at the compiled-in (600s) detection floor instead of 240s. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
|
Promptless prepared documentation updates related to this change. Triggered by vitessio/vitess#20169 Added documentation for the new Updates:
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20169 +/- ##
===========================================
- Coverage 69.67% 62.31% -7.36%
===========================================
Files 1614 129 -1485
Lines 216793 21800 -194993
===========================================
- Hits 151044 13585 -137459
+ Misses 65749 8215 -57534
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 21 changed files in this pull request and generated 2 comments.
Files not reviewed (2)
- go/vt/proto/replicationdata/replicationdata.pb.go: Language not supported
- go/vt/proto/replicationdata/replicationdata_vtproto.pb.go: Language not supported
| // mysqld self-kills at the 600s threshold hardcoded in Oracle/Percona | ||
| // MySQL [1]. MariaDB exposes innodb_fatal_semaphore_wait_threshold to | ||
| // lower it without recompiling, but MariaDB is not supported by Vitess. | ||
| // | ||
| // [1] https://github.com/planetscale/mysql-server-private/blob/8.0/storage/innobase/srv/srv0srv.cc#L120 |
| // innodbLogTableOverride redirects HasRecentInnoDBLongSemaphoreWait's source | ||
| // table to a writable test table. E2E-only — a typo in production silently | ||
| // disables InnoDB-stall detection via the ERNoSuchTable soft-fail below. | ||
| var innodbLogTableOverride = os.Getenv("VTTABLET_E2E_INNODB_LOG_TABLE") | ||
|
|
||
| func init() { | ||
| if innodbLogTableOverride != "" { | ||
| log.Warn(fmt.Sprintf("VTTABLET_E2E_INNODB_LOG_TABLE=%q is set — HasRecentInnoDBLongSemaphoreWait will read from this table instead of performance_schema.error_log. This is for e2e tests only.", innodbLogTableOverride)) | ||
| } | ||
| } | ||
|
|
||
| // HasRecentInnoDBLongSemaphoreWait reports whether MY-012985 was logged to | ||
| // performance_schema.error_log in the lookback window. Best-effort: if the | ||
| // table is unavailable (missing on MariaDB / pre-8.0, dba lacks SELECT, | ||
| // performance_schema disabled), returns (false, nil). Only unexpected errors | ||
| // propagate. | ||
| func (mysqld *Mysqld) HasRecentInnoDBLongSemaphoreWait(ctx context.Context, lookback time.Duration) (bool, error) { | ||
| conn, err := getPoolReconnect(ctx, mysqld.dbaPool) | ||
| if err != nil { | ||
| return false, err | ||
| } | ||
| defer conn.Recycle() | ||
|
|
||
| table := "performance_schema.error_log" | ||
| if innodbLogTableOverride != "" { | ||
| table = innodbLogTableOverride | ||
| } | ||
| query := fmt.Sprintf( | ||
| "SELECT 1 FROM %s "+ | ||
| "WHERE logged >= NOW(6) - INTERVAL %d SECOND "+ | ||
| "AND prio = 'Warning' AND subsystem = 'InnoDB' "+ | ||
| "AND error_code = 'MY-012985' LIMIT 1", | ||
| table, int64(lookback.Seconds()), | ||
| ) |
mattlord
left a comment
There was a problem hiding this comment.
The MySQL-side signal does occur before the fatal crash path in stock MySQL 8.0/8.4, but not quite for the reason described here. The source logs the long semaphore warning after a fixed 4-minute wait, while the fatal path only starts after srv_fatal_semaphore_wait_threshold is exceeded, default 600s, and then requires repeated observations of the same waiter/semaphore. So MY-012985 is the right code, but a single row is a warning/diagnostic signal, not proof that mysqld is definitely about to crash.
IMO we would do better to try and detect underlying causes of the InnoDB stall and try to correct it by e.g. killing a transaction that is open, idle, and holding on to a lot of locks — rather than waiting 4 minutes while the process may be in a degraded state with things piling up.
IMO we should not ship the e2e table override in production code in go/vt/mysqlctl/replication.go:826-858. VTTABLET_E2E_INNODB_LOG_TABLE is test-only behavior in the production mysqlctl path, and its value is interpolated directly into FROM %s. If this env var is accidentally present or malformed in a real deployment, it can silently disable detection, query the wrong table, or report InnodbLongSemaphoreWaitSeen=true from arbitrary data and trigger ERS on an otherwise healthy primary. Since this signal can drive a destructive shard reparent, I don’t think we should carry this kind of test hook in production code. Please remove it, or at minimum make it fail-closed, validate/quote the identifier strictly, and only honor it in an explicitly test-only environment.
A single recent MY-012985 row is not enough to prove an active stall in go/vt/mysqlctl/replication.go:853-858
and go/vt/vtorc/inst/analysis_problem.go:163-164. I checked the MySQL/InnoDB side. The long semaphore warning is emitted before the fatal path, but it is not “fatal threshold / 2” and it does not guarantee mysqld will crash. In stock MySQL 8.0/8.4, the warning is emitted after a fixed 4-minute wait; the fatal path is separate and only fires after the fatal threshold is exceeded and the monitor repeatedly observes the same waiter/semaphore. With the current 60s lookback, a resolved or non-write-blocking long wait can remain visible long enough for VTOrc to ERS a reachable primary. I think this needs an additional confirmation gate, e.g. persistent/re-emitted warnings across polls and/or evidence that writes are actually blocked, before running ERS.
Nit, but we should update the MySQL comments / PR description based on go/vt/vtorc/inst/analysis_problem.go:149-155. The comment still references a private source URL and says the behavior is hardcoded in Oracle/Percona MySQL. The PR description is also stale: it still talks about lowering innodb_fatal_semaphore_wait_threshold and says there is no e2e test, but the final diff removed the my.cnf changes and added an e2e test. Please update this to describe the current behavior accurately.
Description
Implements the proposal in #20168 — a new
InnoDBStalledPrimaryanalysis code that detects MySQL primaries wedged on an InnoDB latch (wheremysqldis alive, accepting connections, and replicas still reportSlave_IO_Running: Yes/Slave_SQL_Running: Yes, but no writes are committing). Today VTOrc has no analyser that fires beforemysqld's eventual self-kill atinnodb_fatal_semaphore_wait_threshold, so the shard sits in zero-write outage for the full kill threshold plus restart time — easily 10+ minutes 😱Detection is the single-poll, stateless query from the issue, run on each tablet via the existing
FullStatusRPC:A row means
mysqlditself has classified an InnoDB latch wait as pathological (it only emitsMY-012985once a thread has already waited ≥innodb_fatal_semaphore_wait_threshold / 2). The signal is self-clearing — when the wait resolves andmysqldstops re-emitting, the row falls out of the 60s window on the next pollWhat changed
HasRecentInnoDBLongSemaphoreWait(ctx, lookback)onMysqlDaemon— runs the query and returns the bool. Wired throughFullStatus(newinnodb_long_semaphore_wait_seen = 26field) into VTOrc's instance-discovery path.ER_NO_SUCH_TABLE(missingperformance_schema.error_logon MariaDB / pre-8.0) soft-fails to(false, nil)so older flavours skip cleanly without log spamInnoDBStalledPrimaryanalysis code + matcher — fires whena.IsClusterPrimary && a.LastCheckValid && a.InnoDBLongSemaphoreWaitSeen && a.CountValidReplicas > 0. PrioritydetectionAnalysisPriorityShardWideAction. UnlikePrimaryDiskStalled,LastCheckValidmust betrue— the whole point is thatmysqldis still answering RPCs while internally wedged. TheCountValidReplicas > 0gate prevents ERS firing on a stuck primary with nowhere to go. DeclaresBeforeAnalyses: [DeadPrimary*]so it wins same-tablet selection if the primary later becomes unreachable mid-stallRecovery wiring —
InnoDBStalledPrimaryis added togetCheckAndRecoverFunctionCode's ERS case alongsidePrimaryDiskStalledandPrimarySemiSyncBlocked. ReusesrecoverDeadPrimaryFuncunchanged — there is no other shape of recovery for an InnoDB-wedged primaryinnodb_fatal_semaphore_wait_threshold = 240— lowered frommysqld's default of600inmysql80.cnf/mysql8026.cnf/mysql84.cnf/mysql90.cnf. With the stock default,MY-012985only fires at T=300s, giving the analyser nothing to act on for the first 5 minutes of a real stall. At240: warning at T=120s, self-kill at T=240s — analyser detects in ~2 minutes and ERS finishes well beforemysqldwould self-kill. Not touched:default.cnf(MariaDB doesn't have this variable),mariadb10.cnf,mysql57.cnf(MY-012985 didn't exist before 8.0). Operators override viaEXTRA_MY_CNFWhy no consecutive-poll hysteresis
An earlier draft of this (and the issue's original proposal) gated ERS behind "the analysis has held across two consecutive polls" to avoid acting on a single transient warning. I removed that after thinking about it more carefully:
MY-012985isn't a transient signal. By the timemysqldemits a single row, a thread has already waited ≥innodb_fatal_semaphore_wait_threshold / 2(120s with the new default, 300s on stock) —mysqld's own monitor has formally declared the wait pathological. Adding another full poll-cycle of write outage to "confirm" whatmysqldhas already endorsed buys little safety on top of the existing safety gates (60s lookback window self-clears the signal when the stall resolves;CountValidReplicas > 0prevents reparenting nowhere; ERS itself won't promote an unviable replica)Why no end-to-end test
Same shape as #20058's note for the disk-stall e2e: we can't write to
performance_schema.error_log(it's a read-only view overmysqld's in-process structured log buffer), and we can't reliably synthesize a real InnoDB latch stall in CI. Even if we could, the warning only fires atinnodb_fatal_semaphore_wait_threshold / 2— minimum 60s of test wait with our new defaultThe existing
TestRecoveryDeadlockscomment (lines 989–1001 ofgo/test/endtoend/vtorc/general/vtorc_test.go) already calls out the same gap forPrimaryDiskStalled— synthetic fault injection flips the matcher field but notLastCheckValid. Coverage comes from the unit tests belowTesting
Unit tests:
TestHasRecentInnoDBLongSemaphoreWaitingo/vt/mysqlctl/replication_test.go— 4 sub-tests: returnstrueon a matching row,falseon zero rows,(false, nil)onER_NO_SUCH_TABLE, propagates any other DB errorTestGetDetectionAnalysisDecision/InnoDBStalledPrimary_*inanalysis_dao_test.go— 3 sub-tests: fires when all gates are satisfied, does not fire withCountValidReplicas: 0,DeadPrimarywins viaBeforeAnalyseswhen the primary becomes unreachable while still showing the warningRelated Issue(s)
Resolves: #20168
Related: #20058
Checklist
Deployment Notes
Two user-visible behavioural changes for any Vitess-managed MySQL 8.0+ instance:
New analyser surfaces automatically —
InnoDBStalledPrimaryfires whenmysqldlogsMY-012985, gated onCountValidReplicas > 0. ERS reuses the existingrecoverDeadPrimaryFuncpath. No new flags. Thedbauser needsSELECTonperformance_schema.error_logfor the check to run — denied access surfaces as a soft error and the check returns(false, nil)for that pollinnodb_fatal_semaphore_wait_thresholddefault lowered to240— MySQL warns at T=120s (was T=300s) and self-kills at T=240s (was T=600s). Halves the no-ERS-available runway too, so operators on shards without a valid replica have less time to react beforemysqldself-kills. Override viaEXTRA_MY_CNFto restore the stock value or pick a different oneNo-op on MariaDB and MySQL/Percona < 8.0 — the capability gate skips the query and the
my.cnfchange is scoped to MySQL 8.0+ flavour files onlyAI Disclosure
Claude Code assisted with development and testing. Claude prepared this PR summary