Feature Description
VTOrc currently relies on DeadPrimary — primary unreachable AND no replica replicating — to fire an Emergency Reparent Shard. This works well for crashed or partitioned primaries, but completely misses a class of failure where mysqld is alive, accepting connections, replicating to its replicas, but has stopped making forward progress because InnoDB is internally wedged
The most common trigger we've observed: an InnoDB latch stall during an ALGORITHM=INSTANT ALTER. mysqld's own srv_monitor_thread eventually detects this AND self-kills at innodb_fatal_semaphore_wait_threshold (600s default), but VTOrc has no analyser that fires before the self-kill. The resulting write outage equals the full innodb_fatal_semaphore_wait_threshold plus mysqld restart time plus the VTOrc poll interval — easily 10+ minutes of zero successful writes on an otherwise-healthy shard, even when healthy semi-sync replicas are available to reparent to 😱
What this hang looks like from outside
While InnoDB is stuck:
- The primary's TCP listener still accepts connections AND vttablet RPCs still return —
DeadPrimary does not fire
SHOW REPLICA STATUS on replicas still reports both threads Yes with Seconds_Behind_Source = 0 (there is nothing new in the binlog, so the replicas are "caught up") — ReplicationStopped does not fire
Rpl_semi_sync_master_wait_sessions is 0 because transactions can't reach the commit/semi-sync stage — LockedSemiSyncPrimary does not fire
Problems:[] in every VTOrc discovery poll, with sub-10ms LastDiscoveryLatency
Meanwhile on the primary, vttablet logs PoolFull: Code: RESOURCE_EXHAUSTED as writes back up against the wedged latch AND become completely unavailable to clients
Why not _vt.heartbeat staleness?
It's the obvious "primary not committing" proxy, but the signal is too noisy for a destructive action like ERS — heartbeat writes also fail for vttablet pool exhaustion, brief network blips, restarts, idle-conn timeouts, etc. The MY-012985 warning is the better trigger because mysqld itself raises it AND independently escalates the same condition to [FATAL] at 2x the threshold — we're acting on a signal mysqld has already endorsed. Heartbeat staleness is fine as a supporting gate, not as the primary trigger
Proposal: a new InnoDBStalled analysis code
Detection should be grounded in InnoDB's own signal — the same one that drives mysqld's eventual self-kill — rather than indirect proxies like heartbeat staleness OR write-throughput rate (hard to threshold across workloads)
srv_monitor_thread emits a structured warning to the error log when a semaphore wait crosses innodb_fatal_semaphore_wait_threshold/2 (~300s default):
[Warning] [MY-012985] [InnoDB] A long semaphore wait:
--Thread 129588379367104 has waited at ha_innodb.cc line 7233 for 922 seconds the semaphore:
This warning repeats while the wait persists (emitted on each InnoDB monitor pass, typically every 5–15 s) AND lands in performance_schema.error_log, which is independent of InnoDB latches AND remains readable while the data plane is wedged:
SELECT 1
FROM performance_schema.error_log el
WHERE el.logged >= NOW(6) - INTERVAL 60 SECOND
AND el.prio = 'Warning'
AND el.subsystem = 'InnoDB'
AND el.error_code = 'MY-012985' -- ER_IB_MSG_785, "A long semaphore wait"
LIMIT 1;
A returned row means: InnoDB has formally declared a thread stuck on a latch for ≥ half the fatal threshold AND the warning has been re-emitted within the last poll window. The signal self-clears as soon as the wait resolves OR mysqld self-kills (at which point DeadPrimary takes over)
Recovery
The recovery is exactly what ERS already does — promote a replica. Suggested gating before issuing ERS:
CountValidReplicas >= MinReplicasForReparent — i.e., there is somewhere safe to reparent to
- (optional, more conservative) At least one of
Rpl_semi_sync_master_wait_sessions > 0 OR vttablet PoolFull counter is non-zero AND climbing — i.e., we have evidence that real writes are queued behind the stuck latch. This rules out InnoDB stalls during pure-read workloads where ERS is unhelpful
A single row matching the query above is sufficient to fire — no consecutive-poll hysteresis is needed. MY-012985 is not a transient signal: mysqld only emits it after a thread has already waited ≥ innodb_fatal_semaphore_wait_threshold / 2 (300s default), so the warning already carries the full pre-filter mysqld considers necessary to declare the wait pathological. The 60-second lookback window in the query gives the natural clearing buffer — when the stall resolves AND mysqld stops re-emitting, the row falls out of the window on the next poll
The warning code MY-012985 (ER_IB_MSG_785) exists in MySQL 8.0+ AND Percona Server 8.0+
Related: #20056 (similar shape — new VTOrc analyser reading from performance_schema.error_log for a failure mode existing analysers miss)
Your thoughts are appreciated 🙏
Use Case(s)
Any Vitess cluster running on MySQL 8.0+ where mysqld can hit an InnoDB latch stall. Triggers we've seen OR are aware of:
ALGORITHM=INSTANT ALTERs on tables with concurrent DML, exhausting an internal dict_sys OR lock_sys latch (this is what motivated the request)
- Long-running
information_schema queries combined with concurrent schema changes
- Known MySQL bugs that hold
lock_sys OR log_sys latches across operations
In all of these, the failure mode is identical: mysqld is reachable, replication is "healthy", but writes are at zero. Without an InnoDBStalled analyser, the cluster waits the full innodb_fatal_semaphore_wait_threshold (default 600s) for mysqld to self-kill before any reparent can happen. With one, ERS can fire within ~30s of the warning starting — turning a multi-minute write outage into a sub-minute one
Feature Description
VTOrc currently relies on
DeadPrimary— primary unreachable AND no replica replicating — to fire an Emergency Reparent Shard. This works well for crashed or partitioned primaries, but completely misses a class of failure where mysqld is alive, accepting connections, replicating to its replicas, but has stopped making forward progress because InnoDB is internally wedgedThe most common trigger we've observed: an InnoDB latch stall during an
ALGORITHM=INSTANTALTER. mysqld's ownsrv_monitor_threadeventually detects this AND self-kills atinnodb_fatal_semaphore_wait_threshold(600s default), but VTOrc has no analyser that fires before the self-kill. The resulting write outage equals the fullinnodb_fatal_semaphore_wait_thresholdplus mysqld restart time plus the VTOrc poll interval — easily 10+ minutes of zero successful writes on an otherwise-healthy shard, even when healthy semi-sync replicas are available to reparent to 😱What this hang looks like from outside
While InnoDB is stuck:
DeadPrimarydoes not fireSHOW REPLICA STATUSon replicas still reports both threadsYeswithSeconds_Behind_Source = 0(there is nothing new in the binlog, so the replicas are "caught up") —ReplicationStoppeddoes not fireRpl_semi_sync_master_wait_sessionsis0because transactions can't reach the commit/semi-sync stage —LockedSemiSyncPrimarydoes not fireProblems:[]in every VTOrc discovery poll, with sub-10msLastDiscoveryLatencyMeanwhile on the primary, vttablet logs
PoolFull: Code: RESOURCE_EXHAUSTEDas writes back up against the wedged latch AND become completely unavailable to clientsWhy not
_vt.heartbeatstaleness?It's the obvious "primary not committing" proxy, but the signal is too noisy for a destructive action like ERS — heartbeat writes also fail for vttablet pool exhaustion, brief network blips, restarts, idle-conn timeouts, etc. The
MY-012985warning is the better trigger because mysqld itself raises it AND independently escalates the same condition to[FATAL]at 2x the threshold — we're acting on a signal mysqld has already endorsed. Heartbeat staleness is fine as a supporting gate, not as the primary triggerProposal: a new
InnoDBStalledanalysis codeDetection should be grounded in InnoDB's own signal — the same one that drives mysqld's eventual self-kill — rather than indirect proxies like heartbeat staleness OR write-throughput rate (hard to threshold across workloads)
srv_monitor_threademits a structured warning to the error log when a semaphore wait crossesinnodb_fatal_semaphore_wait_threshold/2(~300s default):This warning repeats while the wait persists (emitted on each InnoDB monitor pass, typically every 5–15 s) AND lands in
performance_schema.error_log, which is independent of InnoDB latches AND remains readable while the data plane is wedged:A returned row means: InnoDB has formally declared a thread stuck on a latch for ≥ half the fatal threshold AND the warning has been re-emitted within the last poll window. The signal self-clears as soon as the wait resolves OR mysqld self-kills (at which point
DeadPrimarytakes over)Recovery
The recovery is exactly what ERS already does — promote a replica. Suggested gating before issuing ERS:
CountValidReplicas >= MinReplicasForReparent— i.e., there is somewhere safe to reparent toRpl_semi_sync_master_wait_sessions > 0OR vttabletPoolFullcounter is non-zero AND climbing — i.e., we have evidence that real writes are queued behind the stuck latch. This rules out InnoDB stalls during pure-read workloads where ERS is unhelpfulA single row matching the query above is sufficient to fire — no consecutive-poll hysteresis is needed.
MY-012985is not a transient signal: mysqld only emits it after a thread has already waited ≥innodb_fatal_semaphore_wait_threshold / 2(300s default), so the warning already carries the full pre-filter mysqld considers necessary to declare the wait pathological. The 60-second lookback window in the query gives the natural clearing buffer — when the stall resolves AND mysqld stops re-emitting, the row falls out of the window on the next pollThe warning code
MY-012985(ER_IB_MSG_785) exists in MySQL 8.0+ AND Percona Server 8.0+Related: #20056 (similar shape — new VTOrc analyser reading from
performance_schema.error_logfor a failure mode existing analysers miss)Your thoughts are appreciated 🙏
Use Case(s)
Any Vitess cluster running on MySQL 8.0+ where mysqld can hit an InnoDB latch stall. Triggers we've seen OR are aware of:
ALGORITHM=INSTANTALTERs on tables with concurrent DML, exhausting an internaldict_sysORlock_syslatch (this is what motivated the request)information_schemaqueries combined with concurrent schema changeslock_sysORlog_syslatches across operationsIn all of these, the failure mode is identical: mysqld is reachable, replication is "healthy", but writes are at zero. Without an
InnoDBStalledanalyser, the cluster waits the fullinnodb_fatal_semaphore_wait_threshold(default 600s) for mysqld to self-kill before any reparent can happen. With one, ERS can fire within ~30s of the warning starting — turning a multi-minute write outage into a sub-minute one