Skip to content

Feature Request: VTOrc to detect InnoDB semaphore stalls on primary tablets #20168

@timvaillancourt

Description

@timvaillancourt

Feature Description

VTOrc currently relies on DeadPrimary — primary unreachable AND no replica replicating — to fire an Emergency Reparent Shard. This works well for crashed or partitioned primaries, but completely misses a class of failure where mysqld is alive, accepting connections, replicating to its replicas, but has stopped making forward progress because InnoDB is internally wedged

The most common trigger we've observed: an InnoDB latch stall during an ALGORITHM=INSTANT ALTER. mysqld's own srv_monitor_thread eventually detects this AND self-kills at innodb_fatal_semaphore_wait_threshold (600s default), but VTOrc has no analyser that fires before the self-kill. The resulting write outage equals the full innodb_fatal_semaphore_wait_threshold plus mysqld restart time plus the VTOrc poll interval — easily 10+ minutes of zero successful writes on an otherwise-healthy shard, even when healthy semi-sync replicas are available to reparent to 😱

What this hang looks like from outside

While InnoDB is stuck:

  1. The primary's TCP listener still accepts connections AND vttablet RPCs still return — DeadPrimary does not fire
  2. SHOW REPLICA STATUS on replicas still reports both threads Yes with Seconds_Behind_Source = 0 (there is nothing new in the binlog, so the replicas are "caught up")ReplicationStopped does not fire
  3. Rpl_semi_sync_master_wait_sessions is 0 because transactions can't reach the commit/semi-sync stage — LockedSemiSyncPrimary does not fire
  4. Problems:[] in every VTOrc discovery poll, with sub-10ms LastDiscoveryLatency

Meanwhile on the primary, vttablet logs PoolFull: Code: RESOURCE_EXHAUSTED as writes back up against the wedged latch AND become completely unavailable to clients

Why not _vt.heartbeat staleness?

It's the obvious "primary not committing" proxy, but the signal is too noisy for a destructive action like ERS — heartbeat writes also fail for vttablet pool exhaustion, brief network blips, restarts, idle-conn timeouts, etc. The MY-012985 warning is the better trigger because mysqld itself raises it AND independently escalates the same condition to [FATAL] at 2x the threshold — we're acting on a signal mysqld has already endorsed. Heartbeat staleness is fine as a supporting gate, not as the primary trigger

Proposal: a new InnoDBStalled analysis code

Detection should be grounded in InnoDB's own signal — the same one that drives mysqld's eventual self-kill — rather than indirect proxies like heartbeat staleness OR write-throughput rate (hard to threshold across workloads)

srv_monitor_thread emits a structured warning to the error log when a semaphore wait crosses innodb_fatal_semaphore_wait_threshold/2 (~300s default):

[Warning] [MY-012985] [InnoDB] A long semaphore wait:
--Thread 129588379367104 has waited at ha_innodb.cc line 7233 for 922 seconds the semaphore:

This warning repeats while the wait persists (emitted on each InnoDB monitor pass, typically every 5–15 s) AND lands in performance_schema.error_log, which is independent of InnoDB latches AND remains readable while the data plane is wedged:

SELECT 1
  FROM performance_schema.error_log el
 WHERE el.logged    >= NOW(6) - INTERVAL 60 SECOND
   AND el.prio       = 'Warning'
   AND el.subsystem  = 'InnoDB'
   AND el.error_code = 'MY-012985'   -- ER_IB_MSG_785, "A long semaphore wait"
 LIMIT 1;

A returned row means: InnoDB has formally declared a thread stuck on a latch for ≥ half the fatal threshold AND the warning has been re-emitted within the last poll window. The signal self-clears as soon as the wait resolves OR mysqld self-kills (at which point DeadPrimary takes over)

Recovery

The recovery is exactly what ERS already does — promote a replica. Suggested gating before issuing ERS:

  1. CountValidReplicas >= MinReplicasForReparent — i.e., there is somewhere safe to reparent to
  2. (optional, more conservative) At least one of Rpl_semi_sync_master_wait_sessions > 0 OR vttablet PoolFull counter is non-zero AND climbing — i.e., we have evidence that real writes are queued behind the stuck latch. This rules out InnoDB stalls during pure-read workloads where ERS is unhelpful

A single row matching the query above is sufficient to fire — no consecutive-poll hysteresis is needed. MY-012985 is not a transient signal: mysqld only emits it after a thread has already waited ≥ innodb_fatal_semaphore_wait_threshold / 2 (300s default), so the warning already carries the full pre-filter mysqld considers necessary to declare the wait pathological. The 60-second lookback window in the query gives the natural clearing buffer — when the stall resolves AND mysqld stops re-emitting, the row falls out of the window on the next poll

The warning code MY-012985 (ER_IB_MSG_785) exists in MySQL 8.0+ AND Percona Server 8.0+

Related: #20056 (similar shape — new VTOrc analyser reading from performance_schema.error_log for a failure mode existing analysers miss)

Your thoughts are appreciated 🙏

Use Case(s)

Any Vitess cluster running on MySQL 8.0+ where mysqld can hit an InnoDB latch stall. Triggers we've seen OR are aware of:

  • ALGORITHM=INSTANT ALTERs on tables with concurrent DML, exhausting an internal dict_sys OR lock_sys latch (this is what motivated the request)
  • Long-running information_schema queries combined with concurrent schema changes
  • Known MySQL bugs that hold lock_sys OR log_sys latches across operations

In all of these, the failure mode is identical: mysqld is reachable, replication is "healthy", but writes are at zero. Without an InnoDBStalled analyser, the cluster waits the full innodb_fatal_semaphore_wait_threshold (default 600s) for mysqld to self-kill before any reparent can happen. With one, ERS can fire within ~30s of the warning starting — turning a multi-minute write outage into a sub-minute one

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions