Feature Request: VTOrc to detect InnoDB semaphore stalls on primary tablets

### Feature Description

VTOrc currently relies on `DeadPrimary` — primary unreachable AND no replica replicating — to fire an Emergency Reparent Shard. This works well for crashed or partitioned primaries, but completely misses a class of failure where mysqld is alive, accepting connections, replicating to its replicas, but has stopped making forward progress because InnoDB is internally wedged

The most common trigger we've observed: an InnoDB latch stall during an `ALGORITHM=INSTANT` ALTER. mysqld's own `srv_monitor_thread` eventually detects this AND self-kills at `innodb_fatal_semaphore_wait_threshold` _(600s default)_, but VTOrc has no analyser that fires _before_ the self-kill. The resulting write outage equals the full `innodb_fatal_semaphore_wait_threshold` plus mysqld restart time plus the VTOrc poll interval — easily 10+ minutes of zero successful writes on an otherwise-healthy shard, even when healthy semi-sync replicas are available to reparent to 😱

#### What this hang looks like from outside

While InnoDB is stuck:

1. The primary's TCP listener still accepts connections AND vttablet RPCs still return — `DeadPrimary` does not fire
2. `SHOW REPLICA STATUS` on replicas still reports both threads `Yes` with `Seconds_Behind_Source = 0` _(there is nothing new in the binlog, so the replicas are "caught up")_ — `ReplicationStopped` does not fire
3. `Rpl_semi_sync_master_wait_sessions` is `0` because transactions can't reach the commit/semi-sync stage — `LockedSemiSyncPrimary` does not fire
4. `Problems:[]` in every VTOrc discovery poll, with sub-10ms `LastDiscoveryLatency`

Meanwhile on the primary, vttablet logs `PoolFull: Code: RESOURCE_EXHAUSTED` as writes back up against the wedged latch AND become completely unavailable to clients

#### Why not `_vt.heartbeat` staleness?

It's the obvious "primary not committing" proxy, but the signal is too noisy for a destructive action like ERS — heartbeat writes also fail for vttablet pool exhaustion, brief network blips, restarts, idle-conn timeouts, etc. The `MY-012985` warning is the better trigger because mysqld _itself_ raises it AND independently escalates the same condition to `[FATAL]` at 2x the threshold — we're acting on a signal mysqld has already endorsed. Heartbeat staleness is fine as a supporting gate, not as the primary trigger

#### Proposal: a new `InnoDBStalled` analysis code

Detection should be **grounded in InnoDB's own signal** — the same one that drives mysqld's eventual self-kill — rather than indirect proxies like heartbeat staleness OR write-throughput rate _(hard to threshold across workloads)_

`srv_monitor_thread` emits a structured warning to the error log when a semaphore wait crosses `innodb_fatal_semaphore_wait_threshold/2` _(~300s default)_:

```
[Warning] [MY-012985] [InnoDB] A long semaphore wait:
--Thread 129588379367104 has waited at ha_innodb.cc line 7233 for 922 seconds the semaphore:
```

This warning **repeats** while the wait persists _(emitted on each InnoDB monitor pass, typically every 5–15 s)_ AND lands in `performance_schema.error_log`, which is independent of InnoDB latches AND remains readable while the data plane is wedged:

```sql
SELECT 1
  FROM performance_schema.error_log el
 WHERE el.logged    >= NOW(6) - INTERVAL 60 SECOND
   AND el.prio       = 'Warning'
   AND el.subsystem  = 'InnoDB'
   AND el.error_code = 'MY-012985'   -- ER_IB_MSG_785, "A long semaphore wait"
 LIMIT 1;
```

A returned row means: InnoDB has formally declared a thread stuck on a latch for ≥ half the fatal threshold AND the warning has been re-emitted within the last poll window. The signal self-clears as soon as the wait resolves OR mysqld self-kills _(at which point `DeadPrimary` takes over)_

#### Recovery

The recovery is exactly what ERS already does — promote a replica. Suggested gating before issuing ERS:

1. `CountValidReplicas >= MinReplicasForReparent` — i.e., there is somewhere safe to reparent to
2. _(optional, more conservative)_ At least one of `Rpl_semi_sync_master_wait_sessions > 0` OR vttablet `PoolFull` counter is non-zero AND climbing — i.e., we have evidence that real writes are queued behind the stuck latch. This rules out InnoDB stalls during pure-read workloads where ERS is unhelpful

A single row matching the query above is sufficient to fire — no consecutive-poll hysteresis is needed. `MY-012985` is not a transient signal: mysqld only emits it after a thread has already waited ≥ `innodb_fatal_semaphore_wait_threshold / 2` _(300s default)_, so the warning already carries the full pre-filter mysqld considers necessary to declare the wait pathological. The 60-second lookback window in the query gives the natural clearing buffer — when the stall resolves AND mysqld stops re-emitting, the row falls out of the window on the next poll

The warning code `MY-012985` _(`ER_IB_MSG_785`)_ exists in MySQL 8.0+ AND Percona Server 8.0+

Related: #20056 _(similar shape — new VTOrc analyser reading from `performance_schema.error_log` for a failure mode existing analysers miss)_

Your thoughts are appreciated 🙏

### Use Case(s)

Any Vitess cluster running on MySQL 8.0+ where mysqld can hit an InnoDB latch stall. Triggers we've seen OR are aware of:

- `ALGORITHM=INSTANT` ALTERs on tables with concurrent DML, exhausting an internal `dict_sys` OR `lock_sys` latch _(this is what motivated the request)_
- Long-running `information_schema` queries combined with concurrent schema changes
- Known MySQL bugs that hold `lock_sys` OR `log_sys` latches across operations

In all of these, the failure mode is identical: mysqld is reachable, replication is "healthy", but writes are at zero. Without an `InnoDBStalled` analyser, the cluster waits the full `innodb_fatal_semaphore_wait_threshold` _(default 600s)_ for mysqld to self-kill before any reparent can happen. With one, ERS can fire within ~30s of the warning starting — turning a multi-minute write outage into a sub-minute one


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: VTOrc to detect InnoDB semaphore stalls on primary tablets #20168

Feature Description

What this hang looks like from outside

Why not `_vt.heartbeat` staleness?

Proposal: a new `InnoDBStalled` analysis code

Recovery

Use Case(s)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature Request: VTOrc to detect InnoDB semaphore stalls on primary tablets #20168

Description

Feature Description

What this hang looks like from outside

Why not _vt.heartbeat staleness?

Proposal: a new InnoDBStalled analysis code

Recovery

Use Case(s)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Why not `_vt.heartbeat` staleness?

Proposal: a new `InnoDBStalled` analysis code