Feature Request: VTOrc to detect "stalled" replicas where replication threads remain running

### Feature Description

VTOrc decides if a replica's replication is healthy primarily by checking `Slave_IO_Running` and `Slave_SQL_Running`. If both are `Yes`, the replica is considered healthy

There is a class of failure where both threads remain `Yes` while the replica makes zero forward progress. The most common trigger is **a full disk**

#### Why this happens

When a MySQL replica's disk fills, behaviour depends on which filesystem hits `ENOSPC`:

1. **Relay log filesystem** — the IO thread reports `ER_REPLICA_RELAY_LOG_WRITE_FAILURE` and stops. `Slave_IO_Running` flips to `No` _(this is the case everyone expects)_
2. **InnoDB filesystem** _(redo log, tablespace extension, undo, replica's own binlog group commit, applier metadata table, etc.)_ — InnoDB does **not** fail-fast. It logs `Disk is full. Try to clean the disk to free space.` exactly once, sets the global `os_has_said_disk_full` flag, and silently retries every subsequent failed write. The applier thread parks inside `ha_commit_trans()` in the `Waiting for handler commit` stage. The IO thread is unaffected and keeps queueing into the relay log

In case 2, `SHOW REPLICA STATUS` shows both threads `Yes`, empty `Last_*_Error`, and `Seconds_Behind_Source` climbing forever 😱. To VTOrc, this is a healthy replica

#### Proposal: a new `ReplicationStalled` analysis code

Detection should be **stateless**, **single-poll**, and **self-healing** _(automatically clears the moment the applier resumes)_. The signal is comparing the `Disk is full` log entry timestamp against the applier's last successful commit timestamp:

```sql
SELECT 1
  FROM performance_schema.error_log el
 WHERE el.logged    >= NOW(6) - INTERVAL 1 HOUR
   AND el.prio       = 'Error'
   AND el.subsystem  = 'InnoDB'
   AND el.error_code IN ('MY-012814', 'MY-012820')   -- ER_IB_MSG_814 / ER_IB_MSG_820
   AND el.logged > COALESCE(
         (SELECT MAX(LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP)
            FROM performance_schema.replication_applier_status_by_worker),
         '1970-01-01')
 LIMIT 1;
```

A returned row means: a `Disk is full` was logged and the applier has not committed since — still wedged. Every predicate is index-served. Works for both parallel and single-threaded replication _(in single-threaded mode the worker table has one synthetic row for the SQL thread, see [`table_replication_applier_status_by_worker.cc:411-413`](https://github.com/percona/percona-server/blob/release-8.4.7-7/storage/perfschema/table_replication_applier_status_by_worker.cc))_

#### Recovery

There is not much VTOrc-shaped action available — freeing disk space is an operator concern. A "ballast file" deleted via RPC is overkill 👎. For this initial proposal, surfacing the problem is enough — an operator paged with `ReplicationStalled (disk full)` is a meaningful improvement over `Slave_IO_Running: Yes` and silence

Surfaced during testing of #20015 and #19925. Confirmed against Percona Server 8.4.7-7

Your thoughts are appreciated 🙏

### Use Case(s)

Operators of Vitess clusters running on hosts where a disk-full event is possible _(i.e., everyone)_. Today, a replica wedged by `ENOSPC` sits silently in a healthy-looking state until lag-based alerts fire — but lag-based alerts are noisy and don't pinpoint root cause


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: VTOrc to detect "stalled" replicas where replication threads remain running #20056

Feature Description

Why this happens

Proposal: a new `ReplicationStalled` analysis code

Recovery

Use Case(s)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature Request: VTOrc to detect "stalled" replicas where replication threads remain running #20056

Description

Feature Description

Why this happens

Proposal: a new ReplicationStalled analysis code

Recovery

Use Case(s)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Proposal: a new `ReplicationStalled` analysis code