Skip to content

Feature Request: VTOrc to detect "stalled" replicas where replication threads remain running #20056

@timvaillancourt

Description

@timvaillancourt

Feature Description

VTOrc decides if a replica's replication is healthy primarily by checking Slave_IO_Running and Slave_SQL_Running. If both are Yes, the replica is considered healthy

There is a class of failure where both threads remain Yes while the replica makes zero forward progress. The most common trigger is a full disk

Why this happens

When a MySQL replica's disk fills, behaviour depends on which filesystem hits ENOSPC:

  1. Relay log filesystem — the IO thread reports ER_REPLICA_RELAY_LOG_WRITE_FAILURE and stops. Slave_IO_Running flips to No (this is the case everyone expects)
  2. InnoDB filesystem (redo log, tablespace extension, undo, replica's own binlog group commit, applier metadata table, etc.) — InnoDB does not fail-fast. It logs Disk is full. Try to clean the disk to free space. exactly once, sets the global os_has_said_disk_full flag, and silently retries every subsequent failed write. The applier thread parks inside ha_commit_trans() in the Waiting for handler commit stage. The IO thread is unaffected and keeps queueing into the relay log

In case 2, SHOW REPLICA STATUS shows both threads Yes, empty Last_*_Error, and Seconds_Behind_Source climbing forever 😱. To VTOrc, this is a healthy replica

Proposal: a new ReplicationStalled analysis code

Detection should be stateless, single-poll, and self-healing (automatically clears the moment the applier resumes). The signal is comparing the Disk is full log entry timestamp against the applier's last successful commit timestamp:

SELECT 1
  FROM performance_schema.error_log el
 WHERE el.logged    >= NOW(6) - INTERVAL 1 HOUR
   AND el.prio       = 'Error'
   AND el.subsystem  = 'InnoDB'
   AND el.error_code IN ('MY-012814', 'MY-012820')   -- ER_IB_MSG_814 / ER_IB_MSG_820
   AND el.logged > COALESCE(
         (SELECT MAX(LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP)
            FROM performance_schema.replication_applier_status_by_worker),
         '1970-01-01')
 LIMIT 1;

A returned row means: a Disk is full was logged and the applier has not committed since — still wedged. Every predicate is index-served. Works for both parallel and single-threaded replication (in single-threaded mode the worker table has one synthetic row for the SQL thread, see table_replication_applier_status_by_worker.cc:411-413)

Recovery

There is not much VTOrc-shaped action available — freeing disk space is an operator concern. A "ballast file" deleted via RPC is overkill 👎. For this initial proposal, surfacing the problem is enough — an operator paged with ReplicationStalled (disk full) is a meaningful improvement over Slave_IO_Running: Yes and silence

Surfaced during testing of #20015 and #19925. Confirmed against Percona Server 8.4.7-7

Your thoughts are appreciated 🙏

Use Case(s)

Operators of Vitess clusters running on hosts where a disk-full event is possible (i.e., everyone). Today, a replica wedged by ENOSPC sits silently in a healthy-looking state until lag-based alerts fire — but lag-based alerts are noisy and don't pinpoint root cause

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions