Feature Description
VTOrc decides if a replica's replication is healthy primarily by checking Slave_IO_Running and Slave_SQL_Running. If both are Yes, the replica is considered healthy
There is a class of failure where both threads remain Yes while the replica makes zero forward progress. The most common trigger is a full disk
Why this happens
When a MySQL replica's disk fills, behaviour depends on which filesystem hits ENOSPC:
- Relay log filesystem — the IO thread reports
ER_REPLICA_RELAY_LOG_WRITE_FAILURE and stops. Slave_IO_Running flips to No (this is the case everyone expects)
- InnoDB filesystem (redo log, tablespace extension, undo, replica's own binlog group commit, applier metadata table, etc.) — InnoDB does not fail-fast. It logs
Disk is full. Try to clean the disk to free space. exactly once, sets the global os_has_said_disk_full flag, and silently retries every subsequent failed write. The applier thread parks inside ha_commit_trans() in the Waiting for handler commit stage. The IO thread is unaffected and keeps queueing into the relay log
In case 2, SHOW REPLICA STATUS shows both threads Yes, empty Last_*_Error, and Seconds_Behind_Source climbing forever 😱. To VTOrc, this is a healthy replica
Proposal: a new ReplicationStalled analysis code
Detection should be stateless, single-poll, and self-healing (automatically clears the moment the applier resumes). The signal is comparing the Disk is full log entry timestamp against the applier's last successful commit timestamp:
SELECT 1
FROM performance_schema.error_log el
WHERE el.logged >= NOW(6) - INTERVAL 1 HOUR
AND el.prio = 'Error'
AND el.subsystem = 'InnoDB'
AND el.error_code IN ('MY-012814', 'MY-012820') -- ER_IB_MSG_814 / ER_IB_MSG_820
AND el.logged > COALESCE(
(SELECT MAX(LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP)
FROM performance_schema.replication_applier_status_by_worker),
'1970-01-01')
LIMIT 1;
A returned row means: a Disk is full was logged and the applier has not committed since — still wedged. Every predicate is index-served. Works for both parallel and single-threaded replication (in single-threaded mode the worker table has one synthetic row for the SQL thread, see table_replication_applier_status_by_worker.cc:411-413)
Recovery
There is not much VTOrc-shaped action available — freeing disk space is an operator concern. A "ballast file" deleted via RPC is overkill 👎. For this initial proposal, surfacing the problem is enough — an operator paged with ReplicationStalled (disk full) is a meaningful improvement over Slave_IO_Running: Yes and silence
Surfaced during testing of #20015 and #19925. Confirmed against Percona Server 8.4.7-7
Your thoughts are appreciated 🙏
Use Case(s)
Operators of Vitess clusters running on hosts where a disk-full event is possible (i.e., everyone). Today, a replica wedged by ENOSPC sits silently in a healthy-looking state until lag-based alerts fire — but lag-based alerts are noisy and don't pinpoint root cause
Feature Description
VTOrc decides if a replica's replication is healthy primarily by checking
Slave_IO_RunningandSlave_SQL_Running. If both areYes, the replica is considered healthyThere is a class of failure where both threads remain
Yeswhile the replica makes zero forward progress. The most common trigger is a full diskWhy this happens
When a MySQL replica's disk fills, behaviour depends on which filesystem hits
ENOSPC:ER_REPLICA_RELAY_LOG_WRITE_FAILUREand stops.Slave_IO_Runningflips toNo(this is the case everyone expects)Disk is full. Try to clean the disk to free space.exactly once, sets the globalos_has_said_disk_fullflag, and silently retries every subsequent failed write. The applier thread parks insideha_commit_trans()in theWaiting for handler commitstage. The IO thread is unaffected and keeps queueing into the relay logIn case 2,
SHOW REPLICA STATUSshows both threadsYes, emptyLast_*_Error, andSeconds_Behind_Sourceclimbing forever 😱. To VTOrc, this is a healthy replicaProposal: a new
ReplicationStalledanalysis codeDetection should be stateless, single-poll, and self-healing (automatically clears the moment the applier resumes). The signal is comparing the
Disk is fulllog entry timestamp against the applier's last successful commit timestamp:A returned row means: a
Disk is fullwas logged and the applier has not committed since — still wedged. Every predicate is index-served. Works for both parallel and single-threaded replication (in single-threaded mode the worker table has one synthetic row for the SQL thread, seetable_replication_applier_status_by_worker.cc:411-413)Recovery
There is not much VTOrc-shaped action available — freeing disk space is an operator concern. A "ballast file" deleted via RPC is overkill 👎. For this initial proposal, surfacing the problem is enough — an operator paged with
ReplicationStalled (disk full)is a meaningful improvement overSlave_IO_Running: Yesand silenceSurfaced during testing of #20015 and #19925. Confirmed against Percona Server 8.4.7-7
Your thoughts are appreciated 🙏
Use Case(s)
Operators of Vitess clusters running on hosts where a disk-full event is possible (i.e., everyone). Today, a replica wedged by
ENOSPCsits silently in a healthy-looking state until lag-based alerts fire — but lag-based alerts are noisy and don't pinpoint root cause