Overview of the Issue
Somewhat related to #18528, this issue is to discuss the problem of a single lagging tablet causing ERS to either take an unnecessary amount of time, or to timeout if the lag does not recover in X period of time
The ERS code today attempts to, for EVERY tablet:
- Stop Replication and get GTID positions
- Wait for relay logs to apply
- Pick a most advanced candidate
Let's imagine we have a 4 x tablet shard:
PRIMARY ✅
REPLICA with negligible lag ✅
REPLICA with negligible lag ✅
REPLICA with 180 seconds of IO-thread lag
- Example scenario: tablet that just finished restore, tablet that "can't keep up", etc
Today using the example shard above, baring unrelated failures, the ERS code will:
- Succeed to execute the
StopReplicationAndGetStatus RPC on all tablets
- The "wait for relay logs" phase will wait for all tablets. Tablet number 4 (with 180 seconds of lag) will take a long time to execute all relay logs
- The ERS will potentially/likely fail after exceeding
--wait-replicas-timeout (default 15s), due to the single lagging replica being so far behind
Example errors
When a lagging tablet causes the "wait for relay logs" phase to exceed --wait-replicas-timeout, the ERS fails with:
could not apply all relay logs within the provided waitReplicasTimeout (15s): context deadline exceeded
Even if relay logs do apply in time, the lagging tablet can also cause the later reparent journal read to timeout:
could not read reparent journal information within the provided waitReplicasTimeout (15s): context deadline exceeded
Both errors originate from the same root cause — ERS is blocked waiting on a tablet that is too far behind to finish within the timeout
Solution
The key insight is: after StopReplicationAndGetStatus completes, replication is stopped on all tablets. The Combined (relay log) positions are frozen — they cannot advance. This means we already know which tablets are the most advanced and which ones can never catch up, before we even start the "wait for relay logs" phase
Today ERS waits for ALL tablets to apply relay logs, then picks a winner. But a tablet whose Combined position is strictly behind the max can never win — waiting for it is pointless and can only hurt us (by timing out the entire ERS)
Proposed solution:
StopReplicationAndGetStatus runs on all tablets (unchanged) — replication is now stopped everywhere, positions are frozen
- Filter candidates to only those at the max Combined (relay log) position — these are the only tablets that could ever "win"
- Tablets behind this group are excluded from the wait. They'll still apply relaylogs async, we just won't block on them
- Within the most-advanced group, sort by Executed position descending (prefer the tablet that has already applied the most relay logs, as it's closest to "done")
- Optionally cap how many tablets from that group to wait for (since they all share the same Combined position, they'll converge to the same final state after applying relay logs — waiting for 2-3 is sufficient)
- Wait for relay logs to apply on the selected tablets only
- Pick a new Primary (unchanged)
- Reparent all tablets (unchanged)
Why this is strictly safer than today
This isn't just an optimisation — it actually reduces the failure surface of ERS:
- Today: ERS waits for ALL tablets. A single hopelessly-lagging tablet that times out during relay log application → ERS fails. Even though that tablet could never have been the winner
- With this change: non-competitive tablets are excluded from the wait entirely. Fewer tablets in the wait path = fewer things that can fail and block the reparent
The filtering is provably safe because Combined positions are frozen after StopReplicationAndGetStatus — a tablet behind the max will stay behind the max, no matter how long you wait
Your thoughts are appreciated, especially blind-spots in this approach!
Reproduction Steps
- Create a shard with many possible candidates for reparent
- Introduce long-lived IO-thread lag on a replica that is much-larger than
--wait-replicas-timeout (default 15s)
- Run an
EmergencyReparentShard on the test shard
- Notice the ERS times out, or is significantly more likely to timeout
Binary Version
Operating System and Environment details
Log Fragments
Overview of the Issue
Somewhat related to #18528, this issue is to discuss the problem of a single lagging tablet causing ERS to either take an unnecessary amount of time, or to timeout if the lag does not recover in X period of time
The ERS code today attempts to, for EVERY tablet:
Let's imagine we have a 4 x tablet shard:
PRIMARY✅REPLICAwith negligible lag ✅REPLICAwith negligible lag ✅REPLICAwith 180 seconds of IO-thread lagToday using the example shard above, baring unrelated failures, the ERS code will:
StopReplicationAndGetStatusRPC on all tablets--wait-replicas-timeout(default 15s), due to the single lagging replica being so far behindExample errors
When a lagging tablet causes the "wait for relay logs" phase to exceed
--wait-replicas-timeout, the ERS fails with:Even if relay logs do apply in time, the lagging tablet can also cause the later reparent journal read to timeout:
Both errors originate from the same root cause — ERS is blocked waiting on a tablet that is too far behind to finish within the timeout
Solution
The key insight is: after
StopReplicationAndGetStatuscompletes, replication is stopped on all tablets. The Combined (relay log) positions are frozen — they cannot advance. This means we already know which tablets are the most advanced and which ones can never catch up, before we even start the "wait for relay logs" phaseToday ERS waits for ALL tablets to apply relay logs, then picks a winner. But a tablet whose Combined position is strictly behind the max can never win — waiting for it is pointless and can only hurt us (by timing out the entire ERS)
Proposed solution:
StopReplicationAndGetStatusruns on all tablets (unchanged) — replication is now stopped everywhere, positions are frozenWhy this is strictly safer than today
This isn't just an optimisation — it actually reduces the failure surface of ERS:
The filtering is provably safe because Combined positions are frozen after
StopReplicationAndGetStatus— a tablet behind the max will stay behind the max, no matter how long you waitYour thoughts are appreciated, especially blind-spots in this approach!
Reproduction Steps
--wait-replicas-timeout(default15s)EmergencyReparentShardon the test shardBinary Version
Operating System and Environment details
Log Fragments