Skip to content

Bug Report/RFC: lagging tablet(s) can cause EmergencyReparentShard to fail #18529

@timvaillancourt

Description

@timvaillancourt

Overview of the Issue

Somewhat related to #18528, this issue is to discuss the problem of a single lagging tablet causing ERS to either take an unnecessary amount of time, or to timeout if the lag does not recover in X period of time

The ERS code today attempts to, for EVERY tablet:

  1. Stop Replication and get GTID positions
  2. Wait for relay logs to apply
  3. Pick a most advanced candidate

Let's imagine we have a 4 x tablet shard:

  1. PRIMARY
  2. REPLICA with negligible lag ✅
  3. REPLICA with negligible lag ✅
  4. REPLICA with 180 seconds of IO-thread lag
    • Example scenario: tablet that just finished restore, tablet that "can't keep up", etc

Today using the example shard above, baring unrelated failures, the ERS code will:

  1. Succeed to execute the StopReplicationAndGetStatus RPC on all tablets
  2. The "wait for relay logs" phase will wait for all tablets. Tablet number 4 (with 180 seconds of lag) will take a long time to execute all relay logs
  3. The ERS will potentially/likely fail after exceeding --wait-replicas-timeout (default 15s), due to the single lagging replica being so far behind

Example errors

When a lagging tablet causes the "wait for relay logs" phase to exceed --wait-replicas-timeout, the ERS fails with:

could not apply all relay logs within the provided waitReplicasTimeout (15s): context deadline exceeded

Even if relay logs do apply in time, the lagging tablet can also cause the later reparent journal read to timeout:

could not read reparent journal information within the provided waitReplicasTimeout (15s): context deadline exceeded

Both errors originate from the same root cause — ERS is blocked waiting on a tablet that is too far behind to finish within the timeout

Solution

The key insight is: after StopReplicationAndGetStatus completes, replication is stopped on all tablets. The Combined (relay log) positions are frozen — they cannot advance. This means we already know which tablets are the most advanced and which ones can never catch up, before we even start the "wait for relay logs" phase

Today ERS waits for ALL tablets to apply relay logs, then picks a winner. But a tablet whose Combined position is strictly behind the max can never win — waiting for it is pointless and can only hurt us (by timing out the entire ERS)

Proposed solution:

  1. StopReplicationAndGetStatus runs on all tablets (unchanged) — replication is now stopped everywhere, positions are frozen
  2. Filter candidates to only those at the max Combined (relay log) position — these are the only tablets that could ever "win"
    • Tablets behind this group are excluded from the wait. They'll still apply relaylogs async, we just won't block on them
  3. Within the most-advanced group, sort by Executed position descending (prefer the tablet that has already applied the most relay logs, as it's closest to "done")
  4. Optionally cap how many tablets from that group to wait for (since they all share the same Combined position, they'll converge to the same final state after applying relay logs — waiting for 2-3 is sufficient)
  5. Wait for relay logs to apply on the selected tablets only
  6. Pick a new Primary (unchanged)
  7. Reparent all tablets (unchanged)

Why this is strictly safer than today

This isn't just an optimisation — it actually reduces the failure surface of ERS:

  • Today: ERS waits for ALL tablets. A single hopelessly-lagging tablet that times out during relay log application → ERS fails. Even though that tablet could never have been the winner
  • With this change: non-competitive tablets are excluded from the wait entirely. Fewer tablets in the wait path = fewer things that can fail and block the reparent

The filtering is provably safe because Combined positions are frozen after StopReplicationAndGetStatus — a tablet behind the max will stay behind the max, no matter how long you wait

Your thoughts are appreciated, especially blind-spots in this approach!

Reproduction Steps

  1. Create a shard with many possible candidates for reparent
  2. Introduce long-lived IO-thread lag on a replica that is much-larger than --wait-replicas-timeout (default 15s)
  3. Run an EmergencyReparentShard on the test shard
  4. Notice the ERS times out, or is significantly more likely to timeout

Binary Version

v19+

Operating System and Environment details

Linux

Log Fragments

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions