Bug Report/RFC: lagging tablet(s) can cause `EmergencyReparentShard` to fail

### Overview of the Issue

Somewhat related to https://github.com/vitessio/vitess/issues/18528, this issue is to discuss the problem of a single lagging tablet causing ERS to either take an unnecessary amount of time, or to timeout if the lag does not recover in X period of time

The ERS code today attempts to, for EVERY tablet:
1. Stop Replication and get GTID positions
2. Wait for relay logs to apply
3. Pick a most advanced candidate

Let's imagine we have a 4 x tablet shard:
1. `PRIMARY` ✅ 
2. `REPLICA` with negligible lag ✅ 
3. `REPLICA` with negligible lag ✅ 
4. `REPLICA` with 180 seconds of IO-thread lag
    - Example scenario: tablet that just finished restore, tablet that "can't keep up", etc

Today using the example shard above, baring unrelated failures, the ERS code will:
1. Succeed to execute the `StopReplicationAndGetStatus` RPC on all tablets
2. The "wait for relay logs" phase will wait for all tablets. Tablet number 4 _(with 180 seconds of lag)_ will take a long time to execute all relay logs
3. The ERS will potentially/likely fail after exceeding `--wait-replicas-timeout` _(default 15s)_, due to the single lagging replica being so far behind

#### Example errors

When a lagging tablet causes the "wait for relay logs" phase to exceed `--wait-replicas-timeout`, the ERS fails with:
```
could not apply all relay logs within the provided waitReplicasTimeout (15s): context deadline exceeded
```

Even if relay logs do apply in time, the lagging tablet can also cause the later reparent journal read to timeout:
```
could not read reparent journal information within the provided waitReplicasTimeout (15s): context deadline exceeded
```

Both errors originate from the same root cause — ERS is blocked waiting on a tablet that is too far behind to finish within the timeout

#### Solution

The key insight is: after `StopReplicationAndGetStatus` completes, replication is stopped on all tablets. The Combined _(relay log)_ positions are frozen — they cannot advance. This means we already know which tablets are the most advanced and which ones can never catch up, before we even start the "wait for relay logs" phase

Today ERS waits for ALL tablets to apply relay logs, then picks a winner. But a tablet whose Combined position is strictly behind the max can never win — waiting for it is pointless and can only hurt us _(by timing out the entire ERS)_

Proposed solution:
1. `StopReplicationAndGetStatus` runs on all tablets _(unchanged)_ — replication is now stopped everywhere, positions are frozen
2. Filter candidates to only those at the **max Combined _(relay log)_ position** — these are the only tablets that could ever "win"
    - Tablets behind this group are excluded from the wait. They'll still apply relaylogs async, we just won't block on them
3. Within the most-advanced group, sort by Executed position descending _(prefer the tablet that has already applied the most relay logs, as it's closest to "done")_
4. Optionally cap how many tablets from that group to wait for _(since they all share the same Combined position, they'll converge to the same final state after applying relay logs — waiting for 2-3 is sufficient)_
5. Wait for relay logs to apply on the selected tablets only
6. Pick a new Primary _(unchanged)_
7. Reparent all tablets _(unchanged)_

#### Why this is strictly safer than today

This isn't just an optimisation — it actually reduces the failure surface of ERS:
- **Today**: ERS waits for ALL tablets. A single hopelessly-lagging tablet that times out during relay log application → ERS fails. Even though that tablet could never have been the winner
- **With this change**: non-competitive tablets are excluded from the wait entirely. Fewer tablets in the wait path = fewer things that can fail and block the reparent

The filtering is provably safe because Combined positions are frozen after `StopReplicationAndGetStatus` — a tablet behind the max will stay behind the max, no matter how long you wait

Your thoughts are appreciated, especially blind-spots in this approach!

### Reproduction Steps

1. Create a shard with many possible candidates for reparent
2. Introduce long-lived IO-thread lag on a replica that is much-larger than `--wait-replicas-timeout` _(default `15s`)_
3. Run an `EmergencyReparentShard` on the test shard
4. Notice the ERS times out, or is significantly more likely to timeout

### Binary Version

```sh
v19+
```

### Operating System and Environment details

```sh
Linux
```

### Log Fragments

```sh

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report/RFC: lagging tablet(s) can cause `EmergencyReparentShard` to fail #18529

Overview of the Issue

Example errors

Solution

Why this is strictly safer than today

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug Report/RFC: lagging tablet(s) can cause EmergencyReparentShard to fail #18529

Description

Overview of the Issue

Example errors

Solution

Why this is strictly safer than today

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug Report/RFC: lagging tablet(s) can cause `EmergencyReparentShard` to fail #18529