Bug Report/RFC: `EmergencyReparentShard` fails when `mysqld` is down on any tablet in a shard

### Overview of the Issue

#### Problem

On at least v19 and probably versions ahead of that, `EmergencyReparentShard` _(which relies on calling the `StopReplicationAndGetStatus` tabletmanager RPC to all tablets)_ fails when any tablet in a shard has MySQL down:
```bash
rpc error: code = Unknown desc = TabletManager.StopReplicationAndGetStatus on REDACTED-0171543854: before status failed: net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: connection refused (errno 2002) (sqlstate HY000)
```

The `net.Dial` error is `vttablet` attempting to connect to a downed MySQL server on the tablet. In our experience this scenario can happen when:
1. MySQL/InnoDB has crashed/coredumped and cannot come back up
    - In our production we don't let MySQL restart intentionally if it crashes
    - There are also cases where MySQL can crash and cannot start back up
2. MySQL is stopped manually for whatever reason

The ERS code today attempts to, for EVERY tablet _(no matter what)_:
1. Stop Replication and get GTID positions
2. Wait for relay logs to apply
3. Pick a most advanced candidate

This approach is being very careful to understand who has the most relaylog changes, so data loss and errants are not created. However, the failure in the `StopReplicationAndGetStatus` RPC for any tablet halts the ERS at Step number 1 above, which is quite dangerous for availability

#### Solution

Before I propose a solution here, I'll start with some known limitations: what I'm about to propose won't work for tablets with remote MySQL. It's much harder to be certain a network dial error means MySQL is "down"

Now, back to the error we receive: `net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: connection refused (errno 2002) (sqlstate HY000)`. On a tablet with local MySQL this is a "pretty-good" signal that MySQL is down, but we could be even more certain by checking the PID _(we know this via the pidfile)_ and perhaps other details

So let's imagine we have a strong signal MySQL is down on a tablet that still responds to tabletmanager RPCs. With the caveat that we can't always be 100% certain, if MySQL is down on the tablet, I feel we can infer _(perhaps optionally)_:
1. The MySQL on this tablet is no longer a semi-sync ack'er, because it's down
2. The MySQL on this tablet cannot be the most advanced, because it's down _(can't query it's positions)_
    - There are really odd edge cases where this might not be true, but I'd argue it should cover most regular Vitess users

So TL;DR: let's give `EmergencyReparentShard` logic the context of what tablets have MySQL down or up, so we don't try to wait for them fruitlessly, failing the reparent. We'll assume a down `mysqld` means the tablet cannot be most-advanced

This could be implemented in a few ways:
1. An in-`vttablet` "MySQL Monitor" + modify the RPC response for `StopReplicationAndGetStatus` to include this state
    - Today we get a `nil` response because the RPC errored, so we'd need to adjust that to still return a response, and perhaps include the error in the response
2. The ERS code has a function that reliably matches `mysqld`-down errors, eg: `IsMySQLDown(err error) bool`
    - This could become a game of whack-a-mole, because we're matching on error strings (I believe)
    - This logic would need to match on dial errors _(`mysqld` is down)_ and potentially ignore "timeout" errors, because this doesn't mean `mysqld` is really down

This definitely relies on VTOrc existing in a cluster, to fix tablets with the wrong primary if anything fails here 

Your thoughts are appreciated!

### Reproduction Steps

1. Setup a shard with many tablets
2. `kill -9 $(pidof mysqld)` on one tablet, but ensure the host/pod + `vttablet` remains alive
3. Run an `EmergencyReparentShard` on the given shard
4. Notice the ERS fails on the mysql-down error for a single tablet

### Binary Version

```sh
v19, probably versions above v19
```

### Operating System and Environment details

```sh
Linux
```

### Log Fragments

```sh
rpc error: code = Unknown desc = TabletManager.StopReplicationAndGetStatus on REDACTED-0171543854: before status failed: net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: connection refused (errno 2002) (sqlstate HY000)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report/RFC: `EmergencyReparentShard` fails when `mysqld` is down on any tablet in a shard #18528

Overview of the Issue

Problem

Solution

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug Report/RFC: EmergencyReparentShard fails when mysqld is down on any tablet in a shard #18528

Description

Overview of the Issue

Problem

Solution

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug Report/RFC: `EmergencyReparentShard` fails when `mysqld` is down on any tablet in a shard #18528