Overview of the Issue
Problem
On at least v19 and probably versions ahead of that, EmergencyReparentShard (which relies on calling the StopReplicationAndGetStatus tabletmanager RPC to all tablets) fails when any tablet in a shard has MySQL down:
rpc error: code = Unknown desc = TabletManager.StopReplicationAndGetStatus on REDACTED-0171543854: before status failed: net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: connection refused (errno 2002) (sqlstate HY000)
The net.Dial error is vttablet attempting to connect to a downed MySQL server on the tablet. In our experience this scenario can happen when:
- MySQL/InnoDB has crashed/coredumped and cannot come back up
- In our production we don't let MySQL restart intentionally if it crashes
- There are also cases where MySQL can crash and cannot start back up
- MySQL is stopped manually for whatever reason
The ERS code today attempts to, for EVERY tablet (no matter what):
- Stop Replication and get GTID positions
- Wait for relay logs to apply
- Pick a most advanced candidate
This approach is being very careful to understand who has the most relaylog changes, so data loss and errants are not created. However, the failure in the StopReplicationAndGetStatus RPC for any tablet halts the ERS at Step number 1 above, which is quite dangerous for availability
Solution
Before I propose a solution here, I'll start with some known limitations: what I'm about to propose won't work for tablets with remote MySQL. It's much harder to be certain a network dial error means MySQL is "down"
Now, back to the error we receive: net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: connection refused (errno 2002) (sqlstate HY000). On a tablet with local MySQL this is a "pretty-good" signal that MySQL is down, but we could be even more certain by checking the PID (we know this via the pidfile) and perhaps other details
So let's imagine we have a strong signal MySQL is down on a tablet that still responds to tabletmanager RPCs. With the caveat that we can't always be 100% certain, if MySQL is down on the tablet, I feel we can infer (perhaps optionally):
- The MySQL on this tablet is no longer a semi-sync ack'er, because it's down
- The MySQL on this tablet cannot be the most advanced, because it's down (can't query it's positions)
- There are really odd edge cases where this might not be true, but I'd argue it should cover most regular Vitess users
So TL;DR: let's give EmergencyReparentShard logic the context of what tablets have MySQL down or up, so we don't try to wait for them fruitlessly, failing the reparent. We'll assume a down mysqld means the tablet cannot be most-advanced
This could be implemented in a few ways:
- An in-
vttablet "MySQL Monitor" + modify the RPC response for StopReplicationAndGetStatus to include this state
- Today we get a
nil response because the RPC errored, so we'd need to adjust that to still return a response, and perhaps include the error in the response
- The ERS code has a function that reliably matches
mysqld-down errors, eg: IsMySQLDown(err error) bool
- This could become a game of whack-a-mole, because we're matching on error strings (I believe)
- This logic would need to match on dial errors (
mysqld is down) and potentially ignore "timeout" errors, because this doesn't mean mysqld is really down
This definitely relies on VTOrc existing in a cluster, to fix tablets with the wrong primary if anything fails here
Your thoughts are appreciated!
Reproduction Steps
- Setup a shard with many tablets
kill -9 $(pidof mysqld) on one tablet, but ensure the host/pod + vttablet remains alive
- Run an
EmergencyReparentShard on the given shard
- Notice the ERS fails on the mysql-down error for a single tablet
Binary Version
v19, probably versions above v19
Operating System and Environment details
Log Fragments
rpc error: code = Unknown desc = TabletManager.StopReplicationAndGetStatus on REDACTED-0171543854: before status failed: net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: connection refused (errno 2002) (sqlstate HY000)
Overview of the Issue
Problem
On at least v19 and probably versions ahead of that,
EmergencyReparentShard(which relies on calling theStopReplicationAndGetStatustabletmanager RPC to all tablets) fails when any tablet in a shard has MySQL down:rpc error: code = Unknown desc = TabletManager.StopReplicationAndGetStatus on REDACTED-0171543854: before status failed: net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: connection refused (errno 2002) (sqlstate HY000)The
net.Dialerror isvttabletattempting to connect to a downed MySQL server on the tablet. In our experience this scenario can happen when:The ERS code today attempts to, for EVERY tablet (no matter what):
This approach is being very careful to understand who has the most relaylog changes, so data loss and errants are not created. However, the failure in the
StopReplicationAndGetStatusRPC for any tablet halts the ERS at Step number 1 above, which is quite dangerous for availabilitySolution
Before I propose a solution here, I'll start with some known limitations: what I'm about to propose won't work for tablets with remote MySQL. It's much harder to be certain a network dial error means MySQL is "down"
Now, back to the error we receive:
net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: connection refused (errno 2002) (sqlstate HY000). On a tablet with local MySQL this is a "pretty-good" signal that MySQL is down, but we could be even more certain by checking the PID (we know this via the pidfile) and perhaps other detailsSo let's imagine we have a strong signal MySQL is down on a tablet that still responds to tabletmanager RPCs. With the caveat that we can't always be 100% certain, if MySQL is down on the tablet, I feel we can infer (perhaps optionally):
So TL;DR: let's give
EmergencyReparentShardlogic the context of what tablets have MySQL down or up, so we don't try to wait for them fruitlessly, failing the reparent. We'll assume a downmysqldmeans the tablet cannot be most-advancedThis could be implemented in a few ways:
vttablet"MySQL Monitor" + modify the RPC response forStopReplicationAndGetStatusto include this statenilresponse because the RPC errored, so we'd need to adjust that to still return a response, and perhaps include the error in the responsemysqld-down errors, eg:IsMySQLDown(err error) boolmysqldis down) and potentially ignore "timeout" errors, because this doesn't meanmysqldis really downThis definitely relies on VTOrc existing in a cluster, to fix tablets with the wrong primary if anything fails here
Your thoughts are appreciated!
Reproduction Steps
kill -9 $(pidof mysqld)on one tablet, but ensure the host/pod +vttabletremains aliveEmergencyReparentShardon the given shardBinary Version
Operating System and Environment details
Log Fragments
rpc error: code = Unknown desc = TabletManager.StopReplicationAndGetStatus on REDACTED-0171543854: before status failed: net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: connection refused (errno 2002) (sqlstate HY000)