Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
721e63e
vttablet: handle applier metadata init failures in relay-log recovery
mhamza15 Mar 4, 2026
2357fb8
tabletmanager: use sqlerror for replication init failures
mhamza15 Mar 11, 2026
f5209b4
tabletmanager: clarify replication init recovery helper
mhamza15 Mar 11, 2026
871bdfd
no abbreviation
mhamza15 Mar 11, 2026
554e615
change helper name
mhamza15 Mar 11, 2026
f213844
Add missing PRS relay error test cases
mhamza15 Mar 11, 2026
b151ec2
rerun ci
mhamza15 Mar 13, 2026
86eafef
handleRecoverableReplicationInitError
mhamza15 Mar 16, 2026
2691e6e
add further handling
mhamza15 Mar 16, 2026
ce99c0f
Document reused MySQL replication init errnos
mhamza15 Mar 18, 2026
7f335e2
Add mysqlctl-wrapped replication init error test
mhamza15 Mar 18, 2026
9ee9980
sqlerror: parse native MySQL error strings
mhamza15 Mar 18, 2026
6e33a9a
Update sql_error.go
mhamza15 Mar 18, 2026
875517d
Clarify MySQL 1871/1872 reassignment
mhamza15 Mar 18, 2026
7f5c352
tabletmanager: recover startup replication init errors
mhamza15 Mar 19, 2026
9742cf1
Use `t.Context()` in relay recovery tests
mhamza15 Mar 19, 2026
2124145
Merge branch 'main' into mhamza15/vtorc-handle-applier-metadata-failure
mhamza15 Mar 31, 2026
d23a352
tabletmanager: make `SetReplicationSource` recover safely
mhamza15 Apr 1, 2026
0a9d9fb
sqlerror: add aliases for versioned replication errnos
mhamza15 Apr 1, 2026
2350f61
wrangler: clarify relay-log recovery tests
mhamza15 Apr 1, 2026
dac71ba
Update constants.go
mhamza15 Apr 1, 2026
22ff699
Update rpc_replication.go
mhamza15 Apr 1, 2026
3ee8dc7
tabletmanager: use `startReplicationRecoverable` in reparent restart …
mhamza15 Apr 1, 2026
4762624
tabletmanager: add `stopReplicationRecoverable`
mhamza15 Apr 1, 2026
b925a4b
tabletmanager: recover `SetReplicationSource` with `RESET REPLICA ALL`
mhamza15 Apr 1, 2026
a34cbbd
tabletmanager: stop before `RESET REPLICA ALL` recovery
mhamza15 Apr 1, 2026
1f8320a
sqlerror: fix `gofumpt` formatting
mhamza15 Apr 1, 2026
37ce03c
wrangler: update PRS relay-log stop error expectations
mhamza15 Apr 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion go/mysql/sqlerror/constants.go
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ const (
ERDupUnique = ErrorCode(1169)
ERRequiresPrimaryKey = ErrorCode(1173)
ERCantDoThisDuringAnTransaction = ErrorCode(1179)
ERMasterInfo = ErrorCode(1201)
ERReadOnlyTransaction = ErrorCode(1207)
ERCannotAddForeign = ErrorCode(1215)
ERNoReferencedRow = ErrorCode(1216)
Expand All @@ -128,7 +129,17 @@ const (
ERSourceHasPurgedRequiredGtids = ErrorCode(1789)
ERInnodbIndexCorrupt = ErrorCode(1817)
ERDupIndex = ErrorCode(1831)
ERInnodbReadOnly = ErrorCode(1874)

// MySQL used 1871/1872 for master-info and relay-log-info initialization
// errors through 8.0.32, and reassigned those numbers in 8.0.33 to
// connection-metadata and applier-metadata initialization errors. These
// errnos therefore map to different metadata types depending on version.
ERReplicaMasterInfoInitRepository = ErrorCode(1871)
ERReplicaRelayLogInfoInitRepository = ErrorCode(1872)
ERReplicaConnectionMetadataInitRepository = ErrorCode(1871)
ERReplicaApplierMetadataInitRepository = ErrorCode(1872)
Comment thread
mhamza15 marked this conversation as resolved.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 1873, but maybe we should use 1871 and 1872?


ERInnodbReadOnly = ErrorCode(1874)

ERVectorConversion = ErrorCode(6138)

Expand Down
22 changes: 20 additions & 2 deletions go/mysql/sqlerror/sql_error.go
Original file line number Diff line number Diff line change
Expand Up @@ -154,10 +154,23 @@ func (se *SQLError) VtRpcErrorCode() vtrpcpb.Code {
}
}

var errExtract = regexp.MustCompile(`\(errno ([0-9]*)\) \(sqlstate ([0-9a-zA-Z]{5})\)`)
var (
// errExtract matches Vitess-wrapped SQL errors ending with `(errno <num>) (sqlstate <state>)`.
errExtract = regexp.MustCompile(`\(errno ([0-9]*)\) \(sqlstate ([0-9a-zA-Z]{5})\)`)

// nativeErrExtract matches native MySQL server errors using `ERROR <num> (<state>): <message>`.
nativeErrExtract = regexp.MustCompile(`ERROR ([0-9]*) \(([0-9a-zA-Z]{5})\):`)
)

// NewSQLErrorFromError returns a *SQLError from the provided error.
// If it's not the right type, it still tries to get it from a regexp.
//
// - If err already is a *SQLError, it returns err unchanged.
// - If err is a Vitess error with a mapped MySQL code, it returns the converted SQLError.
// - If err contains a Vitess-wrapped `(errno <num>) (sqlstate <state>)` suffix, it extracts that code and state.
// - If err contains a native MySQL `ERROR <num> (<state>): <message>` string, it extracts that code and state.
// - Otherwise, it maps the Vitess error code to a MySQL error code when one is defined and returns
// a generic SQLError with that code and err.Error() as the message.
//
// Notes about the `error` return type:
// The function really returns *SQLError or `nil`. Seemingly, the function could just return
// `*SQLError` type. However, it really must return `error`. The reason is the way `golang`
Expand Down Expand Up @@ -189,6 +202,11 @@ func NewSQLErrorFromError(err error) error {
return extractSQLErrorFromMessage(match, msg)
}

match = nativeErrExtract.FindStringSubmatch(msg)
if len(match) >= 2 {
return extractSQLErrorFromMessage(match, msg)
}

return mapToSQLErrorFromErrorCode(err, msg)
}

Expand Down
10 changes: 10 additions & 0 deletions go/mysql/sqlerror/sql_error_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,16 @@ func TestNewSQLErrorFromError(t *testing.T) {
num: ERNoDb,
ss: SSNoDB,
},
{
err: errors.New("ERROR 1201 (HY000): Could not initialize master info structure; more error messages can be found in the MySQL error log"),
num: ERMasterInfo,
ss: SSUnknownSQLState,
},
{
err: errors.New("ERROR 1872 (HY000): Replica failed to initialize applier metadata structure from the repository"),
num: ERReplicaApplierMetadataInitRepository,
ss: SSUnknownSQLState,
},
{
err: errors.New("just some random text here"),
num: ERUnknownError,
Expand Down
7 changes: 7 additions & 0 deletions go/vt/mysqlctl/fakemysqldaemon.go
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,9 @@ type FakeMysqlDaemon struct {
// SetReplicationSourceError is used by SetReplicationSource.
SetReplicationSourceError error

// SetReplicationSourceFunc overrides SetReplicationSource when it is set.
SetReplicationSourceFunc func(ctx context.Context, host string, port int32, heartbeatInterval float64, stopReplicationBefore bool, startReplicationAfter bool) error

// StopReplicationError error is used by StopReplication.
StopReplicationError error

Expand Down Expand Up @@ -549,6 +552,10 @@ func (fmd *FakeMysqlDaemon) SetReplicationPosition(ctx context.Context, pos repl

// SetReplicationSource is part of the MysqlDaemon interface.
func (fmd *FakeMysqlDaemon) SetReplicationSource(ctx context.Context, host string, port int32, heartbeatInterval float64, stopReplicationBefore bool, startReplicationAfter bool) error {
if fmd.SetReplicationSourceFunc != nil {
return fmd.SetReplicationSourceFunc(ctx, host, port, heartbeatInterval, stopReplicationBefore, startReplicationAfter)
}

input := fmt.Sprintf("%v:%v", host, port)
found := false
for _, sourceInput := range fmd.SetReplicationSourceInputs {
Expand Down
2 changes: 1 addition & 1 deletion go/vt/vttablet/tabletmanager/restore.go
Original file line number Diff line number Diff line change
Expand Up @@ -352,7 +352,7 @@ func (tm *TabletManager) disableReplication(ctx context.Context) error {
return vterrors.Wrap(err, "failed to reset replication")
}

if err := tm.MysqlDaemon.SetReplicationSource(ctx, "//", 0, 0, false, true); err != nil {
if err := tm.setReplicationSourceRecoverable(ctx, "//", 0, 0, false, true); err != nil {
return vterrors.Wrap(err, "failed to disable replication")
}

Expand Down
146 changes: 118 additions & 28 deletions go/vt/vttablet/tabletmanager/rpc_replication.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ package tabletmanager
import (
"context"
"fmt"
"log/slog"
"runtime"
"strings"
"time"

"vitess.io/vitess/go/mysql"
Expand Down Expand Up @@ -306,7 +306,7 @@ func (tm *TabletManager) StartReplication(ctx context.Context, semiSync bool) er
if err := tm.fixSemiSync(ctx, tm.Tablet().Type, semiSyncAction); err != nil {
return err
}
return tm.MysqlDaemon.StartReplication(ctx, tm.hookExtraEnv())
return tm.startReplicationRecoverable(ctx)
}

// RestartReplication will stop replication and then start it again
Expand Down Expand Up @@ -335,7 +335,7 @@ func (tm *TabletManager) RestartReplication(ctx context.Context, semiSync bool)
}

// Start replication
return tm.MysqlDaemon.StartReplication(ctx, tm.hookExtraEnv())
return tm.startReplicationRecoverable(ctx)
}

// StartReplicationUntilAfter will start the replication and let it catch up
Expand Down Expand Up @@ -524,7 +524,8 @@ func (tm *TabletManager) InitReplica(ctx context.Context, parent *topodatapb.Tab
if err := tm.MysqlDaemon.SetReplicationPosition(ctx, pos); err != nil {
return err
}
if err := tm.MysqlDaemon.SetReplicationSource(ctx, ti.MysqlHostname, ti.MysqlPort, 0, false, true); err != nil {

if err := tm.setReplicationSourceRecoverable(ctx, ti.MysqlHostname, ti.MysqlPort, 0, false, true); err != nil {
return err
}

Expand Down Expand Up @@ -949,23 +950,19 @@ func (tm *TabletManager) setReplicationSourceLocked(ctx context.Context, parentA
}
if status.SourceHost != host || status.SourcePort != port || heartbeatInterval != 0 {
// This handles both changing the address and starting replication.
if err := tm.MysqlDaemon.SetReplicationSource(ctx, host, port, heartbeatInterval, wasReplicating, shouldbeReplicating); err != nil {
if err := tm.handleRelayLogError(ctx, err); err != nil {
return err
}
if err := tm.setReplicationSourceRecoverable(ctx, host, port, heartbeatInterval, wasReplicating, shouldbeReplicating); err != nil {
return err
}
} else if shouldbeReplicating {
// The address is correct. We need to restart replication so that any semi-sync changes if any
// are taken into account
// are taken into account. We don't attempt to recover from the known recoverable errors here
// because recovery requires running `STOP REPLICA` in order to reset the replication metadata.
// If we error the first time, we're likely to error the second time as well.
if err := tm.MysqlDaemon.StopReplication(ctx, tm.hookExtraEnv()); err != nil {
if err := tm.handleRelayLogError(ctx, err); err != nil {
return err
}
return err
}
Comment on lines 961 to 963

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Recover metadata-init errors when STOP REPLICA fails

In the status.SourceHost == host && status.SourcePort == port && shouldbeReplicating path, StopReplication errors are now returned directly, so recoverable metadata-init failures (1201/1871/1872) from STOP REPLICA no longer trigger the reset-and-restart self-heal. This branch is exercised when the source is already correct (e.g. no-op/planned reparent and force-start flows), so a transient/corrupt replication metadata condition that previously recovered can now abort reparent/replication setup instead of repairing itself.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, see #19560 (comment)

if err := tm.MysqlDaemon.StartReplication(ctx, tm.hookExtraEnv()); err != nil {
if err := tm.handleRelayLogError(ctx, err); err != nil {
return err
}
if err := tm.startReplicationRecoverable(ctx); err != nil {
return err
}
}

Expand Down Expand Up @@ -1231,25 +1228,118 @@ func (tm *TabletManager) fixSemiSyncAndReplication(ctx context.Context, tabletTy
if err := tm.MysqlDaemon.StopReplication(ctx, tm.hookExtraEnv()); err != nil {
return vterrors.Wrap(err, "failed to StopReplication")
}
if err := tm.MysqlDaemon.StartReplication(ctx, tm.hookExtraEnv()); err != nil {
if err := tm.startReplicationRecoverable(ctx); err != nil {
return vterrors.Wrap(err, "failed to StartReplication")
}
return nil
Comment thread
mattlord marked this conversation as resolved.
}

// handleRelayLogError resets replication of the instance.
// This is required because sometimes MySQL gets stuck due to improper initialization of
// master info structure or related failures and throws errors like
// ERROR 1201 (HY000): Could not initialize master info structure; more error messages can be found in the MySQL error log
// These errors can only be resolved by resetting the replication, otherwise START REPLICA fails.
func (tm *TabletManager) handleRelayLogError(ctx context.Context, err error) error {
// attempt to fix this error:
// Replica failed to initialize relay log info structure from the repository (errno 1872) (sqlstate HY000) during query: START REPLICA
// startReplicationRecoverable starts replication and handles recoverable errors by resetting replication.
func (tm *TabletManager) startReplicationRecoverable(ctx context.Context) error {
err := tm.MysqlDaemon.StartReplication(ctx, tm.hookExtraEnv())
if err == nil {
return nil
}

// Try to recover from the error.
if err := tm.handleRecoverableReplicationInitError(ctx, err); err != nil {
return err
}

return nil
}

// setReplicationSourceRecoverable configures the requested replication source and optionally starts
// replication afterward. When possible, certain errors are recovered by reinitializing replication
// metadata.
func (tm *TabletManager) setReplicationSourceRecoverable(ctx context.Context, host string, port int32, heartbeatInterval float64, wasReplicating bool, shouldStartReplication bool) error {
// Let's first try to apply the requested source without starting replication afterwards. If the
// replica was replicating before, we stop replication first.
err := tm.MysqlDaemon.SetReplicationSource(ctx, host, port, heartbeatInterval, wasReplicating, false)
if err == nil {
// We succeeded, let's start replication but only if it was requested.
if !shouldStartReplication {
return nil
}

return tm.startReplicationRecoverable(ctx)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid running postflight hook in SetReplicationSource path

When shouldStartReplication is true, this helper now starts replication via startReplicationRecoverable, which calls MysqlDaemon.StartReplication and therefore runs the postflight_start_slave hook. Previously, SetReplicationSource(..., startReplicationAfter=true) performed CHANGE REPLICATION SOURCE + START REPLICA without invoking that hook, so this change can make reparent/init flows fail in environments where the postflight hook is present and returns an error, even though source reconfiguration and SQL start succeeded. Please preserve the previous hook behavior for source-change starts (or make hook execution explicit/opt-in for this path).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is certainly a behavior change, but I'm not sure it's a wrong change. I'd expect the hook to run in this case, but it may cause some unexpected behavior.

}

// We hit an error. If the error is not one of the recoverable ones, we can't recover and should return it.
if !isRecoverableReplicationInitializationError(err) {
return err
}

log.Warn(
"Encountered recoverable replication initialization error while changing replication source, resetting "+
"replication parameters and reapplying source",
slog.String("source_host", host),
slog.Int("source_port", int(port)),
slog.Any("error", err),
)

// If the replica was running when the source-change attempt failed, stop it
// explicitly before resetting replication metadata.
if wasReplicating {
if err := tm.MysqlDaemon.StopReplication(ctx, tm.hookExtraEnv()); err != nil {
return err
}
}

// Recover from the error by reinitializing replication metadata through
// `RESET REPLICA ALL`.
if err := tm.MysqlDaemon.ResetReplicationParameters(ctx); err != nil {
return err
}

// Now that we've reinitialized the replication metadata, try setting the source again.
if err := tm.MysqlDaemon.SetReplicationSource(ctx, host, port, heartbeatInterval, false, false); err != nil {
return err
}

// The replication source has finally been set. Let's also start replication if it was requested.
if shouldStartReplication {
return tm.startReplicationRecoverable(ctx)
}

return nil
}

// recoverableReplicationInitializationErrorCodes is the set of replication initialization error
// codes that can be recovered from by reinitializing replication metadata.
// MySQL used 1871/1872 for master-info and relay-log-info initialization errors
// through 8.0.32, and reassigned those numbers in 8.0.33 to connection-metadata
// and applier-metadata initialization errors.
var recoverableReplicationInitializationErrorCodes = map[sqlerror.ErrorCode]struct{}{
sqlerror.ERMasterInfo: {},
sqlerror.ERReplicaConnectionMetadataInitRepository: {},
sqlerror.ERReplicaApplierMetadataInitRepository: {},
Comment thread
mhamza15 marked this conversation as resolved.
}

// isRecoverableReplicationInitializationError reports whether an error can be recovered from by
// reinitializing replication metadata.
func isRecoverableReplicationInitializationError(err error) bool {
sqlErr, ok := sqlerror.NewSQLErrorFromError(err).(*sqlerror.SQLError)
Comment thread
mattlord marked this conversation as resolved.
if !ok || sqlErr == nil {
return false
}

Comment thread
mhamza15 marked this conversation as resolved.
_, ok = recoverableReplicationInitializationErrorCodes[sqlErr.Number()]
return ok
}

// handleRecoverableReplicationInitError repairs recoverable replication initialization
// failures by restarting replication.
func (tm *TabletManager) handleRecoverableReplicationInitError(ctx context.Context, err error) error {
// Attempt to self-heal by restarting replication when initialization fails.
// see https://bugs.mysql.com/bug.php?id=83713 or https://github.com/vitessio/vitess/issues/5067
// The same fix also works for https://github.com/vitessio/vitess/issues/10955.
if strings.Contains(err.Error(), "Replica failed to initialize relay log info structure from the repository") ||
strings.Contains(err.Error(), "Could not initialize master info structure") {
// Stop, reset and start replication again to resolve this error
if isRecoverableReplicationInitializationError(err) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't have access to the MySQL error codes here? It feels quite brittle to check against the error message strings - but if that's the only thing we can do here it's fine.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking the same thing, but it gets flattened earlier: https://github.com/vitessio/vitess/blob/main/go/vt/mysqlctl/query.go?plain=1#L84. It's definitely doable and preferable, but I think it'd require a bit of a refactor that I'll leave as a follow-up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 that sqlerror.NewSQLErrorFromError sucks, but it's in other code 🤷

I just wanted to point out many RPCs map sqlerrors -> vterrors.Code. For example, we return vtrpcpb.Code_UNAVAILABLE if the error is of a class that probably means unavailable

Comment thread
mattlord marked this conversation as resolved.
Comment thread
mhamza15 marked this conversation as resolved.
log.Warn(
"Encountered recoverable replication initialization error, restarting replication",
slog.Any("error", err),
)

if err := tm.MysqlDaemon.RestartReplication(ctx, tm.hookExtraEnv()); err != nil {
return err
}
Expand Down
Loading
Loading