Overview of the Issue
During a planned reparent, it is possible that PromoteSlaveWhenCaughtUp succeeds without an error but the outer context has been canceled. In that situation, we still call UndoDemoteMaster with a background context.
By design, the old master discovers that it is no longer master and changes its type to REPLICA. However, the call to UndoDemoteMaster is unnecessary insufficient in this situation (as explained further below).
When the outer context is canceled at a point where PromoteSlaveWhenCaughtUp has completed, other replicas are still pointing to the old master, so we can end up with a situation like this:
New Master <---- Old Master <---- Other replicas
To prevent this, we should consider two additional safeguards:
- add logic in
PromoteSlaveWhenCaughtUp to revert any partial changes in case of errors.
- create an
UndoPromoteSlave that is called along with UndoDemoteMaster.
Code snippet for reference:
https://github.com/vitessio/vitess/blame/master/go/vt/wrangler/reparent.go#L605
rp, err = wr.tmc.PromoteSlaveWhenCaughtUp(promoteCtx, masterElectTabletInfo.Tablet, rp)
if err != nil || (ctx.Err() != nil && ctx.Err() == context.DeadlineExceeded) {
// If we fail to promote the new master, try to roll back to the
// original master before aborting.
// It is possible that we have used up the original context, or that
// not enough time is left on it before it times out.
// But at this point we really need to be able to Undo so as not to
// leave the cluster in a bad state.
// So we create a fresh context based on context.Background().
undoCtx, undoCancel := context.WithTimeout(context.Background(), *topo.RemoteOperationTimeout)
defer undoCancel()
if err1 := wr.tmc.UndoDemoteMaster(undoCtx, oldMasterTabletInfo.Tablet); err1 != nil {
log.Warningf("Encountered error %v while trying to undo DemoteMaster", err1)
}
return fmt.Errorf("master-elect tablet %v failed to catch up with replication or be upgraded to master: %v", masterElectTabletAliasStr, err)
}
Overview of the Issue
During a planned reparent, it is possible that
PromoteSlaveWhenCaughtUpsucceeds without an error but the outer context has been canceled. In that situation, we still callUndoDemoteMasterwith a background context.By design, the old master discovers that it is no longer master and changes its type to REPLICA. However, the call to UndoDemoteMaster is
unnecessaryinsufficient in this situation (as explained further below).When the outer context is canceled at a point where
PromoteSlaveWhenCaughtUphas completed, other replicas are still pointing to the old master, so we can end up with a situation like this:To prevent this, we should consider two additional safeguards:
PromoteSlaveWhenCaughtUpto revert any partial changes in case of errors.UndoPromoteSlavethat is called along withUndoDemoteMaster.Code snippet for reference:
https://github.com/vitessio/vitess/blame/master/go/vt/wrangler/reparent.go#L605