Skip to content

PlannedReparent improvements #5991

@deepthi

Description

@deepthi

Overview of the Issue

During a planned reparent, it is possible that PromoteSlaveWhenCaughtUp succeeds without an error but the outer context has been canceled. In that situation, we still call UndoDemoteMaster with a background context.
By design, the old master discovers that it is no longer master and changes its type to REPLICA. However, the call to UndoDemoteMaster is unnecessary insufficient in this situation (as explained further below).
When the outer context is canceled at a point where PromoteSlaveWhenCaughtUp has completed, other replicas are still pointing to the old master, so we can end up with a situation like this:

               New Master <---- Old Master <---- Other replicas

To prevent this, we should consider two additional safeguards:

  • add logic in PromoteSlaveWhenCaughtUp to revert any partial changes in case of errors.
  • create an UndoPromoteSlave that is called along with UndoDemoteMaster.

Code snippet for reference:
https://github.com/vitessio/vitess/blame/master/go/vt/wrangler/reparent.go#L605

		rp, err = wr.tmc.PromoteSlaveWhenCaughtUp(promoteCtx, masterElectTabletInfo.Tablet, rp)
		if err != nil || (ctx.Err() != nil && ctx.Err() == context.DeadlineExceeded) {
			// If we fail to promote the new master, try to roll back to the
			// original master before aborting.
			// It is possible that we have used up the original context, or that
			// not enough time is left on it before it times out.
			// But at this point we really need to be able to Undo so as not to
			// leave the cluster in a bad state.
			// So we create a fresh context based on context.Background().
			undoCtx, undoCancel := context.WithTimeout(context.Background(), *topo.RemoteOperationTimeout)
			defer undoCancel()
			if err1 := wr.tmc.UndoDemoteMaster(undoCtx, oldMasterTabletInfo.Tablet); err1 != nil {
				log.Warningf("Encountered error %v while trying to undo DemoteMaster", err1)
			}
			return fmt.Errorf("master-elect tablet %v failed to catch up with replication or be upgraded to master: %v", masterElectTabletAliasStr, err)
		}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions