Skip to content

Bug Report: race condition in MoveTables ... Complete --rename-tables #20135

@mcrauwel

Description

@mcrauwel

Overview of the Issue

MoveTables ... Complete --rename-tables has a race window where the reverse
workflow's apply path can hit a renamed source table and permanently error
the reverse stream. The reverse workflow is not stopped or drained before
the source tables are renamed; it is only deleted afterward.

Observed behavior

Running MoveTables ... Complete --rename-tables=true on a Tables-type
workflow (one source keyspace, one target keyspace, table moves) with a
healthy forward workflow (Frozen) and an active reverse workflow.

Real-world timeline from a production cutover (single source shard, all
times UTC, all on the same source primary tablet):

Δt Event
T+0 ms vtctld logs Renaming table <src_db>.<tbl1> to <src_db>._<tbl1>_old (traffic_switcher.go, removeSourceTables)
T+292 ms source vttablet schema engine confirms: created [_<tbl1>_old], dropped [<tbl1>]
T+792 ms reverse workflow stream errors: error applying event: Table '<src_db>.<tbl1>' doesn't exist (errno 1146) (sqlstate 42S02)
T+792 ms controller.go:317 classifies as unrecoverable and parks the stream in permanent error state

dropSourceReverseVReplicationStreams deletes the reverse stream row from
_vt.vreplication after the rename completes — but the in-flight apply on
the now-renamed table has already failed and the controller has already
marked the stream errored, so deleting the row doesn't recover anything;
it just leaves an orphaned reverse-workflow entry that operators have to
clean up manually.

Expected behavior

Complete --rename-tables is documented/intended to atomically finalize
the cutover: tear down the reverse workflow AND rename source tables. From
an operator's perspective there should be no window in which the reverse
workflow can apply to a table that Complete has already renamed.

Root cause

In dropSources (go/vt/vtctl/workflow/server.go, the path that
MoveTablesComplete takes):

  1. validateWorkflowHasCompleted — only reads the forward workflow
    on the targets and checks that its streams are Frozen
    (go/vt/vtctl/workflow/utils.go, the ReadVReplicationWorkflow call
    uses ts.WorkflowName(), which is the forward name). The reverse
    workflow's state is never inspected.
  2. removeSourceTables(ctx, removalType) — issues RENAME TABLE <src_db>.<tbl> TO <src_db>._<tbl>_old on each source primary
    (go/vt/vtctl/workflow/traffic_switcher.go, in removeSourceTables).
    The reverse workflow is still running on the source primary at this
    point.
  3. dropArtifactsdropSourceReverseVReplicationStreams — only now
    does it DELETE FROM _vt.vreplication for the reverse streams.

Between steps 2 and 3 the reverse vreplicator is still:

  • subscribed to the target keyspace's binlog stream,
  • holding events in its in-memory apply pipeline,
  • writing applied events back to the source DB.

Any DML for a just-renamed table — whether it arrived during the window or
was already buffered at the moment of rename — fails with 1146. The
controller (go/vt/vttablet/tabletmanager/vreplication/controller.go,
around line 317) treats 1146 as unrecoverable and the stream stays in
error state forever, even though the row gets deleted milliseconds later.

Suggested fix

Close the window by either:

Option A — reorder (smaller change): in dropSources, swap the order
so reverse streams are deleted before source tables are renamed. The
controller will stop trying to apply once the row is gone, eliminating
the rename-vs-apply race. Forward streams are already Frozen so they
won't observe the rename either.

Option B — explicit drain (more robust): before removeSourceTables,
explicitly stop the reverse workflow and wait for its applied position
to catch up to the latest source-side binlog position (or simply wait
for its apply queue to drain and confirm streams are in Stopped
state). Then proceed with the rename, then delete the streams.

Option B is safer if there's any concern about events that haven't yet
been read from the binlog at all (Option A doesn't drain those), though
those are arguably fine to discard once Complete has been called.

Either way validateWorkflowHasCompleted should probably grow a check
on the reverse workflow's state as well, not just the forward.

Reproduction Steps

  1. Set up a MoveTables workflow between two keyspaces with at least one
    moderately busy table on the target. (Continuous writes on the target
    side after SwitchTraffic increase the odds of a buffered reverse event
    landing on the rename.)
  2. SwitchTraffic so the forward workflow goes Frozen and the reverse
    workflow takes over.
  3. While the reverse workflow has activity (in-flight DMLs), run
    MoveTables ... Complete --rename-tables=true.
  4. Observe errno 1146 apply errors on the reverse workflow streams and
    the controller parking them in error state, even though Complete
    reports success.

Probability scales with reverse-workflow throughput at the moment of
Complete and with the number of tables in the move. We hit it on a
production cutover at observable but non-deterministic frequency.

Binary Version

PlanetScale Vitess v22

Operating System and Environment details

PlanetScale instance

Log Fragments

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions