Bug Report: race condition in `MoveTables ... Complete --rename-tables`

### Overview of the Issue

`MoveTables ... Complete --rename-tables` has a race window where the reverse
workflow's apply path can hit a renamed source table and permanently error
the reverse stream. The reverse workflow is not stopped or drained before
the source tables are renamed; it is only deleted afterward.

## Observed behavior

Running `MoveTables ... Complete --rename-tables=true` on a Tables-type
workflow (one source keyspace, one target keyspace, table moves) with a
healthy forward workflow (Frozen) and an active reverse workflow.

Real-world timeline from a production cutover (single source shard, all
times UTC, all on the same source primary tablet):

| Δt        | Event |
|-----------|-------|
| T+0  ms   | vtctld logs `Renaming table <src_db>.<tbl1> to <src_db>._<tbl1>_old` (`traffic_switcher.go`, `removeSourceTables`) |
| T+292 ms  | source vttablet schema engine confirms: `created [_<tbl1>_old], dropped [<tbl1>]` |
| T+792 ms  | reverse workflow stream errors: `error applying event: Table '<src_db>.<tbl1>' doesn't exist (errno 1146) (sqlstate 42S02)` |
| T+792 ms  | `controller.go:317` classifies as unrecoverable and parks the stream in permanent error state |

`dropSourceReverseVReplicationStreams` deletes the reverse stream row from
`_vt.vreplication` after the rename completes — but the in-flight apply on
the now-renamed table has already failed and the controller has already
marked the stream errored, so deleting the row doesn't recover anything;
it just leaves an orphaned reverse-workflow entry that operators have to
clean up manually.

## Expected behavior

`Complete --rename-tables` is documented/intended to atomically finalize
the cutover: tear down the reverse workflow AND rename source tables. From
an operator's perspective there should be no window in which the reverse
workflow can apply to a table that Complete has already renamed.

## Root cause

In `dropSources` (`go/vt/vtctl/workflow/server.go`, the path that
`MoveTablesComplete` takes):

1. `validateWorkflowHasCompleted` — only reads the **forward** workflow
   on the targets and checks that its streams are `Frozen`
   (`go/vt/vtctl/workflow/utils.go`, the `ReadVReplicationWorkflow` call
   uses `ts.WorkflowName()`, which is the forward name). The reverse
   workflow's state is never inspected.
2. `removeSourceTables(ctx, removalType)` — issues `RENAME TABLE
   <src_db>.<tbl> TO <src_db>._<tbl>_old` on each source primary
   (`go/vt/vtctl/workflow/traffic_switcher.go`, in `removeSourceTables`).
   The reverse workflow is still running on the source primary at this
   point.
3. `dropArtifacts` → `dropSourceReverseVReplicationStreams` — only now
   does it `DELETE FROM _vt.vreplication` for the reverse streams.

Between steps 2 and 3 the reverse vreplicator is still:
- subscribed to the target keyspace's binlog stream,
- holding events in its in-memory apply pipeline,
- writing applied events back to the source DB.

Any DML for a just-renamed table — whether it arrived during the window or
was already buffered at the moment of rename — fails with `1146`. The
controller (`go/vt/vttablet/tabletmanager/vreplication/controller.go`,
around line 317) treats `1146` as unrecoverable and the stream stays in
error state forever, even though the row gets deleted milliseconds later.

## Suggested fix

Close the window by either:

**Option A — reorder** (smaller change): in `dropSources`, swap the order
so reverse streams are deleted before source tables are renamed. The
controller will stop trying to apply once the row is gone, eliminating
the rename-vs-apply race. Forward streams are already Frozen so they
won't observe the rename either.

**Option B — explicit drain** (more robust): before `removeSourceTables`,
explicitly stop the reverse workflow and wait for its applied position
to catch up to the latest source-side binlog position (or simply wait
for its apply queue to drain and confirm streams are in `Stopped`
state). Then proceed with the rename, then delete the streams.

Option B is safer if there's any concern about events that haven't yet
been read from the binlog at all (Option A doesn't drain those), though
those are arguably fine to discard once Complete has been called.

Either way `validateWorkflowHasCompleted` should probably grow a check
on the reverse workflow's state as well, not just the forward.

### Reproduction Steps

1. Set up a MoveTables workflow between two keyspaces with at least one
   moderately busy table on the target. (Continuous writes on the target
   side after SwitchTraffic increase the odds of a buffered reverse event
   landing on the rename.)
2. SwitchTraffic so the forward workflow goes Frozen and the reverse
   workflow takes over.
3. While the reverse workflow has activity (in-flight DMLs), run
   `MoveTables ... Complete --rename-tables=true`.
4. Observe `errno 1146` apply errors on the reverse workflow streams and
   the controller parking them in error state, even though Complete
   reports success.

Probability scales with reverse-workflow throughput at the moment of
Complete and with the number of tables in the move. We hit it on a
production cutover at observable but non-deterministic frequency.

### Binary Version

```sh
PlanetScale Vitess v22
```

### Operating System and Environment details

```sh
PlanetScale instance
```

### Log Fragments

```sh

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: race condition in `MoveTables ... Complete --rename-tables` #20135

Overview of the Issue

Observed behavior

Expected behavior

Root cause

Suggested fix

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Δt	Event
T+0 ms	vtctld logs `Renaming table <src_db>.<tbl1> to <src_db>._<tbl1>_old` (`traffic_switcher.go`, `removeSourceTables`)
T+292 ms	source vttablet schema engine confirms: `created [_<tbl1>_old], dropped [<tbl1>]`
T+792 ms	reverse workflow stream errors: `error applying event: Table '<src_db>.<tbl1>' doesn't exist (errno 1146) (sqlstate 42S02)`
T+792 ms	`controller.go:317` classifies as unrecoverable and parks the stream in permanent error state

Bug Report: race condition in MoveTables ... Complete --rename-tables #20135

Description

Overview of the Issue

Observed behavior

Expected behavior

Root cause

Suggested fix

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug Report: race condition in `MoveTables ... Complete --rename-tables` #20135