[release-22.0] OnlineDDL: always close lock connection (#19586)#19720
Conversation
|
Hello @timvaillancourt, there are conflicts in this backport. Please address them in order to merge this Pull Request. You can execute the snippet below to reset your branch and resolve the conflict manually. Make sure you replace |
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
|
Resolved conflicts for backport of #19586. Conflict: Additional fix: replaced |
There was a problem hiding this comment.
Pull request overview
Backport to release-22.0 of an OnlineDDL cutover safety fix to reduce the risk of orphaned LOCK TABLES blocking traffic by ensuring the lock-connection is actively released/closed during cutover teardown.
Changes:
- Add a lock-connection teardown defer that attempts
UNLOCK TABLESand then closes the connection viaKill(...). - Refactor lock-wait timeout handling by introducing
initConnectionSessionTimeout(...)and delegatinginitConnectionLockWaitTimeout(...)to it. - Minor formatting/cleanup adjustments in
executor.go.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| defer func() { | ||
| // Always attempt UNLOCK TABLES first, as it releases locks immediately on this | ||
| // connection. Then kill the connection as a fallback to guarantee any held locks | ||
| // are released, even if UNLOCK TABLES were to fail. | ||
| unlockCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second) | ||
| defer cancel() | ||
| if _, err := lockConn.Conn.Exec(unlockCtx, sqlUnlockTables, 1, false); err != nil { | ||
| log.Warningf("Failed to UNLOCK TABLES in OnlineDDL migration %s: %v", onlineDDL.UUID, err) | ||
| } | ||
| if err := lockConn.Conn.Kill("closing lock tables connection", 0); err != nil { | ||
| log.Warningf("Failed to kill lock tables connection in OnlineDDL migration %s: %v", onlineDDL.UUID, err) | ||
| } |
There was a problem hiding this comment.
The defer comment says killing the lock connection is a fallback if UNLOCK TABLES fails, but the code calls lockConn.Conn.Kill(...) unconditionally. Either make the kill conditional (e.g., only if UNLOCK TABLES errors) or adjust the comment to reflect the intended always-kill behavior.
| func (e *Executor) initConnectionSessionTimeout(ctx context.Context, conn *connpool.Conn, variable string, timeout time.Duration) (deferFunc func(), err error) { | ||
| deferFunc = func() {} | ||
|
|
||
| if _, err := conn.Exec(ctx, `set @lock_wait_timeout=@@session.lock_wait_timeout`, 0, false); err != nil { | ||
| return deferFunc, vterrors.Errorf(vtrpcpb.Code_UNKNOWN, "could not read lock_wait_timeout: %v", err) | ||
| saveQuery, err := sqlparser.ParseAndBind( | ||
| fmt.Sprintf("set @%s=@@session.%s", variable, variable), | ||
| ) | ||
| if err != nil { | ||
| return deferFunc, err | ||
| } | ||
| if _, err := conn.Exec(ctx, saveQuery, 0, false); err != nil { | ||
| return deferFunc, vterrors.Wrapf(err, "could not read %s", variable) | ||
| } | ||
| setQuery, err := sqlparser.ParseAndBind( | ||
| fmt.Sprintf("set @@session.%s=%%a", variable), | ||
| sqltypes.Int64BindVariable(int64(timeout.Seconds())), | ||
| ) |
There was a problem hiding this comment.
initConnectionSessionTimeout builds SET statements by interpolating variable into the SQL (e.g. @@session.%s). Since this is not bindable, it would be safer to validate/whitelist allowed variable names inside this helper to prevent accidental misuse (or future injection risks) if it ever gets called with non-constant input.
| return deferFunc, err | ||
| } | ||
| deferFunc = func() { | ||
| conn.Exec(ctx, restoreQuery, 0, false) |
There was a problem hiding this comment.
The restore closure returned by initConnectionSessionTimeout uses the caller's ctx when executing the restore query. If ctx is canceled/expired during unwind, the restore will be skipped and the modified session variable may leak into a pooled connection (notably for renameConn when renameWasSuccessful is true and the connection is recycled). Consider restoring using a fresh context.WithTimeout(context.Background(), ...) and (optionally) logging restore failures.
| conn.Exec(ctx, restoreQuery, 0, false) | |
| restoreCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second) | |
| defer cancel() | |
| if _, err := conn.Exec(restoreCtx, restoreQuery, 0, false); err != nil { | |
| log.Warningf("initConnectionSessionTimeout: failed to restore session variable %s: %v", variable, err) | |
| } |
| // fixCompletedTimestampDone fixes a nil `completed_timestamp` columns, see | ||
| // https://github.com/vitessio/vitess/issues/13927 | ||
| // The fix is in release-18.0 | ||
| // TODO: remove in release-19.0 |
There was a problem hiding this comment.
This TODO references removing the workaround in release-19.0, but this is the release-22.0 branch. Since these lines are being touched, it would help to either update the TODO to the correct target (or explain why it still needs to exist) to avoid misleading future maintainers.
| // TODO: remove in release-19.0 | |
| // TODO: remove once it is safe to assume all clusters have been upgraded past release-18.0 |
Description
This is a backport of #19586