smartconnpool: consolidate shutdown signaling on lifetime context by arthurschreiber · Pull Request #20186 · vitessio/vitess

arthurschreiber · 2026-05-26T17:16:14Z

Description

The smartconnpool tracked "is the pool open" with two atomic pointers: a close channel pointer and a lifetime context pointer. The two had distinct timing — lifetime.cancel() fired at the start of Close, but close(closeChan) only fired after the drain completed. As a result, queued Get waiters and background workers were kept alive throughout the drain, even though there was nothing useful left for them to do.

This PR drops the close field entirely and uses the lifetime context's Done() as the single shutdown signal. runWorker and wait.waitForConn now take a context.Context instead of a <-chan struct{}, which is also more idiomatic Go.

Observable behavior changes

IsOpen() returns false as soon as CloseWithContext is entered, not after the drain completes. The only external caller in the repo (semisyncmonitor.go:167) is a startup probe and is unaffected.
Queued Get waiters return ErrConnPoolClosed at drain start rather than being parked until the drain finishes. They were unservicable during the drain anyway because tryReturnConn closes any Recycled conn while active > capacity.
The closeIdleResources reopen loop's in-flight connect call is cancelled at drain start, so it returns ctx.Err() instead of running to completion and producing a fresh conn that has to be discarded immediately.

Both behavior shifts are pinned by new tests: TestIsOpenIsFalseDuringDrain and TestWaiterUnblocksAtDrainStart.

Related Issue(s)

Stacked on top of #20122 (smartconnpool shutdown fixes). The diff against main will narrow naturally once #20122 merges and this PR is retargeted.

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

Deployment Notes

No user-facing config change. The minor behavior shifts on IsOpen() timing and waiter wake-up are described above; neither is reachable through any current external caller as a correctness-affecting change.

AI Disclosure

This PR was authored primarily by Claude Code — I provided direction and reviewed the design.

Two changes that together let Close return promptly when the MySQL backend is unhealthy: 1. tryReturnConn now closes connections eagerly once capacity has been driven to 0. Without this, in-flight Recycles during Close get handed to queued waiters instead of landing on a stack, and the setCapacity drain loop never observes the conns it needs to close. 2. The idle worker's connection reopen now uses a context that is cancelled at the start of Close. Previously connReopen used context.Background(), so a Close that races with an idle-tick reopen was extended by the full backend connect timeout per expired connection, ignoring PoolCloseTimeout. The context is held behind an atomic pointer to keep ConnPool's heap footprint stable; the lock-free stack's atomic-128 ops rely on Go's allocator placing the struct in a size class with 16-byte alignment, and inlining context.Context + CancelFunc directly crossed a size class boundary that no longer satisfied that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Switches the new shutdown tests to t.Cleanup for releasing the blocked connector and closing the pool so resources unwind even when an assertion fires partway through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

t.Context is cancelled just before t.Cleanup runs, so the blocked connector can wait on it directly instead of managing a dedicated release channel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

The struct is the pool's open/close-bounded context, used by any code path that calls into a user-supplied callback during a shutdown — not just worker goroutines. Renaming makes the abstraction clearer and leaves room for future callers (e.g. Taint/Recycle's connNew) to use the same mechanism without a misleading name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Pooled.Taint and Pooled.Recycle on a closed conn both route through put(nil), which used to synchronously call connNew(context.Background()) to open a replacement. With a slow or unreachable backend this turned a non-blocking cleanup into a full MySQL connect-timeout round-trip on the caller's goroutine — a particularly bad shape for Taint, which is called from error-handling code paths. put(nil) now just decrements the active count and returns. The pool will reopen the slot lazily on the next Get via getNew, which is already designed to do exactly that. Steady-state behavior is unchanged in healthy pools; under an unhealthy backend the pool drains to whatever number of connections it can sustain instead of stalling callers waiting for connects that will time out. The TestOpen assertion that encoded the old eager-reopen behavior is updated to match: after p.put(nil) the slot is free until demand triggers a reopen. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

The maxLifetime branch of put used to synchronously call connReopen(context.Background()) to swap in a fresh connection, mirroring the put(nil) eager-reopen pattern. Under a slow or unreachable backend this turned every Recycle of an aged conn into a full MySQL connect-timeout round-trip on the caller's goroutine — worse than the Taint case because Recycle is the hot path. put now closes the maxLifetime-exceeded conn and frees the slot, the same way the put(nil) case does. The next Get lazily reopens via getNew. TestConnReopen and TestMaxLifetime, which encoded the old eager-reopen behavior, are updated to match. TestConnReopen still covers the idle-worker's (separate, still-eager) reopen path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Previous commits made put(nil) and put(maxLifetime) lazy — just decrement active and let the next Get reopen via getNew. That stranded any waiter already queued in the waitlist: the new caller's Get sees active < capacity, opens a fresh conn for itself via getNew, and the existing waiter is never fed. With nothing on a stack, the expire worker can't help either, so the waiter waits until its own ctx expires. Replace the lazy approach with an async spawn: put(nil) and put(maxLifetime) now go pool.openReplacement(), which runs connNew off the caller's goroutine and hands the result to a waiter (via tryReturnConn) on success or drops the active count on connect failure. The caller never blocks on connect, and queued waiters are still served whenever the backend will let us open one. Tests: - TestTaintWakesWaiter and TestRecycleMaxLifetimeWakesWaiter cover the regression directly. - TestTaintDoesNotBlockOnConnect and TestRecycleDoesNotBlockOnMaxLifetimeReopen keep their elapsed-time assertions but drop the lazy-decrement-specific ones (connectCalls, Active) that the new approach intentionally doesn't satisfy. - TestOpen, TestMaxLifetime, TestConnReopen, and TestCreateFailOnPut wrap the post-put assertions in EventuallyWithT since the reopen now completes asynchronously. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

connReopen used to read Setting() *after* replacing dbconn.Conn with a freshly-connected conn, so it always read nil — the conn silently migrated from settings[bucket] to clean on every reopen. The idle worker has done this since it was written; the new maxLifetime async reopen inherits the same path. Fixing connReopen to capture Setting before the replace looked obvious but it broke the get()/getWithSetting error paths, which rely on the bare-reopen behavior: when ResetSetting fails they call connReopen expecting it to leave the conn settingless so the caller can then ApplySetting (or omit it). Capturing-and-re-applying the old Setting there leaks an unwanted Setting back to the get() caller and mis-routes the conn to the wrong stack. So instead: keep connReopen bare, and move the capture+re-apply into the two callers that actually want preservation — closeIdleResources (idle worker) and reopenExpired (maxLifetime path). Both capture Setting before conn.Close, call connReopen, and ApplySetting on the new Conn. Two new tests cover the preserved-Setting behavior: - TestRecycleMaxLifetimePreservesSetting - TestIdleWorkerReopenPreservesSetting Both verify the reopened conn lands in settings[bucket] rather than clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Previous commits moved the replacement-open work for put(nil) and the maxLifetime branch onto background goroutines. That solved the hot-path stall under a sick backend but introduced async-test surgery in places that didn't need it. Step back: keep put synchronous and route its connect calls through pool.connectCtx() — the same lifetime context the idle worker already uses. A Close still cancels in-flight put-time connects (the original stall surface). Healthy pools see no change. Under a sick backend during normal operation, put still blocks on the connect timeout exactly as it did before this PR — that's an accepted trade-off, since fixing it is a bigger design conversation. Setting preservation on the maxLifetime path stays (capture before close, ApplySetting after the reopen). openReplacement and reopenExpired helpers are removed. The two "doesn't block" tests are reframed as the property the sync+lifetime approach actually provides: - TestTaintConnectCancelsOnClose - TestRecycleMaxLifetimeReopenCancelsOnClose Each parks the connector and asserts Close unblocks the in-flight reopen. TestTaintWakesWaiter and TestRecycleMaxLifetimeWakesWaiter are kept — with sync they're trivially satisfied but they're cheap defensive regression tests for the waiter hand-off. EventuallyWithT wrappers added for the async behavior in TestOpen, TestMaxLifetime, TestConnReopen, and TestCreateFailOnPut are reverted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

…eter connReopen now takes the Setting that should end up applied on the reopened conn (or nil for a bare reopen). Callers refreshing a conn in place pass the previous Setting through so the reopened conn lands back in the same settings stack; callers recovering from a ResetSetting failure pass nil so the conn comes back settingless and the caller can re-apply or omit. Removes the duplicated "capture Setting + ApplySetting after reopen" dance from closeIdleResources and the maxLifetime branch of put — the post-reopen apply now lives in connReopen itself. Net behavior matches the previous commit; the change is purely about making the contract explicit at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Resolves the conflict in closeIdleResources by adopting upstream's two-loop structure (#20079: close all expired conns and drop their active slot first, then reopen up to capacity with CAS to respect mid-flight capacity changes) while keeping this branch's setting preservation (capture each conn's Setting before close, pass through connReopen's new setting parameter on reopen) and lifetime context (pool.connectCtx() so a Close cancels the in-flight reopen). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

The test polled for Metrics.IdleClosed() > 10 and then immediately asserted Active == 10. After the two-loop closeIdleResources from #20079, there's a brief window between "first loop dropped active for each closed conn" and "second loop's CAS restored active for each reopened conn" — the Eventually poll can land inside that window under stress and the next assertion sees Active < 10. Fold both conditions into a single EventuallyWithT so the poll keeps trying until they're both true simultaneously. Drop the trailing totalInStack <= 10 check: it covered the same intent (no leaked conns) but walked the lock-free clean stack without synchronization against the idle worker, and the assertion was trivially permissive since totalInStack can't exceed capacity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Pull the duplicated select/default close-channel polling into a tryReceiveClose helper, fold the repeated watchdog-failure cleanup sequences into an abort closure, and rename WatchdogDelay to Watchdog so both helpers use the same name for the same constant. Inline the unused worker := i alias. Add doc comments to the top-level test functions and to the storm Taint() fan-out so the reason for the goroutine wrapping (Taint synchronously enters the blocked connect path) is explicit. No behavior change. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Replace tryReturnConn's eager-close trigger with active > capacity. The old capacity == 0 check fires only when capacity is fully drained and misses partial reductions: SetCapacity(N → M) with 0 < M < N and queued waiters cycles Recycles to waiters without ever pushing to a stack, so the drain loop spins until ctx expires. active > capacity is the actual invariant we want to defend ("we have more conns out than configured"), and it strictly subsumes the previous check — capacity == 0 implies active >= 0 > -1, so anything the old check fired on, the new one fires on too. Add TestSetCapacityReductionDrainsWithWaiters that holds N conns, queues hold-forever waiters, then SetCapacity(1) with a 2s deadline. Without the change the test hangs at active=N, borrowed=N; with the change SetCapacity returns promptly and active drops to the target. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Resolved conflicts in go/pools/smartconnpool/pool.go where upstream's bulk idle-stack sweep (#20136) overlapped with this branch's shutdown fixes: - put(): kept upstream's cached `now` for the maxLifetime check while threading this branch's Setting and connectCtx through connReopen. - tryReturnConn(): kept this branch's active>capacity drain guard and folded in upstream's new updateIdleTime parameter. - closeIdleResources(): adopted upstream's PopAll-based linked-list bulk processing and read each expired conn's Setting() inline in the reopen loop (Close() doesn't touch the setting field). TestIdleTimeoutStopsReopeningWhenPoolCloses bypasses pool.Open(), so pool.lifetime was nil and connectCtx() returned a pre-cancelled context, breaking the test's connect-blocking setup. Set pool.lifetime explicitly in the test to match real Open() semantics. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Setting() is a pure field accessor that Close() doesn't touch, so there's no need to stash it in a local before calling connReopen — Go's argument evaluation runs Setting() before connReopen replaces dbconn.Conn anyway. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

The async-spawn approach was reverted; the replacement conn is opened synchronously on the caller's goroutine via connNew. Update the test doc comment so future readers aren't misled. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

…l-shutdown-fixes Signed-off-by: Arthur Schreiber <arthur@planetscale.com> # Conflicts: # go/pools/smartconnpool/pool.go # go/pools/smartconnpool/stress_test.go

The pool tracked "is the pool open" twice: a `close atomic.Pointer[chan struct{}]` and a `lifetime atomic.Pointer[lifetime]` (which carried a cancellable context for in-flight Connector calls). The two phases had distinct timing — `lifetime.cancel()` fired at the start of Close, but `close(closeChan)` only fired after the drain finished — which meant queued waiters and background workers were held through the drain even though there was nothing useful left for them to do. Drop the `close` field and use the lifetime context's `Done()` as the single shutdown signal. `runWorker` and `wait.waitForConn` take a `context.Context` instead of `<-chan struct{}`. `IsOpen` now flips false as soon as `CloseWithContext` is entered, and queued `Get` waiters return `ErrConnPoolClosed` at drain start rather than being parked until the drain completes (during which they were unservicable anyway — `tryReturnConn` already closes any Recycled conn while `active > capacity`). The new behavior is pinned by `TestIsOpenIsFalseDuringDrain` and `TestWaiterUnblocksAtDrainStart`. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

vitess-bot · 2026-05-26T17:16:42Z

Copilot

Pull request overview

This PR simplifies smartconnpool shutdown semantics by removing the separate “close channel” signal and using the pool lifetime context.Context as the single shutdown signal, so that workers, queued waiters, and in-flight connects unblock immediately when CloseWithContext starts.

Changes:

Remove the close channel pointer from ConnPool and key IsOpen() / shutdown checks off the lifetime context pointer.
Update worker loops and waitlist waiting to accept a context.Context and select on ctx.Done().
Add/adjust tests to pin the new observable shutdown behaviors (e.g., IsOpen() false during drain, waiters unblocking at drain start).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
go/pools/smartconnpool/waitlist.go	Switch waiter shutdown signal from a close channel to the pool lifetime context.
go/pools/smartconnpool/waitlist_test.go	Update waitlist tests to pass a lifetime context instead of a close channel.
go/pools/smartconnpool/pool.go	Remove `close` field, consolidate shutdown/open state on lifetime context, and wire workers/waiters to `ctx.Done()`.
go/pools/smartconnpool/pool_test.go	Update existing shutdown-related tests and add new tests for the revised `IsOpen()` and waiter-unblock timing contracts.

+	// Cancel the pool's lifetime context before we start draining: any user
+	// connect callback currently blocked behind it (e.g. the idle worker
+	// reopening an expired connection) unblocks immediately, workers exit,
+	// and queued waiters in wait.waitForConn return ErrConnPoolClosed
+	// without having to wait for the drain to finish.
+	lt := pool.lifetime.Swap(nil)
+	if lt == nil {
 		// already closed
 		pool.capacityMu.Unlock()
 		return nil
 	}
-
-	// Cancel the pool's lifetime context before we start draining: any user
-	// connect callback currently blocked behind it (e.g. the idle worker
-	// reopening an expired connection) needs to unblock now so that
-	// setCapacity can observe active dropping to zero and so that
-	// workers.Wait below isn't held up by a hung connect.
-	if lt := pool.lifetime.Swap(nil); lt != nil {
-		lt.cancel()
-	}
+	lt.cancel()


codecov · 2026-05-26T17:23:17Z

Codecov Report

❌ Patch coverage is 96.55172% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 57.32%. Comparing base (abaebfb) to head (b294679).

Files with missing lines	Patch %	Lines
go/pools/smartconnpool/pool.go	96.29%	1 Missing ⚠️

Additional details and impacted files

@@                           Coverage Diff                           @@
##           arthur/smartconnpool-shutdown-fixes   #20186      +/-   ##
=======================================================================
+ Coverage                                56.72%   57.32%   +0.59%     
=======================================================================
  Files                                        8        8              
  Lines                                      966      963       -3     
=======================================================================
+ Hits                                       548      552       +4     
+ Misses                                     418      411       -7

Flag	Coverage Δ
partial	`57.32% <96.55%> (+0.59%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

arthurschreiber and others added 21 commits May 18, 2026 12:29

smartconnpool: add shutdown stress tests

efdf6d8

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

smartconnpool: clean up shutdown stress tests

6d830af

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Merge remote-tracking branch 'upstream/main' into arthur/smartconnpoo…

abaebfb

…l-shutdown-fixes Signed-off-by: Arthur Schreiber <arthur@planetscale.com> # Conflicts: # go/pools/smartconnpool/pool.go # go/pools/smartconnpool/stress_test.go

Copilot AI review requested due to automatic review settings May 26, 2026 17:16

arthurschreiber added Type: Internal Cleanup Component: VTTablet labels May 26, 2026

Copilot started reviewing on behalf of arthurschreiber May 26, 2026 17:16 View session

github-actions Bot added this to the v25.0.0 milestone May 26, 2026

Copilot AI reviewed May 26, 2026

View reviewed changes

arthurschreiber mentioned this pull request May 26, 2026

smartconnpool: non-blocking lifecycle via generation counter #20187

Open

Base automatically changed from arthur/smartconnpool-shutdown-fixes to main May 27, 2026 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smartconnpool: consolidate shutdown signaling on lifetime context#20186

smartconnpool: consolidate shutdown signaling on lifetime context#20186
arthurschreiber wants to merge 21 commits into
mainfrom
arthur/smartconnpool-lifetime-context-consolidation

arthurschreiber commented May 26, 2026

Uh oh!

vitess-bot Bot commented May 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

codecov Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arthurschreiber commented May 26, 2026

Description

Observable behavior changes

Related Issue(s)

Checklist

Deployment Notes

AI Disclosure

Uh oh!

vitess-bot Bot commented May 26, 2026

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov Bot commented May 26, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants