smartconnpool: consolidate shutdown signaling on lifetime context#20186
smartconnpool: consolidate shutdown signaling on lifetime context#20186arthurschreiber wants to merge 21 commits into
Conversation
Two changes that together let Close return promptly when the MySQL backend is unhealthy: 1. tryReturnConn now closes connections eagerly once capacity has been driven to 0. Without this, in-flight Recycles during Close get handed to queued waiters instead of landing on a stack, and the setCapacity drain loop never observes the conns it needs to close. 2. The idle worker's connection reopen now uses a context that is cancelled at the start of Close. Previously connReopen used context.Background(), so a Close that races with an idle-tick reopen was extended by the full backend connect timeout per expired connection, ignoring PoolCloseTimeout. The context is held behind an atomic pointer to keep ConnPool's heap footprint stable; the lock-free stack's atomic-128 ops rely on Go's allocator placing the struct in a size class with 16-byte alignment, and inlining context.Context + CancelFunc directly crossed a size class boundary that no longer satisfied that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Switches the new shutdown tests to t.Cleanup for releasing the blocked connector and closing the pool so resources unwind even when an assertion fires partway through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
t.Context is cancelled just before t.Cleanup runs, so the blocked connector can wait on it directly instead of managing a dedicated release channel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The struct is the pool's open/close-bounded context, used by any code path that calls into a user-supplied callback during a shutdown — not just worker goroutines. Renaming makes the abstraction clearer and leaves room for future callers (e.g. Taint/Recycle's connNew) to use the same mechanism without a misleading name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Pooled.Taint and Pooled.Recycle on a closed conn both route through put(nil), which used to synchronously call connNew(context.Background()) to open a replacement. With a slow or unreachable backend this turned a non-blocking cleanup into a full MySQL connect-timeout round-trip on the caller's goroutine — a particularly bad shape for Taint, which is called from error-handling code paths. put(nil) now just decrements the active count and returns. The pool will reopen the slot lazily on the next Get via getNew, which is already designed to do exactly that. Steady-state behavior is unchanged in healthy pools; under an unhealthy backend the pool drains to whatever number of connections it can sustain instead of stalling callers waiting for connects that will time out. The TestOpen assertion that encoded the old eager-reopen behavior is updated to match: after p.put(nil) the slot is free until demand triggers a reopen. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The maxLifetime branch of put used to synchronously call connReopen(context.Background()) to swap in a fresh connection, mirroring the put(nil) eager-reopen pattern. Under a slow or unreachable backend this turned every Recycle of an aged conn into a full MySQL connect-timeout round-trip on the caller's goroutine — worse than the Taint case because Recycle is the hot path. put now closes the maxLifetime-exceeded conn and frees the slot, the same way the put(nil) case does. The next Get lazily reopens via getNew. TestConnReopen and TestMaxLifetime, which encoded the old eager-reopen behavior, are updated to match. TestConnReopen still covers the idle-worker's (separate, still-eager) reopen path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Previous commits made put(nil) and put(maxLifetime) lazy — just decrement active and let the next Get reopen via getNew. That stranded any waiter already queued in the waitlist: the new caller's Get sees active < capacity, opens a fresh conn for itself via getNew, and the existing waiter is never fed. With nothing on a stack, the expire worker can't help either, so the waiter waits until its own ctx expires. Replace the lazy approach with an async spawn: put(nil) and put(maxLifetime) now go pool.openReplacement(), which runs connNew off the caller's goroutine and hands the result to a waiter (via tryReturnConn) on success or drops the active count on connect failure. The caller never blocks on connect, and queued waiters are still served whenever the backend will let us open one. Tests: - TestTaintWakesWaiter and TestRecycleMaxLifetimeWakesWaiter cover the regression directly. - TestTaintDoesNotBlockOnConnect and TestRecycleDoesNotBlockOnMaxLifetimeReopen keep their elapsed-time assertions but drop the lazy-decrement-specific ones (connectCalls, Active) that the new approach intentionally doesn't satisfy. - TestOpen, TestMaxLifetime, TestConnReopen, and TestCreateFailOnPut wrap the post-put assertions in EventuallyWithT since the reopen now completes asynchronously. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
connReopen used to read Setting() *after* replacing dbconn.Conn with a freshly-connected conn, so it always read nil — the conn silently migrated from settings[bucket] to clean on every reopen. The idle worker has done this since it was written; the new maxLifetime async reopen inherits the same path. Fixing connReopen to capture Setting before the replace looked obvious but it broke the get()/getWithSetting error paths, which rely on the bare-reopen behavior: when ResetSetting fails they call connReopen expecting it to leave the conn settingless so the caller can then ApplySetting (or omit it). Capturing-and-re-applying the old Setting there leaks an unwanted Setting back to the get() caller and mis-routes the conn to the wrong stack. So instead: keep connReopen bare, and move the capture+re-apply into the two callers that actually want preservation — closeIdleResources (idle worker) and reopenExpired (maxLifetime path). Both capture Setting before conn.Close, call connReopen, and ApplySetting on the new Conn. Two new tests cover the preserved-Setting behavior: - TestRecycleMaxLifetimePreservesSetting - TestIdleWorkerReopenPreservesSetting Both verify the reopened conn lands in settings[bucket] rather than clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Previous commits moved the replacement-open work for put(nil) and the maxLifetime branch onto background goroutines. That solved the hot-path stall under a sick backend but introduced async-test surgery in places that didn't need it. Step back: keep put synchronous and route its connect calls through pool.connectCtx() — the same lifetime context the idle worker already uses. A Close still cancels in-flight put-time connects (the original stall surface). Healthy pools see no change. Under a sick backend during normal operation, put still blocks on the connect timeout exactly as it did before this PR — that's an accepted trade-off, since fixing it is a bigger design conversation. Setting preservation on the maxLifetime path stays (capture before close, ApplySetting after the reopen). openReplacement and reopenExpired helpers are removed. The two "doesn't block" tests are reframed as the property the sync+lifetime approach actually provides: - TestTaintConnectCancelsOnClose - TestRecycleMaxLifetimeReopenCancelsOnClose Each parks the connector and asserts Close unblocks the in-flight reopen. TestTaintWakesWaiter and TestRecycleMaxLifetimeWakesWaiter are kept — with sync they're trivially satisfied but they're cheap defensive regression tests for the waiter hand-off. EventuallyWithT wrappers added for the async behavior in TestOpen, TestMaxLifetime, TestConnReopen, and TestCreateFailOnPut are reverted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
…eter connReopen now takes the Setting that should end up applied on the reopened conn (or nil for a bare reopen). Callers refreshing a conn in place pass the previous Setting through so the reopened conn lands back in the same settings stack; callers recovering from a ResetSetting failure pass nil so the conn comes back settingless and the caller can re-apply or omit. Removes the duplicated "capture Setting + ApplySetting after reopen" dance from closeIdleResources and the maxLifetime branch of put — the post-reopen apply now lives in connReopen itself. Net behavior matches the previous commit; the change is purely about making the contract explicit at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Resolves the conflict in closeIdleResources by adopting upstream's two-loop structure (#20079: close all expired conns and drop their active slot first, then reopen up to capacity with CAS to respect mid-flight capacity changes) while keeping this branch's setting preservation (capture each conn's Setting before close, pass through connReopen's new setting parameter on reopen) and lifetime context (pool.connectCtx() so a Close cancels the in-flight reopen). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The test polled for Metrics.IdleClosed() > 10 and then immediately asserted Active == 10. After the two-loop closeIdleResources from #20079, there's a brief window between "first loop dropped active for each closed conn" and "second loop's CAS restored active for each reopened conn" — the Eventually poll can land inside that window under stress and the next assertion sees Active < 10. Fold both conditions into a single EventuallyWithT so the poll keeps trying until they're both true simultaneously. Drop the trailing totalInStack <= 10 check: it covered the same intent (no leaked conns) but walked the lock-free clean stack without synchronization against the idle worker, and the assertion was trivially permissive since totalInStack can't exceed capacity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Pull the duplicated select/default close-channel polling into a tryReceiveClose helper, fold the repeated watchdog-failure cleanup sequences into an abort closure, and rename WatchdogDelay to Watchdog so both helpers use the same name for the same constant. Inline the unused worker := i alias. Add doc comments to the top-level test functions and to the storm Taint() fan-out so the reason for the goroutine wrapping (Taint synchronously enters the blocked connect path) is explicit. No behavior change. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Replace tryReturnConn's eager-close trigger with active > capacity. The
old capacity == 0 check fires only when capacity is fully drained and
misses partial reductions: SetCapacity(N → M) with 0 < M < N and
queued waiters cycles Recycles to waiters without ever pushing to a
stack, so the drain loop spins until ctx expires.
active > capacity is the actual invariant we want to defend ("we have
more conns out than configured"), and it strictly subsumes the
previous check — capacity == 0 implies active >= 0 > -1, so anything
the old check fired on, the new one fires on too.
Add TestSetCapacityReductionDrainsWithWaiters that holds N conns,
queues hold-forever waiters, then SetCapacity(1) with a 2s deadline.
Without the change the test hangs at active=N, borrowed=N; with the
change SetCapacity returns promptly and active drops to the target.
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Resolved conflicts in go/pools/smartconnpool/pool.go where upstream's bulk idle-stack sweep (#20136) overlapped with this branch's shutdown fixes: - put(): kept upstream's cached `now` for the maxLifetime check while threading this branch's Setting and connectCtx through connReopen. - tryReturnConn(): kept this branch's active>capacity drain guard and folded in upstream's new updateIdleTime parameter. - closeIdleResources(): adopted upstream's PopAll-based linked-list bulk processing and read each expired conn's Setting() inline in the reopen loop (Close() doesn't touch the setting field). TestIdleTimeoutStopsReopeningWhenPoolCloses bypasses pool.Open(), so pool.lifetime was nil and connectCtx() returned a pre-cancelled context, breaking the test's connect-blocking setup. Set pool.lifetime explicitly in the test to match real Open() semantics. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Setting() is a pure field accessor that Close() doesn't touch, so there's no need to stash it in a local before calling connReopen — Go's argument evaluation runs Setting() before connReopen replaces dbconn.Conn anyway. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The async-spawn approach was reverted; the replacement conn is opened synchronously on the caller's goroutine via connNew. Update the test doc comment so future readers aren't misled. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
…l-shutdown-fixes Signed-off-by: Arthur Schreiber <arthur@planetscale.com> # Conflicts: # go/pools/smartconnpool/pool.go # go/pools/smartconnpool/stress_test.go
The pool tracked "is the pool open" twice: a `close atomic.Pointer[chan
struct{}]` and a `lifetime atomic.Pointer[lifetime]` (which carried a
cancellable context for in-flight Connector calls). The two phases had
distinct timing — `lifetime.cancel()` fired at the start of Close, but
`close(closeChan)` only fired after the drain finished — which meant
queued waiters and background workers were held through the drain even
though there was nothing useful left for them to do.
Drop the `close` field and use the lifetime context's `Done()` as the
single shutdown signal. `runWorker` and `wait.waitForConn` take a
`context.Context` instead of `<-chan struct{}`. `IsOpen` now flips false
as soon as `CloseWithContext` is entered, and queued `Get` waiters
return `ErrConnPoolClosed` at drain start rather than being parked until
the drain completes (during which they were unservicable anyway —
`tryReturnConn` already closes any Recycled conn while `active >
capacity`).
The new behavior is pinned by `TestIsOpenIsFalseDuringDrain` and
`TestWaiterUnblocksAtDrainStart`.
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
There was a problem hiding this comment.
Pull request overview
This PR simplifies smartconnpool shutdown semantics by removing the separate “close channel” signal and using the pool lifetime context.Context as the single shutdown signal, so that workers, queued waiters, and in-flight connects unblock immediately when CloseWithContext starts.
Changes:
- Remove the
closechannel pointer fromConnPooland keyIsOpen()/ shutdown checks off the lifetime context pointer. - Update worker loops and waitlist waiting to accept a
context.Contextand select onctx.Done(). - Add/adjust tests to pin the new observable shutdown behaviors (e.g.,
IsOpen()false during drain, waiters unblocking at drain start).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| go/pools/smartconnpool/waitlist.go | Switch waiter shutdown signal from a close channel to the pool lifetime context. |
| go/pools/smartconnpool/waitlist_test.go | Update waitlist tests to pass a lifetime context instead of a close channel. |
| go/pools/smartconnpool/pool.go | Remove close field, consolidate shutdown/open state on lifetime context, and wire workers/waiters to ctx.Done(). |
| go/pools/smartconnpool/pool_test.go | Update existing shutdown-related tests and add new tests for the revised IsOpen() and waiter-unblock timing contracts. |
| // Cancel the pool's lifetime context before we start draining: any user | ||
| // connect callback currently blocked behind it (e.g. the idle worker | ||
| // reopening an expired connection) unblocks immediately, workers exit, | ||
| // and queued waiters in wait.waitForConn return ErrConnPoolClosed | ||
| // without having to wait for the drain to finish. | ||
| lt := pool.lifetime.Swap(nil) | ||
| if lt == nil { | ||
| // already closed | ||
| pool.capacityMu.Unlock() | ||
| return nil | ||
| } | ||
|
|
||
| // Cancel the pool's lifetime context before we start draining: any user | ||
| // connect callback currently blocked behind it (e.g. the idle worker | ||
| // reopening an expired connection) needs to unblock now so that | ||
| // setCapacity can observe active dropping to zero and so that | ||
| // workers.Wait below isn't held up by a hung connect. | ||
| if lt := pool.lifetime.Swap(nil); lt != nil { | ||
| lt.cancel() | ||
| } | ||
| lt.cancel() |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## arthur/smartconnpool-shutdown-fixes #20186 +/- ##
=======================================================================
+ Coverage 56.72% 57.32% +0.59%
=======================================================================
Files 8 8
Lines 966 963 -3
=======================================================================
+ Hits 548 552 +4
+ Misses 418 411 -7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
The smartconnpool tracked "is the pool open" with two atomic pointers: a
closechannel pointer and alifetimecontext pointer. The two had distinct timing —lifetime.cancel()fired at the start ofClose, butclose(closeChan)only fired after the drain completed. As a result, queuedGetwaiters and background workers were kept alive throughout the drain, even though there was nothing useful left for them to do.This PR drops the
closefield entirely and uses the lifetime context'sDone()as the single shutdown signal.runWorkerandwait.waitForConnnow take acontext.Contextinstead of a<-chan struct{}, which is also more idiomatic Go.Observable behavior changes
IsOpen()returns false as soon asCloseWithContextis entered, not after the drain completes. The only external caller in the repo (semisyncmonitor.go:167) is a startup probe and is unaffected.Getwaiters returnErrConnPoolClosedat drain start rather than being parked until the drain finishes. They were unservicable during the drain anyway becausetryReturnConncloses any Recycled conn whileactive > capacity.closeIdleResourcesreopen loop's in-flightconnectcall is cancelled at drain start, so it returnsctx.Err()instead of running to completion and producing a fresh conn that has to be discarded immediately.Both behavior shifts are pinned by new tests:
TestIsOpenIsFalseDuringDrainandTestWaiterUnblocksAtDrainStart.Related Issue(s)
Stacked on top of #20122 (smartconnpool shutdown fixes). The diff against
mainwill narrow naturally once #20122 merges and this PR is retargeted.Checklist
Deployment Notes
No user-facing config change. The minor behavior shifts on
IsOpen()timing and waiter wake-up are described above; neither is reachable through any current external caller as a correctness-affecting change.AI Disclosure
This PR was authored primarily by Claude Code — I provided direction and reviewed the design.