smartconnpool: unblock Close and preserve Setting on reopen#20122
Conversation
Two changes that together let Close return promptly when the MySQL backend is unhealthy: 1. tryReturnConn now closes connections eagerly once capacity has been driven to 0. Without this, in-flight Recycles during Close get handed to queued waiters instead of landing on a stack, and the setCapacity drain loop never observes the conns it needs to close. 2. The idle worker's connection reopen now uses a context that is cancelled at the start of Close. Previously connReopen used context.Background(), so a Close that races with an idle-tick reopen was extended by the full backend connect timeout per expired connection, ignoring PoolCloseTimeout. The context is held behind an atomic pointer to keep ConnPool's heap footprint stable; the lock-free stack's atomic-128 ops rely on Go's allocator placing the struct in a size class with 16-byte alignment, and inlining context.Context + CancelFunc directly crossed a size class boundary that no longer satisfied that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
There was a problem hiding this comment.
Pull request overview
Two small fixes to smartconnpool.ConnPool to ensure Close() returns promptly when the MySQL backend is unhealthy: connections being recycled during a close are closed eagerly rather than handed to waiters, and the idle-timeout worker's in-flight connect is now cancelled at close start instead of blocking for the full backend connect timeout.
Changes:
tryReturnConnnow closes connections immediately when capacity has been driven to 0 (Close or explicit SetCapacity(0)), preventing waiter-to-waiter handoff loops that starve thesetCapacitydrain.- Introduces a heap-allocated
workerLifetime{ctx, cancel}(held behindatomic.Pointerto preserveConnPool's size class and 16-byte alignment for the lock-free stack's atomic-128 ops) that is cancelled at the start ofCloseWithContext; the idle worker uses this context forconnReopen. - Two new regression tests:
TestCloseDoesNotHandOffToWaitersandTestIdleWorkerConnectCancelsOnClose.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| go/pools/smartconnpool/pool.go | Adds workerLifetime struct + atomic pointer, cancels it at Close start, uses pool worker ctx for idle-worker connReopen, and closes recycled conns eagerly when capacity is 0. |
| go/pools/smartconnpool/pool_test.go | Adds two regression tests covering the no-handoff-after-Close guarantee and the cancellation of the idle-worker reopen on Close. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20122 +/- ##
===========================================
- Coverage 69.67% 56.52% -13.16%
===========================================
Files 1614 8 -1606
Lines 216793 966 -215827
===========================================
- Hits 151044 546 -150498
+ Misses 65749 420 -65329
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Switches the new shutdown tests to t.Cleanup for releasing the blocked connector and closing the pool so resources unwind even when an assertion fires partway through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
t.Context is cancelled just before t.Cleanup runs, so the blocked connector can wait on it directly instead of managing a dedicated release channel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The struct is the pool's open/close-bounded context, used by any code path that calls into a user-supplied callback during a shutdown — not just worker goroutines. Renaming makes the abstraction clearer and leaves room for future callers (e.g. Taint/Recycle's connNew) to use the same mechanism without a misleading name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Replace tryReturnConn's eager-close trigger with active > capacity. The
old capacity == 0 check fires only when capacity is fully drained and
misses partial reductions: SetCapacity(N → M) with 0 < M < N and
queued waiters cycles Recycles to waiters without ever pushing to a
stack, so the drain loop spins until ctx expires.
active > capacity is the actual invariant we want to defend ("we have
more conns out than configured"), and it strictly subsumes the
previous check — capacity == 0 implies active >= 0 > -1, so anything
the old check fired on, the new one fires on too.
Add TestSetCapacityReductionDrainsWithWaiters that holds N conns,
queues hold-forever waiters, then SetCapacity(1) with a 2s deadline.
Without the change the test hangs at active=N, borrowed=N; with the
change SetCapacity returns promptly and active drops to the target.
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
mattlord
left a comment
There was a problem hiding this comment.
LGTM!
One tiny nit / non-blocking cleanup: go/pools/smartconnpool/pool_test.go:1943-1946 still says the replacement conn is opened off the caller goroutine, but the final implementation is synchronous again. I’d update that comment to avoid confusing future us.
Resolved conflicts in go/pools/smartconnpool/pool.go where upstream's bulk idle-stack sweep (#20136) overlapped with this branch's shutdown fixes: - put(): kept upstream's cached `now` for the maxLifetime check while threading this branch's Setting and connectCtx through connReopen. - tryReturnConn(): kept this branch's active>capacity drain guard and folded in upstream's new updateIdleTime parameter. - closeIdleResources(): adopted upstream's PopAll-based linked-list bulk processing and read each expired conn's Setting() inline in the reopen loop (Close() doesn't touch the setting field). TestIdleTimeoutStopsReopeningWhenPoolCloses bypasses pool.Open(), so pool.lifetime was nil and connectCtx() returned a pre-cancelled context, breaking the test's connect-blocking setup. Set pool.lifetime explicitly in the test to match real Open() semantics. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Setting() is a pure field accessor that Close() doesn't touch, so there's no need to stash it in a local before calling connReopen — Go's argument evaluation runs Setting() before connReopen replaces dbconn.Conn anyway. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The async-spawn approach was reverted; the replacement conn is opened synchronously on the caller's goroutine via connNew. Update the test doc comment so future readers aren't misled. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
…l-shutdown-fixes Signed-off-by: Arthur Schreiber <arthur@planetscale.com> # Conflicts: # go/pools/smartconnpool/pool.go # go/pools/smartconnpool/stress_test.go
… tests Replace requireReceive and requireMaxLifetimeExpired with the canonical Go patterns at each call site: a blocking select with time.After, and a direct time.Sleep on the configured maxLifetime. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
go/pools/smartconnpool/pool.go:483
- In put(nil), the pool now attempts connNew(pool.connectCtx()) even when the pool is already closed (pool.close.Load()==nil). Since connectors are not required to honor ctx cancellation (e.g. tests’ newConnector ignores ctx), a late Recycle/Taint after Close can block in the user connector and even attempt opening new backend connections after shutdown. Restore the fast-path that decrements active and returns without calling connect when the pool is already closed.
if conn == nil {
// Taint, or Recycle on a closed conn: open a replacement so a
// queued waiter can be served. We use the pool's lifetime context
// so a Close in flight unblocks this connect, instead of waiting
// for the backend's connect timeout.
var err error
conn, err = pool.connNew(pool.connectCtx())
if err != nil {
pool.closedConn()
return
}
| // close — they should fail fast rather than open a connection that has | ||
| // nowhere to go. | ||
| var alreadyCancelled = func() context.Context { | ||
| ctx, cancel := context.WithCancel(context.Background()) |
There was a problem hiding this comment.
It's too bad the context package doesn't have a "nice" pre-cancelled context
I just wanted to add an alternative in case it is somehow more efficent or preferable:
var alreadyCancelled, _ = context.WithDeadline(context.Background(), time.Time{})There was a problem hiding this comment.
LGTM @arthurschreiber. Added one alternative option if it's interesting to you
Thanks. I don't think this makes any difference either way, so I'll keep it as-is! 🙇♂️ |
Description
Four small changes to the smartconnpool. The first three address shutdown stalls when the MySQL backend is unhealthy; the fourth fixes a long-standing silent loss of connection settings on reopen. Each is scoped to a specific code path; healthy pools see no behavior change.
1.
tryReturnConncloses eagerly when the pool is over capacityWhenever
pool.active > pool.capacity— driven either byClose()/SetCapacity(0)taking capacity to 0 or by anySetCapacity(N)reducing it below the current active count — Recycled connections are closed immediately instead of being handed to whichever waiter is still in the waitlist. Without this, conns get passed waiter-to-waiter during the drain and thesetCapacityloop's 10ms-sleep polls never observe a non-empty stack to close. The check isactive > capacityrather thancapacity == 0so partial reductions get the same drain semantics as a full Close —SetCapacity(N → M)with0 < M < Nand a non-empty waitlist also gets a clean drain.2. Pool-wide lifetime context, cancelled on Close
The pool now holds a
context.Context(in a small heap-allocatedlifetimestruct, behind anatomic.Pointer) that is cancelled at the very start ofCloseWithContext, beforesetCapacity(0).The pointer indirection is deliberate: inlining
context.Context(16B) +CancelFunc(8B) directly intoConnPoolpushed the struct's size across a Go allocator size-class boundary and exposed a latent 16-byte alignment requirement of the lock-free stack's atomic-128 ops, causingSIGBUSon ARM64. Keeping the inline footprint at 8 bytes preserves the size class.3. The idle worker,
put(nil), and the maxLifetimeputbranch all use the lifetime ctx for their connect callsPreviously these three sites passed
context.Background()toconnect(...), so aClose()racing with an in-flight reopen was extended by the fulldb-connect-timeout-msper blocked connect, completely ignoringPoolCloseTimeout. They now usepool.connectCtx(), so a Close unblocks any in-flight connect immediately.putstays synchronous on the caller's goroutine — the same shape as before this PR — but no longer holds up Close.4.
connReopenno longer silently drops the connection's SettingconnReopenused to readSetting()after replacingdbconn.Connwith a freshly-connected conn, so it always read nil and the conn silently migrated fromsettings[bucket]tocleanon every reopen — both for the idle worker and for the maxLifetime path.Fixing
connReopenitself to unconditionally capture Setting before the replace broke theget()/getWithSettingerror paths, which rely on the bare-reopen behavior to fall back to a settingless conn whenResetSettingfails. SoconnReopennow takes an explicitsetting *Settingparameter: callers refreshing a conn in place (closeIdleResourcesand the maxLifetime branch ofput) capture the previous Setting and pass it through; callers recovering from aResetSettingfailure passnilfor a bare reopen.What this PR does not do
Under a sick backend during normal operation (not Close),
TaintandRecyclestill synchronously block on the connect attempt for the full backend connect timeout. That matches the behavior before this PR — fixing it requires moving the connect off the caller's goroutine, which has its own design trade-offs and is a separate conversation. The Close-time stalls are the immediate operational problem this PR addresses.Related Issue(s)
None. Discovered during a deep review of the smartconnpool stall surface.
Checklist
New and updated coverage:
TestCloseDoesNotHandOffToWaiters— Close doesn't hand Recycled conns to waiters once capacity is 0.TestSetCapacityReductionDrainsWithWaiters—SetCapacity(N → M)with0 < M < Nand hold-forever waiters completes within its deadline; recycled conns are closed when active > capacity rather than handed off and stuck.TestIdleWorkerConnectCancelsOnClose— idle worker's connect unblocks when Close fires.TestTaintConnectCancelsOnClose—Taint's synchronous reopen unblocks when Close fires.TestRecycleMaxLifetimeReopenCancelsOnClose— the maxLifetime reopen inputunblocks when Close fires.TestTaintWakesWaiterandTestRecycleMaxLifetimeWakesWaiter— queued waiters are still served by the synchronous replacement.TestRecycleMaxLifetimePreservesSettingandTestIdleWorkerReopenPreservesSetting— reopened conns return to the same settings stack.TestIdleTimeoutDoesntLeaveLingeringConnection— tightened to account for the two-loop idle refresh path without racing the worker.TestStressCloseDuringReconnectStorm— stress-covers Close racing with blocked reconnects and verifies shutdown completes without leaking conns.TestStressWaiterStormDuringDrain— stress-covers queued waiters during Close drain and verifies returned conns are closed rather than handed off.Existing
TestCloseDuringWaitForConn× 50 iterations under-racecontinues to pass.Backport justification
These are production-reliability fixes for
vttabletshutdown against an unhealthy backend. With a dead/unreachable MySQL:connReopen, plus theput(nil)and maxLifetime branches ofput, all blocked for the full backend connect timeout per in-flight connect, ignoringPoolCloseTimeout. Observed Close latency goes from the intended ~10s up to many minutes.Closeget cycled waiter-to-waiter instead of landing on a stack the drain loop can see, adding more polling latency.A
vttabletrestart against a dead primary is exactly when fast shutdown matters most. The changes are scoped to the Close, drain, and reopen paths; healthy pools see no behavior change beyond the (long-broken) setting preservation now actually working.Deployment Notes
One user-visible behavior change: connections being Recycled while the pool is over its configured capacity — during
Close(),SetCapacity(0), or anySetCapacity(N)that reduces capacity below current active — are now closed rather than handed to queued waiters. Those waiters will return withErrConnPoolClosedonceClose()finishes signalling the close channel, or with their own ctx timeout otherwise.AI Disclosure
This PR was implemented by Claude Code based on a deep review and step-by-step fix plan I directed.