You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
smartconnpool has three operations that mutate pool lifecycle state, all serialized today behind capacityMu, and all blocking:
SetCapacity(ctx, newcap) — operator-driven capacity change. Blocks on a 10ms sleep loop waiting for borrowed conns to be Recycled so it can pop and close them.
reopen() — refresh-worker triggered on backend rotation (e.g. DNS change). Does setCapacity(0) + setCapacity(originalCap) to forcibly cycle every conn so they reconnect to the new backend.
Close() / CloseWithContext(ctx) — pool teardown. Blocks until every conn is returned (or the ctx expires).
None of the blocking semantics are load-bearing for correctness. They each conflate two concerns: transitioning the pool state (cheap, atomic) and waiting for callers to give conns back (slow, depends on caller behavior).
Proposal
Make all three lifecycle ops non-blocking lock-free atomic transitions plus a bounded drain. Introduce a generation counter as the keystone primitive that makes reopen non-blocking and lets capacityMu be removed entirely.
typeConnPool[CConnection] struct {
...generation atomic.Uint64// bumped on each reopen()
}
typePooled[CConnection] struct {
...genuint64// recorded on connNew
}
tryReturnConn grows a triple-check that jointly covers every "this conn should die now" condition:
Each lifecycle op becomes a single atomic state transition plus a bounded stack drain:
Op
State change
Drain
Returns
SetCapacity
capacity.Swap(newcap)
PopAll-snapshot of idle stacks
immediately
reopen
generation.Add(1)
PopAll-snapshot of idle stacks
immediately
Close
lifetime.Swap(nil) + cancel
PopAll-snapshot of idle stacks
after workers.Wait() (bounded by ~100ms worker tick)
Borrowed conns drain naturally through tryReturnConn on Recycle/Taint. Stack drains are PopAll-snapshot (like closeIdleResources already does) so a new-gen Get-then-Recycle landing on the stack mid-drain isn't churned.
Why Close doesn't need to block
The blocking semantic today conflates pool teardown with joining callers. The pool can be fully torn down without waiting:
Correctness: borrowed conns still get properly closed when Recycled — tryReturnConn sees lifetime == nil and closes the conn instead of stacking it. They can never reach any future user. No leak path.
OS resource cleanup: borrowed conns close at the MySQL/TCP layer when their holder Recycles. The blocking semantic doesn't make this happen sooner — it just observes the moment.
Process exit: irrelevant — the OS reaps everything.
Pool reconfig within a process: the old pool's borrowed conns become unreachable through any future API (different *ConnPool instance); they close themselves when Recycled.
The one thing Close does still wait on is workers.Wait() — internal background goroutines exit on lifetime.ctx.Done() within one tick (~100ms). That's bounded and prevents pool GC issues; it's not the same as waiting on arbitrary user goroutines.
Migration for callers that genuinely want the old semantic
Add an opt-in helper for callers that actually need "no MySQL conns remain":
// WaitForDrain blocks until active drops to zero or ctx expires.// Callers that need stuck-conn diagnostics or strict resource-accounting// after Close should call this explicitly.func (pool*ConnPool[C]) WaitForDrain(ctx context.Context) error
In-tree callers can be audited individually. Most don't need it — they either run on process shutdown (OS reaps) or are tests that should join their own goroutines.
Why the mutex drops out
With the triple-check in place, the previously hazardous races all become harmless:
SetCapacity racing with Close — Close swaps lifetime to nil; a concurrent SetCapacity that already passed the lifetime gate might swap capacity back up. Today's blocking world: tryReturnConn would push returned conns onto stacks no one drains → leak. With the triple-check: lifetime == nil short-circuits and the conn is closed regardless.
SetCapacity vs SetCapacity — last writer wins on the target; possible transient over-drain. Harmless — Get reopens on demand.
SetCapacity vs reopen — independent atomics (capacity vs generation), independent drains. Final state may briefly jiggle but converges.
reopen vs reopen — generation is monotonic; both increments visible; both drains harmless.
reopen vs Close — reopen bumping generation on a closed pool is a no-op (lifetime check at top still holds; any returned new-gen conn dies via the lifetime branch of tryReturnConn).
Close vs Close — lifetime.Swap(nil) is single-shot: one caller observes the old value, the other observes nil. Single-goroutine-only contract on Open/Close remains but no longer needs mutex enforcement.
Tradeoffs
SetCapacity API change: drops the ctx parameter (no longer blocks long enough to care). Public-breaking for downstream users of smartconnpool. In-tree callers convert trivially; a deprecated shim that ignores ctx is easy if needed.
Close / CloseWithContext behavior change: returns before borrowed conns are returned. The Active() == 0 post-Close invariant is replaced by "active monotonically decreases to 0 as Recycles flow in." Callers that need the old semantic call WaitForDrain(ctx) explicitly. The deprecation of the PoolCloseTimeout diagnostic is the largest user-visible change in this proposal and deserves a release note.
Pooled[C] size: +8 bytes for gen uint64 (or +4 bytes with uint32 and acceptance of a 4B-cycle wrap — at one reopen per DNS event, centuries).
One extra atomic load per Recycle in tryReturnConn. Negligible — same cache line as the existing pool atomics.
Larger conceptual surface: "stale-generation" is a third lifecycle state alongside "closed" (lifetime nil) and "over-capacity" (active > capacity). Documented in one place.
Error semantics: SetCapacity returns ErrConnPoolClosed if lifetime.Load() == nil; today it would race and return a context error in similar conditions.
Best done after #20122 (smartconnpool shutdown fixes) and #20186 (lifetime context consolidation) land, since both reshape the surrounding lifecycle code. The generation counter sits naturally on top of the lifetime-context model from #20186.
Background
smartconnpoolhas three operations that mutate pool lifecycle state, all serialized today behindcapacityMu, and all blocking:SetCapacity(ctx, newcap)— operator-driven capacity change. Blocks on a 10ms sleep loop waiting for borrowed conns to beRecycled so it can pop and close them.reopen()— refresh-worker triggered on backend rotation (e.g. DNS change). DoessetCapacity(0)+setCapacity(originalCap)to forcibly cycle every conn so they reconnect to the new backend.Close()/CloseWithContext(ctx)— pool teardown. Blocks until every conn is returned (or the ctx expires).None of the blocking semantics are load-bearing for correctness. They each conflate two concerns: transitioning the pool state (cheap, atomic) and waiting for callers to give conns back (slow, depends on caller behavior).
Proposal
Make all three lifecycle ops non-blocking lock-free atomic transitions plus a bounded drain. Introduce a generation counter as the keystone primitive that makes
reopennon-blocking and letscapacityMube removed entirely.tryReturnConngrows a triple-check that jointly covers every "this conn should die now" condition:Each lifecycle op becomes a single atomic state transition plus a bounded stack drain:
SetCapacitycapacity.Swap(newcap)PopAll-snapshot of idle stacksreopengeneration.Add(1)PopAll-snapshot of idle stacksCloselifetime.Swap(nil)+ cancelPopAll-snapshot of idle stacksworkers.Wait()(bounded by ~100ms worker tick)Borrowed conns drain naturally through
tryReturnConnonRecycle/Taint. Stack drains arePopAll-snapshot (likecloseIdleResourcesalready does) so a new-gen Get-then-Recycle landing on the stack mid-drain isn't churned.Why
Closedoesn't need to blockThe blocking semantic today conflates pool teardown with joining callers. The pool can be fully torn down without waiting:
tryReturnConnseeslifetime == niland closes the conn instead of stacking it. They can never reach any future user. No leak path.*ConnPoolinstance); they close themselves when Recycled.The one thing
Closedoes still wait on isworkers.Wait()— internal background goroutines exit onlifetime.ctx.Done()within one tick (~100ms). That's bounded and prevents pool GC issues; it's not the same as waiting on arbitrary user goroutines.Migration for callers that genuinely want the old semantic
Add an opt-in helper for callers that actually need "no MySQL conns remain":
In-tree callers can be audited individually. Most don't need it — they either run on process shutdown (OS reaps) or are tests that should join their own goroutines.
Why the mutex drops out
With the triple-check in place, the previously hazardous races all become harmless:
SetCapacityracing withClose— Close swapslifetimeto nil; a concurrentSetCapacitythat already passed the lifetime gate might swapcapacityback up. Today's blocking world:tryReturnConnwould push returned conns onto stacks no one drains → leak. With the triple-check:lifetime == nilshort-circuits and the conn is closed regardless.SetCapacityvsSetCapacity— last writer wins on the target; possible transient over-drain. Harmless — Get reopens on demand.SetCapacityvsreopen— independent atomics (capacityvsgeneration), independent drains. Final state may briefly jiggle but converges.reopenvsreopen— generation is monotonic; both increments visible; both drains harmless.reopenvsClose—reopenbumping generation on a closed pool is a no-op (lifetime check at top still holds; any returned new-gen conn dies via the lifetime branch oftryReturnConn).ClosevsClose—lifetime.Swap(nil)is single-shot: one caller observes the old value, the other observes nil. Single-goroutine-only contract on Open/Close remains but no longer needs mutex enforcement.Tradeoffs
SetCapacityAPI change: drops thectxparameter (no longer blocks long enough to care). Public-breaking for downstream users ofsmartconnpool. In-tree callers convert trivially; a deprecated shim that ignoresctxis easy if needed.Close/CloseWithContextbehavior change: returns before borrowed conns are returned. TheActive() == 0post-Close invariant is replaced by "active monotonically decreases to 0 as Recycles flow in." Callers that need the old semantic callWaitForDrain(ctx)explicitly. The deprecation of thePoolCloseTimeoutdiagnostic is the largest user-visible change in this proposal and deserves a release note.Pooled[C]size: +8 bytes forgen uint64(or +4 bytes withuint32and acceptance of a 4B-cycle wrap — at one reopen per DNS event, centuries).tryReturnConn. Negligible — same cache line as the existing pool atomics.SetCapacityreturnsErrConnPoolClosediflifetime.Load() == nil; today it would race and return a context error in similar conditions.capacityMu → setCapacityMurename discussed in smartconnpool: consolidate shutdown signaling on lifetime context #20186 is unnecessary if the mutex goes away entirely.Staging
Best done after #20122 (smartconnpool shutdown fixes) and #20186 (lifetime context consolidation) land, since both reshape the surrounding lifecycle code. The generation counter sits naturally on top of the lifetime-context model from #20186.