Skip to content

smartconnpool: non-blocking lifecycle via generation counter #20187

@arthurschreiber

Description

@arthurschreiber

Background

smartconnpool has three operations that mutate pool lifecycle state, all serialized today behind capacityMu, and all blocking:

  • SetCapacity(ctx, newcap) — operator-driven capacity change. Blocks on a 10ms sleep loop waiting for borrowed conns to be Recycled so it can pop and close them.
  • reopen() — refresh-worker triggered on backend rotation (e.g. DNS change). Does setCapacity(0) + setCapacity(originalCap) to forcibly cycle every conn so they reconnect to the new backend.
  • Close() / CloseWithContext(ctx) — pool teardown. Blocks until every conn is returned (or the ctx expires).

None of the blocking semantics are load-bearing for correctness. They each conflate two concerns: transitioning the pool state (cheap, atomic) and waiting for callers to give conns back (slow, depends on caller behavior).

Proposal

Make all three lifecycle ops non-blocking lock-free atomic transitions plus a bounded drain. Introduce a generation counter as the keystone primitive that makes reopen non-blocking and lets capacityMu be removed entirely.

type ConnPool[C Connection] struct {
    ...
    generation atomic.Uint64 // bumped on each reopen()
}

type Pooled[C Connection] struct {
    ...
    gen uint64 // recorded on connNew
}

tryReturnConn grows a triple-check that jointly covers every "this conn should die now" condition:

if pool.lifetime.Load() == nil ||
   conn.gen != pool.generation.Load() ||
   pool.active.Load() > pool.capacity.Load() {
    conn.Close()
    pool.closedConn()
    return false
}

Each lifecycle op becomes a single atomic state transition plus a bounded stack drain:

Op State change Drain Returns
SetCapacity capacity.Swap(newcap) PopAll-snapshot of idle stacks immediately
reopen generation.Add(1) PopAll-snapshot of idle stacks immediately
Close lifetime.Swap(nil) + cancel PopAll-snapshot of idle stacks after workers.Wait() (bounded by ~100ms worker tick)

Borrowed conns drain naturally through tryReturnConn on Recycle/Taint. Stack drains are PopAll-snapshot (like closeIdleResources already does) so a new-gen Get-then-Recycle landing on the stack mid-drain isn't churned.

Why Close doesn't need to block

The blocking semantic today conflates pool teardown with joining callers. The pool can be fully torn down without waiting:

  • Correctness: borrowed conns still get properly closed when Recycled — tryReturnConn sees lifetime == nil and closes the conn instead of stacking it. They can never reach any future user. No leak path.
  • OS resource cleanup: borrowed conns close at the MySQL/TCP layer when their holder Recycles. The blocking semantic doesn't make this happen sooner — it just observes the moment.
  • Process exit: irrelevant — the OS reaps everything.
  • Pool reconfig within a process: the old pool's borrowed conns become unreachable through any future API (different *ConnPool instance); they close themselves when Recycled.

The one thing Close does still wait on is workers.Wait() — internal background goroutines exit on lifetime.ctx.Done() within one tick (~100ms). That's bounded and prevents pool GC issues; it's not the same as waiting on arbitrary user goroutines.

Migration for callers that genuinely want the old semantic

Add an opt-in helper for callers that actually need "no MySQL conns remain":

// WaitForDrain blocks until active drops to zero or ctx expires.
// Callers that need stuck-conn diagnostics or strict resource-accounting
// after Close should call this explicitly.
func (pool *ConnPool[C]) WaitForDrain(ctx context.Context) error

In-tree callers can be audited individually. Most don't need it — they either run on process shutdown (OS reaps) or are tests that should join their own goroutines.

Why the mutex drops out

With the triple-check in place, the previously hazardous races all become harmless:

  • SetCapacity racing with Close — Close swaps lifetime to nil; a concurrent SetCapacity that already passed the lifetime gate might swap capacity back up. Today's blocking world: tryReturnConn would push returned conns onto stacks no one drains → leak. With the triple-check: lifetime == nil short-circuits and the conn is closed regardless.
  • SetCapacity vs SetCapacity — last writer wins on the target; possible transient over-drain. Harmless — Get reopens on demand.
  • SetCapacity vs reopen — independent atomics (capacity vs generation), independent drains. Final state may briefly jiggle but converges.
  • reopen vs reopen — generation is monotonic; both increments visible; both drains harmless.
  • reopen vs Closereopen bumping generation on a closed pool is a no-op (lifetime check at top still holds; any returned new-gen conn dies via the lifetime branch of tryReturnConn).
  • Close vs Closelifetime.Swap(nil) is single-shot: one caller observes the old value, the other observes nil. Single-goroutine-only contract on Open/Close remains but no longer needs mutex enforcement.

Tradeoffs

  1. SetCapacity API change: drops the ctx parameter (no longer blocks long enough to care). Public-breaking for downstream users of smartconnpool. In-tree callers convert trivially; a deprecated shim that ignores ctx is easy if needed.
  2. Close / CloseWithContext behavior change: returns before borrowed conns are returned. The Active() == 0 post-Close invariant is replaced by "active monotonically decreases to 0 as Recycles flow in." Callers that need the old semantic call WaitForDrain(ctx) explicitly. The deprecation of the PoolCloseTimeout diagnostic is the largest user-visible change in this proposal and deserves a release note.
  3. Pooled[C] size: +8 bytes for gen uint64 (or +4 bytes with uint32 and acceptance of a 4B-cycle wrap — at one reopen per DNS event, centuries).
  4. One extra atomic load per Recycle in tryReturnConn. Negligible — same cache line as the existing pool atomics.
  5. Larger conceptual surface: "stale-generation" is a third lifecycle state alongside "closed" (lifetime nil) and "over-capacity" (active > capacity). Documented in one place.
  6. Error semantics: SetCapacity returns ErrConnPoolClosed if lifetime.Load() == nil; today it would race and return a context error in similar conditions.
  7. Mutex naming becomes moot: the capacityMu → setCapacityMu rename discussed in smartconnpool: consolidate shutdown signaling on lifetime context #20186 is unnecessary if the mutex goes away entirely.

Staging

Best done after #20122 (smartconnpool shutdown fixes) and #20186 (lifetime context consolidation) land, since both reshape the surrounding lifecycle code. The generation counter sits naturally on top of the lifetime-context model from #20186.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions