smartconnpool: non-blocking lifecycle via generation counter

## Background

`smartconnpool` has three operations that mutate pool lifecycle state, all serialized today behind `capacityMu`, and all blocking:

- `SetCapacity(ctx, newcap)` — operator-driven capacity change. Blocks on a 10ms sleep loop waiting for borrowed conns to be `Recycle`d so it can pop and close them.
- `reopen()` — refresh-worker triggered on backend rotation (e.g. DNS change). Does `setCapacity(0)` + `setCapacity(originalCap)` to forcibly cycle every conn so they reconnect to the new backend.
- `Close()` / `CloseWithContext(ctx)` — pool teardown. Blocks until every conn is returned (or the ctx expires).

None of the blocking semantics are load-bearing for correctness. They each conflate two concerns: *transitioning the pool state* (cheap, atomic) and *waiting for callers to give conns back* (slow, depends on caller behavior).

## Proposal

Make all three lifecycle ops non-blocking lock-free atomic transitions plus a bounded drain. Introduce a generation counter as the keystone primitive that makes `reopen` non-blocking and lets `capacityMu` be removed entirely.

```go
type ConnPool[C Connection] struct {
    ...
    generation atomic.Uint64 // bumped on each reopen()
}

type Pooled[C Connection] struct {
    ...
    gen uint64 // recorded on connNew
}
```

`tryReturnConn` grows a triple-check that jointly covers every "this conn should die now" condition:

```go
if pool.lifetime.Load() == nil ||
   conn.gen != pool.generation.Load() ||
   pool.active.Load() > pool.capacity.Load() {
    conn.Close()
    pool.closedConn()
    return false
}
```

Each lifecycle op becomes a single atomic state transition plus a bounded stack drain:

| Op            | State change                  | Drain                                   | Returns           |
|---------------|-------------------------------|-----------------------------------------|-------------------|
| `SetCapacity` | `capacity.Swap(newcap)`       | `PopAll`-snapshot of idle stacks        | immediately       |
| `reopen`      | `generation.Add(1)`           | `PopAll`-snapshot of idle stacks        | immediately       |
| `Close`       | `lifetime.Swap(nil)` + cancel | `PopAll`-snapshot of idle stacks        | after `workers.Wait()` (bounded by ~100ms worker tick) |

Borrowed conns drain naturally through `tryReturnConn` on `Recycle`/`Taint`. Stack drains are `PopAll`-snapshot (like `closeIdleResources` already does) so a new-gen Get-then-Recycle landing on the stack mid-drain isn't churned.

## Why `Close` doesn't need to block

The blocking semantic today conflates *pool teardown* with *joining callers*. The pool can be fully torn down without waiting:

- **Correctness**: borrowed conns still get properly closed when Recycled — `tryReturnConn` sees `lifetime == nil` and closes the conn instead of stacking it. They can never reach any future user. No leak path.
- **OS resource cleanup**: borrowed conns close at the MySQL/TCP layer when their holder Recycles. The blocking semantic doesn't make this happen sooner — it just observes the moment.
- **Process exit**: irrelevant — the OS reaps everything.
- **Pool reconfig within a process**: the old pool's borrowed conns become unreachable through any future API (different `*ConnPool` instance); they close themselves when Recycled.

The one thing `Close` does still wait on is `workers.Wait()` — internal background goroutines exit on `lifetime.ctx.Done()` within one tick (~100ms). That's bounded and prevents pool GC issues; it's not the same as waiting on arbitrary user goroutines.

## Migration for callers that genuinely want the old semantic

Add an opt-in helper for callers that actually need "no MySQL conns remain":

```go
// WaitForDrain blocks until active drops to zero or ctx expires.
// Callers that need stuck-conn diagnostics or strict resource-accounting
// after Close should call this explicitly.
func (pool *ConnPool[C]) WaitForDrain(ctx context.Context) error
```

In-tree callers can be audited individually. Most don't need it — they either run on process shutdown (OS reaps) or are tests that should join their own goroutines.

## Why the mutex drops out

With the triple-check in place, the previously hazardous races all become harmless:

- **`SetCapacity` racing with `Close`** — Close swaps `lifetime` to nil; a concurrent `SetCapacity` that already passed the lifetime gate might swap `capacity` back up. Today's blocking world: `tryReturnConn` would push returned conns onto stacks no one drains → leak. With the triple-check: `lifetime == nil` short-circuits and the conn is closed regardless.
- **`SetCapacity` vs `SetCapacity`** — last writer wins on the target; possible transient over-drain. Harmless — Get reopens on demand.
- **`SetCapacity` vs `reopen`** — independent atomics (`capacity` vs `generation`), independent drains. Final state may briefly jiggle but converges.
- **`reopen` vs `reopen`** — generation is monotonic; both increments visible; both drains harmless.
- **`reopen` vs `Close`** — `reopen` bumping generation on a closed pool is a no-op (lifetime check at top still holds; any returned new-gen conn dies via the lifetime branch of `tryReturnConn`).
- **`Close` vs `Close`** — `lifetime.Swap(nil)` is single-shot: one caller observes the old value, the other observes nil. Single-goroutine-only contract on Open/Close remains but no longer needs mutex enforcement.

## Tradeoffs

1. **`SetCapacity` API change**: drops the `ctx` parameter (no longer blocks long enough to care). Public-breaking for downstream users of `smartconnpool`. In-tree callers convert trivially; a deprecated shim that ignores `ctx` is easy if needed.
2. **`Close` / `CloseWithContext` behavior change**: returns before borrowed conns are returned. The `Active() == 0` post-Close invariant is replaced by "active monotonically decreases to 0 as Recycles flow in." Callers that need the old semantic call `WaitForDrain(ctx)` explicitly. The deprecation of the `PoolCloseTimeout` diagnostic is the largest user-visible change in this proposal and deserves a release note.
3. **`Pooled[C]` size**: +8 bytes for `gen uint64` (or +4 bytes with `uint32` and acceptance of a 4B-cycle wrap — at one reopen per DNS event, centuries).
4. **One extra atomic load per Recycle** in `tryReturnConn`. Negligible — same cache line as the existing pool atomics.
5. **Larger conceptual surface**: "stale-generation" is a third lifecycle state alongside "closed" (lifetime nil) and "over-capacity" (active > capacity). Documented in one place.
6. **Error semantics**: `SetCapacity` returns `ErrConnPoolClosed` if `lifetime.Load() == nil`; today it would race and return a context error in similar conditions.
7. **Mutex naming becomes moot**: the `capacityMu → setCapacityMu` rename discussed in #20186 is unnecessary if the mutex goes away entirely.

## Staging

Best done after #20122 (smartconnpool shutdown fixes) and #20186 (lifetime context consolidation) land, since both reshape the surrounding lifecycle code. The generation counter sits naturally on top of the lifetime-context model from #20186.

Op	State change	Drain	Returns
`SetCapacity`	`capacity.Swap(newcap)`	`PopAll`-snapshot of idle stacks	immediately
`reopen`	`generation.Add(1)`	`PopAll`-snapshot of idle stacks	immediately
`Close`	`lifetime.Swap(nil)` + cancel	`PopAll`-snapshot of idle stacks	after `workers.Wait()` (bounded by ~100ms worker tick)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smartconnpool: non-blocking lifecycle via generation counter #20187

Background

Proposal

Why `Close` doesn't need to block

Migration for callers that genuinely want the old semantic

Why the mutex drops out

Tradeoffs

Staging

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

smartconnpool: non-blocking lifecycle via generation counter #20187

Description

Background

Proposal

Why Close doesn't need to block

Migration for callers that genuinely want the old semantic

Why the mutex drops out

Tradeoffs

Staging

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Why `Close` doesn't need to block