Skip to content

smartconnpool: consolidate shutdown signaling on lifetime context#20186

Draft
arthurschreiber wants to merge 21 commits into
mainfrom
arthur/smartconnpool-lifetime-context-consolidation
Draft

smartconnpool: consolidate shutdown signaling on lifetime context#20186
arthurschreiber wants to merge 21 commits into
mainfrom
arthur/smartconnpool-lifetime-context-consolidation

Conversation

@arthurschreiber
Copy link
Copy Markdown
Member

Description

The smartconnpool tracked "is the pool open" with two atomic pointers: a close channel pointer and a lifetime context pointer. The two had distinct timing — lifetime.cancel() fired at the start of Close, but close(closeChan) only fired after the drain completed. As a result, queued Get waiters and background workers were kept alive throughout the drain, even though there was nothing useful left for them to do.

This PR drops the close field entirely and uses the lifetime context's Done() as the single shutdown signal. runWorker and wait.waitForConn now take a context.Context instead of a <-chan struct{}, which is also more idiomatic Go.

Observable behavior changes

  • IsOpen() returns false as soon as CloseWithContext is entered, not after the drain completes. The only external caller in the repo (semisyncmonitor.go:167) is a startup probe and is unaffected.
  • Queued Get waiters return ErrConnPoolClosed at drain start rather than being parked until the drain finishes. They were unservicable during the drain anyway because tryReturnConn closes any Recycled conn while active > capacity.
  • The closeIdleResources reopen loop's in-flight connect call is cancelled at drain start, so it returns ctx.Err() instead of running to completion and producing a fresh conn that has to be discarded immediately.

Both behavior shifts are pinned by new tests: TestIsOpenIsFalseDuringDrain and TestWaiterUnblocksAtDrainStart.

Related Issue(s)

Stacked on top of #20122 (smartconnpool shutdown fixes). The diff against main will narrow naturally once #20122 merges and this PR is retargeted.

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

No user-facing config change. The minor behavior shifts on IsOpen() timing and waiter wake-up are described above; neither is reachable through any current external caller as a correctness-affecting change.

AI Disclosure

This PR was authored primarily by Claude Code — I provided direction and reviewed the design.

arthurschreiber and others added 21 commits May 18, 2026 12:29
Two changes that together let Close return promptly when the MySQL
backend is unhealthy:

1. tryReturnConn now closes connections eagerly once capacity has been
   driven to 0. Without this, in-flight Recycles during Close get
   handed to queued waiters instead of landing on a stack, and the
   setCapacity drain loop never observes the conns it needs to close.

2. The idle worker's connection reopen now uses a context that is
   cancelled at the start of Close. Previously connReopen used
   context.Background(), so a Close that races with an idle-tick reopen
   was extended by the full backend connect timeout per expired
   connection, ignoring PoolCloseTimeout.

The context is held behind an atomic pointer to keep ConnPool's heap
footprint stable; the lock-free stack's atomic-128 ops rely on Go's
allocator placing the struct in a size class with 16-byte alignment,
and inlining context.Context + CancelFunc directly crossed a size
class boundary that no longer satisfied that.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Switches the new shutdown tests to t.Cleanup for releasing the blocked
connector and closing the pool so resources unwind even when an
assertion fires partway through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
t.Context is cancelled just before t.Cleanup runs, so the blocked
connector can wait on it directly instead of managing a dedicated
release channel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The struct is the pool's open/close-bounded context, used by any code
path that calls into a user-supplied callback during a shutdown — not
just worker goroutines. Renaming makes the abstraction clearer and
leaves room for future callers (e.g. Taint/Recycle's connNew) to use
the same mechanism without a misleading name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Pooled.Taint and Pooled.Recycle on a closed conn both route through
put(nil), which used to synchronously call connNew(context.Background())
to open a replacement. With a slow or unreachable backend this turned a
non-blocking cleanup into a full MySQL connect-timeout round-trip on
the caller's goroutine — a particularly bad shape for Taint, which is
called from error-handling code paths.

put(nil) now just decrements the active count and returns. The pool
will reopen the slot lazily on the next Get via getNew, which is
already designed to do exactly that. Steady-state behavior is
unchanged in healthy pools; under an unhealthy backend the pool drains
to whatever number of connections it can sustain instead of stalling
callers waiting for connects that will time out.

The TestOpen assertion that encoded the old eager-reopen behavior is
updated to match: after p.put(nil) the slot is free until demand
triggers a reopen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The maxLifetime branch of put used to synchronously call
connReopen(context.Background()) to swap in a fresh connection,
mirroring the put(nil) eager-reopen pattern. Under a slow or
unreachable backend this turned every Recycle of an aged conn into a
full MySQL connect-timeout round-trip on the caller's goroutine —
worse than the Taint case because Recycle is the hot path.

put now closes the maxLifetime-exceeded conn and frees the slot, the
same way the put(nil) case does. The next Get lazily reopens via
getNew.

TestConnReopen and TestMaxLifetime, which encoded the old eager-reopen
behavior, are updated to match. TestConnReopen still covers the
idle-worker's (separate, still-eager) reopen path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Previous commits made put(nil) and put(maxLifetime) lazy — just
decrement active and let the next Get reopen via getNew. That stranded
any waiter already queued in the waitlist: the new caller's Get sees
active < capacity, opens a fresh conn for itself via getNew, and the
existing waiter is never fed. With nothing on a stack, the expire
worker can't help either, so the waiter waits until its own ctx
expires.

Replace the lazy approach with an async spawn: put(nil) and
put(maxLifetime) now go pool.openReplacement(), which runs connNew off
the caller's goroutine and hands the result to a waiter (via
tryReturnConn) on success or drops the active count on connect
failure. The caller never blocks on connect, and queued waiters are
still served whenever the backend will let us open one.

Tests:

- TestTaintWakesWaiter and TestRecycleMaxLifetimeWakesWaiter cover the
  regression directly.
- TestTaintDoesNotBlockOnConnect and
  TestRecycleDoesNotBlockOnMaxLifetimeReopen keep their elapsed-time
  assertions but drop the lazy-decrement-specific ones (connectCalls,
  Active) that the new approach intentionally doesn't satisfy.
- TestOpen, TestMaxLifetime, TestConnReopen, and TestCreateFailOnPut
  wrap the post-put assertions in EventuallyWithT since the reopen now
  completes asynchronously.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
connReopen used to read Setting() *after* replacing dbconn.Conn with a
freshly-connected conn, so it always read nil — the conn silently
migrated from settings[bucket] to clean on every reopen. The idle
worker has done this since it was written; the new maxLifetime async
reopen inherits the same path.

Fixing connReopen to capture Setting before the replace looked obvious
but it broke the get()/getWithSetting error paths, which rely on the
bare-reopen behavior: when ResetSetting fails they call connReopen
expecting it to leave the conn settingless so the caller can then
ApplySetting (or omit it). Capturing-and-re-applying the old Setting
there leaks an unwanted Setting back to the get() caller and
mis-routes the conn to the wrong stack.

So instead: keep connReopen bare, and move the capture+re-apply into
the two callers that actually want preservation — closeIdleResources
(idle worker) and reopenExpired (maxLifetime path). Both capture
Setting before conn.Close, call connReopen, and ApplySetting on the
new Conn.

Two new tests cover the preserved-Setting behavior:

- TestRecycleMaxLifetimePreservesSetting
- TestIdleWorkerReopenPreservesSetting

Both verify the reopened conn lands in settings[bucket] rather than
clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Previous commits moved the replacement-open work for put(nil) and the
maxLifetime branch onto background goroutines. That solved the
hot-path stall under a sick backend but introduced async-test surgery
in places that didn't need it.

Step back: keep put synchronous and route its connect calls through
pool.connectCtx() — the same lifetime context the idle worker already
uses. A Close still cancels in-flight put-time connects (the original
stall surface). Healthy pools see no change. Under a sick backend
during normal operation, put still blocks on the connect timeout
exactly as it did before this PR — that's an accepted trade-off, since
fixing it is a bigger design conversation.

Setting preservation on the maxLifetime path stays (capture before
close, ApplySetting after the reopen). openReplacement and
reopenExpired helpers are removed.

The two "doesn't block" tests are reframed as the property the
sync+lifetime approach actually provides:

- TestTaintConnectCancelsOnClose
- TestRecycleMaxLifetimeReopenCancelsOnClose

Each parks the connector and asserts Close unblocks the in-flight
reopen. TestTaintWakesWaiter and TestRecycleMaxLifetimeWakesWaiter are
kept — with sync they're trivially satisfied but they're cheap
defensive regression tests for the waiter hand-off.

EventuallyWithT wrappers added for the async behavior in TestOpen,
TestMaxLifetime, TestConnReopen, and TestCreateFailOnPut are reverted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
…eter

connReopen now takes the Setting that should end up applied on the
reopened conn (or nil for a bare reopen). Callers refreshing a conn in
place pass the previous Setting through so the reopened conn lands
back in the same settings stack; callers recovering from a
ResetSetting failure pass nil so the conn comes back settingless and
the caller can re-apply or omit.

Removes the duplicated "capture Setting + ApplySetting after reopen"
dance from closeIdleResources and the maxLifetime branch of put — the
post-reopen apply now lives in connReopen itself. Net behavior matches
the previous commit; the change is purely about making the contract
explicit at the call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Resolves the conflict in closeIdleResources by adopting upstream's
two-loop structure (#20079: close all expired conns and drop their
active slot first, then reopen up to capacity with CAS to respect
mid-flight capacity changes) while keeping this branch's setting
preservation (capture each conn's Setting before close, pass through
connReopen's new setting parameter on reopen) and lifetime context
(pool.connectCtx() so a Close cancels the in-flight reopen).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The test polled for Metrics.IdleClosed() > 10 and then immediately
asserted Active == 10. After the two-loop closeIdleResources from
#20079, there's a brief window between "first loop dropped active for
each closed conn" and "second loop's CAS restored active for each
reopened conn" — the Eventually poll can land inside that window
under stress and the next assertion sees Active < 10.

Fold both conditions into a single EventuallyWithT so the poll keeps
trying until they're both true simultaneously. Drop the trailing
totalInStack <= 10 check: it covered the same intent (no leaked
conns) but walked the lock-free clean stack without synchronization
against the idle worker, and the assertion was trivially permissive
since totalInStack can't exceed capacity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Pull the duplicated select/default close-channel polling into a
tryReceiveClose helper, fold the repeated watchdog-failure cleanup
sequences into an abort closure, and rename WatchdogDelay to Watchdog so
both helpers use the same name for the same constant. Inline the unused
worker := i alias. Add doc comments to the top-level test functions and
to the storm Taint() fan-out so the reason for the goroutine wrapping
(Taint synchronously enters the blocked connect path) is explicit.

No behavior change.

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Replace tryReturnConn's eager-close trigger with active > capacity. The
old capacity == 0 check fires only when capacity is fully drained and
misses partial reductions: SetCapacity(N → M) with 0 < M < N and
queued waiters cycles Recycles to waiters without ever pushing to a
stack, so the drain loop spins until ctx expires.

active > capacity is the actual invariant we want to defend ("we have
more conns out than configured"), and it strictly subsumes the
previous check — capacity == 0 implies active >= 0 > -1, so anything
the old check fired on, the new one fires on too.

Add TestSetCapacityReductionDrainsWithWaiters that holds N conns,
queues hold-forever waiters, then SetCapacity(1) with a 2s deadline.
Without the change the test hangs at active=N, borrowed=N; with the
change SetCapacity returns promptly and active drops to the target.

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Resolved conflicts in go/pools/smartconnpool/pool.go where upstream's
bulk idle-stack sweep (#20136) overlapped with this branch's shutdown
fixes:

- put(): kept upstream's cached `now` for the maxLifetime check while
  threading this branch's Setting and connectCtx through connReopen.
- tryReturnConn(): kept this branch's active>capacity drain guard and
  folded in upstream's new updateIdleTime parameter.
- closeIdleResources(): adopted upstream's PopAll-based linked-list
  bulk processing and read each expired conn's Setting() inline in the
  reopen loop (Close() doesn't touch the setting field).

TestIdleTimeoutStopsReopeningWhenPoolCloses bypasses pool.Open(), so
pool.lifetime was nil and connectCtx() returned a pre-cancelled
context, breaking the test's connect-blocking setup. Set pool.lifetime
explicitly in the test to match real Open() semantics.

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Setting() is a pure field accessor that Close() doesn't touch, so
there's no need to stash it in a local before calling connReopen —
Go's argument evaluation runs Setting() before connReopen replaces
dbconn.Conn anyway.

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The async-spawn approach was reverted; the replacement conn is opened
synchronously on the caller's goroutine via connNew. Update the test
doc comment so future readers aren't misled.

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
…l-shutdown-fixes

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

# Conflicts:
#	go/pools/smartconnpool/pool.go
#	go/pools/smartconnpool/stress_test.go
The pool tracked "is the pool open" twice: a `close atomic.Pointer[chan
struct{}]` and a `lifetime atomic.Pointer[lifetime]` (which carried a
cancellable context for in-flight Connector calls). The two phases had
distinct timing — `lifetime.cancel()` fired at the start of Close, but
`close(closeChan)` only fired after the drain finished — which meant
queued waiters and background workers were held through the drain even
though there was nothing useful left for them to do.

Drop the `close` field and use the lifetime context's `Done()` as the
single shutdown signal. `runWorker` and `wait.waitForConn` take a
`context.Context` instead of `<-chan struct{}`. `IsOpen` now flips false
as soon as `CloseWithContext` is entered, and queued `Get` waiters
return `ErrConnPoolClosed` at drain start rather than being parked until
the drain completes (during which they were unservicable anyway —
`tryReturnConn` already closes any Recycled conn while `active >
capacity`).

The new behavior is pinned by `TestIsOpenIsFalseDuringDrain` and
`TestWaiterUnblocksAtDrainStart`.

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copilot AI review requested due to automatic review settings May 26, 2026 17:16
@github-actions github-actions Bot added this to the v25.0.0 milestone May 26, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 26, 2026
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot Bot commented May 26, 2026

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR simplifies smartconnpool shutdown semantics by removing the separate “close channel” signal and using the pool lifetime context.Context as the single shutdown signal, so that workers, queued waiters, and in-flight connects unblock immediately when CloseWithContext starts.

Changes:

  • Remove the close channel pointer from ConnPool and key IsOpen() / shutdown checks off the lifetime context pointer.
  • Update worker loops and waitlist waiting to accept a context.Context and select on ctx.Done().
  • Add/adjust tests to pin the new observable shutdown behaviors (e.g., IsOpen() false during drain, waiters unblocking at drain start).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
go/pools/smartconnpool/waitlist.go Switch waiter shutdown signal from a close channel to the pool lifetime context.
go/pools/smartconnpool/waitlist_test.go Update waitlist tests to pass a lifetime context instead of a close channel.
go/pools/smartconnpool/pool.go Remove close field, consolidate shutdown/open state on lifetime context, and wire workers/waiters to ctx.Done().
go/pools/smartconnpool/pool_test.go Update existing shutdown-related tests and add new tests for the revised IsOpen() and waiter-unblock timing contracts.

Comment on lines +326 to +337
// Cancel the pool's lifetime context before we start draining: any user
// connect callback currently blocked behind it (e.g. the idle worker
// reopening an expired connection) unblocks immediately, workers exit,
// and queued waiters in wait.waitForConn return ErrConnPoolClosed
// without having to wait for the drain to finish.
lt := pool.lifetime.Swap(nil)
if lt == nil {
// already closed
pool.capacityMu.Unlock()
return nil
}

// Cancel the pool's lifetime context before we start draining: any user
// connect callback currently blocked behind it (e.g. the idle worker
// reopening an expired connection) needs to unblock now so that
// setCapacity can observe active dropping to zero and so that
// workers.Wait below isn't held up by a hung connect.
if lt := pool.lifetime.Swap(nil); lt != nil {
lt.cancel()
}
lt.cancel()
@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

❌ Patch coverage is 96.55172% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 57.32%. Comparing base (abaebfb) to head (b294679).

Files with missing lines Patch % Lines
go/pools/smartconnpool/pool.go 96.29% 1 Missing ⚠️
Additional details and impacted files
@@                           Coverage Diff                           @@
##           arthur/smartconnpool-shutdown-fixes   #20186      +/-   ##
=======================================================================
+ Coverage                                56.72%   57.32%   +0.59%     
=======================================================================
  Files                                        8        8              
  Lines                                      966      963       -3     
=======================================================================
+ Hits                                       548      552       +4     
+ Misses                                     418      411       -7     
Flag Coverage Δ
partial 57.32% <96.55%> (+0.59%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Base automatically changed from arthur/smartconnpool-shutdown-fixes to main May 27, 2026 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: VTTablet NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says Type: Internal Cleanup

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants