smartconnpool: unblock Close and preserve Setting on reopen by arthurschreiber · Pull Request #20122 · vitessio/vitess

arthurschreiber · 2026-05-18T12:29:50Z

Description

Four small changes to the smartconnpool. The first three address shutdown stalls when the MySQL backend is unhealthy; the fourth fixes a long-standing silent loss of connection settings on reopen. Each is scoped to a specific code path; healthy pools see no behavior change.

1. tryReturnConn closes eagerly when the pool is over capacity

Whenever pool.active > pool.capacity — driven either by Close()/SetCapacity(0) taking capacity to 0 or by any SetCapacity(N) reducing it below the current active count — Recycled connections are closed immediately instead of being handed to whichever waiter is still in the waitlist. Without this, conns get passed waiter-to-waiter during the drain and the setCapacity loop's 10ms-sleep polls never observe a non-empty stack to close. The check is active > capacity rather than capacity == 0 so partial reductions get the same drain semantics as a full Close — SetCapacity(N → M) with 0 < M < N and a non-empty waitlist also gets a clean drain.

2. Pool-wide lifetime context, cancelled on Close

The pool now holds a context.Context (in a small heap-allocated lifetime struct, behind an atomic.Pointer) that is cancelled at the very start of CloseWithContext, before setCapacity(0).

The pointer indirection is deliberate: inlining context.Context (16B) + CancelFunc (8B) directly into ConnPool pushed the struct's size across a Go allocator size-class boundary and exposed a latent 16-byte alignment requirement of the lock-free stack's atomic-128 ops, causing SIGBUS on ARM64. Keeping the inline footprint at 8 bytes preserves the size class.

3. The idle worker, put(nil), and the maxLifetime put branch all use the lifetime ctx for their connect calls

Previously these three sites passed context.Background() to connect(...), so a Close() racing with an in-flight reopen was extended by the full db-connect-timeout-ms per blocked connect, completely ignoring PoolCloseTimeout. They now use pool.connectCtx(), so a Close unblocks any in-flight connect immediately. put stays synchronous on the caller's goroutine — the same shape as before this PR — but no longer holds up Close.

4. connReopen no longer silently drops the connection's Setting

connReopen used to read Setting() after replacing dbconn.Conn with a freshly-connected conn, so it always read nil and the conn silently migrated from settings[bucket] to clean on every reopen — both for the idle worker and for the maxLifetime path.

Fixing connReopen itself to unconditionally capture Setting before the replace broke the get()/getWithSetting error paths, which rely on the bare-reopen behavior to fall back to a settingless conn when ResetSetting fails. So connReopen now takes an explicit setting *Setting parameter: callers refreshing a conn in place (closeIdleResources and the maxLifetime branch of put) capture the previous Setting and pass it through; callers recovering from a ResetSetting failure pass nil for a bare reopen.

What this PR does not do

Under a sick backend during normal operation (not Close), Taint and Recycle still synchronously block on the connect attempt for the full backend connect timeout. That matches the behavior before this PR — fixing it requires moving the connect off the caller's goroutine, which has its own design trade-offs and is a separate conversation. The Close-time stalls are the immediate operational problem this PR addresses.

Related Issue(s)

None. Discovered during a deep review of the smartconnpool stall surface.

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

New and updated coverage:

TestCloseDoesNotHandOffToWaiters — Close doesn't hand Recycled conns to waiters once capacity is 0.
TestSetCapacityReductionDrainsWithWaiters — SetCapacity(N → M) with 0 < M < N and hold-forever waiters completes within its deadline; recycled conns are closed when active > capacity rather than handed off and stuck.
TestIdleWorkerConnectCancelsOnClose — idle worker's connect unblocks when Close fires.
TestTaintConnectCancelsOnClose — Taint's synchronous reopen unblocks when Close fires.
TestRecycleMaxLifetimeReopenCancelsOnClose — the maxLifetime reopen in put unblocks when Close fires.
TestTaintWakesWaiter and TestRecycleMaxLifetimeWakesWaiter — queued waiters are still served by the synchronous replacement.
TestRecycleMaxLifetimePreservesSetting and TestIdleWorkerReopenPreservesSetting — reopened conns return to the same settings stack.
TestIdleTimeoutDoesntLeaveLingeringConnection — tightened to account for the two-loop idle refresh path without racing the worker.
TestStressCloseDuringReconnectStorm — stress-covers Close racing with blocked reconnects and verifies shutdown completes without leaking conns.
TestStressWaiterStormDuringDrain — stress-covers queued waiters during Close drain and verifies returned conns are closed rather than handed off.

Existing TestCloseDuringWaitForConn × 50 iterations under -race continues to pass.

Backport justification

These are production-reliability fixes for vttablet shutdown against an unhealthy backend. With a dead/unreachable MySQL:

The idle worker's connReopen, plus the put(nil) and maxLifetime branches of put, all blocked for the full backend connect timeout per in-flight connect, ignoring PoolCloseTimeout. Observed Close latency goes from the intended ~10s up to many minutes.
Concurrent Recycles during Close get cycled waiter-to-waiter instead of landing on a stack the drain loop can see, adding more polling latency.
Settings configured on connections silently disappear after every idle-timeout refresh and every conn rotation that goes through the reopen path, forcing repeated re-application.

A vttablet restart against a dead primary is exactly when fast shutdown matters most. The changes are scoped to the Close, drain, and reopen paths; healthy pools see no behavior change beyond the (long-broken) setting preservation now actually working.

Deployment Notes

One user-visible behavior change: connections being Recycled while the pool is over its configured capacity — during Close(), SetCapacity(0), or any SetCapacity(N) that reduces capacity below current active — are now closed rather than handed to queued waiters. Those waiters will return with ErrConnPoolClosed once Close() finishes signalling the close channel, or with their own ctx timeout otherwise.

AI Disclosure

This PR was implemented by Claude Code based on a deep review and step-by-step fix plan I directed.

Two changes that together let Close return promptly when the MySQL backend is unhealthy: 1. tryReturnConn now closes connections eagerly once capacity has been driven to 0. Without this, in-flight Recycles during Close get handed to queued waiters instead of landing on a stack, and the setCapacity drain loop never observes the conns it needs to close. 2. The idle worker's connection reopen now uses a context that is cancelled at the start of Close. Previously connReopen used context.Background(), so a Close that races with an idle-tick reopen was extended by the full backend connect timeout per expired connection, ignoring PoolCloseTimeout. The context is held behind an atomic pointer to keep ConnPool's heap footprint stable; the lock-free stack's atomic-128 ops rely on Go's allocator placing the struct in a size class with 16-byte alignment, and inlining context.Context + CancelFunc directly crossed a size class boundary that no longer satisfied that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

vitess-bot · 2026-05-18T12:30:23Z

Copilot

Pull request overview

Two small fixes to smartconnpool.ConnPool to ensure Close() returns promptly when the MySQL backend is unhealthy: connections being recycled during a close are closed eagerly rather than handed to waiters, and the idle-timeout worker's in-flight connect is now cancelled at close start instead of blocking for the full backend connect timeout.

Changes:

tryReturnConn now closes connections immediately when capacity has been driven to 0 (Close or explicit SetCapacity(0)), preventing waiter-to-waiter handoff loops that starve the setCapacity drain.
Introduces a heap-allocated workerLifetime{ctx, cancel} (held behind atomic.Pointer to preserve ConnPool's size class and 16-byte alignment for the lock-free stack's atomic-128 ops) that is cancelled at the start of CloseWithContext; the idle worker uses this context for connReopen.
Two new regression tests: TestCloseDoesNotHandOffToWaiters and TestIdleWorkerConnectCancelsOnClose.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
go/pools/smartconnpool/pool.go	Adds `workerLifetime` struct + atomic pointer, cancels it at Close start, uses pool worker ctx for idle-worker connReopen, and closes recycled conns eagerly when capacity is 0.
go/pools/smartconnpool/pool_test.go	Adds two regression tests covering the no-handoff-after-Close guarantee and the cancellation of the idle-worker reopen on Close.

codecov · 2026-05-18T12:35:10Z

Codecov Report

❌ Patch coverage is 90.47619% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.52%. Comparing base (70c7a72) to head (03e65e6).
⚠️ Report is 277 commits behind head on main.

Files with missing lines	Patch %	Lines
go/pools/smartconnpool/pool.go	90.47%	2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (70c7a72) and HEAD (03e65e6). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (70c7a72) HEAD (03e65e6)

1 0

Additional details and impacted files

@@             Coverage Diff             @@
##             main   #20122       +/-   ##
===========================================
- Coverage   69.67%   56.52%   -13.16%     
===========================================
  Files        1614        8     -1606     
  Lines      216793      966   -215827     
===========================================
- Hits       151044      546   -150498     
+ Misses      65749      420    -65329

Flag	Coverage Δ
partial	`56.52% <90.47%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Switches the new shutdown tests to t.Cleanup for releasing the blocked connector and closing the pool so resources unwind even when an assertion fires partway through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

t.Context is cancelled just before t.Cleanup runs, so the blocked connector can wait on it directly instead of managing a dedicated release channel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

The struct is the pool's open/close-bounded context, used by any code path that calls into a user-supplied callback during a shutdown — not just worker goroutines. Renaming makes the abstraction clearer and leaves room for future callers (e.g. Taint/Recycle's connNew) to use the same mechanism without a misleading name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Replace tryReturnConn's eager-close trigger with active > capacity. The old capacity == 0 check fires only when capacity is fully drained and misses partial reductions: SetCapacity(N → M) with 0 < M < N and queued waiters cycles Recycles to waiters without ever pushing to a stack, so the drain loop spins until ctx expires. active > capacity is the actual invariant we want to defend ("we have more conns out than configured"), and it strictly subsumes the previous check — capacity == 0 implies active >= 0 > -1, so anything the old check fired on, the new one fires on too. Add TestSetCapacityReductionDrainsWithWaiters that holds N conns, queues hold-forever waiters, then SetCapacity(1) with a 2s deadline. Without the change the test hangs at active=N, borrowed=N; with the change SetCapacity returns promptly and active drops to the target. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

mattlord

LGTM!

One tiny nit / non-blocking cleanup: go/pools/smartconnpool/pool_test.go:1943-1946 still says the replacement conn is opened off the caller goroutine, but the final implementation is synchronous again. I’d update that comment to avoid confusing future us.

Resolved conflicts in go/pools/smartconnpool/pool.go where upstream's bulk idle-stack sweep (#20136) overlapped with this branch's shutdown fixes: - put(): kept upstream's cached `now` for the maxLifetime check while threading this branch's Setting and connectCtx through connReopen. - tryReturnConn(): kept this branch's active>capacity drain guard and folded in upstream's new updateIdleTime parameter. - closeIdleResources(): adopted upstream's PopAll-based linked-list bulk processing and read each expired conn's Setting() inline in the reopen loop (Close() doesn't touch the setting field). TestIdleTimeoutStopsReopeningWhenPoolCloses bypasses pool.Open(), so pool.lifetime was nil and connectCtx() returned a pre-cancelled context, breaking the test's connect-blocking setup. Set pool.lifetime explicitly in the test to match real Open() semantics. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Setting() is a pure field accessor that Close() doesn't touch, so there's no need to stash it in a local before calling connReopen — Go's argument evaluation runs Setting() before connReopen replaces dbconn.Conn anyway. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

The async-spawn approach was reverted; the replacement conn is opened synchronously on the caller's goroutine via connNew. Update the test doc comment so future readers aren't misled. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

…l-shutdown-fixes Signed-off-by: Arthur Schreiber <arthur@planetscale.com> # Conflicts: # go/pools/smartconnpool/pool.go # go/pools/smartconnpool/stress_test.go

… tests Replace requireReceive and requireMaxLifetimeExpired with the canonical Go patterns at each call site: a blocking select with time.After, and a direct time.Sleep on the configured maxLifetime. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

go/pools/smartconnpool/pool.go:483

In put(nil), the pool now attempts connNew(pool.connectCtx()) even when the pool is already closed (pool.close.Load()==nil). Since connectors are not required to honor ctx cancellation (e.g. tests’ newConnector ignores ctx), a late Recycle/Taint after Close can block in the user connector and even attempt opening new backend connections after shutdown. Restore the fast-path that decrements active and returns without calling connect when the pool is already closed.

	if conn == nil {
		// Taint, or Recycle on a closed conn: open a replacement so a
		// queued waiter can be served. We use the pool's lifetime context
		// so a Close in flight unblocks this connect, instead of waiting
		// for the backend's connect timeout.
		var err error
		conn, err = pool.connNew(pool.connectCtx())
		if err != nil {
			pool.closedConn()
			return
		}

timvaillancourt · 2026-05-27T15:54:54Z

+// close — they should fail fast rather than open a connection that has
+// nowhere to go.
+var alreadyCancelled = func() context.Context {
+	ctx, cancel := context.WithCancel(context.Background())


It's too bad the context package doesn't have a "nice" pre-cancelled context

I just wanted to add an alternative in case it is somehow more efficent or preferable:

var alreadyCancelled, _ = context.WithDeadline(context.Background(), time.Time{})

timvaillancourt

LGTM @arthurschreiber. Added one alternative option if it's interesting to you

arthurschreiber · 2026-05-27T16:50:51Z

Added one alternative option if it's interesting to you

Thanks. I don't think this makes any difference either way, so I'll keep it as-is! 🙇‍♂️

Copilot AI review requested due to automatic review settings May 18, 2026 12:29

arthurschreiber added Type: Bug Component: Query Serving labels May 18, 2026

github-actions Bot added this to the v25.0.0 milestone May 18, 2026

Copilot started reviewing on behalf of arthurschreiber May 18, 2026 12:30 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

arthurschreiber and others added 2 commits May 18, 2026 12:38

Copilot AI review requested due to automatic review settings May 18, 2026 12:40

Copilot started reviewing on behalf of arthurschreiber May 18, 2026 12:41 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

arthurschreiber marked this pull request as ready for review May 18, 2026 12:47

Copilot AI review requested due to automatic review settings May 18, 2026 12:47

arthurschreiber requested a review from harshit-gangal as a code owner May 18, 2026 12:47

Copilot AI review requested due to automatic review settings May 21, 2026 14:21

Copilot started reviewing on behalf of arthurschreiber May 21, 2026 14:21 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

mattlord approved these changes May 22, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 23, 2026 10:51

Copilot started reviewing on behalf of arthurschreiber May 23, 2026 10:51 View session

Copilot AI reviewed May 23, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 23, 2026 10:57

Copilot started reviewing on behalf of arthurschreiber May 23, 2026 10:57 View session

Copilot AI reviewed May 23, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into arthur/smartconnpoo…

abaebfb

…l-shutdown-fixes Signed-off-by: Arthur Schreiber <arthur@planetscale.com> # Conflicts: # go/pools/smartconnpool/pool.go # go/pools/smartconnpool/stress_test.go

This was referenced May 26, 2026

smartconnpool: consolidate shutdown signaling on lifetime context #20186

Draft

smartconnpool: non-blocking lifecycle via generation counter #20187

Open

Copilot AI review requested due to automatic review settings May 27, 2026 13:45

Copilot started reviewing on behalf of arthurschreiber May 27, 2026 13:45 View session

arthurschreiber enabled auto-merge (squash) May 27, 2026 13:48

Copilot AI reviewed May 27, 2026

View reviewed changes

timvaillancourt reviewed May 27, 2026

View reviewed changes

timvaillancourt approved these changes May 27, 2026

View reviewed changes

arthurschreiber merged commit e396da4 into main May 27, 2026
171 of 172 checks passed

arthurschreiber deleted the arthur/smartconnpool-shutdown-fixes branch May 27, 2026 16:51

This was referenced May 27, 2026

[release-23.0] smartconnpool: unblock Close and preserve Setting on reopen (#20122) #20200

Draft

[release-24.0] smartconnpool: unblock Close and preserve Setting on reopen (#20122) #20201

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smartconnpool: unblock Close and preserve Setting on reopen#20122

smartconnpool: unblock Close and preserve Setting on reopen#20122
arthurschreiber merged 21 commits into
mainfrom
arthur/smartconnpool-shutdown-fixes

arthurschreiber commented May 18, 2026 •

edited

Loading

Uh oh!

vitess-bot Bot commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

codecov Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

mattlord left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

timvaillancourt May 27, 2026

Uh oh!

timvaillancourt left a comment •

edited

Loading

Uh oh!

arthurschreiber commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

arthurschreiber commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What this PR does not do

Related Issue(s)

Checklist

Backport justification

Deployment Notes

AI Disclosure

Uh oh!

vitess-bot Bot commented May 18, 2026

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

mattlord left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

timvaillancourt May 27, 2026

Choose a reason for hiding this comment

Uh oh!

timvaillancourt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arthurschreiber commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

arthurschreiber commented May 18, 2026 •

edited

Loading

codecov Bot commented May 18, 2026 •

edited

Loading

timvaillancourt left a comment •

edited

Loading