Skip to content

smartconnpool: unblock Close and preserve Setting on reopen#20122

Merged
arthurschreiber merged 21 commits into
mainfrom
arthur/smartconnpool-shutdown-fixes
May 27, 2026
Merged

smartconnpool: unblock Close and preserve Setting on reopen#20122
arthurschreiber merged 21 commits into
mainfrom
arthur/smartconnpool-shutdown-fixes

Conversation

@arthurschreiber
Copy link
Copy Markdown
Member

@arthurschreiber arthurschreiber commented May 18, 2026

Description

Four small changes to the smartconnpool. The first three address shutdown stalls when the MySQL backend is unhealthy; the fourth fixes a long-standing silent loss of connection settings on reopen. Each is scoped to a specific code path; healthy pools see no behavior change.

1. tryReturnConn closes eagerly when the pool is over capacity

Whenever pool.active > pool.capacity — driven either by Close()/SetCapacity(0) taking capacity to 0 or by any SetCapacity(N) reducing it below the current active count — Recycled connections are closed immediately instead of being handed to whichever waiter is still in the waitlist. Without this, conns get passed waiter-to-waiter during the drain and the setCapacity loop's 10ms-sleep polls never observe a non-empty stack to close. The check is active > capacity rather than capacity == 0 so partial reductions get the same drain semantics as a full Close — SetCapacity(N → M) with 0 < M < N and a non-empty waitlist also gets a clean drain.

2. Pool-wide lifetime context, cancelled on Close

The pool now holds a context.Context (in a small heap-allocated lifetime struct, behind an atomic.Pointer) that is cancelled at the very start of CloseWithContext, before setCapacity(0).

The pointer indirection is deliberate: inlining context.Context (16B) + CancelFunc (8B) directly into ConnPool pushed the struct's size across a Go allocator size-class boundary and exposed a latent 16-byte alignment requirement of the lock-free stack's atomic-128 ops, causing SIGBUS on ARM64. Keeping the inline footprint at 8 bytes preserves the size class.

3. The idle worker, put(nil), and the maxLifetime put branch all use the lifetime ctx for their connect calls

Previously these three sites passed context.Background() to connect(...), so a Close() racing with an in-flight reopen was extended by the full db-connect-timeout-ms per blocked connect, completely ignoring PoolCloseTimeout. They now use pool.connectCtx(), so a Close unblocks any in-flight connect immediately. put stays synchronous on the caller's goroutine — the same shape as before this PR — but no longer holds up Close.

4. connReopen no longer silently drops the connection's Setting

connReopen used to read Setting() after replacing dbconn.Conn with a freshly-connected conn, so it always read nil and the conn silently migrated from settings[bucket] to clean on every reopen — both for the idle worker and for the maxLifetime path.

Fixing connReopen itself to unconditionally capture Setting before the replace broke the get()/getWithSetting error paths, which rely on the bare-reopen behavior to fall back to a settingless conn when ResetSetting fails. So connReopen now takes an explicit setting *Setting parameter: callers refreshing a conn in place (closeIdleResources and the maxLifetime branch of put) capture the previous Setting and pass it through; callers recovering from a ResetSetting failure pass nil for a bare reopen.

What this PR does not do

Under a sick backend during normal operation (not Close), Taint and Recycle still synchronously block on the connect attempt for the full backend connect timeout. That matches the behavior before this PR — fixing it requires moving the connect off the caller's goroutine, which has its own design trade-offs and is a separate conversation. The Close-time stalls are the immediate operational problem this PR addresses.

Related Issue(s)

None. Discovered during a deep review of the smartconnpool stall surface.

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

New and updated coverage:

  • TestCloseDoesNotHandOffToWaiters — Close doesn't hand Recycled conns to waiters once capacity is 0.
  • TestSetCapacityReductionDrainsWithWaitersSetCapacity(N → M) with 0 < M < N and hold-forever waiters completes within its deadline; recycled conns are closed when active > capacity rather than handed off and stuck.
  • TestIdleWorkerConnectCancelsOnClose — idle worker's connect unblocks when Close fires.
  • TestTaintConnectCancelsOnCloseTaint's synchronous reopen unblocks when Close fires.
  • TestRecycleMaxLifetimeReopenCancelsOnClose — the maxLifetime reopen in put unblocks when Close fires.
  • TestTaintWakesWaiter and TestRecycleMaxLifetimeWakesWaiter — queued waiters are still served by the synchronous replacement.
  • TestRecycleMaxLifetimePreservesSetting and TestIdleWorkerReopenPreservesSetting — reopened conns return to the same settings stack.
  • TestIdleTimeoutDoesntLeaveLingeringConnection — tightened to account for the two-loop idle refresh path without racing the worker.
  • TestStressCloseDuringReconnectStorm — stress-covers Close racing with blocked reconnects and verifies shutdown completes without leaking conns.
  • TestStressWaiterStormDuringDrain — stress-covers queued waiters during Close drain and verifies returned conns are closed rather than handed off.

Existing TestCloseDuringWaitForConn × 50 iterations under -race continues to pass.

Backport justification

These are production-reliability fixes for vttablet shutdown against an unhealthy backend. With a dead/unreachable MySQL:

  • The idle worker's connReopen, plus the put(nil) and maxLifetime branches of put, all blocked for the full backend connect timeout per in-flight connect, ignoring PoolCloseTimeout. Observed Close latency goes from the intended ~10s up to many minutes.
  • Concurrent Recycles during Close get cycled waiter-to-waiter instead of landing on a stack the drain loop can see, adding more polling latency.
  • Settings configured on connections silently disappear after every idle-timeout refresh and every conn rotation that goes through the reopen path, forcing repeated re-application.

A vttablet restart against a dead primary is exactly when fast shutdown matters most. The changes are scoped to the Close, drain, and reopen paths; healthy pools see no behavior change beyond the (long-broken) setting preservation now actually working.

Deployment Notes

One user-visible behavior change: connections being Recycled while the pool is over its configured capacity — during Close(), SetCapacity(0), or any SetCapacity(N) that reduces capacity below current active — are now closed rather than handed to queued waiters. Those waiters will return with ErrConnPoolClosed once Close() finishes signalling the close channel, or with their own ctx timeout otherwise.

AI Disclosure

This PR was implemented by Claude Code based on a deep review and step-by-step fix plan I directed.

Two changes that together let Close return promptly when the MySQL
backend is unhealthy:

1. tryReturnConn now closes connections eagerly once capacity has been
   driven to 0. Without this, in-flight Recycles during Close get
   handed to queued waiters instead of landing on a stack, and the
   setCapacity drain loop never observes the conns it needs to close.

2. The idle worker's connection reopen now uses a context that is
   cancelled at the start of Close. Previously connReopen used
   context.Background(), so a Close that races with an idle-tick reopen
   was extended by the full backend connect timeout per expired
   connection, ignoring PoolCloseTimeout.

The context is held behind an atomic pointer to keep ConnPool's heap
footprint stable; the lock-free stack's atomic-128 ops rely on Go's
allocator placing the struct in a size class with 16-byte alignment,
and inlining context.Context + CancelFunc directly crossed a size
class boundary that no longer satisfied that.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copilot AI review requested due to automatic review settings May 18, 2026 12:29
@github-actions github-actions Bot added this to the v25.0.0 milestone May 18, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 18, 2026
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot Bot commented May 18, 2026

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Two small fixes to smartconnpool.ConnPool to ensure Close() returns promptly when the MySQL backend is unhealthy: connections being recycled during a close are closed eagerly rather than handed to waiters, and the idle-timeout worker's in-flight connect is now cancelled at close start instead of blocking for the full backend connect timeout.

Changes:

  • tryReturnConn now closes connections immediately when capacity has been driven to 0 (Close or explicit SetCapacity(0)), preventing waiter-to-waiter handoff loops that starve the setCapacity drain.
  • Introduces a heap-allocated workerLifetime{ctx, cancel} (held behind atomic.Pointer to preserve ConnPool's size class and 16-byte alignment for the lock-free stack's atomic-128 ops) that is cancelled at the start of CloseWithContext; the idle worker uses this context for connReopen.
  • Two new regression tests: TestCloseDoesNotHandOffToWaiters and TestIdleWorkerConnectCancelsOnClose.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
go/pools/smartconnpool/pool.go Adds workerLifetime struct + atomic pointer, cancels it at Close start, uses pool worker ctx for idle-worker connReopen, and closes recycled conns eagerly when capacity is 0.
go/pools/smartconnpool/pool_test.go Adds two regression tests covering the no-handoff-after-Close guarantee and the cancellation of the idle-worker reopen on Close.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 90.47619% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.52%. Comparing base (70c7a72) to head (03e65e6).
⚠️ Report is 277 commits behind head on main.

Files with missing lines Patch % Lines
go/pools/smartconnpool/pool.go 90.47% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (70c7a72) and HEAD (03e65e6). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (70c7a72) HEAD (03e65e6)
1 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #20122       +/-   ##
===========================================
- Coverage   69.67%   56.52%   -13.16%     
===========================================
  Files        1614        8     -1606     
  Lines      216793      966   -215827     
===========================================
- Hits       151044      546   -150498     
+ Misses      65749      420    -65329     
Flag Coverage Δ
partial 56.52% <90.47%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

arthurschreiber and others added 2 commits May 18, 2026 12:38
Switches the new shutdown tests to t.Cleanup for releasing the blocked
connector and closing the pool so resources unwind even when an
assertion fires partway through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
t.Context is cancelled just before t.Cleanup runs, so the blocked
connector can wait on it directly instead of managing a dedicated
release channel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copilot AI review requested due to automatic review settings May 18, 2026 12:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

The struct is the pool's open/close-bounded context, used by any code
path that calls into a user-supplied callback during a shutdown — not
just worker goroutines. Renaming makes the abstraction clearer and
leaves room for future callers (e.g. Taint/Recycle's connNew) to use
the same mechanism without a misleading name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
@arthurschreiber arthurschreiber added Component: VTTablet Backport to: release-23.0 Needs to be backport to release-23.0 Backport to: release-24.0 Needs to be backport to release-24.0 and removed Component: Query Serving NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 18, 2026
@arthurschreiber arthurschreiber marked this pull request as ready for review May 18, 2026 12:47
Copilot AI review requested due to automatic review settings May 18, 2026 12:47
Copilot AI review requested due to automatic review settings May 21, 2026 14:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Replace tryReturnConn's eager-close trigger with active > capacity. The
old capacity == 0 check fires only when capacity is fully drained and
misses partial reductions: SetCapacity(N → M) with 0 < M < N and
queued waiters cycles Recycles to waiters without ever pushing to a
stack, so the drain loop spins until ctx expires.

active > capacity is the actual invariant we want to defend ("we have
more conns out than configured"), and it strictly subsumes the
previous check — capacity == 0 implies active >= 0 > -1, so anything
the old check fired on, the new one fires on too.

Add TestSetCapacityReductionDrainsWithWaiters that holds N conns,
queues hold-forever waiters, then SetCapacity(1) with a 2s deadline.
Without the change the test hangs at active=N, borrowed=N; with the
change SetCapacity returns promptly and active drops to the target.

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copy link
Copy Markdown
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

One tiny nit / non-blocking cleanup: go/pools/smartconnpool/pool_test.go:1943-1946 still says the replacement conn is opened off the caller goroutine, but the final implementation is synchronous again. I’d update that comment to avoid confusing future us.

Resolved conflicts in go/pools/smartconnpool/pool.go where upstream's
bulk idle-stack sweep (#20136) overlapped with this branch's shutdown
fixes:

- put(): kept upstream's cached `now` for the maxLifetime check while
  threading this branch's Setting and connectCtx through connReopen.
- tryReturnConn(): kept this branch's active>capacity drain guard and
  folded in upstream's new updateIdleTime parameter.
- closeIdleResources(): adopted upstream's PopAll-based linked-list
  bulk processing and read each expired conn's Setting() inline in the
  reopen loop (Close() doesn't touch the setting field).

TestIdleTimeoutStopsReopeningWhenPoolCloses bypasses pool.Open(), so
pool.lifetime was nil and connectCtx() returned a pre-cancelled
context, breaking the test's connect-blocking setup. Set pool.lifetime
explicitly in the test to match real Open() semantics.

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copilot AI review requested due to automatic review settings May 23, 2026 10:51
Setting() is a pure field accessor that Close() doesn't touch, so
there's no need to stash it in a local before calling connReopen —
Go's argument evaluation runs Setting() before connReopen replaces
dbconn.Conn anyway.

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

The async-spawn approach was reverted; the replacement conn is opened
synchronously on the caller's goroutine via connNew. Update the test
doc comment so future readers aren't misled.

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copilot AI review requested due to automatic review settings May 23, 2026 10:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

…l-shutdown-fixes

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

# Conflicts:
#	go/pools/smartconnpool/pool.go
#	go/pools/smartconnpool/stress_test.go
… tests

Replace requireReceive and requireMaxLifetimeExpired with the canonical
Go patterns at each call site: a blocking select with time.After, and a
direct time.Sleep on the configured maxLifetime.

Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copilot AI review requested due to automatic review settings May 27, 2026 13:45
@arthurschreiber arthurschreiber enabled auto-merge (squash) May 27, 2026 13:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

go/pools/smartconnpool/pool.go:483

  • In put(nil), the pool now attempts connNew(pool.connectCtx()) even when the pool is already closed (pool.close.Load()==nil). Since connectors are not required to honor ctx cancellation (e.g. tests’ newConnector ignores ctx), a late Recycle/Taint after Close can block in the user connector and even attempt opening new backend connections after shutdown. Restore the fast-path that decrements active and returns without calling connect when the pool is already closed.
	if conn == nil {
		// Taint, or Recycle on a closed conn: open a replacement so a
		// queued waiter can be served. We use the pool's lifetime context
		// so a Close in flight unblocks this connect, instead of waiting
		// for the backend's connect timeout.
		var err error
		conn, err = pool.connNew(pool.connectCtx())
		if err != nil {
			pool.closedConn()
			return
		}

// close — they should fail fast rather than open a connection that has
// nowhere to go.
var alreadyCancelled = func() context.Context {
ctx, cancel := context.WithCancel(context.Background())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too bad the context package doesn't have a "nice" pre-cancelled context

I just wanted to add an alternative in case it is somehow more efficent or preferable:

var alreadyCancelled, _ = context.WithDeadline(context.Background(), time.Time{})

Copy link
Copy Markdown
Contributor

@timvaillancourt timvaillancourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @arthurschreiber. Added one alternative option if it's interesting to you

@arthurschreiber
Copy link
Copy Markdown
Member Author

Added one alternative option if it's interesting to you

Thanks. I don't think this makes any difference either way, so I'll keep it as-is! 🙇‍♂️

@arthurschreiber arthurschreiber merged commit e396da4 into main May 27, 2026
171 of 172 checks passed
@arthurschreiber arthurschreiber deleted the arthur/smartconnpool-shutdown-fixes branch May 27, 2026 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backport to: release-23.0 Needs to be backport to release-23.0 Backport to: release-24.0 Needs to be backport to release-24.0 Component: VTTablet Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants