smartconnpool: don't hand returned connections to expired waiters by arthurschreiber · Pull Request #20308 · vitessio/vitess

arthurschreiber · 2026-06-11T23:02:01Z

Description

A production v23 vttablet stalled completely under a timeout storm: ~18k goroutines serialized on the stream pool's waitlist mutex, with every pool connection trapped in the return path. The root cause is in the waitlist: tryReturnConn hands a returned connection to the front-most waiter even when that waiter's context has already expired. Each such handoff parks the connection on a dead request — the unbuffered channel send blocks until the expired waiter gets through the contended mutex — the request then fails instantly and recycles the connection to the next expired waiter. Useful throughput drops to zero while new requests keep joining the list.

The fix stores the request context on the waitlist entry, and tryReturnConnSlow evicts waiters whose context has already expired as it scans for a handoff target: it removes them from the list and wakes them with a nil. Disambiguation is by the value received on the channel — a real connection is a handoff, a nil is an eviction — so the entry's channel is buffered (cap 1) and a waitResult helper maps a nil back to the request's own context error. Because a listed waiter's buffer is always empty (only a returner sends, and only while removing the waiter from the list), both the eviction wake-up and the handoff send complete without ever blocking — including the eviction send made under the waitlist mutex. If every waiter has expired, the connection goes back onto the stack where new requests grab it directly.

An earlier revision of this PR skipped expired waiters but left them on the list, on the theory they'd remove themselves. Under a sustained timeout storm that only relocated the convoy: the expired entries pile up as a long prefix at the head of the list, and every later return re-scans that whole prefix under the mutex — the same O(n)-under-mutex stall, moved from the waiter's exit path to the returner's handoff path. A customer's closed-loop reproduction still wedged on that revision. Evicting during the scan keeps the list from accumulating a dead prefix at all.

This also removes the other half of the collapse: a timed-out waiter could only leave the waitlist by scanning the entire list under the mutex to learn whether it was still a member — O(n) per exit, ~18k entries deep in the incident. go/list now exposes RemoveIfPresent, an O(1) membership check via the owning-list pointer the list already maintains, and both waitForConn fallback paths use it, so the backlog drains at mutex-handoff speed instead of scan speed. Remove keeps its panicking must-be-present contract for the returner's claim path, where absence would mean list corruption.

One surprise along the way: the new context field grew ConnPool past an allocator size bucket where its 128-bit atomics lose their 16-byte alignment, which faults (SIGBUS) on arm64 and amd64. Go has no supported way to demand 16-byte alignment, so this had always worked by allocator accident. The waitlist now lives behind a pointer so its size can never change ConnPool's allocation again, and the constraint is documented on connStack next to the atomics it affects.

Two tests guard the behavior. The first reproduces the production incident with real actors: requests whose deadlines expire while queued, live traffic queued behind them, and returners parked on the waitlist mutex (the test holds the mutex to play the role of the production convoy); on the old code it fails immediately with connections handed to dead requests, and passes deterministically with the fix. The second (TestWaitlistConvoyDrainsUnderConcurrentReturns) is a bounded reproduction of the relocated convoy specifically: a large expired prefix with live waiters queued behind it and several concurrent returners — it asserts the list drains to empty and every live waiter is served, and fails on the skip-and-leave revision (the expired prefix is never removed).

Benchmarks

Added benchmarks over the waitForConn/tryReturnConn paths — waitlist_bench_test.go (steady-state handoff at 1/8/64 waiters, and the empty-list fast path) and waitlist_evict_bench_test.go (repeated returns against a list fronted by an expired prefix).

Comparing this branch against the pre-eviction revision (benchstat, 10 runs; the handoff was re-confirmed with the per-iteration allocation hoisted out and GOMAXPROCS pinned, 20 runs, to remove GC/scheduler jitter):

Path	Before	After	Δ
Handoff (geomean, 1/8/64 waiters)	180.5 ns	180.6 ns	+0.06% (within ±1%)
Empty-list fast path	~0.8 ns	~0.8 ns	unchanged
Allocations (all paths)	—	—	0 added
32 returns, 256-deep expired prefix	66.5 µs	5.4 µs	−92%
32 returns, 2048-deep expired prefix	525 µs	48 µs	−91%

The common handoff path is neutral — a buffered send to an already-parked receiver does the same direct hand-off as the unbuffered one, so buffering costs nothing here. The expired-prefix path is ~11× faster: that is the relocated convoy being removed — skip-and-leave re-scans the whole prefix on every return, eviction drains it once.

Backport Reasoning

This fixes a tablet-wide production outage: under a timeout storm the connection pool livelocks, the tablet serves zero queries, and only a restart recovers it. There is no configuration workaround. The defective handoff logic exists unchanged on all supported release branches, and the fix is small, self-contained to go/pools/smartconnpool and go/list, and changes no public API or behavior other than no longer handing connections to requests whose context has expired. Requesting backport to release-23.0 (version the incident occurred on) and release-24.0.

Related Issue(s)

Fixes Bug Report: vttablet stalls completely when the connection pool waitlist fills with timed-out requests #20310
Related: smartconnpool: non-blocking lifecycle via generation counter #20187 (waitlist handling requirements for the connection-pool generation-counter redesign)

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

Deployment Notes

No deployment steps. Behavioral change: the connection pool no longer hands returned connections to requests whose context has already expired — those requests fail with the usual timeout error, and the connection goes to a live waiter or back into the pool.

AI Disclosure

This PR was written primarily by Claude Code (analysis of the goroutine dump, the fix, the tests, and the benchmarks), with human direction and review.

When a connection is returned to the pool, tryReturnConn hands it to the front-most waiter in the waitlist even if that waiter's context has already expired. Under a timeout storm this collapses the pool: every returned connection is parked on a dead request, fails instantly, and is recycled to the next expired waiter, while the unbuffered handoff leaves returners blocked in a channel send until their expired target fights through the contended waitlist mutex. A production vttablet stalled completely this way, with ~18k goroutines serialized on the stream pool's waitlist mutex and every pool connection trapped in the return path doing no useful work. Store the request context on the waitlist entry and skip waiters whose context has expired when picking a handoff target. Expired waiters stay in the list and still remove themselves: absence from the list is how a waiter knows a returner is committed to handing it a connection, so only the waiter may remove its own entry on the timeout path. If every waiter has expired, the connection goes back onto the stack where new requests pick it up directly. The new field grew ConnPool past an allocator size bucket in which its 128-bit atomics lose their 16-byte alignment, which faults on arm64 and amd64 (the Go allocator only provides 16-byte alignment for certain object sizes, and there is no supported way to demand it). The waitlist now lives behind a pointer so that its size can never again change ConnPool's allocation, and NewPool asserts the alignment at startup so a future layout or toolchain change fails fast instead of crashing with SIGBUS under traffic. The new stress test reproduces the production stall deterministically: a backlog of expired waiters pinned in the waitlist swallows every returned connection on the old code, starving live waiters queued behind it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

vitess-bot · 2026-06-11T23:02:28Z

Drop the 16-byte-alignment panic from NewPool and document the constraint on connStack instead, next to the 128-bit atomic it protects: the alignment depends on where the allocator places the ConnPool allocation, which is why the waitlist is held behind a pointer and why growing ConnPool needs care. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Copilot

Pull request overview

This PR fixes a pathological stall in smartconnpool under timeout storms by preventing returned connections from being handed off to waiters whose request context has already expired, ensuring returned connections can reach live waiters (or go back to the pool) instead of being repeatedly parked on dead requests.

Changes:

Store the waiting request context.Context on waitlist entries and skip expired waiters when selecting a handoff target.
Move the waitlist behind a pointer in ConnPool to stabilize ConnPool’s allocation size, and add a startup alignment assertion to fail fast if 128-bit stack atomics are misaligned.
Add unit and stress tests that deterministically reproduce the expired-waiter backlog scenario and validate correct handoff behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

File	Description
go/pools/smartconnpool/waitlist.go	Attach request contexts to waiters and skip expired waiters during connection handoff selection.
go/pools/smartconnpool/waitlist_test.go	Add focused unit tests covering expired-waiter skipping and claim-window semantics.
go/pools/smartconnpool/stress_test.go	Add a stress regression test reproducing the production stall via an expired-waiter backlog.
go/pools/smartconnpool/pool.go	Make `wait` a pointer to stabilize `ConnPool` size and add a fast-fail alignment assertion for 128-bit atomics.

Comments suppressed due to low confidence (1)

go/pools/smartconnpool/pool.go:215

The 16-byte alignment assertion in NewPool is currently unconditional, but the 128-bit atomic implementation that faults on misalignment is only built on amd64/arm64 (go/atomic2/atomic128.go). On other GOARCH values, atomic2 uses the spinlock implementation (go/atomic2/atomic128_spinlock.go), which does not require 16-byte alignment. As written, this can panic at startup on non-amd64/arm64 platforms even though they would otherwise function correctly.

Consider gating the check to amd64/arm64 (runtime.GOARCH) or moving the assert into an amd64/arm64 build-tagged file so it only runs where misalignment is actually fatal.

	pool.config.maxCapacity = config.Capacity
	pool.config.maxIdleCount = config.MaxIdleCount
	pool.config.maxLifetime.Store(config.MaxLifetime.Nanoseconds())
	pool.config.idleTimeout.Store(config.IdleTimeout.Nanoseconds())
	pool.config.refreshInterval.Store(config.RefreshInterval.Nanoseconds())
	pool.config.logWait = config.LogWait
	pool.config.maxWaiters = config.MaxWaiters
	pool.wait.init()

codecov · 2026-06-11T23:11:49Z

Codecov Report

❌ Patch coverage is 93.93939% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.40%. Comparing base (70c7a72) to head (d02ff2c).
⚠️ Report is 323 commits behind head on main.

Files with missing lines	Patch %	Lines
go/pools/smartconnpool/waitlist.go	92.59%	2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (70c7a72) and HEAD (d02ff2c). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (70c7a72) HEAD (d02ff2c)

1 0

Additional details and impacted files

@@             Coverage Diff             @@
##             main   #20308       +/-   ##
===========================================
- Coverage   69.67%   60.40%    -9.27%     
===========================================
  Files        1614        9     -1605     
  Lines      216793     1043   -215750     
===========================================
- Hits       151044      630   -150414     
+ Misses      65749      413    -65336

Flag	Coverage Δ
partial	`60.40% <93.93%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…ress test Replace the synthetic pinned-waiter stress test with one built from real actors: requests whose deadlines expire while queued on the waitlist, live traffic queued behind them, and returner goroutines parked on the waitlist mutex. The mutex convoy that pins expired waiters in the list at production scale is recreated by holding the waitlist mutex from the test while the backlog expires, so the returners scan a list full of expired-but-listed waiters exactly as in the incident. The assertions are the customer-visible contract and are independent of timing: no request whose context has already expired may be handed a connection, and every live request must be served. On the unfixed code the test fails in milliseconds with connections delivered to dead requests; with the fix it passes deterministically. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

A waiter that wakes on context expiry or pool close could only leave the waitlist by scanning the entire list under the waitlist mutex to learn whether it was still a member — O(n) per exit, with ~18k entries in the production incident, all serialized on one mutex. The membership bit is already maintained by the list itself: go/list.Element tracks its owning list, set on insert and cleared on remove. Expose that as list.RemoveIfPresent, which removes the element and reports whether it did, instead of panicking like Remove. Both fallback paths in waitForConn now use it, so timed-out waiters exit in constant time and the backlog drains at mutex-handoff speed instead of scan speed. Remove keeps its panicking must-be-present contract for the returner's claim path, where absence would mean list corruption. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

Expired waiters are skipped by the returner's aging loop, so they keep age == 0 until they remove themselves and inflated maybeStarvingCount, causing the starving worker to pop a connection it can't hand out. Skip them the same way tryReturnConnSlow does. Also replace the loop's stale comment, which described a force parameter that no longer exists. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

mattlord

LGTM. I'm not so sure about the backport... but I wouldn't block it.

The previous change skipped expired waiters during a return but left them on the waitlist, so under a timeout storm an expired prefix accumulated at the head of the list and every later return re-scanned it under the waitlist mutex -- O(prefix) per handoff. That is the same O(n)-under-mutex stall this PR set out to remove, relocated from the waiter's exit path to the returner's handoff path; a customer's closed-loop reproduction wedged on the patched branch. Have the returner evict expired waiters as it scans: remove them from the list and wake them with a nil. Disambiguation is by the received value -- a real connection is a handoff, a nil is an eviction -- so the conn channel is buffered (cap 1) and a waitResult helper maps a nil back to the context error. Because a listed waiter's buffer is always empty (only a returner sends, and only while removing the waiter from the list), both the eviction nil-send and the handoff send complete without blocking, including the eviction send made under the mutex. Add a bounded convoy stress test plus benchmarks covering the handoff, the empty-list fast path, and the expired-prefix scan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

arthurschreiber added Type: Bug Component: Query Serving labels Jun 11, 2026

Copilot AI review requested due to automatic review settings June 11, 2026 23:02

arthurschreiber added Backport to: release-23.0 Needs to be backport to release-23.0 Backport to: release-24.0 Needs to be backport to release-24.0 labels Jun 11, 2026

github-actions Bot added this to the v25.0.0 milestone Jun 11, 2026

Copilot started reviewing on behalf of arthurschreiber June 11, 2026 23:02 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 12, 2026 07:33

Copilot started reviewing on behalf of arthurschreiber June 12, 2026 07:33 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

arthurschreiber and others added 2 commits June 12, 2026 08:05

smartconnpool: drop ineffectual assignment flagged by golangci-lint

ef24516

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Copilot AI review requested due to automatic review settings June 12, 2026 08:09

Copilot started reviewing on behalf of arthurschreiber June 12, 2026 08:10 View session

arthurschreiber marked this pull request as ready for review June 12, 2026 08:12

arthurschreiber requested a review from harshit-gangal as a code owner June 12, 2026 08:12

Copilot AI reviewed Jun 12, 2026

View reviewed changes

mattlord approved these changes Jun 12, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 14, 2026 15:29

Copilot started reviewing on behalf of arthurschreiber June 14, 2026 15:30 View session

Copilot AI reviewed Jun 14, 2026

View reviewed changes

arthurschreiber self-assigned this Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smartconnpool: don't hand returned connections to expired waiters#20308

smartconnpool: don't hand returned connections to expired waiters#20308
arthurschreiber wants to merge 7 commits into
mainfrom
arthur/smartconnpool-skip-expired-waiters

arthurschreiber commented Jun 11, 2026 •

edited

Loading

Uh oh!

vitess-bot Bot commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

codecov Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

mattlord left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

arthurschreiber commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Benchmarks

Backport Reasoning

Related Issue(s)

Checklist

Deployment Notes

AI Disclosure

Uh oh!

vitess-bot Bot commented Jun 11, 2026

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

mattlord left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arthurschreiber commented Jun 11, 2026 •

edited

Loading

codecov Bot commented Jun 11, 2026 •

edited

Loading