Skip to content

smartconnpool: don't hand returned connections to expired waiters#20308

Open
arthurschreiber wants to merge 7 commits into
mainfrom
arthur/smartconnpool-skip-expired-waiters
Open

smartconnpool: don't hand returned connections to expired waiters#20308
arthurschreiber wants to merge 7 commits into
mainfrom
arthur/smartconnpool-skip-expired-waiters

Conversation

@arthurschreiber

@arthurschreiber arthurschreiber commented Jun 11, 2026

Copy link
Copy Markdown
Member

Description

A production v23 vttablet stalled completely under a timeout storm: ~18k goroutines serialized on the stream pool's waitlist mutex, with every pool connection trapped in the return path. The root cause is in the waitlist: tryReturnConn hands a returned connection to the front-most waiter even when that waiter's context has already expired. Each such handoff parks the connection on a dead request — the unbuffered channel send blocks until the expired waiter gets through the contended mutex — the request then fails instantly and recycles the connection to the next expired waiter. Useful throughput drops to zero while new requests keep joining the list.

The fix stores the request context on the waitlist entry, and tryReturnConnSlow evicts waiters whose context has already expired as it scans for a handoff target: it removes them from the list and wakes them with a nil. Disambiguation is by the value received on the channel — a real connection is a handoff, a nil is an eviction — so the entry's channel is buffered (cap 1) and a waitResult helper maps a nil back to the request's own context error. Because a listed waiter's buffer is always empty (only a returner sends, and only while removing the waiter from the list), both the eviction wake-up and the handoff send complete without ever blocking — including the eviction send made under the waitlist mutex. If every waiter has expired, the connection goes back onto the stack where new requests grab it directly.

An earlier revision of this PR skipped expired waiters but left them on the list, on the theory they'd remove themselves. Under a sustained timeout storm that only relocated the convoy: the expired entries pile up as a long prefix at the head of the list, and every later return re-scans that whole prefix under the mutex — the same O(n)-under-mutex stall, moved from the waiter's exit path to the returner's handoff path. A customer's closed-loop reproduction still wedged on that revision. Evicting during the scan keeps the list from accumulating a dead prefix at all.

This also removes the other half of the collapse: a timed-out waiter could only leave the waitlist by scanning the entire list under the mutex to learn whether it was still a member — O(n) per exit, ~18k entries deep in the incident. go/list now exposes RemoveIfPresent, an O(1) membership check via the owning-list pointer the list already maintains, and both waitForConn fallback paths use it, so the backlog drains at mutex-handoff speed instead of scan speed. Remove keeps its panicking must-be-present contract for the returner's claim path, where absence would mean list corruption.

One surprise along the way: the new context field grew ConnPool past an allocator size bucket where its 128-bit atomics lose their 16-byte alignment, which faults (SIGBUS) on arm64 and amd64. Go has no supported way to demand 16-byte alignment, so this had always worked by allocator accident. The waitlist now lives behind a pointer so its size can never change ConnPool's allocation again, and the constraint is documented on connStack next to the atomics it affects.

Two tests guard the behavior. The first reproduces the production incident with real actors: requests whose deadlines expire while queued, live traffic queued behind them, and returners parked on the waitlist mutex (the test holds the mutex to play the role of the production convoy); on the old code it fails immediately with connections handed to dead requests, and passes deterministically with the fix. The second (TestWaitlistConvoyDrainsUnderConcurrentReturns) is a bounded reproduction of the relocated convoy specifically: a large expired prefix with live waiters queued behind it and several concurrent returners — it asserts the list drains to empty and every live waiter is served, and fails on the skip-and-leave revision (the expired prefix is never removed).

Benchmarks

Added benchmarks over the waitForConn/tryReturnConn paths — waitlist_bench_test.go (steady-state handoff at 1/8/64 waiters, and the empty-list fast path) and waitlist_evict_bench_test.go (repeated returns against a list fronted by an expired prefix).

Comparing this branch against the pre-eviction revision (benchstat, 10 runs; the handoff was re-confirmed with the per-iteration allocation hoisted out and GOMAXPROCS pinned, 20 runs, to remove GC/scheduler jitter):

Path Before After Δ
Handoff (geomean, 1/8/64 waiters) 180.5 ns 180.6 ns +0.06% (within ±1%)
Empty-list fast path ~0.8 ns ~0.8 ns unchanged
Allocations (all paths) 0 added
32 returns, 256-deep expired prefix 66.5 µs 5.4 µs −92%
32 returns, 2048-deep expired prefix 525 µs 48 µs −91%

The common handoff path is neutral — a buffered send to an already-parked receiver does the same direct hand-off as the unbuffered one, so buffering costs nothing here. The expired-prefix path is ~11× faster: that is the relocated convoy being removed — skip-and-leave re-scans the whole prefix on every return, eviction drains it once.

Backport Reasoning

This fixes a tablet-wide production outage: under a timeout storm the connection pool livelocks, the tablet serves zero queries, and only a restart recovers it. There is no configuration workaround. The defective handoff logic exists unchanged on all supported release branches, and the fix is small, self-contained to go/pools/smartconnpool and go/list, and changes no public API or behavior other than no longer handing connections to requests whose context has expired. Requesting backport to release-23.0 (version the incident occurred on) and release-24.0.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

No deployment steps. Behavioral change: the connection pool no longer hands returned connections to requests whose context has already expired — those requests fail with the usual timeout error, and the connection goes to a live waiter or back into the pool.

AI Disclosure

This PR was written primarily by Claude Code (analysis of the goroutine dump, the fix, the tests, and the benchmarks), with human direction and review.

When a connection is returned to the pool, tryReturnConn hands it to the
front-most waiter in the waitlist even if that waiter's context has
already expired. Under a timeout storm this collapses the pool: every
returned connection is parked on a dead request, fails instantly, and is
recycled to the next expired waiter, while the unbuffered handoff leaves
returners blocked in a channel send until their expired target fights
through the contended waitlist mutex. A production vttablet stalled
completely this way, with ~18k goroutines serialized on the stream
pool's waitlist mutex and every pool connection trapped in the return
path doing no useful work.

Store the request context on the waitlist entry and skip waiters whose
context has expired when picking a handoff target. Expired waiters stay
in the list and still remove themselves: absence from the list is how a
waiter knows a returner is committed to handing it a connection, so only
the waiter may remove its own entry on the timeout path. If every waiter
has expired, the connection goes back onto the stack where new requests
pick it up directly.

The new field grew ConnPool past an allocator size bucket in which its
128-bit atomics lose their 16-byte alignment, which faults on arm64 and
amd64 (the Go allocator only provides 16-byte alignment for certain
object sizes, and there is no supported way to demand it). The waitlist
now lives behind a pointer so that its size can never again change
ConnPool's allocation, and NewPool asserts the alignment at startup so a
future layout or toolchain change fails fast instead of crashing with
SIGBUS under traffic.

The new stress test reproduces the production stall deterministically:
a backlog of expired waiters pinned in the waitlist swallows every
returned connection on the old code, starving live waiters queued
behind it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copilot AI review requested due to automatic review settings June 11, 2026 23:02
@arthurschreiber arthurschreiber added Backport to: release-23.0 Needs to be backport to release-23.0 Backport to: release-24.0 Needs to be backport to release-24.0 labels Jun 11, 2026
@github-actions github-actions Bot added this to the v25.0.0 milestone Jun 11, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Jun 11, 2026
@vitess-bot

vitess-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

Drop the 16-byte-alignment panic from NewPool and document the
constraint on connStack instead, next to the 128-bit atomic it
protects: the alignment depends on where the allocator places the
ConnPool allocation, which is why the waitlist is held behind a
pointer and why growing ConnPool needs care.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a pathological stall in smartconnpool under timeout storms by preventing returned connections from being handed off to waiters whose request context has already expired, ensuring returned connections can reach live waiters (or go back to the pool) instead of being repeatedly parked on dead requests.

Changes:

  • Store the waiting request context.Context on waitlist entries and skip expired waiters when selecting a handoff target.
  • Move the waitlist behind a pointer in ConnPool to stabilize ConnPool’s allocation size, and add a startup alignment assertion to fail fast if 128-bit stack atomics are misaligned.
  • Add unit and stress tests that deterministically reproduce the expired-waiter backlog scenario and validate correct handoff behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

File Description
go/pools/smartconnpool/waitlist.go Attach request contexts to waiters and skip expired waiters during connection handoff selection.
go/pools/smartconnpool/waitlist_test.go Add focused unit tests covering expired-waiter skipping and claim-window semantics.
go/pools/smartconnpool/stress_test.go Add a stress regression test reproducing the production stall via an expired-waiter backlog.
go/pools/smartconnpool/pool.go Make wait a pointer to stabilize ConnPool size and add a fast-fail alignment assertion for 128-bit atomics.
Comments suppressed due to low confidence (1)

go/pools/smartconnpool/pool.go:215

  • The 16-byte alignment assertion in NewPool is currently unconditional, but the 128-bit atomic implementation that faults on misalignment is only built on amd64/arm64 (go/atomic2/atomic128.go). On other GOARCH values, atomic2 uses the spinlock implementation (go/atomic2/atomic128_spinlock.go), which does not require 16-byte alignment. As written, this can panic at startup on non-amd64/arm64 platforms even though they would otherwise function correctly.

Consider gating the check to amd64/arm64 (runtime.GOARCH) or moving the assert into an amd64/arm64 build-tagged file so it only runs where misalignment is actually fatal.

	pool.config.maxCapacity = config.Capacity
	pool.config.maxIdleCount = config.MaxIdleCount
	pool.config.maxLifetime.Store(config.MaxLifetime.Nanoseconds())
	pool.config.idleTimeout.Store(config.IdleTimeout.Nanoseconds())
	pool.config.refreshInterval.Store(config.RefreshInterval.Nanoseconds())
	pool.config.logWait = config.LogWait
	pool.config.maxWaiters = config.MaxWaiters
	pool.wait.init()

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.93939% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.40%. Comparing base (70c7a72) to head (d02ff2c).
⚠️ Report is 323 commits behind head on main.

Files with missing lines Patch % Lines
go/pools/smartconnpool/waitlist.go 92.59% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (70c7a72) and HEAD (d02ff2c). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (70c7a72) HEAD (d02ff2c)
1 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #20308       +/-   ##
===========================================
- Coverage   69.67%   60.40%    -9.27%     
===========================================
  Files        1614        9     -1605     
  Lines      216793     1043   -215750     
===========================================
- Hits       151044      630   -150414     
+ Misses      65749      413    -65336     
Flag Coverage Δ
partial 60.40% <93.93%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…ress test

Replace the synthetic pinned-waiter stress test with one built from real
actors: requests whose deadlines expire while queued on the waitlist,
live traffic queued behind them, and returner goroutines parked on the
waitlist mutex. The mutex convoy that pins expired waiters in the list
at production scale is recreated by holding the waitlist mutex from the
test while the backlog expires, so the returners scan a list full of
expired-but-listed waiters exactly as in the incident.

The assertions are the customer-visible contract and are independent of
timing: no request whose context has already expired may be handed a
connection, and every live request must be served. On the unfixed code
the test fails in milliseconds with connections delivered to dead
requests; with the fix it passes deterministically.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copilot AI review requested due to automatic review settings June 12, 2026 07:33

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@arthurschreiber arthurschreiber removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Jun 12, 2026
arthurschreiber and others added 2 commits June 12, 2026 08:05
A waiter that wakes on context expiry or pool close could only leave
the waitlist by scanning the entire list under the waitlist mutex to
learn whether it was still a member — O(n) per exit, with ~18k entries
in the production incident, all serialized on one mutex. The membership
bit is already maintained by the list itself: go/list.Element tracks
its owning list, set on insert and cleared on remove.

Expose that as list.RemoveIfPresent, which removes the element and
reports whether it did, instead of panicking like Remove. Both fallback
paths in waitForConn now use it, so timed-out waiters exit in constant
time and the backlog drains at mutex-handoff speed instead of scan
speed. Remove keeps its panicking must-be-present contract for the
returner's claim path, where absence would mean list corruption.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copilot AI review requested due to automatic review settings June 12, 2026 08:09
@arthurschreiber arthurschreiber marked this pull request as ready for review June 12, 2026 08:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

Expired waiters are skipped by the returner's aging loop, so they keep
age == 0 until they remove themselves and inflated maybeStarvingCount,
causing the starving worker to pop a connection it can't hand out.
Skip them the same way tryReturnConnSlow does. Also replace the loop's
stale comment, which described a force parameter that no longer exists.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>

@mattlord mattlord left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm not so sure about the backport... but I wouldn't block it.

The previous change skipped expired waiters during a return but left them
on the waitlist, so under a timeout storm an expired prefix accumulated at
the head of the list and every later return re-scanned it under the
waitlist mutex -- O(prefix) per handoff. That is the same O(n)-under-mutex
stall this PR set out to remove, relocated from the waiter's exit path to
the returner's handoff path; a customer's closed-loop reproduction wedged
on the patched branch.

Have the returner evict expired waiters as it scans: remove them from the
list and wake them with a nil. Disambiguation is by the received value --
a real connection is a handoff, a nil is an eviction -- so the conn channel
is buffered (cap 1) and a waitResult helper maps a nil back to the context
error. Because a listed waiter's buffer is always empty (only a returner
sends, and only while removing the waiter from the list), both the eviction
nil-send and the handoff send complete without blocking, including the
eviction send made under the mutex.

Add a bounded convoy stress test plus benchmarks covering the handoff, the
empty-list fast path, and the expired-prefix scan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Copilot AI review requested due to automatic review settings June 14, 2026 15:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

@arthurschreiber arthurschreiber self-assigned this Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backport to: release-23.0 Needs to be backport to release-23.0 Backport to: release-24.0 Needs to be backport to release-24.0 Component: Query Serving Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: vttablet stalls completely when the connection pool waitlist fills with timed-out requests

3 participants