smartconnpool: don't hand returned connections to expired waiters#20308
smartconnpool: don't hand returned connections to expired waiters#20308arthurschreiber wants to merge 7 commits into
Conversation
When a connection is returned to the pool, tryReturnConn hands it to the front-most waiter in the waitlist even if that waiter's context has already expired. Under a timeout storm this collapses the pool: every returned connection is parked on a dead request, fails instantly, and is recycled to the next expired waiter, while the unbuffered handoff leaves returners blocked in a channel send until their expired target fights through the contended waitlist mutex. A production vttablet stalled completely this way, with ~18k goroutines serialized on the stream pool's waitlist mutex and every pool connection trapped in the return path doing no useful work. Store the request context on the waitlist entry and skip waiters whose context has expired when picking a handoff target. Expired waiters stay in the list and still remove themselves: absence from the list is how a waiter knows a returner is committed to handing it a connection, so only the waiter may remove its own entry on the timeout path. If every waiter has expired, the connection goes back onto the stack where new requests pick it up directly. The new field grew ConnPool past an allocator size bucket in which its 128-bit atomics lose their 16-byte alignment, which faults on arm64 and amd64 (the Go allocator only provides 16-byte alignment for certain object sizes, and there is no supported way to demand it). The waitlist now lives behind a pointer so that its size can never again change ConnPool's allocation, and NewPool asserts the alignment at startup so a future layout or toolchain change fails fast instead of crashing with SIGBUS under traffic. The new stress test reproduces the production stall deterministically: a backlog of expired waiters pinned in the waitlist swallows every returned connection on the old code, starving live waiters queued behind it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Drop the 16-byte-alignment panic from NewPool and document the constraint on connStack instead, next to the 128-bit atomic it protects: the alignment depends on where the allocator places the ConnPool allocation, which is why the waitlist is held behind a pointer and why growing ConnPool needs care. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes a pathological stall in smartconnpool under timeout storms by preventing returned connections from being handed off to waiters whose request context has already expired, ensuring returned connections can reach live waiters (or go back to the pool) instead of being repeatedly parked on dead requests.
Changes:
- Store the waiting request
context.Contexton waitlist entries and skip expired waiters when selecting a handoff target. - Move the waitlist behind a pointer in
ConnPoolto stabilizeConnPool’s allocation size, and add a startup alignment assertion to fail fast if 128-bit stack atomics are misaligned. - Add unit and stress tests that deterministically reproduce the expired-waiter backlog scenario and validate correct handoff behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| go/pools/smartconnpool/waitlist.go | Attach request contexts to waiters and skip expired waiters during connection handoff selection. |
| go/pools/smartconnpool/waitlist_test.go | Add focused unit tests covering expired-waiter skipping and claim-window semantics. |
| go/pools/smartconnpool/stress_test.go | Add a stress regression test reproducing the production stall via an expired-waiter backlog. |
| go/pools/smartconnpool/pool.go | Make wait a pointer to stabilize ConnPool size and add a fast-fail alignment assertion for 128-bit atomics. |
Comments suppressed due to low confidence (1)
go/pools/smartconnpool/pool.go:215
- The 16-byte alignment assertion in NewPool is currently unconditional, but the 128-bit atomic implementation that faults on misalignment is only built on amd64/arm64 (go/atomic2/atomic128.go). On other GOARCH values, atomic2 uses the spinlock implementation (go/atomic2/atomic128_spinlock.go), which does not require 16-byte alignment. As written, this can panic at startup on non-amd64/arm64 platforms even though they would otherwise function correctly.
Consider gating the check to amd64/arm64 (runtime.GOARCH) or moving the assert into an amd64/arm64 build-tagged file so it only runs where misalignment is actually fatal.
pool.config.maxCapacity = config.Capacity
pool.config.maxIdleCount = config.MaxIdleCount
pool.config.maxLifetime.Store(config.MaxLifetime.Nanoseconds())
pool.config.idleTimeout.Store(config.IdleTimeout.Nanoseconds())
pool.config.refreshInterval.Store(config.RefreshInterval.Nanoseconds())
pool.config.logWait = config.LogWait
pool.config.maxWaiters = config.MaxWaiters
pool.wait.init()
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20308 +/- ##
===========================================
- Coverage 69.67% 60.40% -9.27%
===========================================
Files 1614 9 -1605
Lines 216793 1043 -215750
===========================================
- Hits 151044 630 -150414
+ Misses 65749 413 -65336
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…ress test Replace the synthetic pinned-waiter stress test with one built from real actors: requests whose deadlines expire while queued on the waitlist, live traffic queued behind them, and returner goroutines parked on the waitlist mutex. The mutex convoy that pins expired waiters in the list at production scale is recreated by holding the waitlist mutex from the test while the backlog expires, so the returners scan a list full of expired-but-listed waiters exactly as in the incident. The assertions are the customer-visible contract and are independent of timing: no request whose context has already expired may be handed a connection, and every live request must be served. On the unfixed code the test fails in milliseconds with connections delivered to dead requests; with the fix it passes deterministically. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
A waiter that wakes on context expiry or pool close could only leave the waitlist by scanning the entire list under the waitlist mutex to learn whether it was still a member — O(n) per exit, with ~18k entries in the production incident, all serialized on one mutex. The membership bit is already maintained by the list itself: go/list.Element tracks its owning list, set on insert and cleared on remove. Expose that as list.RemoveIfPresent, which removes the element and reports whether it did, instead of panicking like Remove. Both fallback paths in waitForConn now use it, so timed-out waiters exit in constant time and the backlog drains at mutex-handoff speed instead of scan speed. Remove keeps its panicking must-be-present contract for the returner's claim path, where absence would mean list corruption. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Expired waiters are skipped by the returner's aging loop, so they keep age == 0 until they remove themselves and inflated maybeStarvingCount, causing the starving worker to pop a connection it can't hand out. Skip them the same way tryReturnConnSlow does. Also replace the loop's stale comment, which described a force parameter that no longer exists. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
mattlord
left a comment
There was a problem hiding this comment.
LGTM. I'm not so sure about the backport... but I wouldn't block it.
The previous change skipped expired waiters during a return but left them on the waitlist, so under a timeout storm an expired prefix accumulated at the head of the list and every later return re-scanned it under the waitlist mutex -- O(prefix) per handoff. That is the same O(n)-under-mutex stall this PR set out to remove, relocated from the waiter's exit path to the returner's handoff path; a customer's closed-loop reproduction wedged on the patched branch. Have the returner evict expired waiters as it scans: remove them from the list and wake them with a nil. Disambiguation is by the received value -- a real connection is a handoff, a nil is an eviction -- so the conn channel is buffered (cap 1) and a waitResult helper maps a nil back to the context error. Because a listed waiter's buffer is always empty (only a returner sends, and only while removing the waiter from the list), both the eviction nil-send and the handoff send complete without blocking, including the eviction send made under the mutex. Add a bounded convoy stress test plus benchmarks covering the handoff, the empty-list fast path, and the expired-prefix scan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Description
A production v23 vttablet stalled completely under a timeout storm: ~18k goroutines serialized on the stream pool's waitlist mutex, with every pool connection trapped in the return path. The root cause is in the waitlist:
tryReturnConnhands a returned connection to the front-most waiter even when that waiter's context has already expired. Each such handoff parks the connection on a dead request — the unbuffered channel send blocks until the expired waiter gets through the contended mutex — the request then fails instantly and recycles the connection to the next expired waiter. Useful throughput drops to zero while new requests keep joining the list.The fix stores the request context on the waitlist entry, and
tryReturnConnSlowevicts waiters whose context has already expired as it scans for a handoff target: it removes them from the list and wakes them with anil. Disambiguation is by the value received on the channel — a real connection is a handoff, anilis an eviction — so the entry's channel is buffered (cap 1) and awaitResulthelper maps anilback to the request's own context error. Because a listed waiter's buffer is always empty (only a returner sends, and only while removing the waiter from the list), both the eviction wake-up and the handoff send complete without ever blocking — including the eviction send made under the waitlist mutex. If every waiter has expired, the connection goes back onto the stack where new requests grab it directly.An earlier revision of this PR skipped expired waiters but left them on the list, on the theory they'd remove themselves. Under a sustained timeout storm that only relocated the convoy: the expired entries pile up as a long prefix at the head of the list, and every later return re-scans that whole prefix under the mutex — the same O(n)-under-mutex stall, moved from the waiter's exit path to the returner's handoff path. A customer's closed-loop reproduction still wedged on that revision. Evicting during the scan keeps the list from accumulating a dead prefix at all.
This also removes the other half of the collapse: a timed-out waiter could only leave the waitlist by scanning the entire list under the mutex to learn whether it was still a member — O(n) per exit, ~18k entries deep in the incident.
go/listnow exposesRemoveIfPresent, an O(1) membership check via the owning-list pointer the list already maintains, and bothwaitForConnfallback paths use it, so the backlog drains at mutex-handoff speed instead of scan speed.Removekeeps its panicking must-be-present contract for the returner's claim path, where absence would mean list corruption.One surprise along the way: the new context field grew
ConnPoolpast an allocator size bucket where its 128-bit atomics lose their 16-byte alignment, which faults (SIGBUS) on arm64 and amd64. Go has no supported way to demand 16-byte alignment, so this had always worked by allocator accident. The waitlist now lives behind a pointer so its size can never changeConnPool's allocation again, and the constraint is documented onconnStacknext to the atomics it affects.Two tests guard the behavior. The first reproduces the production incident with real actors: requests whose deadlines expire while queued, live traffic queued behind them, and returners parked on the waitlist mutex (the test holds the mutex to play the role of the production convoy); on the old code it fails immediately with connections handed to dead requests, and passes deterministically with the fix. The second (
TestWaitlistConvoyDrainsUnderConcurrentReturns) is a bounded reproduction of the relocated convoy specifically: a large expired prefix with live waiters queued behind it and several concurrent returners — it asserts the list drains to empty and every live waiter is served, and fails on the skip-and-leave revision (the expired prefix is never removed).Benchmarks
Added benchmarks over the
waitForConn/tryReturnConnpaths —waitlist_bench_test.go(steady-state handoff at 1/8/64 waiters, and the empty-list fast path) andwaitlist_evict_bench_test.go(repeated returns against a list fronted by an expired prefix).Comparing this branch against the pre-eviction revision (
benchstat, 10 runs; the handoff was re-confirmed with the per-iteration allocation hoisted out andGOMAXPROCSpinned, 20 runs, to remove GC/scheduler jitter):The common handoff path is neutral — a buffered send to an already-parked receiver does the same direct hand-off as the unbuffered one, so buffering costs nothing here. The expired-prefix path is ~11× faster: that is the relocated convoy being removed — skip-and-leave re-scans the whole prefix on every return, eviction drains it once.
Backport Reasoning
This fixes a tablet-wide production outage: under a timeout storm the connection pool livelocks, the tablet serves zero queries, and only a restart recovers it. There is no configuration workaround. The defective handoff logic exists unchanged on all supported release branches, and the fix is small, self-contained to
go/pools/smartconnpoolandgo/list, and changes no public API or behavior other than no longer handing connections to requests whose context has expired. Requesting backport torelease-23.0(version the incident occurred on) andrelease-24.0.Related Issue(s)
Checklist
Deployment Notes
No deployment steps. Behavioral change: the connection pool no longer hands returned connections to requests whose context has already expired — those requests fail with the usual timeout error, and the connection goes to a live waiter or back into the pool.
AI Disclosure
This PR was written primarily by Claude Code (analysis of the goroutine dump, the fix, the tests, and the benchmarks), with human direction and review.