smartconnpool: avoid close deadlock with refresh reopen#20157
Conversation
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
There was a problem hiding this comment.
Pull request overview
Fixes a close-time deadlock in smartconnpool by ensuring CloseWithContext does not wait for worker goroutines while holding capacityMu, and by preventing refresh reopens / capacity changes from reviving a pool after it has been closed.
Changes:
- Update
CloseWithContextto releasecapacityMubefore waiting onpool.workers. - Make
reopen()exit early when the pool is closed, and makeSetCapacity()returnErrConnPoolClosedafter close. - Add regression + stress tests that exercise close-during-traffic with concurrent refresh reopen and capacity updates.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| go/pools/smartconnpool/pool.go | Adjusts close/reopen/capacity logic to avoid deadlock and prevent post-close resurrection. |
| go/pools/smartconnpool/pool_test.go | Adds focused regressions for close behavior after SetCapacity(0) and for SetCapacity after close. |
| go/pools/smartconnpool/stress_test.go | Adds a multi-cycle stress test that closes the pool under concurrent traffic, refresh reopen, and capacity changes. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20157 +/- ##
===========================================
- Coverage 69.67% 54.74% -14.93%
===========================================
Files 1614 8 -1606
Lines 216793 937 -215856
===========================================
- Hits 151044 513 -150531
+ Misses 65749 424 -65325
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Mirror the cleanup applied to the other shutdown stress tests: fold the two watchdog-failure cleanup sequences into a small abort closure with a two-mode signature (pre-close vs. mid-close), rename WatchdogDelay to Watchdog so the constant matches the sibling helpers, inline the unused worker := i alias, and add a doc comment to TestStressCloseDuringTraffic describing the scenario and the invariants it checks. No behavior change. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
| @@ -310,6 +310,7 @@ func (pool *ConnPool[C]) CloseWithContext(ctx context.Context) error { | |||
| closeChan := *pool.close.Swap(nil) | |||
| close(closeChan) | |||
|
|
|||
| pool.capacityMu.Unlock() | |||
| pool.workers.Wait() | |||
There was a problem hiding this comment.
Good catch — fixed in c9149b6. setCapacity's early return now also requires pool.active <= newcap, so a CloseWithContext following a timed-out SetCapacity(0) re-enters the drain loop rather than short-circuiting. Added a regression test (TestCloseWithContextDrainsAfterTimedOutSetCapacityZero) that holds a conn, drives SetCapacity(0) with a cancelled context to force the timeout, then asserts CloseWithContext returns an error rather than silently completing.
…y(0) setCapacity short-circuits when oldcap == newcap. That's fine for a genuine no-op SetCapacity, but it also fires after a prior SetCapacity(0) timed out with borrowed conns still out: capacity is swapped to 0 before the drain loop, so the loop's timeout leaves the pool at capacity=0 with active>0. A subsequent CloseWithContext then calls setCapacity(0) again, hits the early return without re-entering the drain loop, and returns nil despite borrowed conns still being out — in violation of CloseWithContext's documented contract. Tighten the early return to require pool.active <= newcap as well, so a follow-up call resumes draining when the pool is over the target. Borrowed conns released after CloseWithContext returns still get closed cleanly via tryReturnConn's closed-pool branch; this just fixes the caller-visible return value. Add a regression test that holds a conn, runs SetCapacity(0) with a cancelled context to force the timeout, then asserts CloseWithContext also returns an error rather than silently completing. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
mattlord
left a comment
There was a problem hiding this comment.
LGTM! Thanks, @arthurschreiber !
…l-close-lock-order Signed-off-by: Arthur Schreiber <arthur@planetscale.com> # Conflicts: # go/pools/smartconnpool/pool.go
The cherry-pick of 4061458 left conflict markers in pool.go. The upstream PR added an early-return check to tryReturnConn, but the function signature differs between branches: main has `(conn *Pooled[C], updateIdleTime bool)` while release-24.0 has `(conn *Pooled[C])`. Keep release-24.0's signature and apply the new close-state check on top. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
The cherry-pick of 4061458 left conflict markers in pool.go. The upstream PR added an early-return check to tryReturnConn, but the function signature differs between branches: main has `(conn *Pooled[C], updateIdleTime bool)` while release-23.0 has `(conn *Pooled[C])`. Keep release-23.0's signature and apply the new close-state check on top. Signed-off-by: Arthur Schreiber <arthur@planetscale.com>
Description
Fixes a close-time deadlock in
smartconnpoolwhereCloseWithContextcould wait for background workers while still holdingcapacityMu. A refresh worker already insidereopen()could then block on that same mutex, leaving close stuck.This also keeps closed/open state separate from capacity changes in the affected paths.
SetCapacitycan still update desired capacity, butGet,getNew,put(nil), and returned idle connections now check whether the pool is closed before opening or retaining connections. The stress test avoids usingSetCapacity(0)as a synthetic close signal so it exercises realCloseWithContextbehavior.Related Issue(s)
Related to #20122.
Backport justification: this fixes a rare
smartconnpoolshutdown deadlock whenCloseWithContextraces with refresh-drivenreopen(). The race can stall pool shutdown in production, and the fix is contained to connection-pool close/open state handling.Checklist
Local test runs:
go test -count=1 -run '^(TestTxPoolBeginWithPoolConnectionError_Errno2006_Transient)$' -timeout 180s ./go/vt/vttablet/tabletservergo test -count=1 ./go/pools/smartconnpoolgo test -count=1 -race -run '^(TestCloseWithContextAfterSetCapacityZeroClosesPool|TestStressCloseDuringTraffic)$' -timeout 240s ./go/pools/smartconnpoolgo test -count=1 -race -run '^(TestTxPoolBeginWithPoolConnectionError_Errno2006_Transient)$' -timeout 180s ./go/vt/vttablet/tabletserverDeployment Notes
No deployment steps required.
AI Disclosure
This PR was written with AI assistance from Codex, with direction and review from Arthur.