Bug Report: vttablet stalls completely when the connection pool waitlist fills with timed-out requests

### Overview of the Issue

A production v23 vttablet stalled completely under load: it stopped serving all queries and required a restart. A goroutine dump taken during the stall shows the entire tablet serialized on a single mutex inside `smartconnpool` — and the root cause is that the pool's waitlist hands returned connections to requests whose context has already expired.

**Anatomy of the stall (from the goroutine dump)**

Of 19,177 goroutines, 18,264 were blocked on the stream pool waitlist's `wl.mu`, all under `TabletServer.StreamExecute`. Broken down by blocking site (line numbers from `release-23.0`, `go/pools/smartconnpool/waitlist.go`):

| Count | Site | What they are |
|---|---|---|
| 9,342 | `waitlist.go:55` | new requests trying to join the waitlist |
| 8,631 | `waitlist.go:89` | timed-out waiters (`ctx.Done`) trying to remove themselves |
| 288 | `waitlist.go:149` | connection returners (`Recycle` → `tryReturnConnSlow`) waiting for the mutex |
| 152 | `waitlist.go:179` | returners blocked in the unbuffered `target.Value.conn <- conn` handoff |
| 10 | `waitlist.go:60` | waiters actually parked healthily in the select |

Critically, **zero goroutines were doing MySQL I/O**: every single pool connection was trapped in the return path, either queued for the mutex or parked mid-handoff. MySQL was healthy; the pool was eating its own connections.

**Root cause**

When a connection is returned, `tryReturnConnSlow` picks a handoff target by scanning the waitlist for a setting match or an over-age waiter, falling back to the front-most waiter — without ever checking whether that waiter's context has expired. Under a timeout storm the waitlist is dominated by waiters whose deadlines have already passed but which haven't yet removed themselves (self-removal requires reacquiring the same contended mutex and doing an O(n) membership scan). The result is a self-sustaining cycle:

1. A connection is returned and handed to the front-most waiter — almost always expired.
2. The handoff channel is unbuffered, so the returner blocks until the expired waiter crawls through the mutex queue, fails to find itself in the list, and performs the protocol's fallback receive. During that whole time the connection serves nobody (the 152 `chan send` goroutines above).
3. The expired waiter's `Get` then "succeeds", returning a connection to a request whose context is dead. The request fails immediately and recycles the connection — back to step 1, to the next expired waiter.
4. Meanwhile every self-removal does an O(n) scan of an ~18k-entry list under the mutex, the mutex is in starvation-mode FIFO, and new requests (plus client retries) keep joining faster than the list can drain.

Note this is a livelock, not a deadlock: none of the mutex waiters showed a `, N minutes` tag, so the lock was turning over — but every handoff went to a dead request, so useful throughput was exactly zero while the tablet appeared completely hung from the outside.

**Impact**

- The affected tablet serves no queries at all; only a restart recovers it.
- Any pool built on `smartconnpool` is affected; the incident hit the stream connection pool via `StreamExecute` traffic.
- The code is present since `smartconnpool` was introduced; the dump is from v23.

### Reproduction Steps

The full production stall is emergent at scale (thousands of queued goroutines and a convoyed waitlist mutex; at small scale the system self-heals because expired waiters remove themselves quickly). The contract violation at its core can be reproduced deterministically in a `smartconnpool` test:

1. Open a pool with small capacity and `Get` all connections so subsequent requests queue on the waitlist.
2. Queue a backlog of requests whose contexts will be cancelled while they wait, then a few live requests behind them.
3. Hold `wl.mu` (standing in for the production mutex convoy), park the returning goroutines on it by recycling the held connections, then cancel the backlog's contexts so the expired waiters pile up behind the returners.
4. Release the mutex. The returners scan a waitlist full of expired-but-still-listed waiters, and connections are delivered to requests whose deadlines have already passed (`Get` returns a connection with a nil error despite the caller's context being dead) instead of going to the live requests or back into the pool.

### Binary Version

```sh
vttablet, v23 (observed in production)
The defective handoff logic exists on main and all supported release branches.
```

### Operating System and Environment details

```sh
Linux, amd64
The defect is in pure-Go pool logic and is architecture- and platform-independent.
```

### Log Fragments

```sh
goroutine 11279468700 [sync.Mutex.Lock]:           # x9,342 - joining the waitlist
vitess.io/vitess/go/pools/smartconnpool.(*waitlist[...]).waitForConn(...)
        vitess.io/vitess/go/pools/smartconnpool/waitlist.go:55
vitess.io/vitess/go/vt/vttablet/tabletserver.(*QueryExecutor).getStreamConn(...)
vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).StreamExecute(...)

goroutine 11279506485 [sync.Mutex.Lock]:           # x8,631 - timed out, trying to self-remove
vitess.io/vitess/go/pools/smartconnpool.(*waitlist[...]).waitForConn(...)
        vitess.io/vitess/go/pools/smartconnpool/waitlist.go:89

goroutine 11279341158 [chan send]:                 # x152 - connection parked on a dead request
vitess.io/vitess/go/pools/smartconnpool.(*waitlist[...]).tryReturnConnSlow(...)
        vitess.io/vitess/go/pools/smartconnpool/waitlist.go:179
vitess.io/vitess/go/pools/smartconnpool.(*Pooled[...]).Recycle(...)
vitess.io/vitess/go/vt/vttablet/tabletserver.(*QueryExecutor).Stream(...)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: vttablet stalls completely when the connection pool waitlist fills with timed-out requests #20310

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Count	Site	What they are
9,342	`waitlist.go:55`	new requests trying to join the waitlist
8,631	`waitlist.go:89`	timed-out waiters (`ctx.Done`) trying to remove themselves
288	`waitlist.go:149`	connection returners (`Recycle` → `tryReturnConnSlow`) waiting for the mutex
152	`waitlist.go:179`	returners blocked in the unbuffered `target.Value.conn <- conn` handoff
10	`waitlist.go:60`	waiters actually parked healthily in the select

Bug Report: vttablet stalls completely when the connection pool waitlist fills with timed-out requests #20310

Description

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions