Skip to content

Bug Report: vttablet stalls completely when the connection pool waitlist fills with timed-out requests #20310

@arthurschreiber

Description

@arthurschreiber

Overview of the Issue

A production v23 vttablet stalled completely under load: it stopped serving all queries and required a restart. A goroutine dump taken during the stall shows the entire tablet serialized on a single mutex inside smartconnpool — and the root cause is that the pool's waitlist hands returned connections to requests whose context has already expired.

Anatomy of the stall (from the goroutine dump)

Of 19,177 goroutines, 18,264 were blocked on the stream pool waitlist's wl.mu, all under TabletServer.StreamExecute. Broken down by blocking site (line numbers from release-23.0, go/pools/smartconnpool/waitlist.go):

Count Site What they are
9,342 waitlist.go:55 new requests trying to join the waitlist
8,631 waitlist.go:89 timed-out waiters (ctx.Done) trying to remove themselves
288 waitlist.go:149 connection returners (RecycletryReturnConnSlow) waiting for the mutex
152 waitlist.go:179 returners blocked in the unbuffered target.Value.conn <- conn handoff
10 waitlist.go:60 waiters actually parked healthily in the select

Critically, zero goroutines were doing MySQL I/O: every single pool connection was trapped in the return path, either queued for the mutex or parked mid-handoff. MySQL was healthy; the pool was eating its own connections.

Root cause

When a connection is returned, tryReturnConnSlow picks a handoff target by scanning the waitlist for a setting match or an over-age waiter, falling back to the front-most waiter — without ever checking whether that waiter's context has expired. Under a timeout storm the waitlist is dominated by waiters whose deadlines have already passed but which haven't yet removed themselves (self-removal requires reacquiring the same contended mutex and doing an O(n) membership scan). The result is a self-sustaining cycle:

  1. A connection is returned and handed to the front-most waiter — almost always expired.
  2. The handoff channel is unbuffered, so the returner blocks until the expired waiter crawls through the mutex queue, fails to find itself in the list, and performs the protocol's fallback receive. During that whole time the connection serves nobody (the 152 chan send goroutines above).
  3. The expired waiter's Get then "succeeds", returning a connection to a request whose context is dead. The request fails immediately and recycles the connection — back to step 1, to the next expired waiter.
  4. Meanwhile every self-removal does an O(n) scan of an ~18k-entry list under the mutex, the mutex is in starvation-mode FIFO, and new requests (plus client retries) keep joining faster than the list can drain.

Note this is a livelock, not a deadlock: none of the mutex waiters showed a , N minutes tag, so the lock was turning over — but every handoff went to a dead request, so useful throughput was exactly zero while the tablet appeared completely hung from the outside.

Impact

  • The affected tablet serves no queries at all; only a restart recovers it.
  • Any pool built on smartconnpool is affected; the incident hit the stream connection pool via StreamExecute traffic.
  • The code is present since smartconnpool was introduced; the dump is from v23.

Reproduction Steps

The full production stall is emergent at scale (thousands of queued goroutines and a convoyed waitlist mutex; at small scale the system self-heals because expired waiters remove themselves quickly). The contract violation at its core can be reproduced deterministically in a smartconnpool test:

  1. Open a pool with small capacity and Get all connections so subsequent requests queue on the waitlist.
  2. Queue a backlog of requests whose contexts will be cancelled while they wait, then a few live requests behind them.
  3. Hold wl.mu (standing in for the production mutex convoy), park the returning goroutines on it by recycling the held connections, then cancel the backlog's contexts so the expired waiters pile up behind the returners.
  4. Release the mutex. The returners scan a waitlist full of expired-but-still-listed waiters, and connections are delivered to requests whose deadlines have already passed (Get returns a connection with a nil error despite the caller's context being dead) instead of going to the live requests or back into the pool.

Binary Version

vttablet, v23 (observed in production)
The defective handoff logic exists on main and all supported release branches.

Operating System and Environment details

Linux, amd64
The defect is in pure-Go pool logic and is architecture- and platform-independent.

Log Fragments

goroutine 11279468700 [sync.Mutex.Lock]:           # x9,342 - joining the waitlist
vitess.io/vitess/go/pools/smartconnpool.(*waitlist[...]).waitForConn(...)
        vitess.io/vitess/go/pools/smartconnpool/waitlist.go:55
vitess.io/vitess/go/vt/vttablet/tabletserver.(*QueryExecutor).getStreamConn(...)
vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).StreamExecute(...)

goroutine 11279506485 [sync.Mutex.Lock]:           # x8,631 - timed out, trying to self-remove
vitess.io/vitess/go/pools/smartconnpool.(*waitlist[...]).waitForConn(...)
        vitess.io/vitess/go/pools/smartconnpool/waitlist.go:89

goroutine 11279341158 [chan send]:                 # x152 - connection parked on a dead request
vitess.io/vitess/go/pools/smartconnpool.(*waitlist[...]).tryReturnConnSlow(...)
        vitess.io/vitess/go/pools/smartconnpool/waitlist.go:179
vitess.io/vitess/go/pools/smartconnpool.(*Pooled[...]).Recycle(...)
vitess.io/vitess/go/vt/vttablet/tabletserver.(*QueryExecutor).Stream(...)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions