Skip to content

vtgate: use fixed retry interval for health check reconnection#19967

Open
khkim6040 wants to merge 9 commits into
vitessio:mainfrom
khkim6040:fix/healthcheck-retry-backoff-19894
Open

vtgate: use fixed retry interval for health check reconnection#19967
khkim6040 wants to merge 9 commits into
vitessio:mainfrom
khkim6040:fix/healthcheck-retry-backoff-19894

Conversation

@khkim6040

@khkim6040 khkim6040 commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Description

The checkConn reconnect loop used to double retryDelay on every stream
failure (capping at healthCheckTimeout, default 60s). After a prolonged
tablet outage each vtgate sits at a different point on that backoff curve, so a
recovered tablet is rediscovered anywhere from a few seconds to ~60s later
depending on the vtgate — the stagger described in #19894.

This PR drops the exponential backoff and retries on a fixed base interval
(--healthcheck-retry-delay, default 5s) with +/-25% jitter. The fixed base
keeps rediscovery prompt; the jitter stops every vtgate from reconnecting to a
recovered tablet in lockstep (a thundering-herd we'd otherwise create by making
the interval perfectly fixed).

Related Issue(s)

Fixes #19894

Checklist

  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

--healthcheck-retry-delay (default 5s) now drives a fixed retry interval with
+/-25% jitter instead of the initial value of an exponential backoff. Operators
who relied on the backoff growing to throttle retries during long outages may
want to raise it; 5s is fine for most deployments.

AI Disclosure

Written with Claude Code — I gave the direction and chose the design (fixed
interval + jitter over a pure fixed interval); Claude wrote the code and tests.

Replace exponential backoff in checkConn with a fixed retry interval
(default 5s). Previously, the retry delay doubled on each failure up to
healthCheckTimeout (default 60s), causing vtgates to take up to 60s to
rediscover tablets after prolonged outages.

Fixes vitessio#19894

Signed-off-by: Gwanho Kim <khkim6040@gmail.com>
Add TestHealthCheckRetryDelayIsFixed to verify that repeated stream
failures do not cause the retry delay to grow. The test sends multiple
consecutive errors and asserts that the tablet is still rediscovered
within a short, fixed window once it recovers.

Also update a stale comment referencing exponential backoff in
TestHealthCheckTimeout.

Signed-off-by: Gwanho Kim <khkim6040@gmail.com>
Copilot AI review requested due to automatic review settings April 27, 2026 12:22
@github-actions github-actions Bot added this to the v25.0.0 milestone Apr 27, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Apr 27, 2026
@vitess-bot

vitess-bot Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes VTGate’s tablet healthcheck reconnection behavior to use a fixed retry interval (instead of exponential backoff) so recovered tablets are rediscovered promptly and consistently across vtgate instances.

Changes:

  • Removed exponential backoff logic in tabletHealthCheck.checkConn, keeping retries at a fixed hc.retryDelay.
  • Added a regression test intended to verify the retry delay remains fixed across repeated stream failures.
  • Updated a stale test comment referencing exponential backoff.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
go/vt/discovery/tablet_health_check.go Removes exponential backoff in the StreamHealth reconnect loop in favor of a fixed retry interval.
go/vt/discovery/healthcheck_test.go Adds a new regression test for fixed retry delay behavior and updates an existing test comment.

Comment on lines +384 to +388
Serving: true,
PrimaryTermStartTimestamp: 0,
RealtimeStats: &querypb.RealtimeStats{ReplicationLagSeconds: 1, CpuUsage: 0.2},
}
input <- shr

Copilot AI Apr 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input <- shr can block until fakeConn.StreamHealth is running again (because input/hcChan is unbuffered). That means the test’s 100ms timeout starts after the reconnect happens, so it won’t actually catch an exponential-backoff regression. Make the health stream channel buffered (e.g. size 1) or send shr asynchronously and start the timer before/while sending so the assertion includes the retry sleep time.

Copilot uses AI. Check for mistakes.
Comment on lines 319 to 327
// Streaming RPC failed e.g. because vttablet was restarted or took too long.
// Sleep until the next retry is up or the context is done/canceled.
// We use a fixed retry interval instead of exponential backoff so that
// vtgate rediscovers recovered tablets promptly. See #19894.
select {
case <-thc.ctx.Done():
return
case <-time.After(retryDelay):
// Exponentially back-off to prevent tight-loop.
retryDelay *= 2
// Limit the retry delay backoff to the health check timeout
if retryDelay > hc.healthCheckTimeout {
retryDelay = hc.healthCheckTimeout
}
}

Copilot AI Apr 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the switch to a fixed retry interval, retryDelay now directly controls how aggressively vtgate reconnects after stream failures. However, vtgate’s --healthcheck-retry-delay default is currently 2ms (see go/vt/vtgate/vtgate.go and generated flags docs), which would cause very tight reconnect loops and time.After allocations during outages. Consider updating the vtgate default to a saner value (e.g. discovery’s 5s default) and/or enforcing a reasonable minimum retryDelay here to avoid runaway retries.

Copilot uses AI. Check for mistakes.
@promptless

promptless Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Promptless prepared a documentation update related to this change.

Triggered by this PR

Added documentation to the vtgate configuration guide explaining that vtgate now uses a fixed retry interval (controlled by --healthcheck-retry-delay, default 5 seconds) when reconnecting to tablets after connection loss, replacing the previous exponential backoff behavior.

Review: Document vtgate health check fixed retry interval

@github-actions

Copy link
Copy Markdown
Contributor

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

  • Push additional commits to the associated branch.
  • Remove the stale label.
  • Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

@github-actions github-actions Bot added the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label May 28, 2026
khkim6040 added 2 commits June 1, 2026 21:04
Replace the fixed retry interval in checkConn with the same base delay plus
+/-25% jitter, and remove the now-dead local retryDelay variable and its
no-op reset. A purely fixed interval re-synchronizes vtgate instances that
started retrying together during a shared outage, so they would all reconnect
to a recovered tablet in lockstep. Jittering the interval spreads those
reconnects out while still rediscovering recovered tablets promptly. See vitessio#19894.

Signed-off-by: Gwanho Kim <khkim6040@gmail.com>
Rename TestHealthCheckRetryDelayIsFixed to TestHealthCheckRetryDelayIsBounded
since the interval now carries jitter rather than being strictly fixed, and
update its comments accordingly. Replace t.Fatal with require.Fail per the
project test conventions, and widen the timeout to 30s: the success path
returns almost immediately, so a healthy run never approaches it, while a
regression that re-grows the delay still trips it without CI flakiness.

Signed-off-by: Gwanho Kim <khkim6040@gmail.com>
@khkim6040

Copy link
Copy Markdown
Contributor Author

Reworked this from a pure fixed interval to a fixed base (5s) with a +- 25% jitter.

A perfectly fixed interval re-synchronizes the vtgates that started retrying together during a shared outage, so they'd all reconnect to a recovered tablet in lockstep. The jitter keeps rediscovery fast (#19894) while spreading those reconnects out, so no thundering herd. The dead retryDelay reset is gone and the regression test is updated.

@deepthi when you have a moment — I went with fixed+jitter here, but if you'd prefer keeping the backoff with a lower cap instead, happy to switch. Does this direction look right to you?

@github-actions github-actions Bot removed the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label Jun 2, 2026
@mattlord mattlord added Component: Cluster management Component: Query Serving Type: Enhancement Logical improvement (somewhere between a bug and feature) and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Jun 2, 2026
@codecov

codecov Bot commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.42%. Comparing base (70c7a72) to head (87bbe40).
⚠️ Report is 323 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main   #19967       +/-   ##
===========================================
+ Coverage   69.67%   84.42%   +14.74%     
===========================================
  Files        1614      319     -1295     
  Lines      216793    57403   -159390     
===========================================
- Hits       151044    48460   -102584     
+ Misses      65749     8943    -56806     
Flag Coverage Δ
partial 84.42% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mattlord mattlord left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In go/vt/vtgate/vtgate.go:88, go/vt/discovery/tablet_health_check.go:326 the PR makes --healthcheck-retry-delay the steady-state retry interval, but vtgate’s default is still 2ms, not the 5s claimed in the PR/issue. With jitter, a default vtgate will now retry every ~1.5-2.5ms per down tablet for the full outage, creating a very hot reconnect loop and time.After allocation churn across the fleet. Please update the vtgate default to discovery.DefaultHealthCheckRetryDelay/5s and regenerate flag docs, or enforce a sane minimum before using the delay as a fixed interval.

In go/vt/discovery/healthcheck_test.go:386-396 the regression test still does not measure the retry sleep after recovery. input <- shr uses an unbuffered channel, so it blocks until the healthcheck goroutine has already reconnected; the timeout starts only after the old exponential backoff would have elapsed. This means the test would pass even if exponential backoff were restored. Make the channel buffered or send the healthy response asynchronously and start timing before the send can unblock.

khkim6040 added 2 commits June 4, 2026 23:17
Signed-off-by: Gwanho Kim <khkim6040@gmail.com>
Signed-off-by: Gwanho Kim <khkim6040@gmail.com>
Copilot AI review requested due to automatic review settings June 4, 2026 14:19

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment on lines +335 to +341
func retryInterval(base time.Duration) time.Duration {
if base <= 0 {
return base
}
// Spread uniformly within [base-base/4, base+base/4).
return base - base/4 + time.Duration(rand.Int64N(int64(base/2)))
}
Comment on lines +32 to +34
const base = 5 * time.Second
lower := time.Duration(float64(base) * 0.75)
upper := time.Duration(float64(base) * 1.25)
Comment on lines +400 to +411
start := time.Now()
input <- shr

// The fixed retry delay (~10ms +/- jitter) rediscovers the tablet within a
// single cycle. The old exponential backoff would sleep ~320ms after
// numErrors failures (10ms * 2^5), so a regression that regrows the delay
// blows past this bound. The 5s arm only trips on a true hang.
select {
case result := <-resultChan:
assert.True(t, result.Serving, "tablet should be serving after recovery")
assert.Less(t, time.Since(start), 100*time.Millisecond,
"rediscovery took too long after recovery; retry delay may be growing exponentially")
Comment thread go/vt/vtgate/vtgate.go
Comment on lines +87 to +90
// healthCheckRetryDelay is the fixed interval between healthcheck reconnection
// attempts. Since it is now a steady-state interval rather than a backoff seed,
// it defaults to discovery's 5s instead of a few milliseconds.
healthCheckRetryDelay = discovery.DefaultHealthCheckRetryDelay
Signed-off-by: Gwanho Kim <khkim6040@gmail.com>
@khkim6040

Copy link
Copy Markdown
Contributor Author

Thanks @mattlord! Fixed in 2ac3c00 and f591d6d:

  • Default 2ms -> 5s (discovery.DefaultHealthCheckRetryDelay) + regenerated flag docs.
  • Regression test now buffers the channel and times before the send, so it actually measures the retry sleep and fails if backoff returns.

The fixed 5s health check retry interval can leave vtgate without a reconnected
stream when the DML runs, so accept "is either down or nonexistent" alongside
"wrong tablet type" as the VT15001 reason.

Signed-off-by: Gwanho Kim <khkim6040@gmail.com>
Copilot AI review requested due to automatic review settings June 13, 2026 08:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment on lines 322 to 326
select {
case <-thc.ctx.Done():
return
case <-time.After(retryDelay):
// Exponentially back-off to prevent tight-loop.
retryDelay *= 2
// Limit the retry delay backoff to the health check timeout
if retryDelay > hc.healthCheckTimeout {
retryDelay = hc.healthCheckTimeout
}
case <-time.After(retryInterval(hc.retryDelay)):
}
Comment on lines +335 to +345
func retryInterval(base time.Duration) time.Duration {
// half is the jitter span passed to rand.Int64N, which panics on n <= 0. For
// a non-positive base, or one so small that base/2 truncates to zero (sub-2ns),
// there is no room to jitter, so return the base unchanged.
half := base / 2
if half <= 0 {
return base
}
// Spread uniformly within [base-base/4, base+base/4).
return base - base/4 + time.Duration(rand.Int64N(int64(half)))
}
Comment on lines +370 to +382
fc := createFakeConn(tablet, input)
fc.errCh = make(chan error)
hc.AddTablet(tablet)

// Drain the initial not-serving notification from AddTablet.
<-resultChan

// Send multiple consecutive stream errors to simulate a prolonged outage
// where the tablet is unreachable for many retry cycles.
const numErrors = 8 // with exponential backoff the post-recovery sleep would reach ~1.28s (10ms*2^7)
for range numErrors {
fc.errCh <- errors.New("connection refused")
<-resultChan // drain the not-serving update
…-backoff-19894

Signed-off-by: Gwanho Kim <khkim6040@gmail.com>

# Conflicts:
#	go/test/endtoend/reparent/newfeaturetest/reparent_with_open_tx_test.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Cluster management Component: Query Serving Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: vtgate health check retry backoff grows too large, causing slow tablet rediscovery after prolonged outages

3 participants