vtgate: use fixed retry interval for health check reconnection by khkim6040 · Pull Request #19967 · vitessio/vitess

khkim6040 · 2026-04-27T12:22:53Z

Description

The checkConn reconnect loop used to double retryDelay on every stream
failure (capping at healthCheckTimeout, default 60s). After a prolonged
tablet outage each vtgate sits at a different point on that backoff curve, so a
recovered tablet is rediscovered anywhere from a few seconds to ~60s later
depending on the vtgate — the stagger described in #19894.

This PR drops the exponential backoff and retries on a fixed base interval
(--healthcheck-retry-delay, default 5s) with +/-25% jitter. The fixed base
keeps rediscovery prompt; the jitter stops every vtgate from reconnecting to a
recovered tablet in lockstep (a thundering-herd we'd otherwise create by making
the interval perfectly fixed).

Related Issue(s)

Fixes #19894

Checklist

Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

Deployment Notes

--healthcheck-retry-delay (default 5s) now drives a fixed retry interval with
+/-25% jitter instead of the initial value of an exponential backoff. Operators
who relied on the backoff growing to throttle retries during long outages may
want to raise it; 5s is fine for most deployments.

AI Disclosure

Written with Claude Code — I gave the direction and chose the design (fixed
interval + jitter over a pure fixed interval); Claude wrote the code and tests.

Replace exponential backoff in checkConn with a fixed retry interval (default 5s). Previously, the retry delay doubled on each failure up to healthCheckTimeout (default 60s), causing vtgates to take up to 60s to rediscover tablets after prolonged outages. Fixes vitessio#19894 Signed-off-by: Gwanho Kim <khkim6040@gmail.com>

Add TestHealthCheckRetryDelayIsFixed to verify that repeated stream failures do not cause the retry delay to grow. The test sends multiple consecutive errors and asserts that the tablet is still rediscovered within a short, fixed window once it recovers. Also update a stale comment referencing exponential backoff in TestHealthCheckTimeout. Signed-off-by: Gwanho Kim <khkim6040@gmail.com>

vitess-bot · 2026-04-27T12:23:25Z

Copilot

Pull request overview

This PR changes VTGate’s tablet healthcheck reconnection behavior to use a fixed retry interval (instead of exponential backoff) so recovered tablets are rediscovered promptly and consistently across vtgate instances.

Changes:

Removed exponential backoff logic in tabletHealthCheck.checkConn, keeping retries at a fixed hc.retryDelay.
Added a regression test intended to verify the retry delay remains fixed across repeated stream failures.
Updated a stale test comment referencing exponential backoff.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
go/vt/discovery/tablet_health_check.go	Removes exponential backoff in the StreamHealth reconnect loop in favor of a fixed retry interval.
go/vt/discovery/healthcheck_test.go	Adds a new regression test for fixed retry delay behavior and updates an existing test comment.

Copilot · 2026-04-27T12:27:59Z

+		Serving:                   true,
+		PrimaryTermStartTimestamp: 0,
+		RealtimeStats:             &querypb.RealtimeStats{ReplicationLagSeconds: 1, CpuUsage: 0.2},
+	}
+	input <- shr


input <- shr can block until fakeConn.StreamHealth is running again (because input/hcChan is unbuffered). That means the test’s 100ms timeout starts after the reconnect happens, so it won’t actually catch an exponential-backoff regression. Make the health stream channel buffered (e.g. size 1) or send shr asynchronously and start the timer before/while sending so the assertion includes the retry sleep time.

Copilot · 2026-04-27T12:27:59Z

 		// Streaming RPC failed e.g. because vttablet was restarted or took too long.
 		// Sleep until the next retry is up or the context is done/canceled.
+		// We use a fixed retry interval instead of exponential backoff so that
+		// vtgate rediscovers recovered tablets promptly. See #19894.
 		select {
 		case <-thc.ctx.Done():
 			return
 		case <-time.After(retryDelay):
-			// Exponentially back-off to prevent tight-loop.
-			retryDelay *= 2
-			// Limit the retry delay backoff to the health check timeout
-			if retryDelay > hc.healthCheckTimeout {
-				retryDelay = hc.healthCheckTimeout
-			}
 		}


With the switch to a fixed retry interval, retryDelay now directly controls how aggressively vtgate reconnects after stream failures. However, vtgate’s --healthcheck-retry-delay default is currently 2ms (see go/vt/vtgate/vtgate.go and generated flags docs), which would cause very tight reconnect loops and time.After allocations during outages. Consider updating the vtgate default to a saner value (e.g. discovery’s 5s default) and/or enforcing a reasonable minimum retryDelay here to avoid runaway retries.

promptless · 2026-04-27T12:28:25Z

Promptless prepared a documentation update related to this change.

Triggered by this PR

Added documentation to the vtgate configuration guide explaining that vtgate now uses a fixed retry interval (controlled by --healthcheck-retry-delay, default 5 seconds) when reconnecting to tablets after connection loss, replacing the previous exponential backoff behavior.

Review: Document vtgate health check fixed retry interval

github-actions · 2026-05-28T01:40:46Z

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

Push additional commits to the associated branch.
Remove the stale label.
Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

Replace the fixed retry interval in checkConn with the same base delay plus +/-25% jitter, and remove the now-dead local retryDelay variable and its no-op reset. A purely fixed interval re-synchronizes vtgate instances that started retrying together during a shared outage, so they would all reconnect to a recovered tablet in lockstep. Jittering the interval spreads those reconnects out while still rediscovering recovered tablets promptly. See vitessio#19894. Signed-off-by: Gwanho Kim <khkim6040@gmail.com>

Rename TestHealthCheckRetryDelayIsFixed to TestHealthCheckRetryDelayIsBounded since the interval now carries jitter rather than being strictly fixed, and update its comments accordingly. Replace t.Fatal with require.Fail per the project test conventions, and widen the timeout to 30s: the success path returns almost immediately, so a healthy run never approaches it, while a regression that re-grows the delay still trips it without CI flakiness. Signed-off-by: Gwanho Kim <khkim6040@gmail.com>

khkim6040 · 2026-06-01T12:21:36Z

Reworked this from a pure fixed interval to a fixed base (5s) with a +- 25% jitter.

A perfectly fixed interval re-synchronizes the vtgates that started retrying together during a shared outage, so they'd all reconnect to a recovered tablet in lockstep. The jitter keeps rediscovery fast (#19894) while spreading those reconnects out, so no thundering herd. The dead retryDelay reset is gone and the regression test is updated.

@deepthi when you have a moment — I went with fixed+jitter here, but if you'd prefer keeping the backoff with a lower cap instead, happy to switch. Does this direction look right to you?

codecov · 2026-06-02T23:46:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.42%. Comparing base (70c7a72) to head (87bbe40).
⚠️ Report is 323 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main   #19967       +/-   ##
===========================================
+ Coverage   69.67%   84.42%   +14.74%     
===========================================
  Files        1614      319     -1295     
  Lines      216793    57403   -159390     
===========================================
- Hits       151044    48460   -102584     
+ Misses      65749     8943    -56806

Flag	Coverage Δ
partial	`84.42% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

mattlord

In go/vt/vtgate/vtgate.go:88, go/vt/discovery/tablet_health_check.go:326 the PR makes --healthcheck-retry-delay the steady-state retry interval, but vtgate’s default is still 2ms, not the 5s claimed in the PR/issue. With jitter, a default vtgate will now retry every ~1.5-2.5ms per down tablet for the full outage, creating a very hot reconnect loop and time.After allocation churn across the fleet. Please update the vtgate default to discovery.DefaultHealthCheckRetryDelay/5s and regenerate flag docs, or enforce a sane minimum before using the delay as a fixed interval.

In go/vt/discovery/healthcheck_test.go:386-396 the regression test still does not measure the retry sleep after recovery. input <- shr uses an unbuffered channel, so it blocks until the healthcheck goroutine has already reconnected; the timeout starts only after the old exponential backoff would have elapsed. This means the test would pass even if exponential backoff were restored. Make the channel buffered or send the healthy response asynchronously and start timing before the send can unblock.

Signed-off-by: Gwanho Kim <khkim6040@gmail.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

+func retryInterval(base time.Duration) time.Duration {
+	if base <= 0 {
+		return base
+	}
+	// Spread uniformly within [base-base/4, base+base/4).
+	return base - base/4 + time.Duration(rand.Int64N(int64(base/2)))
+}


+	const base = 5 * time.Second
+	lower := time.Duration(float64(base) * 0.75)
+	upper := time.Duration(float64(base) * 1.25)


+	start := time.Now()
+	input <- shr
+
+	// The fixed retry delay (~10ms +/- jitter) rediscovers the tablet within a
+	// single cycle. The old exponential backoff would sleep ~320ms after
+	// numErrors failures (10ms * 2^5), so a regression that regrows the delay
+	// blows past this bound. The 5s arm only trips on a true hang.
+	select {
+	case result := <-resultChan:
+		assert.True(t, result.Serving, "tablet should be serving after recovery")
+		assert.Less(t, time.Since(start), 100*time.Millisecond,
+			"rediscovery took too long after recovery; retry delay may be growing exponentially")


+	// healthCheckRetryDelay is the fixed interval between healthcheck reconnection
+	// attempts. Since it is now a steady-state interval rather than a backoff seed,
+	// it defaults to discovery's 5s instead of a few milliseconds.
+	healthCheckRetryDelay = discovery.DefaultHealthCheckRetryDelay


Signed-off-by: Gwanho Kim <khkim6040@gmail.com>

khkim6040 · 2026-06-04T14:37:56Z

Thanks @mattlord! Fixed in 2ac3c00 and f591d6d:

Default 2ms -> 5s (discovery.DefaultHealthCheckRetryDelay) + regenerated flag docs.
Regression test now buffers the channel and times before the send, so it actually measures the retry sleep and fails if backoff returns.

The fixed 5s health check retry interval can leave vtgate without a reconnected stream when the DML runs, so accept "is either down or nonexistent" alongside "wrong tablet type" as the VT15001 reason. Signed-off-by: Gwanho Kim <khkim6040@gmail.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

 		select {
 		case <-thc.ctx.Done():
 			return
-		case <-time.After(retryDelay):
-			// Exponentially back-off to prevent tight-loop.
-			retryDelay *= 2
-			// Limit the retry delay backoff to the health check timeout
-			if retryDelay > hc.healthCheckTimeout {
-				retryDelay = hc.healthCheckTimeout
-			}
+		case <-time.After(retryInterval(hc.retryDelay)):
 		}


+func retryInterval(base time.Duration) time.Duration {
+	// half is the jitter span passed to rand.Int64N, which panics on n <= 0. For
+	// a non-positive base, or one so small that base/2 truncates to zero (sub-2ns),
+	// there is no room to jitter, so return the base unchanged.
+	half := base / 2
+	if half <= 0 {
+		return base
+	}
+	// Spread uniformly within [base-base/4, base+base/4).
+	return base - base/4 + time.Duration(rand.Int64N(int64(half)))
+}


+	fc := createFakeConn(tablet, input)
+	fc.errCh = make(chan error)
+	hc.AddTablet(tablet)
+
+	// Drain the initial not-serving notification from AddTablet.
+	<-resultChan
+
+	// Send multiple consecutive stream errors to simulate a prolonged outage
+	// where the tablet is unreachable for many retry cycles.
+	const numErrors = 8 // with exponential backoff the post-recovery sleep would reach ~1.28s (10ms*2^7)
+	for range numErrors {
+		fc.errCh <- errors.New("connection refused")
+		<-resultChan // drain the not-serving update


…-backoff-19894 Signed-off-by: Gwanho Kim <khkim6040@gmail.com> # Conflicts: # go/test/endtoend/reparent/newfeaturetest/reparent_with_open_tx_test.go

khkim6040 added 2 commits April 27, 2026 21:15

Copilot AI review requested due to automatic review settings April 27, 2026 12:22

khkim6040 requested review from frouioui and timvaillancourt as code owners April 27, 2026 12:22

github-actions Bot added this to the v25.0.0 milestone Apr 27, 2026

Copilot started reviewing on behalf of khkim6040 April 27, 2026 12:23 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

github-actions Bot added the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label May 28, 2026

khkim6040 added 2 commits June 1, 2026 21:04

github-actions Bot removed the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label Jun 2, 2026

mattlord requested review from arthurschreiber and mattlord June 2, 2026 23:33

mattlord reviewed Jun 2, 2026

View reviewed changes

khkim6040 added 2 commits June 4, 2026 23:17

discovery: measure reconnect interval in retry-delay regression test

2ac3c00

Signed-off-by: Gwanho Kim <khkim6040@gmail.com>

vtgate: default health check retry delay to fixed 5s interval

f591d6d

Signed-off-by: Gwanho Kim <khkim6040@gmail.com>

Copilot AI review requested due to automatic review settings June 4, 2026 14:19

khkim6040 requested review from harshit-gangal and systay as code owners June 4, 2026 14:19

Copilot AI reviewed Jun 4, 2026

View reviewed changes

discovery: guard retryInterval jitter and de-flake its regression test

7868952

Signed-off-by: Gwanho Kim <khkim6040@gmail.com>

Copilot AI review requested due to automatic review settings June 13, 2026 08:01

Copilot AI reviewed Jun 13, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into fix/healthcheck-retry…

87bbe40

…-backoff-19894 Signed-off-by: Gwanho Kim <khkim6040@gmail.com> # Conflicts: # go/test/endtoend/reparent/newfeaturetest/reparent_with_open_tx_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vtgate: use fixed retry interval for health check reconnection#19967

vtgate: use fixed retry interval for health check reconnection#19967
khkim6040 wants to merge 9 commits into
vitessio:mainfrom
khkim6040:fix/healthcheck-retry-backoff-19894

khkim6040 commented Apr 27, 2026 •

edited

Loading

Uh oh!

vitess-bot Bot commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

promptless Bot commented Apr 27, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

khkim6040 commented Jun 1, 2026

Uh oh!

codecov Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

mattlord left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

khkim6040 commented Jun 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

khkim6040 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Checklist

Deployment Notes

AI Disclosure

Uh oh!

vitess-bot Bot commented Apr 27, 2026

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

promptless Bot commented Apr 27, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

khkim6040 commented Jun 1, 2026

Uh oh!

codecov Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mattlord left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

khkim6040 commented Jun 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

khkim6040 commented Apr 27, 2026 •

edited

Loading

codecov Bot commented Jun 2, 2026 •

edited

Loading