Skip to content

fix(dns): add a lower-bound negative TTL#4449

Open
cratelyn wants to merge 8 commits intomainfrom
kate/dns.defensive-zero-ttl-checks
Open

fix(dns): add a lower-bound negative TTL#4449
cratelyn wants to merge 8 commits intomainfrom
kate/dns.defensive-zero-ttl-checks

Conversation

@cratelyn
Copy link
Member

@cratelyn cratelyn commented Mar 10, 2026

see linkerd/linkerd2#14954.

some user reports describe situations in which, when the linkerd control
plane's destination controller is OOM-killed, DNS resolution can
momentarily cause the proxy to compute a negative-TTL duration of zero.

this causes a panic in production environments, because
tokio::time::interval asserts that it has not been provided a duration
of zero.

this manifests in errors that look like this:

thread 'main' panicked at linkerd/app/core/src/control.rs:118:49: period must be non-zero.

this branch introduces changes to enforce a lower-bound for negative TTL's
that are zero, which would cause a panic, or are pathologically short,
which could cause proxies encountering resolution errors to thrash the
DNS server trying to recover.

see linkerd/linkerd2#14954.

some user reports describe situations in which, when the linkerd control
plane's destination controller is OOM-killed, DNS resolution can
momentarily cause the proxy to compute a negative-TTL duration of zero.

this causes a panic in production environments, because
`tokio::time::interval` asserts that it has not been provided a duration
of zero.

this manifests in errors that look like this:

```
thread 'main' panicked at linkerd/app/core/src/control.rs:118:49: period must be non-zero.
```

this commit patches `linkerd-dns::ResolveError::negative_ttl()` so that
it will now log a warning and instead return `None` when a negative TTL
of zero is encountered. a shared `duration_from_error()` helper
(bikeshedding welcome) helps do this for both A/AAAA and SRV records.

X-Ref: #3807
Signed-off-by: katelyn martin <kate@buoyant.io>
@cratelyn cratelyn self-assigned this Mar 10, 2026
@cratelyn cratelyn requested review from unleashed and removed request for unleashed March 10, 2026 21:56
@cratelyn
Copy link
Member Author

i will tend to linter errors in the morning.

Signed-off-by: katelyn martin <kate@buoyant.io>
@cratelyn cratelyn marked this pull request as ready for review March 11, 2026 15:54
@cratelyn cratelyn requested a review from a team as a code owner March 11, 2026 15:54
@cratelyn cratelyn requested a review from unleashed March 11, 2026 15:54
cratelyn added a commit that referenced this pull request Mar 11, 2026
`linkerd_app_core::control` provides utilities used by the data plane to
communicate with the linkerd control plane. this includes, among other
features such as load-balancing and configurability for settings like
connection timeout durations, an error recovery that respects DNS
record's negative TTL.

as of today, we do this within an inline, anonymous closure.

this commit pulls this business logic out of an inline closure, and into
an explicit pair of structures.

ResolveRecover is the Recover implementation that handles identifying
the proper backoff strategy, when presented with a given boxed error.
ResolveBackoff is the structure that acts as the sum type that
encompasses either a TTL-driven interval, or an exponential backoff.

see also, #4449. that introduces some additional
guardrails to prevent panicking if a negative ttl of zero is
encountered.

Signed-off-by: katelyn martin <kate@buoyant.io>
Copy link
Member

@unleashed unleashed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍, just a couple minor comments

#4449 (comment)

Signed-off-by: katelyn martin <kate@buoyant.io>
we follow a convention in the proxy of "punning" fairly liberally.

this comment was pointing to the internal ResolveError, not the
hickory_resolver version of this type.

#4449 (comment)

Signed-off-by: katelyn martin <kate@buoyant.io>
we introduced logic for enforcing a minimum TTL in
#3807. this commit moves that logic from the outer
layer in `linkerd-dns-resolve` and into the `linkerd-dns` library.

this will help us reuse/consolidate the same logic for *negative* TTL's.

Signed-off-by: katelyn martin <kate@buoyant.io>
Signed-off-by: katelyn martin <kate@buoyant.io>
this introduces a new function to `linkerd_dns::minimum_ttl`, for
working with `Duration`s.

this is used in `negative_ttl_of()` so that we not only check for TTL's
of zero, but also for pathologically small TTL's.

Signed-off-by: katelyn martin <kate@buoyant.io>
this tweaks our `sleep_until_expired` function so that it provides a
similar signature to `with_minimum_duration`. this way, callers that
interact with Instants and Durations both have common interfaces.

Signed-off-by: katelyn martin <kate@buoyant.io>
@cratelyn cratelyn changed the title fix(dns): check for negative_ttl of zero fix(dns-resolve): add a lower-bound negative TTL Mar 12, 2026
@cratelyn
Copy link
Member Author

cratelyn commented Mar 12, 2026

i've renamed this pull request now that, after some review feedback has yielded additional changes, this does slightly more than check for non-zero negative TTL's. i have also updated the pull request description to reflect the fact that this goes beyond zero TTL's, and also enforces a lower bound.

@cratelyn cratelyn changed the title fix(dns-resolve): add a lower-bound negative TTL fix(dns): add a lower-bound negative TTL Mar 12, 2026
Copy link
Member

@unleashed unleashed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants