refactor: `FmtLabels` impls use exhaustive bindings #3988

cratelyn · 2025-07-01T19:21:16Z

this is based on #3987.

in #3987 (see linkerd/linkerd2#13821) we discovered that some of the types that implement FmtLabels could collide when used in registry keys; i.e., they might emit identical label sets, but distinct Hash values.

#3987 solves two bugs. this pull request proposes a follow-on change, introducing exhaustive bindings to implementations of FmtLabels, to prevent this category of bug from reoccurring again in the future.

this change means that the introduction of an additional field to any of these label structures, e.g. OutboundEndpointLabels or HTTPLocalRateLimitLabels, will cause a compilation error unless said new field is handled in the corresponding FmtLabels implementation.

🔖 a note

in writing this pull request, i noticed one label that i believe is unintentionally being elided. i've refrained from changing behavior in this pull request. i do note it though, as an example of this syntax identifying the category of bug i hope to hedge against here.

Signed-off-by: katelyn martin <[email protected]>

these impl blocks are all `FmtLabels`, following another series of the same, above. we don't need another one of these comments. Signed-off-by: katelyn martin <[email protected]>

Signed-off-by: katelyn martin <[email protected]>

see #3262 (618838e), which introduced this label. to preserve behavior, this label remains unused. X-Ref: #3262 Signed-off-by: katelyn martin <[email protected]>

cratelyn · 2025-07-01T20:36:25Z

linkerd/app/core/src/metrics.rs

+            // TODO(kate): this label is not currently emitted.
+            zone_locality: _,


in the process of writing this pull request, i found this:

this label, zone_locality, is not included in our outbound endpoint labels. this was added in #3262. i've refrained from adding it here, to avoid changing any behavior. should we file a ticket for this, and is this elision intentional?

There's at least no harm in omitting it, since it not at a higher cardinality than the target_addr

@GTRekter

### 🖼️ background the linkerd2 proxy implements, registers, and exports Prometheus metrics using a variety of systems, for historical reasons. new metrics broadly rely upon the official [`prometheus-client`](https://github.com/prometheus/client_rust/) library, whose interfaces are reexported for internal consumption in the [`linkerd_metrics::prom`](https://github.com/linkerd/linkerd2-proxy/blob/main/linkerd/metrics/src/lib.rs#L30-L60) namespace. other metrics predate this library however, and rely on the metrics registry implemented in the workspace's [`linkerd-metrics`](https://github.com/linkerd/linkerd2-proxy/tree/main/linkerd/metrics) library. ### 🐛 bug report * linkerd/linkerd2#13821 linkerd/linkerd2#13821 reported a bug in which duplicate metrics could be observed and subsequently dropped by Prometheus when upgrading the control plane via helm with an existing workload running. ### 🦋 reproduction example for posterity, i'll note the reproduction steps here. i used these steps to identify the `2025.3.2` edge release as the affected release. upgrading from `2025.2.3` to `2025.3.1` did not exhibit this behavior. see below for more discussion about the cause. generate certificates via <https://linkerd.io/2.18/tasks/generate-certificates/> using these two deployments, courtesy of @GTRekter: <details> <summary>**💾 click to expand: app deployment**</summary> ```yaml apiVersion: v1 kind: Namespace metadata: name: simple-app annotations: linkerd.io/inject: enabled --- apiVersion: v1 kind: Service metadata: name: simple-app-v1 namespace: simple-app spec: selector: app: simple-app-v1 version: v1 ports: - port: 80 targetPort: 5678 --- apiVersion: apps/v1 kind: Deployment metadata: name: simple-app-v1 namespace: simple-app spec: replicas: 1 selector: matchLabels: app: simple-app-v1 version: v1 template: metadata: labels: app: simple-app-v1 version: v1 spec: containers: - name: http-app image: hashicorp/http-echo:latest args: - "-text=Simple App v1" ports: - containerPort: 5678 ``` </details> <details> <summary>**🤠 click to expand: client deployment**</summary> ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: traffic namespace: simple-app spec: replicas: 1 selector: matchLabels: app: traffic template: metadata: labels: app: traffic spec: containers: - name: traffic image: curlimages/curl:latest command: - /bin/sh - -c - | while true; do TIMESTAMP_SEND=$(date '+%Y-%m-%d %H:%M:%S') PAYLOAD="{\"timestamp\":\"$TIMESTAMP_SEND\",\"test_id\":\"sniff_me\",\"message\":\"hello-world\"}" echo "$TIMESTAMP_SEND - Sending payload: $PAYLOAD" RESPONSE=$(curl -s -X POST \ -H "Content-Type: application/json" \ -d "$PAYLOAD" \ http://simple-app-v1.simple-app.svc.cluster.local:80) TIMESTAMP_RESPONSE=$(date '+%Y-%m-%d %H:%M:%S') echo "$TIMESTAMP_RESPONSE - RESPONSE: $RESPONSE" sleep 1 done ``` </details> and this prometheus configuration: <details> <summary>**🔥 click to expand: prometheus configuration**</summary> ```yaml global: scrape_interval: 10s scrape_configs: - job_name: 'pod' scrape_interval: 10s static_configs: - targets: ['localhost:4191'] labels: group: 'traffic' ``` </details> we will perform the following steps: ```sh # install the edge release # specify the versions we'll migrate between. export FROM="2025.3.1" export TO="2025.3.2" # create a cluster, and add the helm charts. kind create cluster helm repo add linkerd-edge https://helm.linkerd.io/edge # install linkerd's crd's and control plane. helm install linkerd-crds linkerd-edge/linkerd-crds \ -n linkerd --create-namespace --version $FROM helm install linkerd-control-plane \ -n linkerd \ --set-file identityTrustAnchorsPEM=cert/ca.crt \ --set-file identity.issuer.tls.crtPEM=cert/issuer.crt \ --set-file identity.issuer.tls.keyPEM=cert/issuer.key \ --version $FROM \ linkerd-edge/linkerd-control-plane # install a simple app and a client to drive traffic. kubectl apply -f duplicate-metrics-simple-app.yml kubectl apply -f duplicate-metrics-traffic.yml # bind the traffic pod's metrics port to the host. kubectl port-forward -n simple-app deploy/traffic 4191 # start prometheus, begin scraping metrics prometheus --config.file=prometheus.yml ``` now, open a browser and query `irate(request_total[1m])`. next, upgrade the control plane: ``` helm upgrade linkerd-crds linkerd-edge/linkerd-crds \ -n linkerd --create-namespace --version $TO helm upgrade linkerd-control-plane \ -n linkerd \ --set-file identityTrustAnchorsPEM=cert/ca.crt \ --set-file identity.issuer.tls.crtPEM=cert/issuer.crt \ --set-file identity.issuer.tls.keyPEM=cert/issuer.key \ --version $TO \ linkerd-edge/linkerd-control-plane ``` prometheus will begin emitting warnings regarding 34 time series being dropped. in your browser, querying `irate(request_total[1m])` once more will show that the rate of requests has stopped, due to the new time series being dropped. next, restart the workloads... ``` kubectl rollout restart deployment -n simple-app simple-app-v1 traffic ``` prometheus warnings will go away, as reported in linkerd/linkerd2#13821. ### 🔍 related changes * linkerd/linkerd2#13699 * linkerd/linkerd2#13715 in linkerd/linkerd2#13715 and linkerd/linkerd2##13699, we made some changes to the destination controller. from the "Cautions" section of the `2025.3.2` edge release: > Additionally, this release changes the default for `outbound-transport-mode` > to `transport-header`, which will result in all traffic between meshed > proxies flowing on port 4143, rather than using the original destination > port. linkerd/linkerd2#13699 (_included in `edge-25.3.1`_) introduced this outbound transport-protocol configuration surface, but maintained the default behavior, while linkerd/linkerd2#13715 (_included in `edge-25.3.2`_) altered the default behavior to route meshed traffic via port 4143. this is a visible change in behavior that can be observed when upgrading from a version that preceded this change to the mesh. this means that when upgrading across `edge-25.3.2`, such as from the `2025.2.1` to `2025.3.2` versions of the helm charts, or from the `2025.2.3` to the `2025.3.4` versions of the helm charts (_reported upstream in linkerd/linkerd2#13821_), the freshly upgraded destination controller pods will begin routing meshed traffic differently. i'll state explicitly, _that_ is not a bug! it is, however, an important clue to bear in mind: data plane pods that were started with the previous control plane version, and continue running after the control plane upgrade, will have seen both routing patterns. reporting a duplicate time series for affected metrics indicates that there is a hashing collision in our metrics system. ### 🐛 the bug(s) we define a collection to structures to model labels for inbound and outbound endpoints' metrics: ```rust // linkerd/app/core/src/metrics.rs #[derive(Clone, Debug, PartialEq, Eq, Hash)] pub enum EndpointLabels { Inbound(InboundEndpointLabels), Outbound(OutboundEndpointLabels), } #[derive(Clone, Debug, PartialEq, Eq, Hash)] pub struct InboundEndpointLabels { pub tls: tls::ConditionalServerTls, pub authority: Option<http::uri::Authority>, pub target_addr: SocketAddr, pub policy: RouteAuthzLabels, } #[derive(Clone, Debug, PartialEq, Eq, Hash)] pub struct OutboundEndpointLabels { pub server_id: tls::ConditionalClientTls, pub authority: Option<http::uri::Authority>, pub labels: Option<String>, pub zone_locality: OutboundZoneLocality, pub target_addr: SocketAddr, } ``` \- <https://github.com/linkerd/linkerd2-proxy/blob/main/linkerd/app/core/src/metrics.rs> bear particular attention to the derived `Hash` implementation. note the `tls::ConditionalClientTls` and `tls::ConditionalServerTls` types used in each of these labels. these are used by some of our types like `TlsConnect` to emit prometheus labels, using our legacy system's `FmtLabels` trait: ```rust // linkerd/app/core/src/transport/labels.rs impl FmtLabels for TlsConnect<'_> { fn fmt_labels(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { match self.0 { Conditional::None(tls::NoClientTls::Disabled) => { write!(f, "tls=\"disabled\"") } Conditional::None(why) => { write!(f, "tls=\"no_identity\",no_tls_reason=\"{}\"", why) } Conditional::Some(tls::ClientTls { server_id, .. }) => { write!(f, "tls=\"true\",server_id=\"{}\"", server_id) } } } } ``` \- <https://github.com/linkerd/linkerd2-proxy/blob/99316f78987975a074ea63453c0dd21546fa4a48/linkerd/app/core/src/transport/labels.rs#L151-L165> note the `ClientTls` case, which ignores fields in the client tls information: ```rust // linkerd/tls/src/client.rs /// A stack parameter that configures a `Client` to establish a TLS connection. #[derive(Clone, Debug, Eq, PartialEq, Hash)] pub struct ClientTls { pub server_name: ServerName, pub server_id: ServerId, pub alpn: Option<AlpnProtocols>, } ``` \- <https://github.com/linkerd/linkerd2-proxy/blob/99316f78987975a074ea63453c0dd21546fa4a48/linkerd/tls/src/client.rs#L20-L26> this means that there is potential for an identical set of labels to be emitted given two `ClientTls` structures with distinct server names or ALPN protocols. for brevity, i'll elide the equivalent issue with `ServerTls`, and its corresponding `TlsAccept<'_>` label implementation, though it exhibits the same issue. ### 🔨 the fix this pull request introduces two new types: `ClientTlsLabels` and `ServerTlsLabels`. these continue to implement `Hash`, for use as a key in our metrics registry, and for use in formatting labels. `ClientTlsLabels` and `ServerTlsLabels` each resemble `ClientTls` and `ServerTls`, respectively, but do not contain any fields that are elided in label formatting, to prevent duplicate metrics from being emitted. relatedly, #3988 audits our existing `FmtLabels` implementations and makes use of exhaustive bindings, to prevent this category of problem in the short-term future. ideally, we might eventually consider replacing the metrics interfaces in `linkerd-metrics`, but that is strictly kept out-of-scope for the purposes of this particular fix. --- * fix: do not key transport metrics registry on `ClientTls` Signed-off-by: katelyn martin <[email protected]> * fix: do not key transport metrics registry on `ServerTls` Signed-off-by: katelyn martin <[email protected]> --------- Signed-off-by: katelyn martin <[email protected]>

cratelyn added 23 commits July 1, 2025 15:19

fix: do not key transport metrics registry on ClientTls

43a48a3

Signed-off-by: katelyn martin <[email protected]>

fix: do not key transport metrics registry on ServerTls

795558f

Signed-off-by: katelyn martin <[email protected]>

refactor(transport-metrics): exhaustive Eos: FmtLabels

ef5ed92

Signed-off-by: katelyn martin <[email protected]>

refactor(app/core): exhaustive ServerLabels: FmtLabels

275cd93

Signed-off-by: katelyn martin <[email protected]>

refactor(app/core): exhaustive TlsAccept: FmtLabels

a719080

Signed-off-by: katelyn martin <[email protected]>

refactor(app/core): exhaustive TargetAddr: FmtLabels

6f01352

Signed-off-by: katelyn martin <[email protected]>

refactor(metrics): exhaustive Label: FmtLabels

36aa0cf

Signed-off-by: katelyn martin <[email protected]>

refactor(http/metrics): exhaustive Status: FmtLabels

1d16d0b

Signed-off-by: katelyn martin <[email protected]>

refactor(app/core): exhaustive ControlLabels: FmtLabels

57f8e98

Signed-off-by: katelyn martin <[email protected]>

refactor(app/core): exhaustive ProfileRouteLabels: FmtLabels

198f0ff

Signed-off-by: katelyn martin <[email protected]>

refactor(app/core): exhaustive InboundEndpointLabels: FmtLabels

60c99bc

Signed-off-by: katelyn martin <[email protected]>

refactor(app/core): exhaustive ServerLabel: FmtLabels

b943b77

Signed-off-by: katelyn martin <[email protected]>

refactor(app/core): exhaustive ServerAuthzLabels: FmtLabels

82fa1dd

Signed-off-by: katelyn martin <[email protected]>

refactor(app/core): exhaustive RouteLabels: FmtLabels

2dffac4

Signed-off-by: katelyn martin <[email protected]>

refactor(app/core): exhaustive RouteAuthzLabels: FmtLabels

c08750b

Signed-off-by: katelyn martin <[email protected]>

refactor(app/core): exhaustive OutboundEndpointLabels: FmtLabels

969e8ce

Signed-off-by: katelyn martin <[email protected]>

refactor(app/core): exhaustive Authority: FmtLabels

22188f8

Signed-off-by: katelyn martin <[email protected]>

refactor(app/core): exhaustive StackLabels: FmtLabels

744b41e

Signed-off-by: katelyn martin <[email protected]>

refactor(app/inbound): exhaustive HTTPLocalRateLimitLabels: FmtLabels

d713d80

Signed-off-by: katelyn martin <[email protected]>

refactor(app/inbound): exhaustive Key<L>: FmtLabels

5c9f1fc

Signed-off-by: katelyn martin <[email protected]>

nit(metrics): remove redundant banner comment

1cd12df

these impl blocks are all `FmtLabels`, following another series of the same, above. we don't need another one of these comments. Signed-off-by: katelyn martin <[email protected]>

nit(metrics): exhaustive AndThen: FmtMetrics

8f5a977

Signed-off-by: katelyn martin <[email protected]>

nit(app/core): note unused label

87ca8e0

see #3262 (618838e), which introduced this label. to preserve behavior, this label remains unused. X-Ref: #3262 Signed-off-by: katelyn martin <[email protected]>

cratelyn mentioned this pull request Jul 1, 2025

fix: remove ambiguous metrics registry keys #3987

Merged

cratelyn commented Jul 1, 2025

View reviewed changes

cratelyn marked this pull request as ready for review July 1, 2025 20:38

cratelyn requested a review from a team as a code owner July 1, 2025 20:38

olix0r approved these changes Jul 2, 2025

View reviewed changes

Base automatically changed from kate/app-core.duplicate-tls-metrics to main July 2, 2025 16:38

cratelyn merged commit 34b46ab into main Jul 3, 2025
15 checks passed

cratelyn deleted the kate/app-core.duplicate-tls-metrics-follow-up.exhaustive-bindings branch July 3, 2025 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: `FmtLabels` impls use exhaustive bindings #3988

refactor: `FmtLabels` impls use exhaustive bindings #3988

Uh oh!

cratelyn commented Jul 1, 2025 •

edited

Loading

Uh oh!

cratelyn Jul 1, 2025

Uh oh!

olix0r Jul 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// TODO(kate): this label is not currently emitted.
		zone_locality: _,

refactor: FmtLabels impls use exhaustive bindings #3988

refactor: FmtLabels impls use exhaustive bindings #3988

Uh oh!

Conversation

cratelyn commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔖 a note

Uh oh!

cratelyn Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

olix0r Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

refactor: `FmtLabels` impls use exhaustive bindings #3988

refactor: `FmtLabels` impls use exhaustive bindings #3988

cratelyn commented Jul 1, 2025 •

edited

Loading