-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
What is the issue?
Hi,
we recently upgraded from edge-25.4.2 to edge-25.7.4. After this upgrade it is impossible for pods to properly terminate due to the linkerd-proxy container persisting in an error state after the main container has completed.
How can it be reproduced?
To troubleshoot this, I wrote a simple demo pod:
apiVersion: v1
kind: Pod
metadata:
name: simple-idle-pod
annotations:
config.linkerd.io/proxy-log-level: debug
spec:
containers:
- name: idle-container
image: ubuntu
command: ["/bin/bash", "-c"]
args:
- |
while true; do
if [ -f /mnt/killpod ]; then
echo "File /mnt/killpod found. Exiting."
exit 0
fi
sleep 5
done
restartPolicy: NeverIf I now connect to the main container and create /mnt/killpod, it will gracefully terminate with exit code 0. The linkerd-proxy pod will report an error with exit code 2, see here.
Logs, error output, etc
According to the proxy logs, everything looks fine and exits normally, no errors are thrown. The network validator logs look fine as well.
output of linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
is running version 25.7.4 but the latest edge version is 25.7.5
see https://linkerd.io/2/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 25.7.4 but the latest edge version is 25.7.5
see https://linkerd.io/2/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-745c4964db-f5jz9 (edge-25.7.4)
* linkerd-destination-745c4964db-n88xm (edge-25.7.4)
* linkerd-identity-6fcb4976b4-5988p (edge-25.7.4)
* linkerd-identity-6fcb4976b4-tbn8l (edge-25.7.4)
* linkerd-proxy-injector-8474cc9fdf-jrc7w (edge-25.7.4)
* linkerd-proxy-injector-8474cc9fdf-zh78k (edge-25.7.4)
see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
Status check results are √
Environment
- Kubernetes Version: 1.31.9
- Cluster Environment: AKS
- Host OS: Ubuntu
- Linkerd version: 25.7.4
Possible solution
We checked multiple versions for the issue to see if we can narrow down when it started happening:
- 25.7.4 -> issue occurs
- 25.7.2 -> issue occurs
- 25.6.4 -> issue occurs
- 25.6.3 -> issue occurs
- 25.6.2 -> ok
- 25.6.1 -> ok
- 25.5.5 -> ok
If we deploy edge-25.7.4, but explicitly add the following values to our helm file to force the proxy to use 25.6.2, the issue also doesn't occur, so it seems to be related to the proxy itself:
proxy:
image:
name: cr.l5d.io/linkerd/proxy
version: edge-25.6.2This is probably a workaround at best, I'm not sure if I'm comfortable having a version skew between the linkerd control plane and its proxies.
Additional context
Below is our values file. All other settings are not changed from their default values:
linkerd2-cni:
# -|- Tolerations section, See the
# [K8S documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/)
# for more information
tolerations:
- key: "CriticalAddonsOnly"
operator: "Equal"
value: "true"
effect: "NoSchedule"
# This tolleration will be removed once all workloads are migrated to the new nodepool / Userpoolv3
- key: "Userpoolv3"
operator: Equal
value: "true"
effect: NoSchedule
linkerd-control-plane:
# This should be the app version of the linkerd-control-plane chart we are using
linkerdVersion: edge-25.7.4
# proxy configuration
proxy:
# -- Enables the proxy's /shutdown admin endpoint
enableShutdownEndpoint: true
# -- Enable KEP-753 native sidecars
# This is an experimental feature. It requires Kubernetes >= 1.29.
# If enabled, .proxy.waitBeforeExitSeconds should not be used.
nativeSidecar: true
# -- Failure policy for the proxy injector
webhookFailurePolicy: Fail
identity:
# -- If the linkerd-identity-trust-roots ConfigMap has already been created
externalCA: true
issuer:
scheme: kubernetes.io/tls
# proxy injector configuration
proxyInjector:
# -- Do not create a secret resource for the policyValidator webhook.
# If this is set to `true`, the value `policyValidator.caBundle` must be set
# or the ca bundle must injected with cert-manager ca injector using
# `policyValidator.injectCaFrom` or `policyValidator.injectCaFromSecret` (see below).
externalSecret: true
# -- Inject the CA bundle from a Secret.
# If set, the `cert-manager.io/inject-ca-from-secret` annotation will be added to the webhook.
# The Secret must have the CA Bundle stored in the `ca.crt` key and have
# the `cert-manager.io/allow-direct-injection` annotation set to `true`.
# See the cert-manager [CA Injector Docs](https://cert-manager.io/docs/concepts/ca-injector/#injecting-ca-data-from-a-secret-resource)
# for more information.
injectCaFrom: "linkerd/linkerd-proxy-injector-k8s-tls"
objectSelector:
matchExpressions:
- key: linkerd.io/control-plane-component
operator: DoesNotExist
- key: linkerd.io/cni-resource
operator: DoesNotExist
- key: linkerd.io/wait-pod
operator: DoesNotExist
# service profile validator configuration
profileValidator:
# -- Do not create a secret resource for the profileValidator webhook.
# If this is set to `true`, the value `proxyInjector.caBundle` must be set
# or the ca bundle must injected with cert-manager ca injector using
# `proxyInjector.injectCaFrom` or `proxyInjector.injectCaFromSecret` (see below).
externalSecret: true
# -- Inject the CA bundle from a cert-manager Certificate.
# See the cert-manager [CA Injector Docs](https://cert-manager.io/docs/concepts/ca-injector/#injecting-ca-data-from-a-certificate-resource)
# for more information.
injectCaFrom: "linkerd/linkerd-sp-validator-k8s-tls"
# policy validator configuration
policyValidator:
# -- Do not create a secret resource for the policyValidator webhook.
# If this is set to `true`, the value `policyValidator.caBundle` must be set
# or the ca bundle must injected with cert-manager ca injector using
# `policyValidator.injectCaFrom` or `policyValidator.injectCaFromSecret` (see below).
externalSecret: true
# -- Inject the CA bundle from a cert-manager Certificate.
# See the cert-manager [CA Injector Docs](https://cert-manager.io/docs/concepts/ca-injector/#injecting-ca-data-from-a-certificate-resource)
# for more information.
injectCaFrom: "linkerd/linkerd-policy-validator-k8s-tls"
# -|- Tolerations section, See the
# [K8S documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/)
# for more information
tolerations:
- key: "CriticalAddonsOnly"
operator: "Equal"
value: "true"
effect: "NoSchedule"
# -|- NodeAffinity section, See the
# [K8S documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity)
# for more information
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.azure.com/mode
operator: In
values:
- system
linkerd-crds: {}Would you like to work on fixing this bug?
None