Skip to content

Proxy remains in error state after main container completed #14289

@micke-post

Description

@micke-post

What is the issue?

Hi,
we recently upgraded from edge-25.4.2 to edge-25.7.4. After this upgrade it is impossible for pods to properly terminate due to the linkerd-proxy container persisting in an error state after the main container has completed.

How can it be reproduced?

To troubleshoot this, I wrote a simple demo pod:

apiVersion: v1
kind: Pod
metadata:
  name: simple-idle-pod
  annotations: 
    config.linkerd.io/proxy-log-level: debug
spec:
  containers:
  - name: idle-container
    image: ubuntu
    command: ["/bin/bash", "-c"]
    args:
      - |
        while true; do
          if [ -f /mnt/killpod ]; then
            echo "File /mnt/killpod found. Exiting."
            exit 0
          fi
          sleep 5
        done
  restartPolicy: Never

If I now connect to the main container and create /mnt/killpod, it will gracefully terminate with exit code 0. The linkerd-proxy pod will report an error with exit code 2, see here.

Image

Logs, error output, etc

According to the proxy logs, everything looks fine and exits normally, no errors are thrown. The network validator logs look fine as well.

output of linkerd check -o short

linkerd-version
---------------
‼ cli is up-to-date
    is running version 25.7.4 but the latest edge version is 25.7.5
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 25.7.4 but the latest edge version is 25.7.5
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-destination-745c4964db-f5jz9 (edge-25.7.4)
        * linkerd-destination-745c4964db-n88xm (edge-25.7.4)
        * linkerd-identity-6fcb4976b4-5988p (edge-25.7.4)
        * linkerd-identity-6fcb4976b4-tbn8l (edge-25.7.4)
        * linkerd-proxy-injector-8474cc9fdf-jrc7w (edge-25.7.4)
        * linkerd-proxy-injector-8474cc9fdf-zh78k (edge-25.7.4)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints

Status check results are √

Environment

  • Kubernetes Version: 1.31.9
  • Cluster Environment: AKS
  • Host OS: Ubuntu
  • Linkerd version: 25.7.4

Possible solution

We checked multiple versions for the issue to see if we can narrow down when it started happening:

  • 25.7.4 -> issue occurs
  • 25.7.2 -> issue occurs
  • 25.6.4 -> issue occurs
  • 25.6.3 -> issue occurs
  • 25.6.2 -> ok
  • 25.6.1 -> ok
  • 25.5.5 -> ok

If we deploy edge-25.7.4, but explicitly add the following values to our helm file to force the proxy to use 25.6.2, the issue also doesn't occur, so it seems to be related to the proxy itself:

  proxy:
    image:
      name: cr.l5d.io/linkerd/proxy
      version: edge-25.6.2

This is probably a workaround at best, I'm not sure if I'm comfortable having a version skew between the linkerd control plane and its proxies.

Additional context

Below is our values file. All other settings are not changed from their default values:

linkerd2-cni:
  # -|- Tolerations section, See the
  # [K8S documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/)
  # for more information
  tolerations:
  - key: "CriticalAddonsOnly"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

  # This tolleration will be removed once all workloads are migrated to the new nodepool / Userpoolv3
  - key: "Userpoolv3"
    operator: Equal
    value: "true"
    effect: NoSchedule

linkerd-control-plane:
  # This should be the app version of the linkerd-control-plane chart we are using
  linkerdVersion: edge-25.7.4

  # proxy configuration
  proxy:
    # -- Enables the proxy's /shutdown admin endpoint
    enableShutdownEndpoint: true
    # -- Enable KEP-753 native sidecars
    # This is an experimental feature. It requires Kubernetes >= 1.29.
    # If enabled, .proxy.waitBeforeExitSeconds should not be used.
    nativeSidecar: true

  # -- Failure policy for the proxy injector
  webhookFailurePolicy: Fail

  identity:
    # -- If the linkerd-identity-trust-roots ConfigMap has already been created
    externalCA: true

    issuer:
      scheme: kubernetes.io/tls

  # proxy injector configuration
  proxyInjector:
    # -- Do not create a secret resource for the policyValidator webhook.
    # If this is set to `true`, the value `policyValidator.caBundle` must be set
    # or the ca bundle must injected with cert-manager ca injector using
    # `policyValidator.injectCaFrom` or `policyValidator.injectCaFromSecret` (see below).
    externalSecret: true
    # -- Inject the CA bundle from a Secret.
    # If set, the `cert-manager.io/inject-ca-from-secret` annotation will be added to the webhook.
    # The Secret must have the CA Bundle stored in the `ca.crt` key and have
    # the `cert-manager.io/allow-direct-injection` annotation set to `true`.
    # See the cert-manager [CA Injector Docs](https://cert-manager.io/docs/concepts/ca-injector/#injecting-ca-data-from-a-secret-resource)
    # for more information.
    injectCaFrom: "linkerd/linkerd-proxy-injector-k8s-tls"

    objectSelector:
      matchExpressions:
      - key: linkerd.io/control-plane-component
        operator: DoesNotExist
      - key: linkerd.io/cni-resource
        operator: DoesNotExist
      - key: linkerd.io/wait-pod
        operator: DoesNotExist

  # service profile validator configuration
  profileValidator:
    # -- Do not create a secret resource for the profileValidator webhook.
    # If this is set to `true`, the value `proxyInjector.caBundle` must be set
    # or the ca bundle must injected with cert-manager ca injector using
    # `proxyInjector.injectCaFrom` or `proxyInjector.injectCaFromSecret` (see below).
    externalSecret: true

    # -- Inject the CA bundle from a cert-manager Certificate.
    # See the cert-manager [CA Injector Docs](https://cert-manager.io/docs/concepts/ca-injector/#injecting-ca-data-from-a-certificate-resource)
    # for more information.
    injectCaFrom: "linkerd/linkerd-sp-validator-k8s-tls"

  # policy validator configuration
  policyValidator:
    # -- Do not create a secret resource for the policyValidator webhook.
    # If this is set to `true`, the value `policyValidator.caBundle` must be set
    # or the ca bundle must injected with cert-manager ca injector using
    # `policyValidator.injectCaFrom` or `policyValidator.injectCaFromSecret` (see below).
    externalSecret: true

    # -- Inject the CA bundle from a cert-manager Certificate.
    # See the cert-manager [CA Injector Docs](https://cert-manager.io/docs/concepts/ca-injector/#injecting-ca-data-from-a-certificate-resource)
    # for more information.
    injectCaFrom: "linkerd/linkerd-policy-validator-k8s-tls"
  
  # -|- Tolerations section, See the
  # [K8S documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/)
  # for more information
  tolerations:
  - key: "CriticalAddonsOnly"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

  # -|- NodeAffinity section, See the
  # [K8S documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity)
  # for more information
  nodeAffinity: 
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.azure.com/mode
          operator: In
          values:
          - system

linkerd-crds: {}

Would you like to work on fixing this bug?

None

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions