Skip to content

pipelines: Components fail to start in Cilium-backed cluster due to missing NetworkPolicy resources #3471

@christian-heusel

Description

@christian-heusel

Validation Checklist

  • I confirm that this is a Kubeflow-related issue.
  • I am reporting this in the appropriate repository.
  • I have followed the Kubeflow installation guidelines.
  • The issue report is detailed and includes version numbers where applicable.
  • I have considered adding my company to the adopters page to support Kubeflow.

Version

master

Detailed Description

Several Kubeflow Pipelines components fail to become ready in a cluster where Cilium is the CNI
(configured alongside Istio CNI per the Kubeflow README). The pods enter CrashLoopBackOff because
inter-component gRPC and HTTP connections are blocked by Cilium with "Operation not permitted".

Affected pods:

  • metadata-writer — cannot reach the MLMD gRPC store (metadata-grpc-deployment) on port 8080
  • ml-pipeline-persistenceagent — cannot reach ml-pipeline API server on port 8888
  • ml-pipeline and ml-pipeline-scheduledworkflow — fail for similar connectivity reasons

The pipeline component manifests define no NetworkPolicy resources for inter-component traffic,
so Cilium's default-deny policy blocks all cross-pod connections that are not explicitly permitted.

Steps to Reproduce

  1. Install Kubeflow from the master branch on a Kubernetes cluster using Cilium as CNI (see
    Cilium + Istio setup described in Update kubeflow/notebooks manifests from v2.0.0-alpha.2 #3455 (comment)).
  2. Observe the following pods in the kubeflow namespace stuck in CrashLoopBackOff:
    NAME                                                     READY   STATUS             RESTARTS
    metadata-grpc-deployment-589ccc5c9d-zndb2                1/2     CrashLoopBackOff   2495
    metadata-writer-6c7657b97c-xnpf7                         1/2     CrashLoopBackOff   1214
    ml-pipeline-85cb9cdd7-b7l8g                              1/2     CrashLoopBackOff   2038
    ml-pipeline-scheduledworkflow-67c5dbfbb8-jkgcc           1/2     CrashLoopBackOff   1349
    
  3. Inspect logs:
    $ kubectl logs -n kubeflow deployment/metadata-writer --tail=10
    Failed to access the Metadata store. Exception: "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.96.186.182:8080: connect: Operation not permitted (1)"
    RuntimeError: Could not connect to the Metadata store
    
    $ kubectl logs -n kubeflow deployment/ml-pipeline-persistenceagent --tail=10
    level=fatal msg="Error creating ML pipeline API Server client: Failed to initialize pipeline client. Error: ... dial tcp 10.96.23.93:8888: connect: operation not permitted"
    

Screenshots or Videos

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions