You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The issue report is detailed and includes version numbers where applicable.
I have considered adding my company to the adopters page to support Kubeflow.
Version
master
Detailed Description
Several Kubeflow Pipelines components fail to become ready in a cluster where Cilium is the CNI
(configured alongside Istio CNI per the Kubeflow README). The pods enter CrashLoopBackOff because
inter-component gRPC and HTTP connections are blocked by Cilium with "Operation not permitted".
Affected pods:
metadata-writer — cannot reach the MLMD gRPC store (metadata-grpc-deployment) on port 8080
ml-pipeline-persistenceagent — cannot reach ml-pipeline API server on port 8888
ml-pipeline and ml-pipeline-scheduledworkflow — fail for similar connectivity reasons
The pipeline component manifests define no NetworkPolicy resources for inter-component traffic,
so Cilium's default-deny policy blocks all cross-pod connections that are not explicitly permitted.
Observe the following pods in the kubeflow namespace stuck in CrashLoopBackOff:
NAME READY STATUS RESTARTS
metadata-grpc-deployment-589ccc5c9d-zndb2 1/2 CrashLoopBackOff 2495
metadata-writer-6c7657b97c-xnpf7 1/2 CrashLoopBackOff 1214
ml-pipeline-85cb9cdd7-b7l8g 1/2 CrashLoopBackOff 2038
ml-pipeline-scheduledworkflow-67c5dbfbb8-jkgcc 1/2 CrashLoopBackOff 1349
Inspect logs:
$ kubectl logs -n kubeflow deployment/metadata-writer --tail=10
Failed to access the Metadata store. Exception: "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.96.186.182:8080: connect: Operation not permitted (1)"
RuntimeError: Could not connect to the Metadata store
$ kubectl logs -n kubeflow deployment/ml-pipeline-persistenceagent --tail=10
level=fatal msg="Error creating ML pipeline API Server client: Failed to initialize pipeline client. Error: ... dial tcp 10.96.23.93:8888: connect: operation not permitted"
Validation Checklist
Version
master
Detailed Description
Several Kubeflow Pipelines components fail to become ready in a cluster where Cilium is the CNI
(configured alongside Istio CNI per the Kubeflow README). The pods enter CrashLoopBackOff because
inter-component gRPC and HTTP connections are blocked by Cilium with "Operation not permitted".
Affected pods:
metadata-writer— cannot reach the MLMD gRPC store (metadata-grpc-deployment) on port 8080ml-pipeline-persistenceagent— cannot reachml-pipelineAPI server on port 8888ml-pipelineandml-pipeline-scheduledworkflow— fail for similar connectivity reasonsThe pipeline component manifests define no
NetworkPolicyresources for inter-component traffic,so Cilium's default-deny policy blocks all cross-pod connections that are not explicitly permitted.
Steps to Reproduce
masterbranch on a Kubernetes cluster using Cilium as CNI (seeCilium + Istio setup described in Update kubeflow/notebooks manifests from v2.0.0-alpha.2 #3455 (comment)).
kubeflownamespace stuck in CrashLoopBackOff:Screenshots or Videos
N/A