[feature] Enhance support and visualization for pod lifecycle failure in Kubeflow Pipelines

### Feature Area


/area frontend
/area backend




### What feature would you like to see?

A pipeline in Kubeflow Pipelines (KFP) fails for one of two reasons: the user-defined pipeline script returns an error, or a pod lifecycle failure occurs. The latter failure can be classified according to lifecycle stage; a failure is either at the provisioning level (e.g., ImagePullBackOff or Unschedulable), runtime level (e.g., CrashLoopBackOff or OOMKilled), or node level (e.g., NodeLost or Preempted).

KFP currently provides limited support when a pipeline run hits a pod lifecycle failure. While the Kubernetes CLI allows users to view the status of their pipeline pods in real time on the terminal, the KFP UI currently provides only minimal visualization. When a pod lifecycle failure occurs, the KFP UI displays a pipeline frozen at the current pod – not progressing, finishing or failing. The cause of failure is not displayed. This creates a confusing and frustrating user experience.  KFP intends to function as a Kubernetes abstraction for data scientists and AI engineers, meaning that not all KFP users are experienced with Kubernetes. Requiring use of Kubernetes tools and skills leaks that abstraction. 

The solution here proposes error timeouts and enhanced visualization. It would be helpful to parameterize time limits on the three categories of pod lifecycle failure outlined above, as environment variables on the API server. A default time limit of one hour would be provided, and users could optionally specify time limits per pod status (e.g. ImagePullBackOff). In addition to the time management component, it would enhance user experience to visually log these failures in the UI. 
Pod lifecycle failures should be logged and displayed in the UI in the same way pipeline script failures are displayed, in order to provide users with real-time pod lifecycle updates.


### What is the use case or pain point?

Users executing pipeline runs with pod lifecycle failures are unable to view pipeline failure in the UI, and unable to manage failure timeout.

### Is there a workaround currently?
The current workaround to lifecycle failure timeout is to manually delete a pipeline run. The current workaround to the limited UI visualization is `kubectl get pods -w` using the Kubernetes CLI in the terminal. There is no workaround to failure visualization directly within the UI.


---


Love this idea? Give it a 👍.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Enhance support and visualization for pod lifecycle failure in Kubeflow Pipelines #12843

Feature Area

What feature would you like to see?

What is the use case or pain point?

Is there a workaround currently?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[feature] Enhance support and visualization for pod lifecycle failure in Kubeflow Pipelines #12843

Description

Feature Area

What feature would you like to see?

What is the use case or pain point?

Is there a workaround currently?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions