feat(proposals): add KEP-12843 pod lifecycle failure support and visualization by khushiiagrawal · Pull Request #13517 · kubeflow/pipelines

khushiiagrawal · 2026-06-12T18:13:35Z

Description of your changes:

Adds the proposal for #12843 - surfacing pod lifecycle failures (ImagePullBackOff, OOMKilled, Unschedulable, etc.) in the KFP run details UI.

Currently, when a task pod fails at the Kubernetes level, the task node sits in a green running state forever with no error shown. The Argo node message that describes the failure is dropped during conversion and never reaches the UI.

This KEP proposes:

Persisting the Argo node failure message on the Task model via a new nullable LifecycleFailureMessage column (additive schema change, no migration required)
Returning it through the existing task_details[].error field in the v2beta1 GetRun response (no new API surface)
Overriding the node visual state to Failed in the frontend and showing the failure reason in the side panel banner, even when MLMD still reports RUNNING

The initial implementation draft is open in #13516.

Related: Fixes #12843

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

google-oss-prow · 2026-06-12T18:13:42Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zazulam for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow · 2026-06-12T18:13:47Z

Hi @khushiiagrawal. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a KEP document proposing end-to-end surfacing of Argo pod lifecycle failure messages in KFP (persisted in Task, exposed via existing task_details[].error, and rendered in the V2 run details UI).

Changes:

Introduces a new KEP covering motivation, backend/frontend design, and risks/mitigations for lifecycle failure visibility.
Documents intended DB schema change (new nullable LifecycleFailureMessage column) and API/UI behavior.
Provides a test plan and manual verification steps for failure/success/retry scenarios.

…s and schema change clarification Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

ntny · 2026-06-13T20:27:46Z

/lgtm

ntny · 2026-06-13T20:27:52Z

I really like the idea of using the message from AWF status.nodes[].message instead of simply surfacing pod events.

This also works in cases where the pod was never created at all (for example, when resource quotas are exceeded or Kyverno blocks pod creation due to security policies).

Good catch on filtering transient messages such as PodInitializing and ContainerCreating.

One design question that may be worth adding to the KEP: can Argo node messages be parsed into a more meaningful task lifecycle status in addition to surfacing he raw text message? Most of the useful messages here correspond to standard Kubernetes problems such as OOMKilled, image pull errors, resource/scheduling issues, or configuration/admission failures. For unknown or newly introduced failure patterns, the structured status could stay Unknown while the raw message would still give users the original diagnostic details.

Concerns

Reporting latency is one concern. Fast reporting is critical for recurring (cron) runs. Unlike regular runs, cron runs are created in MySQL through persistenceagent reporting.

There is already at least one path that depends on the run row existing before task execution finishes. After a task completes:

the driver writes a cache entry through CreateExecutionCache:
https://github.com/kubeflow/pipelines/blob/master/backend/src/v2/cacheutils/cache.go#L140-L145
That goes through TaskServer.CreateTaskV1:
https://github.com/kubeflow/pipelines/blob/master/backend/src/apiserver/server/task_server.go#L36-L50
And ResourceManager.CreateTask first calls GetRun(t.RunID):
https://github.com/kubeflow/pipelines/blob/master/backend/src/apiserver/resource/resource_manager.go#L991-L995
If a driver starts executing before the persistence agent has reported the cron run (ScheduledWorkflow), this path can fail because the run row does not exist yet.
Because of that, any additional processing of lifecycle messages should not significantly slow down the existing persistence agent reporting loop. It would be good to either ensure that this work has negligible impact on reporting latency or keep heavier processing on a separate path.
Could you also add a small performance measurement to the KEP for the persistence-agent reporting path before and after this change?

Possible future improvements

One open question is whether task updates should be deduplicated in the persistence-agent reporting path. Today the agent reports task status snapshots from the workflow, and this proposal extends that existing path with lifecycle failure messages. This is probably fine for the initial implementation, but it may make sense in the future to avoid sending/upserting updates for tasks whose effective state has not changed, to reduce unnecessary database writes.
Lifecycle messages add one extra nuance here: even when the task state is unchanged, Kubernetes/Argo messages may change in ways that do not provide meaningful new information. For example, an invalid image can alternate between ErrImagePull and ImagePullBackOff. If this becomes noisy, we may want to suppress repeated or semantically equivalent message updates as well.
Another possible future improvement would be to distinguish between transient and terminal failures.
For example, OOMKilled is definitely terminal, and it may make sense to immediately transition the task into a Failed state in MySQL. On the other hand, ErrImagePull, ImagePullBackOff, or scheduling-related issues such as FailedScheduling are often transient conditions. In such cases, it may be worth introducing a separate Warning task state in MySQL rather than treating them as either Running or Failed.

Not suggesting this for the current KEP, but it may be worth keeping in mind as a future enhancement.

…tion and reporting latency details Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

khushiiagrawal · 2026-06-15T09:21:46Z

@ntny Thanks for the thorough review! I've updated the KEP to address each point:

Added a structured classification section with the proposed category taxonomy and rationale for deferring it to a follow up
Added a latency section covering the resolveNodeLifecycleMessages traversal and the cron run race condition, the traversal is in-memory only and doesn't affect the existing DB write path; a before/after measurement will be in the implementation PR
Added a Future Work section capturing the transient vs terminal distinction, message deduplication, and task update deduplication points

Let me know if anything needs more detail!

khushiiagrawal · 2026-06-15T09:23:48Z

Hii @alyssacgoins @HumairAK @mprahl @james-jwu , Would really appreciate your thoughts and review on this.
Thanks!

ntny · 2026-06-15T17:14:54Z

/lgtm

alyssacgoins · 2026-06-15T17:40:23Z

+
+### Non-Goals
+
+1. Configurable per-class timeouts (for example, fail after one hour of `ImagePullBackOff`). The original issue mentions this. It is a useful follow-up but it is a separate piece of work and is not covered here.


Just curious - is there a specific reason why configurable timeouts are out of scope for this feature?

Yes. Timeouts need per-class timestamp tracking or a sweep mechanism in the persistence agent, which felt like a separate chunk of work from the core visibility fix. I've updated the Non-Goals section with a short rationale explaining this.

@alyssacgoins let me know what you think.

That make sense to me - once you've completed/merged this feature, be sure to add a follow-up issue for this

alyssacgoins · 2026-06-15T18:34:11Z

+
+Frontend: existing `RunDetailsV2.test.tsx` continues to pass without changes, which is the regression bar. New behavior is exercised through the test cases that already cover `updateFlowElementsState`.
+
+### Manual Verification (E2E)


The manual verification you've included here is quite useful for verifying changes locally. Can you add some info on how you plan to validate/verify this feature (particularly frontend-wise) through integration testing in CI?

Good point, I've added a CI Validation section to the Test Plan covering a bad image API test and a frontend integration test for the banner. Both will ship with the implementation PR.

alyssacgoins · 2026-06-15T18:35:55Z

+
+Rollback is straightforward. Reverting the API server image and either dropping the column or leaving it in place unused restores the previous behavior. The column is informational only, so there is no data loss.
+
+## Frontend Considerations


Can you emphasize here that the frontend will clearly desiplay pod-level failures with this feature?

alyssacgoins · 2026-06-15T18:45:37Z

+
+On the UI side, the run details page renders the pipeline graph from MLMD execution records. The driver pod writes the MLMD execution row in `RUNNING` state before the user container starts. If the user container then fails to start at all, the MLMD record stays at `RUNNING` forever, and the graph node renders green and spins indefinitely.
+
+KFP positions itself as a Kubernetes abstraction for data scientists and ML engineers. Many of its users are not Kubernetes operators. When the UI shows a task that never finishes and never fails, with no message, the only path forward is to ask someone else to run `kubectl get pods`. That breaks the abstraction the product is built on.


This is a great paragraph. This expresses the exact core of the problem, and what we're trying to solve with this feature. Great work 🙂

alyssacgoins · 2026-06-15T18:53:48Z

+
+Add `LifecycleFailureMessage` to `taskColumns`, to `scanRows`, and to the `CreateTask` and `CreateOrUpdateTasks` insert paths.
+
+In `patchTask`, do not preserve the previous value of this column. The fresh filtered value computed from the latest workflow state always wins. This is what makes the field self-clearing if a pod recovers on retry.


The overwrite logic you propose here is a departure from the current backend/src/apiserver/storage/task_store.go/patch_task() logic, which fills only empty fields. Your logic makes sense in the context of this feature, but make sure it does not break any existing patch_task() behavior - you may need to create a separate method.

Thanks for flagging this Alyssa. There's actually no special overwrite logic added. LifecycleFailureMessage is just left out of the preserve-if-empty loop entirely, so the fresh value from the workflow sync always wins and the existing behavior for all other fields stays the same. I've updated the KEP wording to make this clearer.

alyssacgoins · 2026-06-15T18:57:01Z

+Backend, in `backend/src/apiserver/storage/task_store_test.go`: existing CRUD tests cover the new column once it is added to the column list. No new test files are required.
+
+Frontend: existing `RunDetailsV2.test.tsx` continues to pass without changes, which is the regression bar. New behavior is exercised through the test cases that already cover `updateFlowElementsState`.
+


Can you add some information on additional frontend unit test coverage for this feature?

Sure, updated the Test Plan with specific test cases covering the state override and banner rendering in DynamicFlow and RuntimeNodeDetailsV2.

alyssacgoins · 2026-06-15T18:58:10Z

@jeffspahr can we get your thoughts here from the frontend perspective?

alyssacgoins · 2026-06-15T18:58:44Z

@droctothorpe @zazulam can we get your thoughts?

ntny · 2026-06-15T19:46:21Z

I'm all for this proposal. The only risk I see is hypothetical: if KFP wants to add another backend in addition to Argo in the future, would this KEP block that?
What might makes this less concerning to me is that most of the work seems to be on the persistence agent and UI side.

alyssacgoins · 2026-06-15T20:09:55Z

I'm all for this proposal. The only risk I see is hypothetical: if KFP wants to add another backend in addition to Argo in the future, would this KEP block that? What might makes this less concerning to me is that most of the work seems to be on the persistence agent and UI side.

@ntny If KFP was to add an additional backend in the future, I think the design would just need to propagate pod failure events by querying the K8s API directly - @khushiiagrawal includes this as an alternative in her doc here. This alternative is unnecessary in this case because the Argo message already contains the information, but it's an option for a non-Argo backend.

…ep/12843-pod-lifecycle-failure

…ycle failure proposal Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

khushiiagrawal · 2026-06-15T21:09:30Z

@alyssacgoins I have addressed the changes, PTAL!
Thanks.

ntny · 2026-06-16T08:39:35Z

+
+A KFP pipeline task can fail in two ways. The first is a user-script failure, where the Python code inside the container raises an exception. The second is a pod lifecycle failure, where the Kubernetes pod backing the task never reaches a healthy running state, or is killed by the system. Common examples of the second category are `ImagePullBackOff`, `Unschedulable`, `OOMKilled`, `CrashLoopBackOff`, and `NodeLost`.
+
+KFP handles the first case well today. The second case is currently invisible in the UI. The task node stays in a running state forever, no error is shown, and the user has no way to find out what happened without using `kubectl`. This proposal threads the pod lifecycle failure message that Argo Workflows already records all the way through the API server and into the run details page, so the task node turns red and the failure reason is shown in the side panel, just like a user-script failure is shown today.


I agree terminal vs transient classification can be deferred, but I’d handle FailedScheduling / Unschedulable carefully in the initial design.

On a well-utilized cluster, a pod may wait for resources, then get scheduled once other workloads finish. Showing that task as red could be misleading because it may recover without user action. Frequent red -> Running transitions would look strange in the UI because the task may recover without user action.
This feels more like a warning

Also, as far as I understand, the current V2 UI maps node status from MLMD executions rather than tasks, while there is ongoing MLMD-removal work to move it into tasks

feat(backend): switch runtime MLMD tracking to task APIs #13478

feat(backend): Replace MLMD with KFP Server APIs #12430

@ntny @khushiiagrawal can we add a color (such as yellow) that indicates warning rather than failure for these cases? That way we're still communicating status.

alyssacgoins · 2026-06-16T19:50:52Z

+- A dedicated test pipeline with a deliberately bad container image that verifies the failing task's `error.message` contains the expected pod lifecycle failure string via the v2beta1 `GetRun` API.
+- A frontend integration test (in `test/frontend-integration-test/`) that submits the bad-image pipeline, waits for the task node to turn red, and asserts the banner text in the side panel.
+
+These CI additions will be part of the implementation PR rather than this KEP.


Suggested change

These CI additions will be part of the implementation PR rather than this KEP.

You can remove this line - the KEP is a feature outline, so any work here will always be a part of the implementation PR

alyssacgoins · 2026-06-16T19:53:29Z

+
+The existing KFP end-to-end test suite in `.github/workflows/e2e-test.yml` runs full pipeline runs against a live cluster and verifies final task states. Once this change lands, one or more of the following will be added to cover the lifecycle failure path in CI:
+
+- A dedicated test pipeline with a deliberately bad container image that verifies the failing task's `error.message` contains the expected pod lifecycle failure string via the v2beta1 `GetRun` API.


Can you create a matrix here outlining scenario, the resulting pod status, and expected behavior?
You can use this matrix from my Literal Inputs KEP as reference.

proposals: add KEP-12843 pod lifecycle failure support

9884f0b

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

Copilot AI review requested due to automatic review settings June 12, 2026 18:13

google-oss-prow Bot requested review from HumairAK, james-jwu and mprahl June 12, 2026 18:13

google-oss-prow Bot added needs-ok-to-test size/L labels Jun 12, 2026

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread proposals/12843-pod-lifecycle-failure/README.md

Comment thread proposals/12843-pod-lifecycle-failure/README.md Outdated

Comment thread proposals/12843-pod-lifecycle-failure/README.md Outdated

docs: update README for KEP-12843 with additional backend file detail…

77913e8

…s and schema change clarification Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

khushiiagrawal mentioned this pull request Jun 12, 2026

feat(backend,frontend): surface pod lifecycle failures in the KFP UI. Fixes #12843 #13516

Open

google-oss-prow Bot assigned ntny Jun 13, 2026

google-oss-prow Bot added the lgtm label Jun 13, 2026

docs: enhance README for KEP-12843 with structured message classifica…

610c19c

…tion and reporting latency details Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

google-oss-prow Bot removed the lgtm label Jun 15, 2026

google-oss-prow Bot added the lgtm label Jun 15, 2026

alyssacgoins suggested changes Jun 15, 2026

View reviewed changes

google-oss-prow Bot assigned alyssacgoins Jun 15, 2026

google-oss-prow Bot removed the lgtm label Jun 15, 2026

Merge branch 'master' of https://github.com/kubeflow/pipelines into k…

194c7d3

…ep/12843-pod-lifecycle-failure

docs: clarify non-goals and frontend behavior for KEP-12843 pod lifec…

109c64e

…ycle failure proposal Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

ntny reviewed Jun 16, 2026

View reviewed changes

alyssacgoins suggested changes Jun 16, 2026

View reviewed changes


		### Non-Goals

		1. Configurable per-class timeouts (for example, fail after one hour of `ImagePullBackOff`). The original issue mentions this. It is a useful follow-up but it is a separate piece of work and is not covered here.


		Frontend: existing `RunDetailsV2.test.tsx` continues to pass without changes, which is the regression bar. New behavior is exercised through the test cases that already cover `updateFlowElementsState`.

		### Manual Verification (E2E)


		Rollback is straightforward. Reverting the API server image and either dropping the column or leaving it in place unused restores the previous behavior. The column is informational only, so there is no data loss.

		## Frontend Considerations


		On the UI side, the run details page renders the pipeline graph from MLMD execution records. The driver pod writes the MLMD execution row in `RUNNING` state before the user container starts. If the user container then fails to start at all, the MLMD record stays at `RUNNING` forever, and the graph node renders green and spins indefinitely.

		KFP positions itself as a Kubernetes abstraction for data scientists and ML engineers. Many of its users are not Kubernetes operators. When the UI shows a task that never finishes and never fails, with no message, the only path forward is to ask someone else to run `kubectl get pods`. That breaks the abstraction the product is built on.


		Add `LifecycleFailureMessage` to `taskColumns`, to `scanRows`, and to the `CreateTask` and `CreateOrUpdateTasks` insert paths.

		In `patchTask`, do not preserve the previous value of this column. The fresh filtered value computed from the latest workflow state always wins. This is what makes the field self-clearing if a pod recovers on retry.

		Backend, in `backend/src/apiserver/storage/task_store_test.go`: existing CRUD tests cover the new column once it is added to the column list. No new test files are required.

		Frontend: existing `RunDetailsV2.test.tsx` continues to pass without changes, which is the regression bar. New behavior is exercised through the test cases that already cover `updateFlowElementsState`.


		A KFP pipeline task can fail in two ways. The first is a user-script failure, where the Python code inside the container raises an exception. The second is a pod lifecycle failure, where the Kubernetes pod backing the task never reaches a healthy running state, or is killed by the system. Common examples of the second category are `ImagePullBackOff`, `Unschedulable`, `OOMKilled`, `CrashLoopBackOff`, and `NodeLost`.

		KFP handles the first case well today. The second case is currently invisible in the UI. The task node stays in a running state forever, no error is shown, and the user has no way to find out what happened without using `kubectl`. This proposal threads the pod lifecycle failure message that Argo Workflows already records all the way through the API server and into the run details page, so the task node turns red and the failure reason is shown in the side panel, just like a user-script failure is shown today.


		The existing KFP end-to-end test suite in `.github/workflows/e2e-test.yml` runs full pipeline runs against a live cluster and verifies final task states. Once this change lands, one or more of the following will be added to cover the lifecycle failure path in CI:

		- A dedicated test pipeline with a deliberately bad container image that verifies the failing task's `error.message` contains the expected pod lifecycle failure string via the v2beta1 `GetRun` API.

Conversation

khushiiagrawal commented Jun 12, 2026

Uh oh!

google-oss-prow Bot commented Jun 12, 2026

Uh oh!

google-oss-prow Bot commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ntny commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntny commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Concerns

Possible future improvements

Uh oh!

khushiiagrawal commented Jun 15, 2026

Uh oh!

khushiiagrawal commented Jun 15, 2026

Uh oh!

ntny commented Jun 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alyssacgoins commented Jun 15, 2026

Uh oh!

alyssacgoins commented Jun 15, 2026

Uh oh!

ntny commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alyssacgoins commented Jun 15, 2026

Uh oh!

khushiiagrawal commented Jun 15, 2026

Uh oh!

ntny Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

ntny commented Jun 13, 2026 •

edited

Loading

ntny commented Jun 13, 2026 •

edited

Loading

ntny commented Jun 15, 2026 •

edited

Loading

ntny Jun 16, 2026 •

edited

Loading