Skip to content

feat(proposals): add KEP-12843 pod lifecycle failure support and visualization#13517

Open
khushiiagrawal wants to merge 5 commits into
kubeflow:masterfrom
khushiiagrawal:kep/12843-pod-lifecycle-failure
Open

feat(proposals): add KEP-12843 pod lifecycle failure support and visualization#13517
khushiiagrawal wants to merge 5 commits into
kubeflow:masterfrom
khushiiagrawal:kep/12843-pod-lifecycle-failure

Conversation

@khushiiagrawal

Copy link
Copy Markdown
Contributor

Description of your changes:

Adds the proposal for #12843 - surfacing pod lifecycle failures (ImagePullBackOff, OOMKilled, Unschedulable, etc.) in the KFP run details UI.

Currently, when a task pod fails at the Kubernetes level, the task node sits in a green running state forever with no error shown. The Argo node message that describes the failure is dropped during conversion and never reaches the UI.

This KEP proposes:

  • Persisting the Argo node failure message on the Task model via a new nullable LifecycleFailureMessage column (additive schema change, no migration required)
  • Returning it through the existing task_details[].error field in the v2beta1 GetRun response (no new API surface)
  • Overriding the node visual state to Failed in the frontend and showing the failure reason in the side panel banner, even when MLMD still reports RUNNING

The initial implementation draft is open in #13516.

Related: Fixes #12843

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
Copilot AI review requested due to automatic review settings June 12, 2026 18:13
@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zazulam for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow

Copy link
Copy Markdown

Hi @khushiiagrawal. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a KEP document proposing end-to-end surfacing of Argo pod lifecycle failure messages in KFP (persisted in Task, exposed via existing task_details[].error, and rendered in the V2 run details UI).

Changes:

  • Introduces a new KEP covering motivation, backend/frontend design, and risks/mitigations for lifecycle failure visibility.
  • Documents intended DB schema change (new nullable LifecycleFailureMessage column) and API/UI behavior.
  • Provides a test plan and manual verification steps for failure/success/retry scenarios.

Comment thread proposals/12843-pod-lifecycle-failure/README.md
Comment thread proposals/12843-pod-lifecycle-failure/README.md Outdated
Comment thread proposals/12843-pod-lifecycle-failure/README.md Outdated
…s and schema change clarification

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
@ntny

ntny commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@ntny

ntny commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

I really like the idea of using the message from AWF status.nodes[].message instead of simply surfacing pod events.

This also works in cases where the pod was never created at all (for example, when resource quotas are exceeded or Kyverno blocks pod creation due to security policies).

Good catch on filtering transient messages such as PodInitializing and ContainerCreating.

One design question that may be worth adding to the KEP: can Argo node messages be parsed into a more meaningful task lifecycle status in addition to surfacing he raw text message? Most of the useful messages here correspond to standard Kubernetes problems such as OOMKilled, image pull errors, resource/scheduling issues, or configuration/admission failures. For unknown or newly introduced failure patterns, the structured status could stay Unknown while the raw message would still give users the original diagnostic details.

Concerns

Reporting latency is one concern. Fast reporting is critical for recurring (cron) runs. Unlike regular runs, cron runs are created in MySQL through persistenceagent reporting.

There is already at least one path that depends on the run row existing before task execution finishes. After a task completes:

Possible future improvements

  1. One open question is whether task updates should be deduplicated in the persistence-agent reporting path. Today the agent reports task status snapshots from the workflow, and this proposal extends that existing path with lifecycle failure messages. This is probably fine for the initial implementation, but it may make sense in the future to avoid sending/upserting updates for tasks whose effective state has not changed, to reduce unnecessary database writes.
    Lifecycle messages add one extra nuance here: even when the task state is unchanged, Kubernetes/Argo messages may change in ways that do not provide meaningful new information. For example, an invalid image can alternate between ErrImagePull and ImagePullBackOff. If this becomes noisy, we may want to suppress repeated or semantically equivalent message updates as well.

  2. Another possible future improvement would be to distinguish between transient and terminal failures.
    For example, OOMKilled is definitely terminal, and it may make sense to immediately transition the task into a Failed state in MySQL. On the other hand, ErrImagePull, ImagePullBackOff, or scheduling-related issues such as FailedScheduling are often transient conditions. In such cases, it may be worth introducing a separate Warning task state in MySQL rather than treating them as either Running or Failed.

Not suggesting this for the current KEP, but it may be worth keeping in mind as a future enhancement.

@google-oss-prow google-oss-prow Bot added the lgtm label Jun 13, 2026
…tion and reporting latency details

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
@google-oss-prow google-oss-prow Bot removed the lgtm label Jun 15, 2026
@khushiiagrawal

Copy link
Copy Markdown
Contributor Author

@ntny Thanks for the thorough review! I've updated the KEP to address each point:

  • Added a structured classification section with the proposed category taxonomy and rationale for deferring it to a follow up
  • Added a latency section covering the resolveNodeLifecycleMessages traversal and the cron run race condition, the traversal is in-memory only and doesn't affect the existing DB write path; a before/after measurement will be in the implementation PR
  • Added a Future Work section capturing the transient vs terminal distinction, message deduplication, and task update deduplication points

Let me know if anything needs more detail!

@khushiiagrawal

Copy link
Copy Markdown
Contributor Author

Hii @alyssacgoins @HumairAK @mprahl @james-jwu , Would really appreciate your thoughts and review on this.
Thanks!

@ntny

ntny commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@google-oss-prow google-oss-prow Bot added the lgtm label Jun 15, 2026

### Non-Goals

1. Configurable per-class timeouts (for example, fail after one hour of `ImagePullBackOff`). The original issue mentions this. It is a useful follow-up but it is a separate piece of work and is not covered here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - is there a specific reason why configurable timeouts are out of scope for this feature?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Timeouts need per-class timestamp tracking or a sweep mechanism in the persistence agent, which felt like a separate chunk of work from the core visibility fix. I've updated the Non-Goals section with a short rationale explaining this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alyssacgoins let me know what you think.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make sense to me - once you've completed/merged this feature, be sure to add a follow-up issue for this


Frontend: existing `RunDetailsV2.test.tsx` continues to pass without changes, which is the regression bar. New behavior is exercised through the test cases that already cover `updateFlowElementsState`.

### Manual Verification (E2E)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manual verification you've included here is quite useful for verifying changes locally. Can you add some info on how you plan to validate/verify this feature (particularly frontend-wise) through integration testing in CI?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've added a CI Validation section to the Test Plan covering a bad image API test and a frontend integration test for the banner. Both will ship with the implementation PR.


Rollback is straightforward. Reverting the API server image and either dropping the column or leaving it in place unused restores the previous behavior. The column is informational only, so there is no data loss.

## Frontend Considerations

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you emphasize here that the frontend will clearly desiplay pod-level failures with this feature?


On the UI side, the run details page renders the pipeline graph from MLMD execution records. The driver pod writes the MLMD execution row in `RUNNING` state before the user container starts. If the user container then fails to start at all, the MLMD record stays at `RUNNING` forever, and the graph node renders green and spins indefinitely.

KFP positions itself as a Kubernetes abstraction for data scientists and ML engineers. Many of its users are not Kubernetes operators. When the UI shows a task that never finishes and never fails, with no message, the only path forward is to ask someone else to run `kubectl get pods`. That breaks the abstraction the product is built on.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great paragraph. This expresses the exact core of the problem, and what we're trying to solve with this feature. Great work 🙂


Add `LifecycleFailureMessage` to `taskColumns`, to `scanRows`, and to the `CreateTask` and `CreateOrUpdateTasks` insert paths.

In `patchTask`, do not preserve the previous value of this column. The fresh filtered value computed from the latest workflow state always wins. This is what makes the field self-clearing if a pod recovers on retry.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overwrite logic you propose here is a departure from the current backend/src/apiserver/storage/task_store.go/patch_task() logic, which fills only empty fields. Your logic makes sense in the context of this feature, but make sure it does not break any existing patch_task() behavior - you may need to create a separate method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging this Alyssa. There's actually no special overwrite logic added. LifecycleFailureMessage is just left out of the preserve-if-empty loop entirely, so the fresh value from the workflow sync always wins and the existing behavior for all other fields stays the same. I've updated the KEP wording to make this clearer.

Backend, in `backend/src/apiserver/storage/task_store_test.go`: existing CRUD tests cover the new column once it is added to the column list. No new test files are required.

Frontend: existing `RunDetailsV2.test.tsx` continues to pass without changes, which is the regression bar. New behavior is exercised through the test cases that already cover `updateFlowElementsState`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some information on additional frontend unit test coverage for this feature?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated the Test Plan with specific test cases covering the state override and banner rendering in DynamicFlow and RuntimeNodeDetailsV2.

@alyssacgoins

Copy link
Copy Markdown
Contributor

@jeffspahr can we get your thoughts here from the frontend perspective?

@alyssacgoins

Copy link
Copy Markdown
Contributor

@droctothorpe @zazulam can we get your thoughts?

@ntny

ntny commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

I'm all for this proposal. The only risk I see is hypothetical: if KFP wants to add another backend in addition to Argo in the future, would this KEP block that?
What might makes this less concerning to me is that most of the work seems to be on the persistence agent and UI side.

@alyssacgoins

Copy link
Copy Markdown
Contributor

I'm all for this proposal. The only risk I see is hypothetical: if KFP wants to add another backend in addition to Argo in the future, would this KEP block that? What might makes this less concerning to me is that most of the work seems to be on the persistence agent and UI side.

@ntny If KFP was to add an additional backend in the future, I think the design would just need to propagate pod failure events by querying the K8s API directly - @khushiiagrawal includes this as an alternative in her doc here. This alternative is unnecessary in this case because the Argo message already contains the information, but it's an option for a non-Argo backend.

…ycle failure proposal

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
@khushiiagrawal

Copy link
Copy Markdown
Contributor Author

@alyssacgoins I have addressed the changes, PTAL!
Thanks.


A KFP pipeline task can fail in two ways. The first is a user-script failure, where the Python code inside the container raises an exception. The second is a pod lifecycle failure, where the Kubernetes pod backing the task never reaches a healthy running state, or is killed by the system. Common examples of the second category are `ImagePullBackOff`, `Unschedulable`, `OOMKilled`, `CrashLoopBackOff`, and `NodeLost`.

KFP handles the first case well today. The second case is currently invisible in the UI. The task node stays in a running state forever, no error is shown, and the user has no way to find out what happened without using `kubectl`. This proposal threads the pod lifecycle failure message that Argo Workflows already records all the way through the API server and into the run details page, so the task node turns red and the failure reason is shown in the side panel, just like a user-script failure is shown today.

@ntny ntny Jun 16, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree terminal vs transient classification can be deferred, but I’d handle FailedScheduling / Unschedulable carefully in the initial design.

On a well-utilized cluster, a pod may wait for resources, then get scheduled once other workloads finish. Showing that task as red could be misleading because it may recover without user action. Frequent red -> Running transitions would look strange in the UI because the task may recover without user action.
This feels more like a warning

Also, as far as I understand, the current V2 UI maps node status from MLMD executions rather than tasks, while there is ongoing MLMD-removal work to move it into tasks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ntny @khushiiagrawal can we add a color (such as yellow) that indicates warning rather than failure for these cases? That way we're still communicating status.

- A dedicated test pipeline with a deliberately bad container image that verifies the failing task's `error.message` contains the expected pod lifecycle failure string via the v2beta1 `GetRun` API.
- A frontend integration test (in `test/frontend-integration-test/`) that submits the bad-image pipeline, waits for the task node to turn red, and asserts the banner text in the side panel.

These CI additions will be part of the implementation PR rather than this KEP.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These CI additions will be part of the implementation PR rather than this KEP.

You can remove this line - the KEP is a feature outline, so any work here will always be a part of the implementation PR


The existing KFP end-to-end test suite in `.github/workflows/e2e-test.yml` runs full pipeline runs against a live cluster and verifies final task states. Once this change lands, one or more of the following will be added to cover the lifecycle failure path in CI:

- A dedicated test pipeline with a deliberately bad container image that verifies the failing task's `error.message` contains the expected pod lifecycle failure string via the v2beta1 `GetRun` API.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create a matrix here outlining scenario, the resulting pod status, and expected behavior?
You can use this matrix from my Literal Inputs KEP as reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature] Enhance support and visualization for pod lifecycle failure in Kubeflow Pipelines

4 participants