feat(proposals): add KEP-12843 pod lifecycle failure support and visualization#13517
feat(proposals): add KEP-12843 pod lifecycle failure support and visualization#13517khushiiagrawal wants to merge 5 commits into
Conversation
Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @khushiiagrawal. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a KEP document proposing end-to-end surfacing of Argo pod lifecycle failure messages in KFP (persisted in Task, exposed via existing task_details[].error, and rendered in the V2 run details UI).
Changes:
- Introduces a new KEP covering motivation, backend/frontend design, and risks/mitigations for lifecycle failure visibility.
- Documents intended DB schema change (new nullable
LifecycleFailureMessagecolumn) and API/UI behavior. - Provides a test plan and manual verification steps for failure/success/retry scenarios.
…s and schema change clarification Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
|
/lgtm |
|
I really like the idea of using the message from This also works in cases where the pod was never created at all (for example, when resource quotas are exceeded or Kyverno blocks pod creation due to security policies). Good catch on filtering transient messages such as PodInitializing and ContainerCreating. One design question that may be worth adding to the KEP: can Argo node messages be parsed into a more meaningful task lifecycle status in addition to surfacing he raw text message? Most of the useful messages here correspond to standard Kubernetes problems such as ConcernsReporting latency is one concern. Fast reporting is critical for recurring (cron) runs. Unlike regular runs, cron runs are created in MySQL through persistenceagent reporting. There is already at least one path that depends on the run row existing before task execution finishes. After a task completes:
Possible future improvements
Not suggesting this for the current KEP, but it may be worth keeping in mind as a future enhancement. |
…tion and reporting latency details Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
|
@ntny Thanks for the thorough review! I've updated the KEP to address each point:
Let me know if anything needs more detail! |
|
Hii @alyssacgoins @HumairAK @mprahl @james-jwu , Would really appreciate your thoughts and review on this. |
|
/lgtm |
|
|
||
| ### Non-Goals | ||
|
|
||
| 1. Configurable per-class timeouts (for example, fail after one hour of `ImagePullBackOff`). The original issue mentions this. It is a useful follow-up but it is a separate piece of work and is not covered here. |
There was a problem hiding this comment.
Just curious - is there a specific reason why configurable timeouts are out of scope for this feature?
There was a problem hiding this comment.
Yes. Timeouts need per-class timestamp tracking or a sweep mechanism in the persistence agent, which felt like a separate chunk of work from the core visibility fix. I've updated the Non-Goals section with a short rationale explaining this.
There was a problem hiding this comment.
@alyssacgoins let me know what you think.
There was a problem hiding this comment.
That make sense to me - once you've completed/merged this feature, be sure to add a follow-up issue for this
|
|
||
| Frontend: existing `RunDetailsV2.test.tsx` continues to pass without changes, which is the regression bar. New behavior is exercised through the test cases that already cover `updateFlowElementsState`. | ||
|
|
||
| ### Manual Verification (E2E) |
There was a problem hiding this comment.
The manual verification you've included here is quite useful for verifying changes locally. Can you add some info on how you plan to validate/verify this feature (particularly frontend-wise) through integration testing in CI?
There was a problem hiding this comment.
Good point, I've added a CI Validation section to the Test Plan covering a bad image API test and a frontend integration test for the banner. Both will ship with the implementation PR.
|
|
||
| Rollback is straightforward. Reverting the API server image and either dropping the column or leaving it in place unused restores the previous behavior. The column is informational only, so there is no data loss. | ||
|
|
||
| ## Frontend Considerations |
There was a problem hiding this comment.
Can you emphasize here that the frontend will clearly desiplay pod-level failures with this feature?
|
|
||
| On the UI side, the run details page renders the pipeline graph from MLMD execution records. The driver pod writes the MLMD execution row in `RUNNING` state before the user container starts. If the user container then fails to start at all, the MLMD record stays at `RUNNING` forever, and the graph node renders green and spins indefinitely. | ||
|
|
||
| KFP positions itself as a Kubernetes abstraction for data scientists and ML engineers. Many of its users are not Kubernetes operators. When the UI shows a task that never finishes and never fails, with no message, the only path forward is to ask someone else to run `kubectl get pods`. That breaks the abstraction the product is built on. |
There was a problem hiding this comment.
This is a great paragraph. This expresses the exact core of the problem, and what we're trying to solve with this feature. Great work 🙂
|
|
||
| Add `LifecycleFailureMessage` to `taskColumns`, to `scanRows`, and to the `CreateTask` and `CreateOrUpdateTasks` insert paths. | ||
|
|
||
| In `patchTask`, do not preserve the previous value of this column. The fresh filtered value computed from the latest workflow state always wins. This is what makes the field self-clearing if a pod recovers on retry. |
There was a problem hiding this comment.
The overwrite logic you propose here is a departure from the current backend/src/apiserver/storage/task_store.go/patch_task() logic, which fills only empty fields. Your logic makes sense in the context of this feature, but make sure it does not break any existing patch_task() behavior - you may need to create a separate method.
There was a problem hiding this comment.
Thanks for flagging this Alyssa. There's actually no special overwrite logic added. LifecycleFailureMessage is just left out of the preserve-if-empty loop entirely, so the fresh value from the workflow sync always wins and the existing behavior for all other fields stays the same. I've updated the KEP wording to make this clearer.
| Backend, in `backend/src/apiserver/storage/task_store_test.go`: existing CRUD tests cover the new column once it is added to the column list. No new test files are required. | ||
|
|
||
| Frontend: existing `RunDetailsV2.test.tsx` continues to pass without changes, which is the regression bar. New behavior is exercised through the test cases that already cover `updateFlowElementsState`. | ||
|
|
There was a problem hiding this comment.
Can you add some information on additional frontend unit test coverage for this feature?
There was a problem hiding this comment.
Sure, updated the Test Plan with specific test cases covering the state override and banner rendering in DynamicFlow and RuntimeNodeDetailsV2.
|
@jeffspahr can we get your thoughts here from the frontend perspective? |
|
@droctothorpe @zazulam can we get your thoughts? |
|
I'm all for this proposal. The only risk I see is hypothetical: if KFP wants to add another backend in addition to Argo in the future, would this KEP block that? |
@ntny If KFP was to add an additional backend in the future, I think the design would just need to propagate pod failure events by querying the K8s API directly - @khushiiagrawal includes this as an alternative in her doc here. This alternative is unnecessary in this case because the Argo message already contains the information, but it's an option for a non-Argo backend. |
…ep/12843-pod-lifecycle-failure
…ycle failure proposal Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
|
@alyssacgoins I have addressed the changes, PTAL! |
|
|
||
| A KFP pipeline task can fail in two ways. The first is a user-script failure, where the Python code inside the container raises an exception. The second is a pod lifecycle failure, where the Kubernetes pod backing the task never reaches a healthy running state, or is killed by the system. Common examples of the second category are `ImagePullBackOff`, `Unschedulable`, `OOMKilled`, `CrashLoopBackOff`, and `NodeLost`. | ||
|
|
||
| KFP handles the first case well today. The second case is currently invisible in the UI. The task node stays in a running state forever, no error is shown, and the user has no way to find out what happened without using `kubectl`. This proposal threads the pod lifecycle failure message that Argo Workflows already records all the way through the API server and into the run details page, so the task node turns red and the failure reason is shown in the side panel, just like a user-script failure is shown today. |
There was a problem hiding this comment.
I agree terminal vs transient classification can be deferred, but I’d handle FailedScheduling / Unschedulable carefully in the initial design.
On a well-utilized cluster, a pod may wait for resources, then get scheduled once other workloads finish. Showing that task as red could be misleading because it may recover without user action. Frequent red -> Running transitions would look strange in the UI because the task may recover without user action.
This feels more like a warning
Also, as far as I understand, the current V2 UI maps node status from MLMD executions rather than tasks, while there is ongoing MLMD-removal work to move it into tasks
There was a problem hiding this comment.
@ntny @khushiiagrawal can we add a color (such as yellow) that indicates warning rather than failure for these cases? That way we're still communicating status.
| - A dedicated test pipeline with a deliberately bad container image that verifies the failing task's `error.message` contains the expected pod lifecycle failure string via the v2beta1 `GetRun` API. | ||
| - A frontend integration test (in `test/frontend-integration-test/`) that submits the bad-image pipeline, waits for the task node to turn red, and asserts the banner text in the side panel. | ||
|
|
||
| These CI additions will be part of the implementation PR rather than this KEP. |
There was a problem hiding this comment.
| These CI additions will be part of the implementation PR rather than this KEP. |
You can remove this line - the KEP is a feature outline, so any work here will always be a part of the implementation PR
|
|
||
| The existing KFP end-to-end test suite in `.github/workflows/e2e-test.yml` runs full pipeline runs against a live cluster and verifies final task states. Once this change lands, one or more of the following will be added to cover the lifecycle failure path in CI: | ||
|
|
||
| - A dedicated test pipeline with a deliberately bad container image that verifies the failing task's `error.message` contains the expected pod lifecycle failure string via the v2beta1 `GetRun` API. |
There was a problem hiding this comment.
Can you create a matrix here outlining scenario, the resulting pod status, and expected behavior?
You can use this matrix from my Literal Inputs KEP as reference.
Description of your changes:
Adds the proposal for #12843 - surfacing pod lifecycle failures (
ImagePullBackOff,OOMKilled,Unschedulable, etc.) in the KFP run details UI.Currently, when a task pod fails at the Kubernetes level, the task node sits in a green running state forever with no error shown. The Argo node message that describes the failure is dropped during conversion and never reaches the UI.
This KEP proposes:
Taskmodel via a new nullableLifecycleFailureMessagecolumn (additive schema change, no migration required)task_details[].errorfield in the v2beta1GetRunresponse (no new API surface)RUNNINGThe initial implementation draft is open in #13516.
Related: Fixes #12843