Fix/workflow state on conflict 2 by Ayush-kathil · Pull Request #13531 · kubeflow/pipelines

Ayush-kathil · 2026-06-16T05:35:25Z

What this PR does

This is a follow-up to PR #13508 that adds extra state validation checks during RetryRun and TerminateRun.

When the API server encounters a conflict during a retry or terminate operation, it refetches the latest workflow state and tries again. However, in rare cases (like two manual requests racing each other), the first request might successfully move the workflow into a Running or terminal state, while the second request refetches it and blindly applies the operation again.

To fix this:

RetryRun: We now verify that the refetched workflow is still in a retryable state (e.g., Failed or Error) using .CanRetry() before trying to update it again.
TerminateRun: We check if the workflow has already reached a final/terminal state (IsInFinalState()) before patching the activeDeadlineSeconds.

This ensures we don't accidentally retry a running workflow or redundantly terminate an already completed one.

Related Issues

Follow-up to #13508
Fixes #13507

…low#13507 Signed-off-by: Ayush Gupta <kathilshiva@gmail.com>

… terminate Signed-off-by: Ayush Gupta <kathilshiva@gmail.com>

google-oss-prow · 2026-06-16T05:35:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chensun for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

backend/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow · 2026-06-16T05:35:36Z

Hi @Ayush-kathil. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Improves workflow termination and run retry behavior by avoiding unnecessary terminate patches and adding retry logic around Kubernetes workflow update/create conflicts.

Changes:

Skip terminate patching when a workflow is already in a final state.
Add bounded retry loop for workflow update/create during RetryRun, handling conflict and already-exists errors.

+			if apierrors.IsConflict(errors.Unwrap(updateError)) || apierrors.IsConflict(updateError) {
+				finalErr = updateError
+				time.Sleep(100 * time.Millisecond)
+				continue


+				if apierrors.IsAlreadyExists(errors.Unwrap(createError)) || apierrors.IsAlreadyExists(createError) {
+					finalErr = createError
+					time.Sleep(100 * time.Millisecond)
+					continue


 	// First try to update workflow
 	// If fail to get the workflow, return error.


+			return util.NewInternalServerError(ctx.Err(), "Failed to retry run %s due to context cancellation", runId)
+		}
+
+		latestWorkflow, updateError := r.getWorkflowClient(namespace).Get(ctx, newExecSpec.ExecutionName(), v1.GetOptions{})


 		}
-		newExecSpec = newCreatedWorkflow
+		if updateError != nil {
+			if apierrors.IsConflict(errors.Unwrap(updateError)) || apierrors.IsConflict(updateError) {


+			newExecSpec.SetVersion("")
+			newCreatedWorkflow, createError := r.getWorkflowClient(namespace).Create(ctx, newExecSpec, v1.CreateOptions{})
+			if createError != nil {
+				if apierrors.IsAlreadyExists(errors.Unwrap(createError)) || apierrors.IsAlreadyExists(createError) {


…ryRun Signed-off-by: Ayush Gupta <kathilshiva@gmail.com>

Ayush-kathil added 2 commits June 9, 2026 15:55

fix(backend): fix RetryRun concurrency conflict handling. Fixes kubef…

cfda12c

…low#13507 Signed-off-by: Ayush Gupta <kathilshiva@gmail.com>

fix(backend): verify workflow state on conflict refetch for retry and…

883565b

… terminate Signed-off-by: Ayush Gupta <kathilshiva@gmail.com>

Copilot AI review requested due to automatic review settings June 16, 2026 05:35

google-oss-prow Bot requested review from HumairAK and hbelmiro June 16, 2026 05:35

google-oss-prow Bot added size/M needs-ok-to-test labels Jun 16, 2026

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Ayush-kathil added 2 commits June 16, 2026 11:10

fix(backend): use exponential backoff and select on ctx.Done() in Ret…

bea18a2

…ryRun Signed-off-by: Ayush Gupta <kathilshiva@gmail.com>

Merge branch 'master' into fix/workflow-state-on-conflict-2

42bd666

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/workflow state on conflict 2#13531

Fix/workflow state on conflict 2#13531
Ayush-kathil wants to merge 4 commits into
kubeflow:masterfrom
Ayush-kathil:fix/workflow-state-on-conflict-2

Ayush-kathil commented Jun 16, 2026

Uh oh!

google-oss-prow Bot commented Jun 16, 2026

Uh oh!

google-oss-prow Bot commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// First try to update workflow
		// If fail to get the workflow, return error.

Conversation

Ayush-kathil commented Jun 16, 2026

What this PR does

Related Issues

Uh oh!

google-oss-prow Bot commented Jun 16, 2026

Uh oh!

google-oss-prow Bot commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants