Skip to content

Fix/workflow state on conflict 2#13531

Open
Ayush-kathil wants to merge 4 commits into
kubeflow:masterfrom
Ayush-kathil:fix/workflow-state-on-conflict-2
Open

Fix/workflow state on conflict 2#13531
Ayush-kathil wants to merge 4 commits into
kubeflow:masterfrom
Ayush-kathil:fix/workflow-state-on-conflict-2

Conversation

@Ayush-kathil

Copy link
Copy Markdown

What this PR does

This is a follow-up to PR #13508 that adds extra state validation checks during RetryRun and TerminateRun.

When the API server encounters a conflict during a retry or terminate operation, it refetches the latest workflow state and tries again. However, in rare cases (like two manual requests racing each other), the first request might successfully move the workflow into a Running or terminal state, while the second request refetches it and blindly applies the operation again.

To fix this:

  • RetryRun: We now verify that the refetched workflow is still in a retryable state (e.g., Failed or Error) using .CanRetry() before trying to update it again.
  • TerminateRun: We check if the workflow has already reached a final/terminal state (IsInFinalState()) before patching the activeDeadlineSeconds.

This ensures we don't accidentally retry a running workflow or redundantly terminate an already completed one.

Related Issues

Follow-up to #13508
Fixes #13507

…low#13507

Signed-off-by: Ayush Gupta <kathilshiva@gmail.com>
… terminate

Signed-off-by: Ayush Gupta <kathilshiva@gmail.com>
Copilot AI review requested due to automatic review settings June 16, 2026 05:35
@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chensun for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow

Copy link
Copy Markdown

Hi @Ayush-kathil. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Improves workflow termination and run retry behavior by avoiding unnecessary terminate patches and adding retry logic around Kubernetes workflow update/create conflicts.

Changes:

  • Skip terminate patching when a workflow is already in a final state.
  • Add bounded retry loop for workflow update/create during RetryRun, handling conflict and already-exists errors.

Comment on lines +1153 to +1156
if apierrors.IsConflict(errors.Unwrap(updateError)) || apierrors.IsConflict(updateError) {
finalErr = updateError
time.Sleep(100 * time.Millisecond)
continue
Comment on lines +1162 to +1165
if apierrors.IsAlreadyExists(errors.Unwrap(createError)) || apierrors.IsAlreadyExists(createError) {
finalErr = createError
time.Sleep(100 * time.Millisecond)
continue
Comment on lines 1134 to 1135
// First try to update workflow
// If fail to get the workflow, return error.
return util.NewInternalServerError(ctx.Err(), "Failed to retry run %s due to context cancellation", runId)
}

latestWorkflow, updateError := r.getWorkflowClient(namespace).Get(ctx, newExecSpec.ExecutionName(), v1.GetOptions{})
}
newExecSpec = newCreatedWorkflow
if updateError != nil {
if apierrors.IsConflict(errors.Unwrap(updateError)) || apierrors.IsConflict(updateError) {
newExecSpec.SetVersion("")
newCreatedWorkflow, createError := r.getWorkflowClient(namespace).Create(ctx, newExecSpec, v1.CreateOptions{})
if createError != nil {
if apierrors.IsAlreadyExists(errors.Unwrap(createError)) || apierrors.IsAlreadyExists(createError) {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[backend] RetryRun fails with concurrent Argo workflow modification (409 Conflict + already exists)

2 participants