Skip to content

[backend] RetryRun fails with concurrent Argo workflow modification (409 Conflict + already exists) #13507

@kaikaila

Description

@kaikaila

Description

When calling RetryRun on a recently-failed pipeline run, the API server returns an InternalServerError because the RetryRun implementation does not handle Kubernetes optimistic concurrency conflicts (409 Conflict). The failure is consistently reproducible in E2E tests that immediately retry a failed run.

Error

Failed to retry a run: InternalServerError: Failed to retry run <run-id> due to error updating and creating a workflow.
Update error: Operation cannot be fulfilled on workflows.argoproj.io "<wf-name>": the object has been modified;
please apply your changes to the latest version and try again:
workflows.argoproj.io "<wf-name>" already exists, the server was not able to generate a unique name for the object (code: 13)

Root Cause

ResourceManager.RetryRun() in backend/src/apiserver/resource/resource_manager.go follows a GET → modify → UPDATE pattern on the Argo Workflow object. When the Argo workflow-controller concurrently updates the same object (e.g. finalizing status fields right after a pipeline failure), the resourceVersion mismatches and the UPDATE returns a 409 Conflict.

The fallback CREATE also fails because the original workflow object still exists in Kubernetes under the same name.

Unlike ReconcileSwfCrs() in the same file, which explicitly handles apierrors.IsConflict(err) with a continue/retry loop, RetryRun() has no such handling.

Steps to Reproduce

Reliably triggered by the E2E test:

MLflow Integration > Failed pipeline + RetryRun with MLflow >
[It] Should reopen MLflow runs on retry and then reflect the retried status
[MLflow, FullRegression, MLflowFailure]
backend/test/end2end/mlflow_e2e_test.go:478

The test uses fail_v2.yaml (a pipeline designed to fail), waits for FAILED state, then immediately calls RetryRun. At that point the Argo workflow-controller is still processing the workflow, creating the race window.

Observed in

Suggested Fix

Add apierrors.IsConflict handling in RetryRun(), following the existing pattern in ReconcileSwfCrs():

// Current (no conflict handling)
err = r.getWorkflowClient(namespace).Update(ctx, workflow)
if err != nil {
    return util.NewInternalServerError(err, "Failed to retry run ...")
}

// Suggested: retry on 409 Conflict
for {
    err = r.getWorkflowClient(namespace).Update(ctx, workflow)
    if apierrors.IsConflict(err) {
        // re-fetch and reapply
        workflow, err = r.getWorkflowClient(namespace).Get(ctx, workflow.Name, ...)
        if err != nil { ... }
        // reapply retry mutations
        continue
    }
    break
}

Alternatively, consider a Delete + Create approach for stronger idempotency guarantees.

Labels: area/backend, kind/bug

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions