Description
When calling RetryRun on a recently-failed pipeline run, the API server returns an InternalServerError because the RetryRun implementation does not handle Kubernetes optimistic concurrency conflicts (409 Conflict). The failure is consistently reproducible in E2E tests that immediately retry a failed run.
Error
Failed to retry a run: InternalServerError: Failed to retry run <run-id> due to error updating and creating a workflow.
Update error: Operation cannot be fulfilled on workflows.argoproj.io "<wf-name>": the object has been modified;
please apply your changes to the latest version and try again:
workflows.argoproj.io "<wf-name>" already exists, the server was not able to generate a unique name for the object (code: 13)
Root Cause
ResourceManager.RetryRun() in backend/src/apiserver/resource/resource_manager.go follows a GET → modify → UPDATE pattern on the Argo Workflow object. When the Argo workflow-controller concurrently updates the same object (e.g. finalizing status fields right after a pipeline failure), the resourceVersion mismatches and the UPDATE returns a 409 Conflict.
The fallback CREATE also fails because the original workflow object still exists in Kubernetes under the same name.
Unlike ReconcileSwfCrs() in the same file, which explicitly handles apierrors.IsConflict(err) with a continue/retry loop, RetryRun() has no such handling.
Steps to Reproduce
Reliably triggered by the E2E test:
MLflow Integration > Failed pipeline + RetryRun with MLflow >
[It] Should reopen MLflow runs on retry and then reflect the retried status
[MLflow, FullRegression, MLflowFailure]
backend/test/end2end/mlflow_e2e_test.go:478
The test uses fail_v2.yaml (a pipeline designed to fail), waits for FAILED state, then immediately calls RetryRun. At that point the Argo workflow-controller is still processing the workflow, creating the race window.
Observed in
Suggested Fix
Add apierrors.IsConflict handling in RetryRun(), following the existing pattern in ReconcileSwfCrs():
// Current (no conflict handling)
err = r.getWorkflowClient(namespace).Update(ctx, workflow)
if err != nil {
return util.NewInternalServerError(err, "Failed to retry run ...")
}
// Suggested: retry on 409 Conflict
for {
err = r.getWorkflowClient(namespace).Update(ctx, workflow)
if apierrors.IsConflict(err) {
// re-fetch and reapply
workflow, err = r.getWorkflowClient(namespace).Get(ctx, workflow.Name, ...)
if err != nil { ... }
// reapply retry mutations
continue
}
break
}
Alternatively, consider a Delete + Create approach for stronger idempotency guarantees.
Labels: area/backend, kind/bug
Description
When calling
RetryRunon a recently-failed pipeline run, the API server returns anInternalServerErrorbecause theRetryRunimplementation does not handle Kubernetes optimistic concurrency conflicts (409 Conflict). The failure is consistently reproducible in E2E tests that immediately retry a failed run.Error
Root Cause
ResourceManager.RetryRun()inbackend/src/apiserver/resource/resource_manager.gofollows a GET → modify → UPDATE pattern on the ArgoWorkflowobject. When the Argo workflow-controller concurrently updates the same object (e.g. finalizing status fields right after a pipeline failure), theresourceVersionmismatches and the UPDATE returns a409 Conflict.The fallback CREATE also fails because the original workflow object still exists in Kubernetes under the same name.
Unlike
ReconcileSwfCrs()in the same file, which explicitly handlesapierrors.IsConflict(err)with acontinue/retry loop,RetryRun()has no such handling.Steps to Reproduce
Reliably triggered by the E2E test:
The test uses
fail_v2.yaml(a pipeline designed to fail), waits forFAILEDstate, then immediately callsRetryRun. At that point the Argo workflow-controller is still processing the workflow, creating the race window.Observed in
End to End Critical Scenario MLflow Tests - K8s v1.34.0 cacheEnabled=false artifactStorage=s3Suggested Fix
Add
apierrors.IsConflicthandling inRetryRun(), following the existing pattern inReconcileSwfCrs():Alternatively, consider a Delete + Create approach for stronger idempotency guarantees.
Labels:
area/backend,kind/bug