Skip to content

fix(backend): avoid bucket refresh on metadata retry. Fixes #13501#13503

Open
MikeTomlin19 wants to merge 2 commits into
kubeflow:masterfrom
MikeTomlin19:fix/issue-13501-upload-retry
Open

fix(backend): avoid bucket refresh on metadata retry. Fixes #13501#13503
MikeTomlin19 wants to merge 2 commits into
kubeflow:masterfrom
MikeTomlin19:fix/issue-13501-upload-retry

Conversation

@MikeTomlin19

Copy link
Copy Markdown

Summary

  • distinguish object-store upload failures from MLMD RecordArtifact failures during output artifact retry
  • avoid refreshing/replacing the bucket handle when only metadata registration failed
  • keep refreshing the bucket for real upload failures and only swap the handle after refresh succeeds

Verification

  • go test -v -run 'Test_uploadOutputArtifactsWithRetry_refreshesBucketAfterUploadFailure|Test_executeV2_publishLogs/retry_required_-_component_success|Test_executeV2_publishLogs/retry_required_-_component_failure' ./backend/src/v2/component
  • go test ./backend/src/v2/component ./backend/src/v2/objectstore ./backend/src/v2/metadata
  • autoreview --mode local --engine codex --model gpt-5.3-codex-spark returned clean

Fixes #13501

Signed-off-by: MikeTomlin19 <17487859+MikeTomlin19@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 6, 2026 23:28
@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mprahl for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow

Copy link
Copy Markdown

Hi @MikeTomlin19. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR refines artifact upload retry behavior by refreshing the object-store bucket only for upload-related failures (not MLMD/metadata failures), and adds test coverage for the new behavior.

Changes:

  • Add a typed upload error (outputArtifactUploadError) to distinguish upload failures from metadata failures.
  • Refresh bucket credentials only when the failure is an upload error; allow injecting a custom bucket opener for testing.
  • Extend tests to assert bucket refresh behavior on upload vs. metadata retry paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
backend/src/v2/component/launcher_v2.go Adds upload error typing and conditional bucket refresh logic with injectable opener.
backend/src/v2/component/launcher_v2_test.go Updates OpenBucketConfig construction and adds tests verifying refresh/no-refresh behavior.

Comment on lines +891 to +902
if errors.As(err, &uploadErr) {
glog.Info("Refreshing credentials before retrying artifacts upload.")
openBucket := objectstore.OpenBucket
if openBucketConfig.open != nil {
openBucket = openBucketConfig.open
}
refreshedBucket, err := openBucket(
openBucketConfig.ctx,
openBucketConfig.k8sClient,
openBucketConfig.namespace,
openBucketConfig.config,
)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in cfa7726 by returning an explicit wrapped error when an upload failure needs a credential refresh but no OpenBucketConfig is available, instead of dereferencing nil. I also added a regression test for that path.

Comment on lines +897 to +908
refreshedBucket, err := openBucket(
openBucketConfig.ctx,
openBucketConfig.k8sClient,
openBucketConfig.namespace,
openBucketConfig.config,
)
if err != nil {
glog.Infof("Failed to refresh credentials: %v", err)
finalErr = err
continue
}
opts.bucket = refreshedBucket

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left bucket closing unchanged here because uploadOutputArtifactsWithRetry receives the bucket through opts and does not own the caller-provided bucket lifecycle. Closing the old bucket inside this helper would be a behavior change for the caller. The retry now swaps to a refreshed bucket only after open succeeds, preserving the existing ownership semantics.

Signed-off-by: MikeTomlin19 <17487859+MikeTomlin19@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[backend] bug(backend): uploadOutputArtifactsWithRetry discards uploaded blobs when retrying after MLMD RecordArtifact failure

2 participants