Skip to content

[backend] bug(backend): uploadOutputArtifactsWithRetry discards uploaded blobs when retrying after MLMD RecordArtifact failure #13501

@kaikaila

Description

@kaikaila

Describe the bug

When uploadOutputArtifactsWithRetry retries after a RecordArtifact (MLMD) failure, it calls objectstore.OpenBucket to refresh credentials and replaces the opts.bucket instance entirely. This means blobs successfully uploaded to the original bucket during the first attempt are not visible through the new bucket handle, so any subsequent read of those blobs (e.g. executor-logs-0) returns "blob not found".

The bug conflates two independent failure modes:

Object store credential expiry → requires a new bucket handle
MLMD registration failure → only requires retrying RecordArtifact
The current retry loop unconditionally refreshes the bucket on every retry, even when the upload succeeded and only the metadata write failed.

Steps to reproduce

Run the failing unit test:

go test -v -run Test_executeV2_publishLogs/retry_required_-_component_success ./backend/src/v2/component/
Expected: test passes, executor-logs-0 is readable from the bucket.

Actual:

Error: blob (key "executor-logs-0") (code=NotFound): blob not found

Root cause

In uploadOutputArtifactsWithRetry (launcher_v2.go), the retry loop replaces opts.bucket with a freshly opened bucket before re-calling uploadOutputArtifacts. For in-memory or short-lived bucket drivers, this creates a new empty bucket, silently discarding anything written in the previous attempt.

The blob upload and MLMD RecordArtifact are both re-executed from scratch on each retry, even though the upload may have already succeeded.

Proposed fix

Separate the retry concerns:

Only refresh the bucket handle when the upload step fails (i.e. an objectstore.UploadBlob error, not a RecordArtifact error).
Retry RecordArtifact independently without re-uploading the blob.

Environment

Component: backend/src/v2/component/launcher_v2.go
Function: uploadOutputArtifactsWithRetry
Relevant test: Test_executeV2_publishLogs/retry_required_-_component_success in launcher_v2_test.go


Impacted by this bug? Give it a 👍.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions