fix(packages/container): data race when uploading container blobs concurrently#36524
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
Yes it happens to me. I build three packages at one time and two of them share the first packages base layer and this happens. |
1f09d6f to
5756646
Compare
What is the panic? Only error log means nothing. Is it really caused by such "race condition in BlobUploader"? BlobUploader is designed to be only used in one request, where is the "race condition"? |
Is it true? Every |
You're right that each NewBlobUploader creates a new instance. The race isn't between requests it's within a single request. In PutBlobsUpload (container.go):
The comment at lines 451-453 explains: "Some SDK (e.g.: minio) will close the Reader if it is also a Closer after uploading." So within ONE request, Close() can be called by:
Without mutex protection, if the SDK closes the file in a goroutine or callback while the explicit Close() runs, we get a race on u.file. The mutex makes Close() idempotent by checking if u.file == nil before closing. I don't have a panic stack trace because the issue manifests as 400 Bad Request from the client side Gitea doesn't panic, it returns an error. However, the fix eliminated the intermittent 400 errors during parallel container builds. This fix works for me, without it, there is issue every time i build. If not good enough, I can rollback, trigger panics and improve the details, but this does fix my problem. |
Most (or all) io.ReadClosers are safe to be closed twice. The same to
Why such goroutine can exist? The MinIO SDK uploading should have stopped after saveAsPackageBlob Since it isn't the real cause, I don't think it really fixes. |
Yes, I think we need to figure out the root cause. We need a fix which is theoretically right and does fix a real cause. |
When multiple parallel requests push the same blob digest (common in container registry parallel builds), a race condition can cause a nil pointer dereference panic: 1. Request A and B both call GetOrInsertBlob for the same digest 2. Both SELECT queries find no existing blob (parallel execution) 3. Request A's INSERT succeeds 4. Request B's INSERT fails with duplicate key violation 5. GetOrInsertBlob returns (nil, false, error) 6. The error cleanup code in saveAsPackageBlob (blob.go:71) tries to access pb.HashSHA256 to delete from content store, but pb is nil The fix handles this race by retrying the SELECT after an INSERT failure, returning the existing blob instead of nil. This matches how similar functions like TryInsertPackage and TryInsertFile handle the same race. Observed error in production logs: PUT /v2/.../blobs/uploads/...?digest=sha256:... panic @ container/container.go:400(container.PutBlobsUpload) err=runtime error: invalid memory address or nil pointer dereference The previous mutex-based approach in BlobUploader was incorrect because each request creates its own BlobUploader instance - the race is between different requests, not goroutines within a single request.
5756646 to
59bf7dd
Compare
|
Awesome, it looks right now. 👍 |
|
I will make some improvements to the tests. By the way, no need to rebase or force push then (https://github.com/go-gitea/gitea/blob/main/CONTRIBUTING.md#maintaining-open-prs) |
I am sorry my agent ignored instructions. Thank you for helping to fix this. It's a big blocker for me. |
Found the Real Root CauseThanks for pushing back on this. You were right that the BlobUploader mutex wasn't the fix. I reproduced the issue by reverting to upstream Gitea 1.25.2 and running parallel container builds. Here's what I found in the logs: Notice that the same digest ( The Actual BugThe problem is in if _, err = e.Insert(pb); err != nil {
return nil, false, err // Returns nil for pb!
}When multiple parallel requests try to insert the same blob, they all do a SELECT first and find nothing. Then one INSERT succeeds while the others fail with a duplicate key violation. The function returns The caller in if err != nil {
if !exists {
contentStore.Delete(packages_module.BlobHash256Key(pb.HashSHA256)) // pb is nil here!
}
}This causes the nil pointer panic. The FixI've updated the PR to handle this race properly by retrying the SELECT after an INSERT fails: if _, err = e.Insert(pb); err != nil {
// Another request inserted the same blob between our SELECT and INSERT.
// Retry the SELECT to get the existing blob.
if has, _ = e.Where(hashCond).Get(existing); has {
return existing, true, nil
}
return nil, false, err
}This follows the same pattern used by I've reverted all the BlobUploader mutex changes since you correctly pointed out that each request creates its own instance. |
|
Made some more changes, do the new changes look good to you? |
|
Ah... I see. Yes, the changes look great! The global lock by hash is a cleaner solution. I should have seen this. Thank you. |
…currently (go-gitea#36524) Co-authored-by: wxiaoguang <wxiaoguang@gmail.com>
* giteaofficial/main: fix(packages/container): data race when uploading container blobs concurrently (go-gitea#36524)
…currently (#36524) (#36526) Backport #36524 by @noeljackson Fix data race when uploading container blobs concurrently Co-authored-by: Noel Jackson <n@noeljackson.com> Co-authored-by: wxiaoguang <wxiaoguang@gmail.com>
* origin/main: (1246 commits) fix(packages/container): data race when uploading container blobs concurrently (go-gitea#36524) [skip ci] Updated translations via Crowdin Remove and forbid `@ts-expect-error` (go-gitea#36513) Add resolve/unresolve review comment API endpoints (go-gitea#36441) Fix incorrect vendored detections (go-gitea#36508) Bump alpine to 3.23, add platforms to `docker-dryrun` (go-gitea#36379) Unify repo names in system notices (go-gitea#36491) Allow scroll propagation outside code editor (go-gitea#36502) Refactor ActionsTaskID (go-gitea#36503) Update JS deps, remove `knip`, misc tweaks (go-gitea#36499) [skip ci] Updated translations via Crowdin Fix editorconfig not respected in PR Conversation view (go-gitea#36492) Add FOLDER_ICON_THEME configuration option (go-gitea#36496) Don't create self-references in merged PRs (go-gitea#36490) Use reserved .test TLD for unit tests (go-gitea#36498) Fix bug when list pull request commits (go-gitea#36485) Update some go dependencies (go-gitea#36489) chore: add comments for "api/healthz", clean up test env (go-gitea#36481) [SECURITY] Toolchain Update to Go 1.25.6 (go-gitea#36480) [skip ci] Updated translations via Crowdin ... # Conflicts: # modules/templates/helper.go # options/locale/locale_en-US.ini # routers/web/repo/cherry_pick.go # routers/web/repo/editor.go # routers/web/repo/patch.go # templates/repo/editor/edit.tmpl # web_src/js/features/codeeditor.ts
…currently (go-gitea#36524) Co-authored-by: wxiaoguang <wxiaoguang@gmail.com>
Fix data race when uploading container blobs concurrently