Skip to content

fix: resolve Git Anonymous Resolver excessive memory usage#8677

Merged
tekton-robot merged 2 commits intotektoncd:mainfrom
aThorp96:git-resolver-memory-leak
Apr 28, 2025
Merged

fix: resolve Git Anonymous Resolver excessive memory usage#8677
tekton-robot merged 2 commits intotektoncd:mainfrom
aThorp96:git-resolver-memory-leak

Conversation

@aThorp96
Copy link
Member

@aThorp96 aThorp96 commented Mar 28, 2025

Switch git resolver from go-git library to use git binary.

The go-git library does not resolve deltas efficiently, and as a result is not recommended to be used to clone repositories with full-depth. In one example RemoteResolutionRequest targeting a repository which summed 145Mb, configuring the resolution timeout to 10 minutes and giving the resolver to have 25Gb of memory, the resolver pod was OOM killed after ~6 minutes. Additionally, since go-git's delta resolution does not accept any contexts, the time required and memory used during resolving a large repository will not be cutoff when the context is canceled, impacting the resolver's performance for other concurrent remote resolutions.

Since the git resolver can be provided a commit sha which is not tracked at any ref/head, and because go-git only supports fetching fully-qualified refs, go-git does not support fetching arbitrary revisions. Therefore, in order to guarantee the requested revision is fetched, if we continue to use the go-git library, we must fetch all revisions.

Switching to the git binary significantly improves the time and memory performance of the git resolver (by multiple orders of magnitude, in my testing). Using the git binary also enables the resolver to take advantage of shallow fetching/cloning due to the git-fetch's support for fetching arbitrary revisions, improving both the time and memory performance by another order of magnitude. Note that if the revision is not at any ref head or tag, fetching the revision does depend on the git server enabling uploadpack.allowReachableSHA1InWant. This feature is enabled and supported in Github and Gitlab but not Gitea: go-gitea/gitea#11958
See also: https://git-scm.com/docs/protocol-capabilities#_allow_reachable_sha1_in_want

Resolves #8652

Changes

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • pre-commit Passed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

RemoteResolutions using the Git Resolver now use the `git` binary instead of  the Golang library `go-git` to shallow-clone, shallow-fetch, then checkout the provided repository at the given revision. This reduces resolution time and memory significantly. Some git providers such as Gitea may not support fetching revisions if the revision is a SHA which is not reachable via a ref or is not at a ref/head. In general, no user action is required. 

See also: https://git-scm.com/docs/protocol-capabilities#_allow_reachable_sha1_in_want

Resolves https://github.com/tektoncd/pipeline/issues/8652

@tekton-robot tekton-robot added the release-note-none Denotes a PR that doesnt merit a release note. label Mar 28, 2025
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 28, 2025
@tekton-robot tekton-robot requested review from afrittoli and jerop March 28, 2025 20:12
Comment on lines 52 to 81
_, err = repo.execGit(ctx, "clone", repo.url, tmpDir, "--depth=1", "--no-checkout")
if err != nil {
return nil, cleanupFunc, err
}

_, err = repo.execGit(ctx, "fetch", "origin", repo.revision, "--depth=1")
if err != nil {
return nil, cleanupFunc, err
}

_, err = repo.execGit(ctx, "checkout", "FETCH_HEAD")
if err != nil {
return nil, cleanupFunc, err
}

revisionSha, err := repo.execGit(ctx, "rev-list", "-n1", "HEAD")
if err != nil {
return nil, cleanupFunc, err
}
repo.revision = strings.TrimSpace(string(revisionSha))

return &repo, cleanupFunc, nil
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially simply did a full git clone and then simply did git checkout. However doing a shallow clone with --no-checkout, fetching the revision, and then checking the revision out improved the time and space performance significantly. Using the same repository mentioned in the original issue as a benchmark, using a shallow clone reduced the clone time from ~110 seconds by a factor of ten. Using --no-checkout similarly reduced the disk space and time slightly (before the necessary revision was checked-out)

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/resolution/resolver/git/git.go Do not exist 86.0%
pkg/resolution/resolver/git/resolver.go 83.9% 84.8% 0.9

default: "_json_key"
- name: releaseAsLatest
description: Whether to tag and publish this release as Pipelines' latest
description: Whether to tag and publish this release as Pipelines latest
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not a grammatical error, I just noticed it threw off the syntax highlighting of my editor and both versions seemed equally correct. I am happy to revert if needed.

github.com/tektoncd/pipeline/cmd/nop: ghcr.io/tektoncd/pipeline/github.com/tektoncd/pipeline/combined-base-image:latest
github.com/tektoncd/pipeline/cmd/workingdirinit: ghcr.io/tektoncd/pipeline/github.com/tektoncd/pipeline/combined-base-image:latest

github.com/tektoncd/pipeline/cmd/git-init: cgr.dev/chainguard/git
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The git-init package no longer exists.

@aThorp96
Copy link
Member Author

/kind feat

@tekton-robot
Copy link
Collaborator

@aThorp96: The label(s) kind/feat cannot be applied, because the repository doesn't have them.

Details

In response to this:

/kind feat

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aThorp96
Copy link
Member Author

/kind feature

@tekton-robot tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 28, 2025
@aThorp96
Copy link
Member Author

Cc @vdemeester

@vdemeester vdemeester self-assigned this Mar 28, 2025
@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesnt merit a release note. labels Mar 29, 2025
@waveywaves
Copy link
Member

waveywaves commented Apr 1, 2025

 --- FAIL: TestGitResolver_Clone_Failure (0.06s)
    --- FAIL: TestGitResolver_Clone_Failure/commit_does_not_exist (1.08s)
    --- PASS: TestGitResolver_Clone_Failure/path_does_not_exist (2.06s)
    --- FAIL: TestGitResolver_Clone_Failure/repo_does_not_exist (2.08s)

beta e2e tests are failing
cc @aThorp96

@aThorp96
Copy link
Member Author

aThorp96 commented Apr 1, 2025

@waveywaves yeah, hoping to get to addressing the e2e test failures today or tomorrow.

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2025
@aThorp96 aThorp96 force-pushed the git-resolver-memory-leak branch from 2f93442 to c0e27c1 Compare April 10, 2025 17:54
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 10, 2025
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/resolution/resolver/git/git.go Do not exist 80.4%
pkg/resolution/resolver/git/resolver.go 84.5% 85.4% 1.0

@aThorp96 aThorp96 force-pushed the git-resolver-memory-leak branch from c0e27c1 to 6073888 Compare April 11, 2025 04:12
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/resolution/resolver/git/git.go Do not exist 78.8%
pkg/resolution/resolver/git/resolver.go 84.5% 85.4% 1.0

Comment on lines 30 to 38
type repository struct {
url string
username string
password string
directory string
revision string
}

func resolveRepository(ctx context.Context, url, username, password, revision string) (*repository, func(), error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can export the repository struct and use it in the function as an argument instead of passing four arguments ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rework this a bit to make it more ergonomic. I am hesitant to export this struct but I think there is a way to work around that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LMK what you think about the refactor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something for a follow-up?

Comment on lines +109 to +111
volumeMounts:
- name: tmp-clone-volume
mountPath: "/tmp"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mounting a directory at /tmp is necessary because the security context sets the root filesystem as read-only, and since we're using git clone we have to clone into the filesystem.

Additionally, using an empty-dir volume as the /tmp directory allows us to configure the size of the directory in alignment with the pod's memory requests.

@tekton-robot tekton-robot requested a review from vdemeester April 16, 2025 20:30
@vdemeester vdemeester added this to the Pipeline v1.0.0 milestone Apr 22, 2025
@@ -0,0 +1,2 @@
baseImageOverrides:
github.com/tektoncd/pipeline/cmd/resolvers: cgr.dev/chainguard/git@sha256:566235a8ef752f37d285042ee05fc37dbb04293e50f116a231984080fb835693
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would make all resolvers image having to use the git image... isnt that a bit much for example the http-resolver would def not need it...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly all of the resolvers run as plugins together in the same pod(s). So at least while the some resolvers will have unnecessary access to the git binary, all of the resolver pods will need access to the binary if the git resolver is enabled. Not sure if that's that much better though

@chmouel
Copy link
Member

chmouel commented Apr 22, 2025

we used to have git binary used and one of the reason to use go-github is to make it easier to "ship" it since built-in

now there is a issue with backward compatibility for downstream contributor to have that image updated and support (keep git binary updated)

but then i still think it's a good idea to use the git binary directly, since the lib would never be as good as the git binary...

@@ -0,0 +1,2 @@
baseImageOverrides:
github.com/tektoncd/pipeline/cmd/resolvers: cgr.dev/chainguard/git@sha256:566235a8ef752f37d285042ee05fc37dbb04293e50f116a231984080fb835693
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be some changes to be done in the tekton/publish.yaml task as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to modifying the task's inline .koconfig like is already done?

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2025

Git clone with `git clone` is supported for anonymous and authenticated cloning.
Git clone with `git clone` is supported for anonymous and authenticated cloning
This mode shallow-clones the git repo before shallow-fetching and checking
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a bit too low level git plumbing to understand?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fair

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the bit about shallow fetching, but left in the steps shallow-clone, fetch, then checkout. I added a note detailing when users may need to care about the git plumbing, since there is that possibility. Does that read better now?


_, err = repo.execGit(ctx, "clone", repo.url, tmpDir, "--depth=1", "--no-checkout")
if err != nil {
if strings.Contains(err.Error(), "unable to get password from user") {
Copy link
Member

@chmouel chmouel Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you rely on unix exit code instead? instead trying to get the strings matches ? not sure if that apply but maybe that you can get the output in some other way? (ie: translated strings)

even if that doesn't apply relying on the exit code should be better still

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to do something cleaner here. The exit code isn't specific to auth errors however.
image

My intention was to 1) maintain mostly similar error messaging to the old implementation 2) without introducing red herrings. I didn't see another way to ensure that auth issues (or private repos, and by proxy non-existent repos) would maintain their authentication required error without errors like unable to look up $domain being surpressed. Do you have any thoughts?

@aThorp96 aThorp96 force-pushed the git-resolver-memory-leak branch from fbba524 to 89b7461 Compare April 22, 2025 19:12
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/resolution/resolver/git/repository.go Do not exist 90.2%
pkg/resolution/resolver/git/resolver.go 84.5% 84.9% 0.4

@aThorp96 aThorp96 requested a review from chmouel April 23, 2025 20:07
@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 25, 2025
Switch git resolver from go-git library to use git binary.

The go-git library does not resolve deltas efficiently, and as a result
is not recommended to be used to clone repositories with full-depth.
In one example RemoteResolutionRequest targeting a repository which
summed 145Mb, configuring the resolution timeout to 10 minutes and
giving the resolver to have 25Gb of memory, the resolver pod was OOM
killed after ~6 minutes. Additionally, since go-git's delta resolution
does not accept any contexts, the time required and memory used during
resolving a large repository will not be cutoff when the context is
canceled, impacting the resolver's performance for other concurrent
remote resolutions.

Since the git resolver can be provided a revision which is not tracked
at any ref in the repository, and because go-git only supports fetching
fully-qualified refs, go-git does not support fetching arbitrary
revisions. Therefore, in order to guarantee the requested revision is
fetched, if we continue to use the go-git library we must fetch all
revisions.

Switching to the git binary enables the git resolver to take advantage
of the git-fetch's support for fetching arbitrary revisions. Note that
if the revision is not at any ref head, fetching the revision does
depend on the git server enabling uploadpack.allowReachableSHA1InWant.

Resolves tektoncd#8652

See also: https://git-scm.com/docs/protocol-capabilities#_allow_reachable_sha1_in_want

NOTE: This feature is enabled and supported in Github and Gitlab but
not Gitea: go-gitea/gitea#11958
@aThorp96 aThorp96 force-pushed the git-resolver-memory-leak branch from 89b7461 to 1b28bc8 Compare April 28, 2025 15:51
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 28, 2025
@aThorp96 aThorp96 requested a review from twoGiants April 28, 2025 15:54
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/resolution/resolver/git/repository.go Do not exist 84.9%
pkg/resolution/resolver/git/resolver.go 84.5% 84.9% 0.4

@afrittoli
Copy link
Member

Thanks for this!
/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 28, 2025
@tekton-robot tekton-robot merged commit d4c88bc into tektoncd:main Apr 28, 2025
20 checks passed
@aThorp96 aThorp96 deleted the git-resolver-memory-leak branch April 28, 2025 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

bug: Excessive memory consumption during git Resolver remote resolution

8 participants