Open
Conversation
regex_files defaulted folder=Path() and file=Path(); both evaluate as
PosixPath('.') which is truthy AND a directory. Calling
regex_files(file=foo) silently ALSO walked the entire current working
directory, decoding every file into a Python string and running regex
over it. With many CODE_REPOSITORY events this allocated 100+ GB
(observed in production: 105 GB across 5 calls).
Fix uses None defaults with explicit `is not None` checks. Also caps
per-file regex scan at 10 MB — real git ref/object/info files are
small; oversized usually means a webserver returned an HTML error page
instead of the requested git path.
Adds a regression test that fails on the old code (asserts a decoy
file in cwd is NOT scanned when only file= is passed).
Three additional safeguards layered on top of the cwd-walk fix: 1. download_files now passes max_size=10MB to helpers.download. Previous default (500MB from web helper) accepted arbitrarily large responses for any probed git path. 2. download_object recursion is capped at 100 levels. Real git object graphs are shallow; deeper means an unbounded chain of object references (malicious or corrupt repo). 3. download_object now tracks visited object hashes and skips duplicates. Cyclic tree references (A → B → A) no longer recurse forever. Each safeguard has a regression test in test_gitdumper_safeguards.py that fails on the unfixed code (verified) and passes after.
Contributor
📊 Performance Benchmark Report
📈 Detailed Results (All Benchmarks)
🎯 Performance Summary+ 1 improvement 🚀
23 unchanged ✅🔍 Significant Changes (>10%)
🐍 Python Version 3.11.15 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #3086 +/- ##
=====================================
- Coverage 91% 91% -0%
=====================================
Files 440 441 +1
Lines 37560 37655 +95
=====================================
+ Hits 33973 34057 +84
- Misses 3587 3598 +11 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Production captured a memray trace where a single bbot scan allocated 127 GB peak / 1.7 TB total — and 96.5% of it (122 GB) was attributable to gitdumper. Investigation found one catastrophic bug plus several thinner safeguards worth adding while we were in the area. Each fix has a regression test that fails on the old code and passes after.
1. The catastrophic bug —
regex_fileswalked the cwdgitdumper.regex_fileshad this signature:Path()isPosixPath('.')— the current working directory.bool(Path('.'))isTrue.Path('.').is_dir()isTrue. So whenever code calledregex_files(file=foo)to scan one specific file (e.g.download_current_branchreading.git/HEAD), the function silently also walked the entire cwd, opened every file, decoded it as UTF-8 into a Python string, and ran regex over it.In production this fired once per
CODE_REPOSITORYevent the scan saw. With many leaked-.gittargets and bbot installed at/opt/bbot/(33,905 files including.venv/, wordlists, the test suite, all source), the cumulative allocation reached the 100+ GB seen in the trace. The actual git files involved (HEAD, config, etc.) were tiny and not the issue at all.Fix: replace
Path()defaults withNoneand checkis not Noneexplicitly.2. Per-file size cap on
regex_fileEven with the cwd-walk fixed, a folder scan over
.git/still feeds every file throughfile.read()into a single Python string. A multi-GB pack file would do real damage. Skip files larger than 10 MB — legit git ref/object/info files are tiny; oversized always means a webserver returned an HTML error page in place of the requested git path.3.
max_sizeondownload_filesdownload_fileswas callinghelpers.download(...)with nomax_size, so each request accepted up to 500 MB (the web helper default). A misconfigured / malicious server can return arbitrary content for any probed git path. Cap at 10 MB per file.4. Recursion depth limit on
download_objectdownload_objectrecurses through git object references viagit cat-file -poutput. Real git object graphs are shallow (commit → tree → tree → blob, depth 10-20 in practice). Without a cap, an unbounded chain of object references could exhaust the Python stack. Cap at 100 levels.5. Cycle detection on
download_objectSame function, related issue: a malicious or corrupt git repo can have cyclic tree references (object A's content references B, B's references A). Without tracking visited hashes, recursion never terminates. Now tracks visited object hashes and skips duplicates.
Verification
Live scan against
mijnlieff.nl(the original target the user reported):Same target, same module set, finding emitted normally, scan completes in ~40s. ~650× reduction.
Existing module tests (
test_module_gitdumper) still pass. New regression suite intest_step_1/test_gitdumper_safeguards.pycovers all four safeguards — each test was confirmed to fail on the unfixed code before being committed.