Skip to content

gitdumper bug fix + safeguards#3086

Open
liquidsec wants to merge 2 commits intodevfrom
fix-gitdumper-cwd-walk
Open

gitdumper bug fix + safeguards#3086
liquidsec wants to merge 2 commits intodevfrom
fix-gitdumper-cwd-walk

Conversation

@liquidsec
Copy link
Copy Markdown
Contributor

Summary

Production captured a memray trace where a single bbot scan allocated 127 GB peak / 1.7 TB total — and 96.5% of it (122 GB) was attributable to gitdumper. Investigation found one catastrophic bug plus several thinner safeguards worth adding while we were in the area. Each fix has a regression test that fails on the old code and passes after.

1. The catastrophic bug — regex_files walked the cwd

gitdumper.regex_files had this signature:

async def regex_files(self, regex, folder=Path(), file=Path(), files=[]):
    if folder:                          # always true ← here's the bomb
        if folder.is_dir():
            for file_path in folder.rglob("*"):
                ...
                content = fh.read()     # decodes every file in cwd

Path() is PosixPath('.') — the current working directory. bool(Path('.')) is True. Path('.').is_dir() is True. So whenever code called regex_files(file=foo) to scan one specific file (e.g. download_current_branch reading .git/HEAD), the function silently also walked the entire cwd, opened every file, decoded it as UTF-8 into a Python string, and ran regex over it.

In production this fired once per CODE_REPOSITORY event the scan saw. With many leaked-.git targets and bbot installed at /opt/bbot/ (33,905 files including .venv/, wordlists, the test suite, all source), the cumulative allocation reached the 100+ GB seen in the trace. The actual git files involved (HEAD, config, etc.) were tiny and not the issue at all.

Fix: replace Path() defaults with None and check is not None explicitly.

2. Per-file size cap on regex_file

Even with the cwd-walk fixed, a folder scan over .git/ still feeds every file through file.read() into a single Python string. A multi-GB pack file would do real damage. Skip files larger than 10 MB — legit git ref/object/info files are tiny; oversized always means a webserver returned an HTML error page in place of the requested git path.

3. max_size on download_files

download_files was calling helpers.download(...) with no max_size, so each request accepted up to 500 MB (the web helper default). A misconfigured / malicious server can return arbitrary content for any probed git path. Cap at 10 MB per file.

4. Recursion depth limit on download_object

download_object recurses through git object references via git cat-file -p output. Real git object graphs are shallow (commit → tree → tree → blob, depth 10-20 in practice). Without a cap, an unbounded chain of object references could exhaust the Python stack. Cap at 100 levels.

5. Cycle detection on download_object

Same function, related issue: a malicious or corrupt git repo can have cyclic tree references (object A's content references B, B's references A). Without tracking visited hashes, recursion never terminates. Now tracks visited object hashes and skips duplicates.

Verification

Live scan against mijnlieff.nl (the original target the user reported):

Run Peak RSS
Before fix 127 GB (memray trace)
After fix 193 MB

Same target, same module set, finding emitted normally, scan completes in ~40s. ~650× reduction.

Existing module tests (test_module_gitdumper) still pass. New regression suite in test_step_1/test_gitdumper_safeguards.py covers all four safeguards — each test was confirmed to fail on the unfixed code before being committed.

liquidsec added 2 commits May 8, 2026 16:35
regex_files defaulted folder=Path() and file=Path(); both evaluate as
PosixPath('.') which is truthy AND a directory. Calling
regex_files(file=foo) silently ALSO walked the entire current working
directory, decoding every file into a Python string and running regex
over it. With many CODE_REPOSITORY events this allocated 100+ GB
(observed in production: 105 GB across 5 calls).

Fix uses None defaults with explicit `is not None` checks. Also caps
per-file regex scan at 10 MB — real git ref/object/info files are
small; oversized usually means a webserver returned an HTML error page
instead of the requested git path.

Adds a regression test that fails on the old code (asserts a decoy
file in cwd is NOT scanned when only file= is passed).
Three additional safeguards layered on top of the cwd-walk fix:

1. download_files now passes max_size=10MB to helpers.download. Previous
   default (500MB from web helper) accepted arbitrarily large responses
   for any probed git path.

2. download_object recursion is capped at 100 levels. Real git object
   graphs are shallow; deeper means an unbounded chain of object
   references (malicious or corrupt repo).

3. download_object now tracks visited object hashes and skips duplicates.
   Cyclic tree references (A → B → A) no longer recurse forever.

Each safeguard has a regression test in test_gitdumper_safeguards.py
that fails on the unfixed code (verified) and passes after.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

📊 Performance Benchmark Report

Comparing dev (baseline) vs fix-gitdumper-cwd-walk (current)

📈 Detailed Results (All Benchmarks)

📋 Complete results for all benchmarks - includes both significant and insignificant changes

🧪 Test Name 📏 Base 📏 Current 📈 Change 🎯 Status
Bloom Filter Dns Mutation Tracking Performance 4.34ms 4.28ms -1.2%
Bloom Filter Large Scale Dns Brute Force 18.20ms 17.87ms -1.8%
Large Closest Match Lookup 359.13ms 355.07ms -1.1%
Realistic Closest Match Workload 192.09ms 191.18ms -0.5%
Event Memory Medium Scan 1784 B/event 1784 B/event +0.0%
Event Memory Large Scan 1768 B/event 1768 B/event +0.0%
Event Validation Full Scan Startup Small Batch 410.84ms 423.81ms +3.2%
Event Validation Full Scan Startup Large Batch 590.28ms 578.49ms -2.0%
Make Event Autodetection Small 31.20ms 31.00ms -0.6%
Make Event Autodetection Large 314.98ms 318.51ms +1.1%
Make Event Explicit Types 14.04ms 13.91ms -0.9%
Excavate Single Thread Small 4.011s 3.960s -1.3%
Excavate Single Thread Large 9.610s 9.614s +0.0%
Excavate Parallel Tasks Small 4.183s 4.143s -0.9%
Excavate Parallel Tasks Large 7.233s 7.280s +0.6%
Is Ip Performance 3.20ms 3.29ms +2.8%
Make Ip Type Performance 11.65ms 11.75ms +0.9%
Mixed Ip Operations 4.61ms 4.62ms +0.2%
Memory Use Web Crawl 48.1 MB 43.2 MB -10.1% 🟢🟢 🚀
Memory Use Subdomain Enum 19.4 MB 19.4 MB +0.0%
Scan Throughput 100 7.869s 7.523s -4.4%
Scan Throughput 1000 38.958s 37.853s -2.8%
Typical Queue Shuffle 62.04µs 63.27µs +2.0%
Priority Queue Shuffle 702.95µs 712.13µs +1.3%

🎯 Performance Summary

+ 1 improvement 🚀
  23 unchanged ✅

🔍 Significant Changes (>10%)

  • Memory Use Web Crawl: 10.1% 🚀 less memory

🐍 Python Version 3.11.15

@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 93.85965% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 91%. Comparing base (64c2193) to head (5df3903).

Files with missing lines Patch % Lines
bbot/modules/gitdumper.py 84% 6 Missing ⚠️
bbot/test/test_step_1/test_gitdumper_safeguards.py 99% 1 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #3086   +/-   ##
=====================================
- Coverage     91%     91%   -0%     
=====================================
  Files        440     441    +1     
  Lines      37560   37655   +95     
=====================================
+ Hits       33973   34057   +84     
- Misses      3587    3598   +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant