gitdumper bug fix + safeguards by liquidsec · Pull Request #3086 · blacklanternsecurity/bbot

liquidsec · 2026-05-08T20:47:47Z

Summary

Production captured a memray trace where a single bbot scan allocated 127 GB peak / 1.7 TB total — and 96.5% of it (122 GB) was attributable to gitdumper. Investigation found one catastrophic bug plus several thinner safeguards worth adding while we were in the area. Each fix has a regression test that fails on the old code and passes after.

1. The catastrophic bug — `regex_files` walked the cwd

gitdumper.regex_files had this signature:

async def regex_files(self, regex, folder=Path(), file=Path(), files=[]):
    if folder:                          # always true ← here's the bomb
        if folder.is_dir():
            for file_path in folder.rglob("*"):
                ...
                content = fh.read()     # decodes every file in cwd

Path() is PosixPath('.') — the current working directory. bool(Path('.')) is True. Path('.').is_dir() is True. So whenever code called regex_files(file=foo) to scan one specific file (e.g. download_current_branch reading .git/HEAD), the function silently also walked the entire cwd, opened every file, decoded it as UTF-8 into a Python string, and ran regex over it.

In production this fired once per CODE_REPOSITORY event the scan saw. With many leaked-.git targets and bbot installed at /opt/bbot/ (33,905 files including .venv/, wordlists, the test suite, all source), the cumulative allocation reached the 100+ GB seen in the trace. The actual git files involved (HEAD, config, etc.) were tiny and not the issue at all.

Fix: replace Path() defaults with None and check is not None explicitly.

2. Per-file size cap on `regex_file`

Even with the cwd-walk fixed, a folder scan over .git/ still feeds every file through file.read() into a single Python string. A multi-GB pack file would do real damage. Skip files larger than 10 MB — legit git ref/object/info files are tiny; oversized always means a webserver returned an HTML error page in place of the requested git path.

3. `max_size` on `download_files`

download_files was calling helpers.download(...) with no max_size, so each request accepted up to 500 MB (the web helper default). A misconfigured / malicious server can return arbitrary content for any probed git path. Cap at 10 MB per file.

4. Recursion depth limit on `download_object`

download_object recurses through git object references via git cat-file -p output. Real git object graphs are shallow (commit → tree → tree → blob, depth 10-20 in practice). Without a cap, an unbounded chain of object references could exhaust the Python stack. Cap at 100 levels.

5. Cycle detection on `download_object`

Same function, related issue: a malicious or corrupt git repo can have cyclic tree references (object A's content references B, B's references A). Without tracking visited hashes, recursion never terminates. Now tracks visited object hashes and skips duplicates.

Verification

Live scan against mijnlieff.nl (the original target the user reported):

Run	Peak RSS
Before fix	127 GB (memray trace)
After fix	193 MB

Same target, same module set, finding emitted normally, scan completes in ~40s. ~650× reduction.

Existing module tests (test_module_gitdumper) still pass. New regression suite in test_step_1/test_gitdumper_safeguards.py covers all four safeguards — each test was confirmed to fail on the unfixed code before being committed.

regex_files defaulted folder=Path() and file=Path(); both evaluate as PosixPath('.') which is truthy AND a directory. Calling regex_files(file=foo) silently ALSO walked the entire current working directory, decoding every file into a Python string and running regex over it. With many CODE_REPOSITORY events this allocated 100+ GB (observed in production: 105 GB across 5 calls). Fix uses None defaults with explicit `is not None` checks. Also caps per-file regex scan at 10 MB — real git ref/object/info files are small; oversized usually means a webserver returned an HTML error page instead of the requested git path. Adds a regression test that fails on the old code (asserts a decoy file in cwd is NOT scanned when only file= is passed).

Three additional safeguards layered on top of the cwd-walk fix: 1. download_files now passes max_size=10MB to helpers.download. Previous default (500MB from web helper) accepted arbitrarily large responses for any probed git path. 2. download_object recursion is capped at 100 levels. Real git object graphs are shallow; deeper means an unbounded chain of object references (malicious or corrupt repo). 3. download_object now tracks visited object hashes and skips duplicates. Cyclic tree references (A → B → A) no longer recurse forever. Each safeguard has a regression test in test_gitdumper_safeguards.py that fails on the unfixed code (verified) and passes after.

github-actions · 2026-05-08T21:24:36Z

📊 Performance Benchmark Report

Comparing dev (baseline) vs fix-gitdumper-cwd-walk (current)

📈 Detailed Results (All Benchmarks)

📋 Complete results for all benchmarks - includes both significant and insignificant changes

🧪 Test Name	📏 Base	📏 Current	📈 Change	🎯 Status
Bloom Filter Dns Mutation Tracking Performance	`4.34ms`	`4.28ms`	-1.2% ⚪	✅
Bloom Filter Large Scale Dns Brute Force	`18.20ms`	`17.87ms`	-1.8% ⚪	✅
Large Closest Match Lookup	`359.13ms`	`355.07ms`	-1.1% ⚪	✅
Realistic Closest Match Workload	`192.09ms`	`191.18ms`	-0.5% ⚪	✅
Event Memory Medium Scan	`1784 B/event`	`1784 B/event`	+0.0% ⚪	✅
Event Memory Large Scan	`1768 B/event`	`1768 B/event`	+0.0% ⚪	✅
Event Validation Full Scan Startup Small Batch	`410.84ms`	`423.81ms`	+3.2% ⚪	✅
Event Validation Full Scan Startup Large Batch	`590.28ms`	`578.49ms`	-2.0% ⚪	✅
Make Event Autodetection Small	`31.20ms`	`31.00ms`	-0.6% ⚪	✅
Make Event Autodetection Large	`314.98ms`	`318.51ms`	+1.1% ⚪	✅
Make Event Explicit Types	`14.04ms`	`13.91ms`	-0.9% ⚪	✅
Excavate Single Thread Small	`4.011s`	`3.960s`	-1.3% ⚪	✅
Excavate Single Thread Large	`9.610s`	`9.614s`	+0.0% ⚪	✅
Excavate Parallel Tasks Small	`4.183s`	`4.143s`	-0.9% ⚪	✅
Excavate Parallel Tasks Large	`7.233s`	`7.280s`	+0.6% ⚪	✅
Is Ip Performance	`3.20ms`	`3.29ms`	+2.8% ⚪	✅
Make Ip Type Performance	`11.65ms`	`11.75ms`	+0.9% ⚪	✅
Mixed Ip Operations	`4.61ms`	`4.62ms`	+0.2% ⚪	✅
Memory Use Web Crawl	`48.1 MB`	`43.2 MB`	-10.1% 🟢🟢	🚀
Memory Use Subdomain Enum	`19.4 MB`	`19.4 MB`	+0.0% ⚪	✅
Scan Throughput 100	`7.869s`	`7.523s`	-4.4% ⚪	✅
Scan Throughput 1000	`38.958s`	`37.853s`	-2.8% ⚪	✅
Typical Queue Shuffle	`62.04µs`	`63.27µs`	+2.0% ⚪	✅
Priority Queue Shuffle	`702.95µs`	`712.13µs`	+1.3% ⚪	✅

🎯 Performance Summary

+ 1 improvement 🚀
  23 unchanged ✅

🔍 Significant Changes (>10%)

Memory Use Web Crawl: 10.1% 🚀 less memory

🐍 Python Version 3.11.15

codecov · 2026-05-08T21:43:08Z

Codecov Report

❌ Patch coverage is 93.85965% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 91%. Comparing base (64c2193) to head (5df3903).

Files with missing lines	Patch %	Lines
bbot/modules/gitdumper.py	84%	6 Missing ⚠️
bbot/test/test_step_1/test_gitdumper_safeguards.py	99%	1 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff          @@
##             dev   #3086   +/-   ##
=====================================
- Coverage     91%     91%   -0%     
=====================================
  Files        440     441    +1     
  Lines      37560   37655   +95     
=====================================
+ Hits       33973   34057   +84     
- Misses      3587    3598   +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

liquidsec added 2 commits May 8, 2026 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gitdumper bug fix + safeguards#3086

gitdumper bug fix + safeguards#3086
liquidsec wants to merge 2 commits intodevfrom
fix-gitdumper-cwd-walk

liquidsec commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

codecov Bot commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

liquidsec commented May 8, 2026

Summary

1. The catastrophic bug — regex_files walked the cwd

2. Per-file size cap on regex_file

3. max_size on download_files

4. Recursion depth limit on download_object

5. Cycle detection on download_object

Verification

Uh oh!

github-actions Bot commented May 8, 2026

📊 Performance Benchmark Report

🎯 Performance Summary

🔍 Significant Changes (>10%)

Uh oh!

codecov Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. The catastrophic bug — `regex_files` walked the cwd

2. Per-file size cap on `regex_file`

3. `max_size` on `download_files`

4. Recursion depth limit on `download_object`

5. Cycle detection on `download_object`

codecov Bot commented May 8, 2026 •

edited

Loading