Skip to content

Conversation

@duxiao1212
Copy link
Contributor

Summary:
Add memory management to approx_most_frequent aggregate function to prevent OOM during global aggregation, where spilling is not effective.

Changes:

  1. Modified ApproxMostFrequentStreamSummary::insert() to return eviction info (evicted value and count)
  2. Added memory tracking (liveBytes_, deadBytes_) for StringView accumulator
  3. Added rebuild() to compact string storage when dead bytes exceed threshold
  4. Added per-key size limit to prevent individual oversized keys

Motivation:
For queries like:
SELECT APPROX_MOST_FREQUENT(20, CAST(feature AS JSON), 1000), ...
FROM table LIMIT 10000000
With 10M input rows and ~82KB per row, without memory management:
20251215_232949_18211_b5kzi

The HashStringAllocator accumulates string storage for all inserted keys
When keys are evicted from the Space-Saving summary, their string storage becomes "dead" but is never freed
Memory grows unbounded until OOM
This fix addresses:

Single large keys - Per-key size limit (kMaxTotalStringBytes / capacity) prevents any individual key from consuming excessive memory
Dead memory accumulation - rebuild() triggers when deadBytes > kMaxTotalStringBytes * 0.25, compacting storage by copying only live keys to new allocation and freeing the old storage

Differential Revision: D89728171

@netlify
Copy link

netlify bot commented Dec 23, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 3a0bb48
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/694b11861311c0000821b498

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 23, 2025
@meta-codesync
Copy link

meta-codesync bot commented Dec 23, 2025

@duxiao1212 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D89728171.

@duxiao1212 duxiao1212 marked this pull request as draft December 23, 2025 21:53
@duxiao1212 duxiao1212 marked this pull request as ready for review December 23, 2025 22:00
…Input (facebookincubator#15852)

Summary:

Add memory management to approx_most_frequent aggregate function to prevent OOM during global aggregation, where spilling is not effective.

Changes:
1. Modified ApproxMostFrequentStreamSummary::insert() to return eviction info (evicted value and count)
2. Added memory tracking (liveBytes_, deadBytes_) for StringView accumulator
3. Added rebuild() to compact string storage when dead bytes exceed threshold
4. Added per-key size limit to prevent individual oversized keys

Motivation:
For queries like:
SELECT APPROX_MOST_FREQUENT(20, CAST(feature AS JSON), 1000), ...
FROM table LIMIT 10000000
With 10M input rows and ~82KB per row, without memory management:
**20251215_232949_18211_b5kzi**

The HashStringAllocator accumulates string storage for all inserted keys
When keys are evicted from the Space-Saving summary, their string storage becomes "dead" but is never freed
Memory grows unbounded until OOM
This fix addresses:

Single large keys - Per-key size limit (kMaxTotalStringBytes / capacity) prevents any individual key from consuming excessive memory
Dead memory accumulation - rebuild() triggers when deadBytes > kMaxTotalStringBytes * 0.25, compacting storage by copying only live keys to new allocation and freeing the old storage

Differential Revision: D89728171
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant