fix: Add memory management for approx_most_frequent addSingleGroupRawInput #15852
+130
−18
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Add memory management to approx_most_frequent aggregate function to prevent OOM during global aggregation, where spilling is not effective.
Changes:
Motivation:
For queries like:
SELECT APPROX_MOST_FREQUENT(20, CAST(feature AS JSON), 1000), ...
FROM table LIMIT 10000000
With 10M input rows and ~82KB per row, without memory management:
20251215_232949_18211_b5kzi
The HashStringAllocator accumulates string storage for all inserted keys
When keys are evicted from the Space-Saving summary, their string storage becomes "dead" but is never freed
Memory grows unbounded until OOM
This fix addresses:
Single large keys - Per-key size limit (kMaxTotalStringBytes / capacity) prevents any individual key from consuming excessive memory
Dead memory accumulation - rebuild() triggers when deadBytes > kMaxTotalStringBytes * 0.25, compacting storage by copying only live keys to new allocation and freeing the old storage
Differential Revision: D89728171