You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-32629][SQL] Track metrics of BitSet/OpenHashSet in full outer SHJ
### What changes were proposed in this pull request?
This is followup from #29342, where to do two things:
* Per #29342 (comment), change from java `HashSet` to spark in-house `OpenHashSet` to track matched rows for non-unique join keys. I checked `OpenHashSet` implementation which is built from a key index (`OpenHashSet._bitset` as `BitSet`) and key array (`OpenHashSet._data` as `Array`). Java `HashSet` is built from `HashMap`, which stores value in `Node` linked list and by theory should have taken more memory than `OpenHashSet`. Reran the same benchmark query used in #29342, and verified the query has similar performance here between `HashSet` and `OpenHashSet`.
* Track metrics of the extra data structure `BitSet`/`OpenHashSet` for full outer SHJ. This depends on above thing, because there seems no easy way to get java `HashSet` memory size.
### Why are the changes needed?
To better surface the memory usage for full outer SHJ more accurately.
This can help users/developers to debug/improve full outer SHJ.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unite test in `SQLMetricsSuite.scala` .
Closes#29566 from c21/add-metrics.
Authored-by: Cheng Su <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
0 commit comments