Skip to content

Add distinct to the right side when LOJ + IS NULL is rewritten to Semijoin#24884

Merged
tanjialiang merged 1 commit into
prestodb:masterfrom
tanjialiang:semi_join
Apr 10, 2025
Merged

Add distinct to the right side when LOJ + IS NULL is rewritten to Semijoin#24884
tanjialiang merged 1 commit into
prestodb:masterfrom
tanjialiang:semi_join

Conversation

@tanjialiang

@tanjialiang tanjialiang commented Apr 8, 2025

Copy link
Copy Markdown
Contributor

Description

issue #24510

The join might have a huge right side in cases of following optimization:
When left join has the 'is null' key filter on the right side, it is effectively making the query return rows from the left side where there is a no match on the right side. This currently is optimized by converting the section into a left semi join. But the current issue is the right side (key only) might have large amount of duplication that is completely unnecessary for evaluation of the join. This PR addresses it by adding a distinct aggregation operator before to optimize the performance

Selective Meta internal production queries showed 100x performance gain and 3x memory reduction

== RELEASE NOTE ==
General Changes
Add distinct on the right side optimization on optimizer rule ``LeftJoinNullFilterToSemiJoin``.

@tanjialiang tanjialiang requested a review from kaikalur April 8, 2025 04:16
@tanjialiang tanjialiang requested a review from presto-oss April 8, 2025 04:16
@prestodb-ci prestodb-ci added the from:Meta PR from Meta label Apr 8, 2025
@kaikalur

kaikalur commented Apr 8, 2025

Copy link
Copy Markdown
Contributor

Fix the title to be more specific - add distinct to the right side when LOJ + IS NULL is rewritten to Semijoin

@tanjialiang tanjialiang changed the title Add distinct semi join plan optimization Add distinct to the right side when LOJ + IS NULL is rewritten to Semijoin Apr 8, 2025
@kaikalur

kaikalur commented Apr 8, 2025

Copy link
Copy Markdown
Contributor

Also this only partially addresses the issue linked because there could be other direct uses of semijoin that we don't address. This just improves on the original optimization.

@kaikalur

kaikalur commented Apr 8, 2025

Copy link
Copy Markdown
Contributor

Add couple more tests:

  1. already distinct on the right side (and make sure you don't get double distinct - not incorrect but just unncessary)
  2. add group by like: .. (select custkey from orders group by custkey having count(1) > 5) and see if you get the correct plan

@tanjialiang tanjialiang force-pushed the semi_join branch 2 times, most recently from d1e556a to 98c91f4 Compare April 8, 2025 21:38
kaikalur
kaikalur previously approved these changes Apr 8, 2025

@kaikalur kaikalur left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kaikalur

kaikalur commented Apr 8, 2025

Copy link
Copy Markdown
Contributor

@jaystarshot - please take a look. Thanks!

jaystarshot
jaystarshot previously approved these changes Apr 9, 2025

@jaystarshot jaystarshot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also add a release note?

@jaystarshot

Copy link
Copy Markdown
Member

This only applies an agg if the previous join is a Left join, isn't it better to apply this to all semi joins for a general case (as mentioned in the issue) ?

@tanjialiang

Copy link
Copy Markdown
Contributor Author

This only applies an agg if the previous join is a Left join, isn't it better to apply this to all semi joins for a general case (as mentioned in the issue) ?

Hi @jaystarshot, thanks for your review. Yeah we are going to follow up on extracting this out as a more general case support as a followup. We will keep it updated in the linked issue.

@steveburnett

Copy link
Copy Markdown
Contributor

Thanks for the release note! Some formatting nits.

== RELEASE NOTE ==

General Changes
Add distinct on the right side optimization on optimizer rule ``LeftJoinNullFilterToSemiJoin``.

@tanjialiang

Copy link
Copy Markdown
Contributor Author

Hi @jaystarshot could you help with another stamp after rebase? Thank you

@tanjialiang tanjialiang requested a review from jaystarshot April 10, 2025 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:Meta PR from Meta

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants