Skip to content

[SPARK-32129][SQL] Support AQE skew join with Union #28947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

LantaoJin
Copy link
Contributor

What changes were proposed in this pull request?

In the apply method of OptimizeSkewedJoin, we first match out the UnionExec nodes, then try to optimize their children with current logic.

Why are the changes needed?

Current, the AQE skew join only supports two tables join such as

SMJ
:-Sort
:    +-Shuffle
+-Sort
     +-Shuffle

But if the plan contains a Union, the skew join handling not work:

Union
:-SMJ
:   :-Sort
:   :    +-Shuffle
:   +-Sort
:        +-Shuffle
+-SMJ
:   :-Sort
:   :    +-Shuffle
:   +-Sort
         +-Shuffle

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add a UT.

@LantaoJin
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 30, 2020

Test build #124652 has finished for PR 28947 at commit d011e9a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124705 has finished for PR 28947 at commit d011e9a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

retest this please

@LantaoJin
Copy link
Contributor Author

ping @cloud-fan @gatorsmile

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124736 has finished for PR 28947 at commit d011e9a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124740 has finished for PR 28947 at commit d011e9a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

return plan
}

// Try to handle skew join with union case, like
Copy link
Contributor

@cloud-fan cloud-fan Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it more general? It seems like we can optimize any SMJ if its 2 children are both shuffle stages. cc @JkSelf @maryannxue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 3-table join, if they are in the same query stage, it means the shuffles are all leaf, and we will only optimize the first SMJ, as the second SMJ has only one side as shuffle stage.

Copy link
Contributor Author

@LantaoJin LantaoJin Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 3-table join, if they are in the same query stage, it means the shuffles are all leaf, and we will only optimize the first SMJ, as the second SMJ has only one side as shuffle stage.

We have a 3-table skewed join implamentation in our internal code base. But we have replaced the skew join handling logic by community's. So our optimization is not work based on currnet OptimizeSkewedJoin. I will try to re-implement it in current OptimizeSkewedJoin and submit a PR later.

Copy link
Contributor Author

@LantaoJin LantaoJin Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it more general? It seems like we can optimize any SMJ if its 2 children are both shuffle stages. cc @JkSelf @maryannxue

Yes. we usually implemented some optimizations based on our inner usages and issues. So it may be not general. I only see the UNION case so far.

@LantaoJin
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124785 has finished for PR 28947 at commit d011e9a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Can we make a general approach here? e.g. if we just optimize SMJ with both sides as shuffle stages, we can even optimize the first join of a 3-table join plan.

@LantaoJin
Copy link
Contributor Author

Can we make a general approach here? e.g. if we just optimize SMJ with both sides as shuffle stages, we can even optimize the first join of a 3-table join plan.

#29021 is the PR to handle more general skew pattern includes n-tables join.

@manuzhang
Copy link
Contributor

manuzhang commented Aug 27, 2020

@cloud-fan @LantaoJin
Any progress or any suggestion on where we can move forward with this improvement ? We've seen a lot of skew joins not being handled by AQE due to union.

cc @maryannxue @JkSelf

@github-actions
Copy link

github-actions bot commented Dec 6, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 6, 2020
@github-actions github-actions bot closed this Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants