[SPARK-36063][SQL] Optimize OneRowRelation subqueries #33284

allisonwang-db · 2021-07-09T18:21:58Z

What changes were proposed in this pull request?

This PR adds optimization for scalar and lateral subqueries with OneRowRelation as leaf nodes. It inlines such subqueries before decorrelation to avoid rewriting them as left outer joins. It also introduces a flag to turn on/off this optimization: spark.sql.optimizer.optimizeOneRowRelationSubquery (default: True).

For example:

select (select c1) from t

Analyzed plan:

Project [scalar-subquery#17 [c1#18] AS scalarsubquery(c1)#22]
:  +- Project [outer(c1#18)]
:     +- OneRowRelation
+- LocalRelation [c1#18, c2#19]

Optimized plan before this PR:

Project [c1#18#25 AS scalarsubquery(c1)#22]
+- Join LeftOuter, (c1#24 <=> c1#18)
   :- LocalRelation [c1#18]
   +- Aggregate [c1#18], [c1#18 AS c1#18#25, c1#18 AS c1#24]
      +- LocalRelation [c1#18]

Optimized plan after this PR:

LocalRelation [scalarsubquery(c1)#22]

Why are the changes needed?

To optimize query plans.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new unit tests.

SparkQA · 2021-07-09T19:06:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45375/

SparkQA · 2021-07-09T19:40:20Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45375/

SparkQA · 2021-07-09T21:24:03Z

Test build #140864 has finished for PR 33284 at commit fede916.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-10T00:54:13Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45385/

SparkQA · 2021-07-10T01:26:58Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45385/

SparkQA · 2021-07-10T04:44:46Z

Test build #140874 has finished for PR 33284 at commit 65c1ba0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

allisonwang-db · 2021-07-12T17:21:49Z

cc @cloud-fan

cloud-fan · 2021-07-14T13:20:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

+   * Rewrite a subquery expression into one or more expressions. The rewrite can only be done
+   * if there is no nested subqueries in the subquery plan.
+   */
+  private def rewrite(plan: LogicalPlan): LogicalPlan = plan.transformUpWithSubqueries {


do we need to handle nested subqueries here? I think the rule OptimizeSubqueries will run this rule again to optimize nested subqueries.

The reason why we need to check subqueries is to deal with nested subqueries:

Project [scalar-subquery [a]] : +- Project [scalar-subquery [b]] <-- collapsible if transform with nested subqueries first : : +- Project [outer(b) + 1] : : +- OneRowRelation : +- Project [outer(a) as b] : +- OneRowRelation +- Relation [a]

A subquery's plan should only be rewritten if it doesn't contain another correlated subquery. If we do not transform the nested subqueries first, we will miss out cases like the one above.

cloud-fan · 2021-07-14T13:20:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

+   */
+  private def rewrite(plan: LogicalPlan): LogicalPlan = plan.transformUpWithSubqueries {
+    case LateralJoin(left, right @ LateralSubquery(OneRowSubquery(projectList), _, _, _), _, None)
+        if right.plan.subqueriesAll.isEmpty && right.joinCond.isEmpty =>


I think subqueries.isEmpty is good enough?

cloud-fan · 2021-07-14T13:21:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

+    case p: LogicalPlan => p.transformExpressionsUpWithPruning(
+      _.containsPattern(SCALAR_SUBQUERY)) {
+      case s @ ScalarSubquery(OneRowSubquery(projectList), _, _, _)
+          if s.plan.subqueriesAll.isEmpty && s.joinCond.isEmpty =>


cloud-fan · 2021-07-14T13:21:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

+  private def rewrite(plan: LogicalPlan): LogicalPlan = plan.transformUpWithSubqueries {
+    case LateralJoin(left, right @ LateralSubquery(OneRowSubquery(projectList), _, _, _), _, None)
+        if right.plan.subqueriesAll.isEmpty && right.joinCond.isEmpty =>
+      Project(left.output ++ projectList, left)


if the lateral join has a condition, can we just add a filter above project?

It should be fine for inner join but for left outer join, it's trickier. This also applies to subqueries after pulling out correlated filters as join conditions. Maybe this can be a separate optimization before RewriteCorrelatedScalarSubqueries / RewriteLateralSubqueries.

cloud-fan · 2021-07-14T13:22:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -4053,6 +4061,8 @@ class SQLConf extends Serializable with Logging {

  def decorrelateInnerQueryEnabled: Boolean = getConf(SQLConf.DECORRELATE_INNER_QUERY_ENABLED)

+  def optimizeOneRowRelationSubquery: Boolean = getConf(SQLConf.OPTIMIZE_ONE_ROW_RELATION_SUBQUERY)


nit: it's only called once, we can just call conf.getConf(SQLConf.OPTIMIZE_ONE_ROW_RELATION_SUBQUERY) in the new rule

cloud-fan · 2021-07-14T13:24:22Z

...lyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuerySuite.scala

-    val correctAnswer = Project(Seq(x, y), DomainJoin(Seq(x, y), OneRowRelation()))
+    val innerPlan = Project(Seq(OuterReference(x).as("x1"), OuterReference(y).as("y1")), t0)
+    val correctAnswer = Project(
+      Seq(x.as("x1"), y.as("y1"), x, y), DomainJoin(Seq(x, y), t0))


will we optimize away the DomainJoin at the end?

For now once the domain join is added, it will always be rewritten as an inner join because the join condition in the subquery might not be null: select (select c1 where c1 = c2 + 1) from t.

SparkQA · 2021-07-15T01:39:47Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45556/

SparkQA · 2021-07-15T01:40:26Z

Test build #141041 has finished for PR 33284 at commit 8dd685a.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-15T03:18:48Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45559/

SparkQA · 2021-07-15T04:23:12Z

Test build #141044 has finished for PR 33284 at commit 73afab1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-07-15T16:40:50Z

retest this please

SparkQA · 2021-07-15T16:45:57Z

Test build #141084 has started for PR 33284 at commit 73afab1.

SparkQA · 2021-07-15T17:18:31Z

Kubernetes integration test unable to build dist.

exiting with code: 141
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45597/

SparkQA · 2021-07-16T00:36:10Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45621/

SparkQA · 2021-07-16T01:08:53Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45621/

SparkQA · 2021-07-16T04:50:27Z

Test build #141108 has finished for PR 33284 at commit c148cfe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-07-16T05:59:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

+      }
+    }
+
+    transformUp(g)


nit:

transformUp { case plan => val transformed = plan transformExpressionsUp { case planExpression: PlanExpression[PlanType] => val newPlan = planExpression.plan.transformUpWithSubqueries(f) planExpression.withNewPlan(newPlan) } f.applyOrElse[PlanType, PlanType](transformed, identity) }

cloud-fan · 2021-07-16T06:00:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

@@ -435,6 +435,28 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]]
    subqueries ++ subqueries.flatMap(_.subqueriesAll)
  }

+  /**
+   * Returns a copy of this node where the given partial function has been recursively applied
+   * first to this node's children, then this node's subqueries, and finally this node itself


I think this doc is wrong. We apply the func to subqueries first, then children, then the node itself.

cloud-fan · 2021-07-16T06:10:02Z

...test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeOneRowRelationSubquerySuite.scala

+    val inner = t0.select('a.as("a1"), 'b.as("b1")).select(('a1 + 'b1).as("c"))
+    val query = t1.select(ScalarSubquery(inner).as("sub"))
+    val optimized = Optimize.execute(query.analyze)
+    val correctAnswer = Project(Alias(Alias(a + b, "c")(), "sub")() :: Nil, t1)


nit: it's a bit weird that sometimes we use DSL and sometimes we use LogicalPlan directly. Can we be consistent? I think here can be t1.select(('a + 'b).as("c").as("sub"))

The analyzer will remove the extra aliases which make the correct answer differs from the optimized. I will add a clean-up aliases rule in the test optimizer.

cloud-fan · 2021-07-16T06:10:09Z

...test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeOneRowRelationSubquerySuite.scala

+    val inner = t0.select('a.as("b")).select(ScalarSubquery(t0.select('b)).as("s"))
+    val query = t1.select(ScalarSubquery(inner).as("sub"))
+    val optimized = Optimize.execute(query.analyze)
+    val correctAnswer = Project(Alias(Alias(a, "s")(), "sub")() :: Nil, t1)


cloud-fan · 2021-07-16T06:12:39Z

...test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeOneRowRelationSubquerySuite.scala

+    }
+  }
+
+  test("Should not optimize subquery with nested subqueries") {


I think we do support nested subqueries, the problem here is WHERE a = 1?

Yes. The test title is a bit confusing. Will update.

SparkQA · 2021-07-19T21:57:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45794/

SparkQA · 2021-07-19T22:31:11Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45794/

SparkQA · 2021-07-20T01:45:13Z

Test build #141280 has finished for PR 33284 at commit 7ba1974.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-07-22T02:48:27Z

thanks, merging to master/3.2 (it's a very useful optimization for the 3.2 new feature LATERAL JOIN)

### What changes were proposed in this pull request? This PR adds optimization for scalar and lateral subqueries with OneRowRelation as leaf nodes. It inlines such subqueries before decorrelation to avoid rewriting them as left outer joins. It also introduces a flag to turn on/off this optimization: `spark.sql.optimizer.optimizeOneRowRelationSubquery` (default: True). For example: ```sql select (select c1) from t ``` Analyzed plan: ``` Project [scalar-subquery#17 [c1#18] AS scalarsubquery(c1)#22] : +- Project [outer(c1#18)] : +- OneRowRelation +- LocalRelation [c1#18, c2#19] ``` Optimized plan before this PR: ``` Project [c1#18#25 AS scalarsubquery(c1)#22] +- Join LeftOuter, (c1#24 <=> c1#18) :- LocalRelation [c1#18] +- Aggregate [c1#18], [c1#18 AS c1#18#25, c1#18 AS c1#24] +- LocalRelation [c1#18] ``` Optimized plan after this PR: ``` LocalRelation [scalarsubquery(c1)#22] ``` ### Why are the changes needed? To optimize query plans. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new unit tests. Closes #33284 from allisonwang-db/spark-36063-optimize-subquery-one-row-relation. Authored-by: allisonwang-db <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit de8e4be) Signed-off-by: Wenchen Fan <[email protected]>

HyukjinKwon · 2021-07-26T01:29:50Z

The tests seems flaky:

@allisonwang-db would you mind taking a look please?

allisonwang-db · 2021-07-26T22:08:21Z

@HyukjinKwon Thanks for letting me know. The test failures are from branch-3.2 and I will fix them soon.

…lar subqueries This PR cherry picks #33235 to branch-3.2 to fix test failures introduced by #33284. ### What changes were proposed in this pull request? This PR allows the `Project` node to host outer references in scalar subqueries when `decorrelateInnerQuery` is enabled. It is already supported by the new decorrelation framework and the `RewriteCorrelatedScalarSubquery` rule. Note currently by default all correlated subqueries will be decorrelated, which is not necessarily the most optimal approach. Consider `SELECT (SELECT c1) FROM t`. This should be optimized as `SELECT c1 FROM t` instead of rewriting it as a left outer join. This will be done in a separate PR to optimize correlated scalar/lateral subqueries with OneRowRelation. ### Why are the changes needed? To allow more types of correlated scalar subqueries. ### Does this PR introduce _any_ user-facing change? Yes. This PR allows outer query column references in the SELECT cluase of a correlated scalar subquery. For example: ```sql SELECT (SELECT c1) FROM t; ``` Before this change: ``` org.apache.spark.sql.AnalysisException: Expressions referencing the outer query are not supported outside of WHERE/HAVING clauses ``` After this change: ``` +------------------+ |scalarsubquery(c1)| +------------------+ |0 | |1 | +------------------+ ``` ### How was this patch tested? Added unit tests and SQL tests. (cherry picked from commit ca348e5) Signed-off-by: allisonwang-db <allison.wangdatabricks.com> Closes #33527 from allisonwang-db/spark-36028-3.2. Authored-by: allisonwang-db <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added the SQL label Jul 9, 2021

cloud-fan reviewed Jul 14, 2021

View reviewed changes

allisonwang-db added 3 commits July 14, 2021 18:51

optimize

c78b0f0

fix tests

33a0e9a

address comments

73afab1

allisonwang-db force-pushed the spark-36063-optimize-subquery-one-row-relation branch from 8dd685a to 73afab1 Compare July 15, 2021 02:27

update test cases

c148cfe

cloud-fan reviewed Jul 16, 2021

View reviewed changes

address comments

7ba1974

cloud-fan closed this in de8e4be Jul 22, 2021

allisonwang-db mentioned this pull request Jul 26, 2021

[SPARK-36028][SQL][3.2] Allow Project to host outer references in scalar subqueries #33527

Closed

allisonwang-db deleted the spark-36063-optimize-subquery-one-row-relation branch January 19, 2024 01:22

		@@ -4053,6 +4061,8 @@ class SQLConf extends Serializable with Logging {

		def decorrelateInnerQueryEnabled: Boolean = getConf(SQLConf.DECORRELATE_INNER_QUERY_ENABLED)

		def optimizeOneRowRelationSubquery: Boolean = getConf(SQLConf.OPTIMIZE_ONE_ROW_RELATION_SUBQUERY)

[SPARK-36063][SQL] Optimize OneRowRelation subqueries #33284

[SPARK-36063][SQL] Optimize OneRowRelation subqueries #33284

Uh oh!

Conversation

allisonwang-db commented Jul 9, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 10, 2021

Uh oh!

SparkQA commented Jul 10, 2021

Uh oh!

SparkQA commented Jul 10, 2021

Uh oh!

allisonwang-db commented Jul 12, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 15, 2021

Uh oh!

SparkQA commented Jul 15, 2021

Uh oh!

SparkQA commented Jul 15, 2021

Uh oh!

SparkQA commented Jul 15, 2021

Uh oh!

cloud-fan commented Jul 15, 2021

Uh oh!

SparkQA commented Jul 15, 2021

Uh oh!

SparkQA commented Jul 15, 2021

Uh oh!

SparkQA commented Jul 16, 2021

Uh oh!

SparkQA commented Jul 16, 2021

Uh oh!

SparkQA commented Jul 16, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 19, 2021

Uh oh!

SparkQA commented Jul 19, 2021

Uh oh!