[SPARK-33399][SQL] Normalize output partitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes #30300

prakharjain09 · 2020-11-09T15:19:29Z

What changes were proposed in this pull request?

This pull request tries to remove unneeded exchanges/sorts by normalizing the output partitioning and sortorder information correctly with respect to aliases.

Example: consider this join of three tables:

 |SELECT t2id, t3.id as t3id
 |FROM (
 |    SELECT t1.id as t1id, t2.id as t2id
 |    FROM t1, t2
 |    WHERE t1.id = t2.id
 |) t12, t3
 |WHERE t1id = t3.id

The plan for this looks like:

  *(9) Project [t2id#1034L, id#1004L AS t3id#1035L]
  +- *(9) SortMergeJoin [t1id#1033L], [id#1004L], Inner
     :- *(6) Sort [t1id#1033L ASC NULLS FIRST], false, 0
     :  +- Exchange hashpartitioning(t1id#1033L, 5), true, [id=#1343]   <------------------------------
     :     +- *(5) Project [id#996L AS t1id#1033L, id#1000L AS t2id#1034L]
     :        +- *(5) SortMergeJoin [id#996L], [id#1000L], Inner
     :           :- *(2) Sort [id#996L ASC NULLS FIRST], false, 0
     :           :  +- Exchange hashpartitioning(id#996L, 5), true, [id=#1329]
     :           :     +- *(1) Range (0, 10, step=1, splits=2)
     :           +- *(4) Sort [id#1000L ASC NULLS FIRST], false, 0
     :              +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1335]
     :                 +- *(3) Range (0, 20, step=1, splits=2)
     +- *(8) Sort [id#1004L ASC NULLS FIRST], false, 0
        +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1349]
           +- *(7) Range (0, 30, step=1, splits=2)

In this plan, the marked exchange could have been avoided as the data is already partitioned on "t1.id". This happens because AliasAwareOutputPartitioning class handles aliases only related to HashPartitioning. This change normalizes all output partitioning based on aliasing happening in Project.

Why are the changes needed?

To remove unneeded exchanges.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New UT added.

On TPCDS 1000 scale, this change improves the performance of query 95 from 330 seconds to 170 seconds by removing the extra Exchange.

… join

prakharjain09 · 2020-11-09T15:28:23Z

cc - @cloud-fan @dongjoon-hyun @imback82

imback82 · 2020-11-09T19:42:46Z

cc @maropu as well.

maropu · 2020-11-10T00:00:10Z

You need to update the plan stability checks for TPCDS;

spark/sql/core/src/test/scala/org/apache/spark/sql/PlanStabilitySuite.scala

Lines 62 to 65 in 83a8079

    
            * To re-generate golden files for entire suite, run: 
        
            * {{{ 
        
            *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStability[WithStats]Suite" 
        
            * }}}

maropu · 2020-11-10T00:00:15Z

ok to test

SparkQA · 2020-11-10T00:46:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35420/

sql/core/src/main/scala/org/apache/spark/sql/execution/AliasAwareOutputExpression.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

SparkQA · 2020-11-10T01:09:48Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35420/

SparkQA · 2020-11-10T06:25:49Z

Test build #130810 has finished for PR 30300 at commit dd6b841.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/AliasAwareOutputExpression.scala

…itioning

prakharjain09 · 2020-11-11T10:41:17Z

@maropu @viirya Thanks for the review. I have addressed majority of the review comments.

SparkQA · 2020-11-11T11:25:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35531/

SparkQA · 2020-11-11T11:54:01Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35531/

sql/core/src/main/scala/org/apache/spark/sql/execution/AliasAwareOutputExpression.scala

SparkQA · 2020-11-11T13:53:43Z

Test build #130926 has finished for PR 30300 at commit adf3a66.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…itioning

SparkQA · 2020-11-12T11:50:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35596/

SparkQA · 2020-11-12T12:14:08Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35596/

sql/core/src/main/scala/org/apache/spark/sql/execution/AliasAwareOutputExpression.scala

maropu

Nice improvement! LGTM

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

SparkQA · 2020-11-12T15:47:55Z

Test build #130990 has finished for PR 30300 at commit d5a0fbe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…itioning

SparkQA · 2020-11-12T17:25:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35615/

SparkQA · 2020-11-12T17:48:50Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35615/

SparkQA · 2020-11-12T21:02:45Z

Test build #131009 has finished for PR 30300 at commit f4fd12e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait HasMaxBlockSizeInMB extends Params
class HasMaxBlockSizeInMB(Params):
case class ElementAt(
case class GetArrayItem(
case class Elt(

imback82

LGTM as well.

sql/core/src/main/scala/org/apache/spark/sql/execution/AliasAwareOutputExpression.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

SparkQA · 2020-11-13T08:02:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35646/

SparkQA · 2020-11-13T08:05:02Z

Test build #131040 has finished for PR 30300 at commit c66874a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-13T08:31:25Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35646/

cloud-fan · 2020-11-13T08:48:57Z

retest this please

SparkQA · 2020-11-13T09:36:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35654/

SparkQA · 2020-11-13T09:58:16Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35654/

SparkQA · 2020-11-13T12:30:34Z

Test build #131048 has finished for PR 30300 at commit c66874a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…itioning

SparkQA · 2020-11-16T07:46:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35739/

SparkQA · 2020-11-16T08:05:02Z

Test build #131136 has finished for PR 30300 at commit 16e1db2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
>>> class VectorAccumulatorParam(AccumulatorParam):
fully qualified classname of key Writable class (e.g. \"org.apache.hadoop.io.Text\")
fully qualified classname of key Writable class (e.g. \"org.apache.hadoop.io.Text\")
fully qualified classname of key Writable class (e.g. \"org.apache.hadoop.io.Text\")
fully qualified classname of key Writable class (e.g. \"org.apache.hadoop.io.Text\")
class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with Logging
abstract class AbstractSqlParser extends ParserInterface with Logging
class CatalystSqlParser extends AbstractSqlParser
class SparkSqlParser extends AbstractSqlParser
class SparkSqlAstBuilder extends AstBuilder
class VariableSubstitution

SparkQA · 2020-11-16T08:11:43Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35739/

cloud-fan · 2020-11-16T08:21:43Z

retest this please

SparkQA · 2020-11-16T09:08:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35745/

SparkQA · 2020-11-16T09:31:06Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35745/

SparkQA · 2020-11-16T13:32:23Z

Test build #131143 has finished for PR 30300 at commit 16e1db2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
>>> class VectorAccumulatorParam(AccumulatorParam):
fully qualified classname of key Writable class (e.g. \"org.apache.hadoop.io.Text\")
fully qualified classname of key Writable class (e.g. \"org.apache.hadoop.io.Text\")
fully qualified classname of key Writable class (e.g. \"org.apache.hadoop.io.Text\")
fully qualified classname of key Writable class (e.g. \"org.apache.hadoop.io.Text\")
class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with Logging
abstract class AbstractSqlParser extends ParserInterface with Logging
class CatalystSqlParser extends AbstractSqlParser
class SparkSqlParser extends AbstractSqlParser
class SparkSqlAstBuilder extends AstBuilder
class VariableSubstitution

maropu · 2020-11-17T01:36:09Z

Thanks! Merged to master.

prakharjain09 · 2020-11-17T05:28:37Z

Thanks @cloud-fan @maropu @imback82 @viirya for the code reviews and providing suggestions.

maropu · 2020-11-18T04:51:58Z

NOTE: It seems this update makes TPCDS(sf=20) q95 much faster (176324ms->129644ms). Nice.
https://docs.google.com/spreadsheets/d/1V8xoKR9ElU-rOXMH84gb5BbLEw0XAPTJY8c8aZeIqus/edit?usp=sharing

--

cloud-fan · 2020-12-09T16:36:10Z

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

+        val projects = planned.collect { case p: ProjectExec => p }
+        assert(projects.exists(_.outputPartitioning match {
+          case PartitioningCollection(Seq(HashPartitioning(Seq(k1: AttributeReference), _),
+            HashPartitioning(Seq(k2: AttributeReference), _))) if k1.name == "t1id" =>


not related to this PR: The ProjectExec only outputs t1id (after column pruning), and it's a bit redundant to return PartitioningCollection here, as t1id is the only output and other partitionings are just invalid.

Oh, it looks interesting. Hi, @prakharjain09, are you interested in the improvement above?

@maropu Sure. Basically the idea is to stop propagating partitionings and sortOrders corresponding to attributes which are not part of outputset?

Working on this as part of https://issues.apache.org/jira/browse/SPARK-33758.

…rtitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes (#1092) * [SPARK-31078][SQL] Respect aliases in output ordering Currently, in the following scenario, an unnecessary `Sort` node is introduced: ```scala withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") { val df = (0 until 20).toDF("i").as("df") df.repartition(8, df("i")).write.format("parquet") .bucketBy(8, "i").sortBy("i").saveAsTable("t") val t1 = spark.table("t") val t2 = t1.selectExpr("i as ii") t1.join(t2, t1("i") === t2("ii")).explain } ``` ``` == Physical Plan == *(3) SortMergeJoin [i#8], [ii#10], Inner :- *(1) Project [i#8] : +- *(1) Filter isnotnull(i#8) : +- *(1) ColumnarToRow : +- FileScan parquet default.t[i#8] Batched: true, DataFilters: [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8 +- *(2) Sort [ii#10 ASC NULLS FIRST], false, 0 <==== UNNECESSARY +- *(2) Project [i#8 AS ii#10] +- *(2) Filter isnotnull(i#8) +- *(2) ColumnarToRow +- FileScan parquet default.t[i#8] Batched: true, DataFilters: [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8 ``` Notice that `Sort [ii#10 ASC NULLS FIRST], false, 0` is introduced even though the underlying data is already sorted. This is because `outputOrdering` doesn't handle aliases correctly. This PR proposes to fix this issue. To better handle aliases in `outputOrdering`. Yes, now with the fix, the `explain` prints out the following: ``` == Physical Plan == *(3) SortMergeJoin [i#8], [ii#10], Inner :- *(1) Project [i#8] : +- *(1) Filter isnotnull(i#8) : +- *(1) ColumnarToRow : +- FileScan parquet default.t[i#8] Batched: true, DataFilters: [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8 +- *(2) Project [i#8 AS ii#10] +- *(2) Filter isnotnull(i#8) +- *(2) ColumnarToRow +- FileScan parquet default.t[i#8] Batched: true, DataFilters: [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8 ``` Tests added. Closes #27842 from imback82/alias_aware_sort_order. Authored-by: Terry Kim <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-33399][SQL] Normalize output partitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes This pull request tries to remove unneeded exchanges/sorts by normalizing the output partitioning and sortorder information correctly with respect to aliases. Example: consider this join of three tables: |SELECT t2id, t3.id as t3id |FROM ( | SELECT t1.id as t1id, t2.id as t2id | FROM t1, t2 | WHERE t1.id = t2.id |) t12, t3 |WHERE t1id = t3.id The plan for this looks like: *(9) Project [t2id#1034L, id#1004L AS t3id#1035L] +- *(9) SortMergeJoin [t1id#1033L], [id#1004L], Inner :- *(6) Sort [t1id#1033L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(t1id#1033L, 5), true, [id=#1343] <------------------------------ : +- *(5) Project [id#996L AS t1id#1033L, id#1000L AS t2id#1034L] : +- *(5) SortMergeJoin [id#996L], [id#1000L], Inner : :- *(2) Sort [id#996L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(id#996L, 5), true, [id=#1329] : : +- *(1) Range (0, 10, step=1, splits=2) : +- *(4) Sort [id#1000L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1335] : +- *(3) Range (0, 20, step=1, splits=2) +- *(8) Sort [id#1004L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1349] +- *(7) Range (0, 30, step=1, splits=2) In this plan, the marked exchange could have been avoided as the data is already partitioned on "t1.id". This happens because AliasAwareOutputPartitioning class handles aliases only related to HashPartitioning. This change normalizes all output partitioning based on aliasing happening in Project. To remove unneeded exchanges. No New UT added. On TPCDS 1000 scale, this change improves the performance of query 95 from 330 seconds to 170 seconds by removing the extra Exchange. Closes #30300 from prakharjain09/SPARK-33399-outputpartitioning. Authored-by: Prakhar Jain <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> * [CARMEL-6306] Fix ut * [CARMEL-6306] Fix alias not compatible with ebay skew implementation Co-authored-by: Terry Kim <[email protected]> Co-authored-by: Prakhar Jain <[email protected]>

normalize outputPartitioning of Project to handle aliases after inner…

dd6b841

… join

github-actions bot added the SQL label Nov 9, 2020

maropu reviewed Nov 10, 2020

View reviewed changes

maropu mentioned this pull request Nov 10, 2020

[SPARK-33400][SQL] Normalize sameOrderExpressions in SortOrder to avoid unnecessary sort operations #30302

Closed

viirya reviewed Nov 10, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/AliasAwareOutputExpression.scala Outdated Show resolved Hide resolved

prakharjain09 added 2 commits November 10, 2020 14:33

Merge remote-tracking branch 'oss/master' into SPARK-33399-outputpart…

5278272

…itioning

add more tests, fix existing

adf3a66

prakharjain09 changed the title ~~[SPARK-33399][SQL] Normalize output partitioning of Project with respect to aliases~~ [SPARK-33399][SQL] Normalize output partitioning with respect to aliases to avoid unneeded exchanges Nov 11, 2020

cloud-fan reviewed Nov 11, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/AliasAwareOutputExpression.scala Outdated Show resolved Hide resolved

prakharjain09 added 2 commits November 12, 2020 13:29

Merge remote-tracking branch 'oss/master' into SPARK-33399-outputpart…

9c3f15f

…itioning

add fix for sorting also

d5a0fbe

prakharjain09 changed the title ~~[SPARK-33399][SQL] Normalize output partitioning with respect to aliases to avoid unneeded exchanges~~ [SPARK-33399][SQL] Normalize output partitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes Nov 12, 2020

cloud-fan reviewed Nov 12, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/AliasAwareOutputExpression.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Nov 12, 2020

View reviewed changes

maropu approved these changes Nov 12, 2020

View reviewed changes

maropu reviewed Nov 12, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala Show resolved Hide resolved

prakharjain09 added 2 commits November 12, 2020 21:47

add more assertions in tests

4d5f688

Merge remote-tracking branch 'oss/master' into SPARK-33399-outputpart…

f4fd12e

…itioning

imback82 approved these changes Nov 12, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/AliasAwareOutputExpression.scala Outdated Show resolved Hide resolved

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala Outdated Show resolved Hide resolved

add more assertions in tests

c66874a

Merge remote-tracking branch 'oss/master' into SPARK-33399-outputpart…

16e1db2

…itioning

cloud-fan approved these changes Nov 16, 2020

View reviewed changes

ulysses-you mentioned this pull request Nov 17, 2020

[SPARK-33442][SQL] Change Combine Limit to Eliminate limit using max row #30368

Closed

maropu closed this in f5e3302 Nov 17, 2020

cloud-fan reviewed Dec 9, 2020

View reviewed changes

[SPARK-33399][SQL] Normalize output partitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes #30300

[SPARK-33399][SQL] Normalize output partitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes #30300

Uh oh!

Conversation

prakharjain09 commented Nov 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

prakharjain09 commented Nov 9, 2020

Uh oh!

imback82 commented Nov 9, 2020

Uh oh!

maropu commented Nov 10, 2020

Uh oh!

maropu commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

Uh oh!

prakharjain09 commented Nov 11, 2020

Uh oh!

SparkQA commented Nov 11, 2020

Uh oh!

SparkQA commented Nov 11, 2020

Uh oh!

Uh oh!

SparkQA commented Nov 11, 2020

Uh oh!

SparkQA commented Nov 12, 2020

Uh oh!

SparkQA commented Nov 12, 2020

Uh oh!

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Nov 12, 2020

Uh oh!

SparkQA commented Nov 12, 2020

Uh oh!

SparkQA commented Nov 12, 2020

Uh oh!

SparkQA commented Nov 12, 2020

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Nov 13, 2020

Uh oh!

SparkQA commented Nov 13, 2020

Uh oh!

SparkQA commented Nov 13, 2020

Uh oh!

cloud-fan commented Nov 13, 2020

Uh oh!

SparkQA commented Nov 13, 2020

Uh oh!

SparkQA commented Nov 13, 2020

Uh oh!

SparkQA commented Nov 13, 2020

Uh oh!

SparkQA commented Nov 16, 2020

Uh oh!

SparkQA commented Nov 16, 2020

Uh oh!

prakharjain09 commented Nov 9, 2020 •

edited

Loading