[SPARK-32216][SQL] Remove redundant ProjectExec #29031

allisonwang-db · 2020-07-07T23:46:33Z

What changes were proposed in this pull request?

This PR added a physical rule to remove redundant project nodes. A ProjectExec is redundant when

It has the same output attributes and order as its child's output when ordering of these attributes is required.
It has the same output attributes as its child's output when attribute output ordering is not required.

For example:
After Filter:

== Physical Plan ==
*(1) Project [a#14L, b#15L, c#16, key#17] 
+- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5))
   +- *(1) ColumnarToRow
      +- FileScan parquet [a#14L,b#15L,c#16,key#17]

The Project a#14L, b#15L, c#16, key#17 is redundant because its output is exactly the same as filter's output.

Before Aggregate:

== Physical Plan ==
*(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], output=[sum_a#39L, key#17, last_b#41L])
+- Exchange hashpartitioning(key#17, 5), true, [id=#77]
   +- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51])
      +- *(1) Project [key#17, a#14L, b#15L]
         +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100))
            +- *(1) ColumnarToRow
               +- FileScan parquet [a#14L,b#15L,key#17]

The Project key#17, a#14L, b#15L is redundant because hash aggregate doesn't require child plan's output to be in a specific order.

Why are the changes needed?

It removes unnecessary query nodes and makes query plan cleaner.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

maryannxue · 2020-07-07T23:49:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -2935,6 +2944,8 @@ class SQLConf extends Serializable with Logging {

  def subqueryReuseEnabled: Boolean = getConf(SUBQUERY_REUSE_ENABLED)

+  def removeRedundantProjectsEnabled: Boolean = getConf(REMOVE_REDUNDANT_PROJECTS_ENABLED)


nit: Since this conf is used only once, we can remove this variable.

sql/core/src/main/scala/org/apache/spark/sql/execution/RemoveRedundantProjects.scala

HyukjinKwon · 2020-07-08T03:51:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    .internal()
+    .doc("Whether to remove redundant project exec node based on children's output and " +
+      "ordering requirement.")
+    .version("3.0.0")


gatorsmile · 2020-07-09T05:42:03Z

ok to test

gatorsmile · 2020-07-09T05:42:09Z

cc @cloud-fan

SparkQA · 2020-07-09T07:05:01Z

Test build #125436 has finished for PR 29031 at commit a24d93f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-09T07:47:20Z

retest this please

cloud-fan · 2020-07-09T07:51:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/RemoveRedundantProjects.scala

+        val keepOrdering = a.aggregateExpressions
+          .exists(ae => ae.mode.equals(Final) || ae.mode.equals(PartialMerge))
+        a.mapChildren(removeProject(_, keepOrdering))
+      case w: WindowExec => w.mapChildren(removeProject(_, false))


WindowExec.output is implemented as child.output ++ windowExpression.map(_.toAttribute). I think we require the ordering for window children.

Instead of setting require ordering to be true, I am wondering should WindowExec inherit this ordering requirement from its parent? For example in this case

Project[a, avg, key] WindowExec[avg] [key] [a]

WindowExec actually does't require column to be ordered. Is there any scenario where WindowExec must require child output column to be ordered? I am having trouble coming up with a test case for it.

cloud-fan · 2020-07-09T07:54:28Z

sql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantProjectsSuite.scala

+      spark.range(100).selectExpr("id % 10 as key", "id * 2 as a",
+        "id * 3 as b", "cast(id as string) as c", "array(id, id + 1, id + 3) as d")
+        .write.partitionBy("key").parquet(path)
+      spark.read.parquet(path).createOrReplaceTempView("testView")


Shall we put the view creation in beforeAll? Then we only need to do it once for the entire test suite.

sql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantProjectsSuite.scala

SparkQA · 2020-07-09T11:26:39Z

Test build #125453 has finished for PR 29031 at commit a24d93f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

allisonwang-db · 2020-07-10T03:06:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/RemoveRedundantProjects.scala

+      case d: DataSourceV2ScanExecBase if !d.supportsColumnar => false
+      case _ =>
+        if (requireOrdering) {
+          project.output.map(_.exprId.id) == child.output.map(_.exprId.id)


@cloud-fan I am wondering if the qualifier in Attribute should be considered here as well (besides exprId). Would an attribute qualifier in a ProjectExec be different from its child?

I don't think so. AttributeReferece.sameRef doesn't consider qualifier as well.

SparkQA · 2020-07-10T07:05:01Z

Test build #125546 has finished for PR 29031 at commit 2613c30.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

cloud-fan · 2020-07-10T07:17:38Z

sql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantProjectsSuite.scala

+    }
+  }
+
+  private def assertProjectExec(query: String, enabled: Integer, disabled: Integer): Unit = {


Integer -> Int?

cloud-fan · 2020-07-10T07:17:44Z

sql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantProjectsSuite.scala

+
+class RemoveRedundantProjectsSuite extends QueryTest with SharedSparkSession with SQLTestUtils {
+
+  private def assertProjectExecCount(df: DataFrame, expected: Integer): Unit = {


cloud-fan · 2020-07-10T07:18:19Z

sql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantProjectsSuite.scala

+  }
+
+  test("subquery") {
+    testData


I think it's more clear to create a new view here for testing.

SparkQA · 2020-08-04T03:56:44Z

Test build #127019 has finished for PR 29031 at commit 4585a04.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

allisonwang-db · 2020-08-04T17:24:21Z

retest this please

cloud-fan · 2020-08-05T04:58:53Z

retest this please

cloud-fan · 2020-08-05T04:59:08Z

add to whitelist

SparkQA · 2020-08-05T06:51:41Z

Test build #127077 has finished for PR 29031 at commit 4585a04.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-05T08:16:26Z

seems this breaks the DPP test, @allisonwang-db please take a look.

SparkQA · 2020-08-06T19:47:34Z

Test build #127150 has finished for PR 29031 at commit abca971.

This patch fails PySpark pip packaging tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2020-08-06T23:17:07Z

Test build #127155 has finished for PR 29031 at commit 1632028.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

allisonwang-db · 2020-08-07T01:42:34Z

retest this please

SparkQA · 2020-08-07T03:59:36Z

Test build #127159 has finished for PR 29031 at commit 6d5cade.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-07T06:26:09Z

Test build #127163 has finished for PR 29031 at commit 6d5cade.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-07T22:39:45Z

Test build #127212 has finished for PR 29031 at commit feabc1f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-08T02:23:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/RemoveRedundantProjects.scala

        } else {
-          project.output.map(_.exprId.id).sorted == child.output.map(_.exprId.id).sorted
+          project.output.map(_.exprId.id).sorted == child.output.map(_.exprId.id).sorted &&
+            checkNullability(project.output, child.output)


it should be

val orderedProjectOutput = project.output.sortBy(_.exprId.id) val orderedChildOutput = child.output.sortBy(_.exprId.id) orderedProjectOutput.map(_.expr.id) == orderedChildOutput.map(_.exprId.id) && checkNullability(orderedProjectOutput, orderedChildOutput)

cloud-fan · 2020-08-10T17:31:44Z

can you rebase/merge with the master branch to fix conflicts?

SparkQA · 2020-08-10T20:42:23Z

Test build #127289 has finished for PR 29031 at commit 8495025.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2020-08-10T22:52:06Z

Test build #127290 has finished for PR 29031 at commit 394126d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-11T03:14:12Z

thanks, merging to master!

probot-autolabeler bot added the SQL label Jul 7, 2020

maryannxue reviewed Jul 7, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/RemoveRedundantProjects.scala Show resolved Hide resolved

HyukjinKwon reviewed Jul 8, 2020

View reviewed changes

cloud-fan reviewed Jul 9, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantProjectsSuite.scala Show resolved Hide resolved

allisonwang-db commented Jul 10, 2020

View reviewed changes

cloud-fan reviewed Jul 10, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

cloud-fan reviewed Jul 10, 2020

View reviewed changes

cloud-fan approved these changes Aug 4, 2020

View reviewed changes

cloud-fan closed this Aug 5, 2020

cloud-fan reopened this Aug 5, 2020

allisonwang-db force-pushed the remove-project branch from abca971 to 1632028 Compare August 6, 2020 20:59

cloud-fan reviewed Aug 8, 2020

View reviewed changes

allisonwang-db and others added 9 commits August 10, 2020 10:39

remove redundant project exec

cb88819

address comments

4aa7bd8

address comments

69e67ff

address comments

fac1c7e

fix tests

f4ac613

fix golden file

e7d3e1d

add nullability check

b798f5b

update

ae290bc

resolve conflict

394126d

allisonwang-db force-pushed the remove-project branch from 8495025 to 394126d Compare August 10, 2020 18:02

cloud-fan closed this in 1b7443b Aug 11, 2020

allisonwang-db deleted the remove-project branch January 19, 2024 01:21

		@@ -2935,6 +2944,8 @@ class SQLConf extends Serializable with Logging {

		def subqueryReuseEnabled: Boolean = getConf(SUBQUERY_REUSE_ENABLED)

		def removeRedundantProjectsEnabled: Boolean = getConf(REMOVE_REDUNDANT_PROJECTS_ENABLED)


		class RemoveRedundantProjectsSuite extends QueryTest with SharedSparkSession with SQLTestUtils {

		private def assertProjectExecCount(df: DataFrame, expected: Integer): Unit = {

[SPARK-32216][SQL] Remove redundant ProjectExec #29031

[SPARK-32216][SQL] Remove redundant ProjectExec #29031

Uh oh!

Conversation

allisonwang-db commented Jul 7, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jul 9, 2020

Uh oh!

gatorsmile commented Jul 9, 2020

Uh oh!

SparkQA commented Jul 9, 2020

Uh oh!

cloud-fan commented Jul 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Jul 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 10, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 4, 2020

Uh oh!

allisonwang-db commented Aug 4, 2020

Uh oh!

cloud-fan commented Aug 5, 2020

Uh oh!

cloud-fan commented Aug 5, 2020

Uh oh!

SparkQA commented Aug 5, 2020

Uh oh!

cloud-fan commented Aug 5, 2020

Uh oh!

SparkQA commented Aug 6, 2020

Uh oh!

SparkQA commented Aug 6, 2020

Uh oh!

allisonwang-db commented Aug 7, 2020

Uh oh!

SparkQA commented Aug 7, 2020

Uh oh!

SparkQA commented Aug 7, 2020

Uh oh!

SparkQA commented Aug 7, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Aug 10, 2020

Uh oh!

SparkQA commented Aug 10, 2020

Uh oh!

SparkQA commented Aug 10, 2020

Uh oh!

cloud-fan commented Aug 11, 2020

cloud-fan Jul 10, 2020 •

edited

Loading