Skip to content

[SPARK-32216][SQL] Remove redundant ProjectExec #29031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

allisonwang-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR added a physical rule to remove redundant project nodes. A ProjectExec is redundant when

  1. It has the same output attributes and order as its child's output when ordering of these attributes is required.
  2. It has the same output attributes as its child's output when attribute output ordering is not required.

For example:
After Filter:

== Physical Plan ==
*(1) Project [a#14L, b#15L, c#16, key#17] 
+- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5))
   +- *(1) ColumnarToRow
      +- FileScan parquet [a#14L,b#15L,c#16,key#17] 

The Project a#14L, b#15L, c#16, key#17 is redundant because its output is exactly the same as filter's output.

Before Aggregate:

== Physical Plan ==
*(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], output=[sum_a#39L, key#17, last_b#41L])
+- Exchange hashpartitioning(key#17, 5), true, [id=#77]
   +- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51])
      +- *(1) Project [key#17, a#14L, b#15L]
         +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100))
            +- *(1) ColumnarToRow
               +- FileScan parquet [a#14L,b#15L,key#17] 

The Project key#17, a#14L, b#15L is redundant because hash aggregate doesn't require child plan's output to be in a specific order.

Why are the changes needed?

It removes unnecessary query nodes and makes query plan cleaner.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

@@ -2935,6 +2944,8 @@ class SQLConf extends Serializable with Logging {

def subqueryReuseEnabled: Boolean = getConf(SUBQUERY_REUSE_ENABLED)

def removeRedundantProjectsEnabled: Boolean = getConf(REMOVE_REDUNDANT_PROJECTS_ENABLED)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since this conf is used only once, we can remove this variable.

.internal()
.doc("Whether to remove redundant project exec node based on children's output and " +
"ordering requirement.")
.version("3.0.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> 3.1.0

@gatorsmile
Copy link
Member

ok to test

@gatorsmile
Copy link
Member

cc @cloud-fan

@SparkQA
Copy link

SparkQA commented Jul 9, 2020

Test build #125436 has finished for PR 29031 at commit a24d93f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

val keepOrdering = a.aggregateExpressions
.exists(ae => ae.mode.equals(Final) || ae.mode.equals(PartialMerge))
a.mapChildren(removeProject(_, keepOrdering))
case w: WindowExec => w.mapChildren(removeProject(_, false))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WindowExec.output is implemented as child.output ++ windowExpression.map(_.toAttribute). I think we require the ordering for window children.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of setting require ordering to be true, I am wondering should WindowExec inherit this ordering requirement from its parent? For example in this case

Project[a, avg, key]
  WindowExec[avg] [key] [a]

WindowExec actually does't require column to be ordered. Is there any scenario where WindowExec must require child output column to be ordered? I am having trouble coming up with a test case for it.

spark.range(100).selectExpr("id % 10 as key", "id * 2 as a",
"id * 3 as b", "cast(id as string) as c", "array(id, id + 1, id + 3) as d")
.write.partitionBy("key").parquet(path)
spark.read.parquet(path).createOrReplaceTempView("testView")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we put the view creation in beforeAll? Then we only need to do it once for the entire test suite.

@SparkQA
Copy link

SparkQA commented Jul 9, 2020

Test build #125453 has finished for PR 29031 at commit a24d93f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case d: DataSourceV2ScanExecBase if !d.supportsColumnar => false
case _ =>
if (requireOrdering) {
project.output.map(_.exprId.id) == child.output.map(_.exprId.id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I am wondering if the qualifier in Attribute should be considered here as well (besides exprId). Would an attribute qualifier in a ProjectExec be different from its child?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. AttributeReferece.sameRef doesn't consider qualifier as well.

@SparkQA
Copy link

SparkQA commented Jul 10, 2020

Test build #125546 has finished for PR 29031 at commit 2613c30.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

private def assertProjectExec(query: String, enabled: Integer, disabled: Integer): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer -> Int?


class RemoveRedundantProjectsSuite extends QueryTest with SharedSparkSession with SQLTestUtils {

private def assertProjectExecCount(df: DataFrame, expected: Integer): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

}

test("subquery") {
testData
Copy link
Contributor

@cloud-fan cloud-fan Jul 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more clear to create a new view here for testing.

@SparkQA
Copy link

SparkQA commented Aug 4, 2020

Test build #127019 has finished for PR 29031 at commit 4585a04.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@allisonwang-db
Copy link
Contributor Author

retest this please

1 similar comment
@cloud-fan
Copy link
Contributor

retest this please

@cloud-fan cloud-fan closed this Aug 5, 2020
@cloud-fan cloud-fan reopened this Aug 5, 2020
@cloud-fan
Copy link
Contributor

add to whitelist

@SparkQA
Copy link

SparkQA commented Aug 5, 2020

Test build #127077 has finished for PR 29031 at commit 4585a04.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

seems this breaks the DPP test, @allisonwang-db please take a look.

@SparkQA
Copy link

SparkQA commented Aug 6, 2020

Test build #127150 has finished for PR 29031 at commit abca971.

  • This patch fails PySpark pip packaging tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2020

Test build #127155 has finished for PR 29031 at commit 1632028.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@allisonwang-db
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 7, 2020

Test build #127159 has finished for PR 29031 at commit 6d5cade.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 7, 2020

Test build #127163 has finished for PR 29031 at commit 6d5cade.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 7, 2020

Test build #127212 has finished for PR 29031 at commit feabc1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

} else {
project.output.map(_.exprId.id).sorted == child.output.map(_.exprId.id).sorted
project.output.map(_.exprId.id).sorted == child.output.map(_.exprId.id).sorted &&
checkNullability(project.output, child.output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be

val orderedProjectOutput = project.output.sortBy(_.exprId.id)
val orderedChildOutput = child.output.sortBy(_.exprId.id)
orderedProjectOutput.map(_.expr.id) == orderedChildOutput.map(_.exprId.id) &&
  checkNullability(orderedProjectOutput, orderedChildOutput)

@cloud-fan
Copy link
Contributor

can you rebase/merge with the master branch to fix conflicts?

@SparkQA
Copy link

SparkQA commented Aug 10, 2020

Test build #127289 has finished for PR 29031 at commit 8495025.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2020

Test build #127290 has finished for PR 29031 at commit 394126d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 1b7443b Aug 11, 2020
@allisonwang-db allisonwang-db deleted the remove-project branch January 19, 2024 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants