[SPARK-10339] [SPARK-10334] [SPARK-10301] [SQL]Partitioned table scan can OOM driver and throw a better error message when users need to enable parquet schema merging #8515

yhuai · 2015-08-28T22:56:22Z

This fixes the problem that scanning partitioned table causes driver have a high memory pressure and takes down the cluster. Also, with this fix, we will be able to correctly show the query plan of a query consuming partitioned tables.

https://issues.apache.org/jira/browse/SPARK-10339
https://issues.apache.org/jira/browse/SPARK-10334

Finally, this PR squeeze in a "quick fix" for SPARK-10301. It is not a real fix, but it just throw a better error message to let user know what to do.

…d of create one Filter/Project for every partition.

yhuai · 2015-08-28T22:58:06Z

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala

+        "A Parquet file's schema has different number of fields with the table schema. " +
+          "Please enable schema merging by setting \"mergeSchema\" to true when load " +
+          "a Parquet dataset or set spark.sql.parquet.mergeSchema to tru in SQLConf.")
+    }


@liancheng @marmbrus I add the check to make sure the number of fields of a parquet file's struct have the same number of the corresponding struct in the table schema. If this check fails, we will ask users to enable mergeSchema.

…ields with the table schema.

marmbrus · 2015-08-28T23:14:12Z

test this please

liancheng · 2015-08-29T00:34:03Z

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala

+          "Please enable schema merging by setting \"mergeSchema\" to true when load " +
+          "a Parquet dataset or set spark.sql.parquet.mergeSchema to true in SQLConf.")
+    }
+


Just a note: this is a quick fix version of #8509.

SparkQA · 2015-08-29T01:27:11Z

Test build #41770 has finished for PR 8515 at commit b509bee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-29T01:32:49Z

Test build #41771 has finished for PR 8515 at commit b509bee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-08-29T13:50:31Z

Would be nice to have the following test case in ParquetHadoopFsRelationSuite:

  test("SPARK-10334 Projections and filters should be kept in physical plan") {
    withTempPath { dir =>
      val path = dir.getCanonicalPath

      sqlContext.range(2).select('id as 'a, 'id as 'b).write.partitionBy("b").parquet(path)
      val df = sqlContext.read.parquet(path).filter('a === 0).select('b)
      val physicalPlan = df.queryExecution.executedPlan

      assert(physicalPlan.collect { case p: execution.Project => p }.length === 1)
      assert(physicalPlan.collect { case p: execution.Filter => p }.length === 1)
    }
  }

And probably add [SPARK-10301] in the PR title, since this PR also delivers a quick fix for it.

Otherwise LGTM. Verified locally that filter push-down and column pruning both work properly.

SparkQA · 2015-08-29T20:52:06Z

Test build #41786 has finished for PR 8515 at commit 8ecd3a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-08-29T23:39:06Z

Thanks, merging to master and branch-1.5.

…n can OOM driver and throw a better error message when users need to enable parquet schema merging This fixes the problem that scanning partitioned table causes driver have a high memory pressure and takes down the cluster. Also, with this fix, we will be able to correctly show the query plan of a query consuming partitioned tables. https://issues.apache.org/jira/browse/SPARK-10339 https://issues.apache.org/jira/browse/SPARK-10334 Finally, this PR squeeze in a "quick fix" for SPARK-10301. It is not a real fix, but it just throw a better error message to let user know what to do. Author: Yin Huai <[email protected]> Closes #8515 from yhuai/partitionedTableScan. (cherry picked from commit 097a7e3) Signed-off-by: Michael Armbrust <[email protected]>

…or nested structs We used to workaround SPARK-10301 with a quick fix in branch-1.5 (PR #8515), but it doesn't cover the case described in SPARK-10428. So this PR backports PR #8509, which had once been considered too big a change to be merged into branch-1.5 in the last minute, to fix both SPARK-10301 and SPARK-10428 for Spark 1.5. Also added more test cases for SPARK-10428. This PR looks big, but the essential change is only ~200 loc. All other changes are for testing. Especially, PR #8454 is also backported here because the `ParquetInteroperabilitySuite` introduced in PR #8515 depends on it. This should be safe since #8454 only touches testing code. Author: Cheng Lian <[email protected]> Closes #8583 from liancheng/spark-10301/for-1.5.

…or nested structs We used to workaround SPARK-10301 with a quick fix in branch-1.5 (PR apache#8515), but it doesn't cover the case described in SPARK-10428. So this PR backports PR apache#8509, which had once been considered too big a change to be merged into branch-1.5 in the last minute, to fix both SPARK-10301 and SPARK-10428 for Spark 1.5. Also added more test cases for SPARK-10428. This PR looks big, but the essential change is only ~200 loc. All other changes are for testing. Especially, PR apache#8454 is also backported here because the `ParquetInteroperabilitySuite` introduced in PR apache#8515 depends on it. This should be safe since apache#8454 only touches testing code. Author: Cheng Lian <[email protected]> Closes apache#8583 from liancheng/spark-10301/for-1.5. (cherry picked from commit fca16c5) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala

Only create a single Filter/Project for partitioned table scan instea…

2390b06

…d of create one Filter/Project for every partition.

yhuai reviewed Aug 28, 2015
View reviewed changes

Throw an exception if parquet file's schema has different number of f…

b509bee

…ields with the table schema.

liancheng mentioned this pull request Aug 29, 2015

[SPARK-10301] [SQL] Fixes schema merging for nested structs #8509

Closed

liancheng reviewed Aug 29, 2015
View reviewed changes

yhuai changed the title ~~[SPARK-10339] [SPARK-10334] [SQL]Partitioned table scan~~ [SPARK-10339] [SPARK-10334] [SQL]Partitioned table scan can OOM driver Aug 29, 2015

yhuai changed the title ~~[SPARK-10339] [SPARK-10334] [SQL]Partitioned table scan can OOM driver~~ [SPARK-10339] [SPARK-10334] [SPARK-10301] [SQL]Partitioned table scan can OOM driver Aug 29, 2015

yhuai changed the title ~~[SPARK-10339] [SPARK-10334] [SPARK-10301] [SQL]Partitioned table scan can OOM driver~~ [SPARK-10339] [SPARK-10334] [SPARK-10301] [SQL]Partitioned table scan can OOM driver and throw a better error message when users need to enable parquet schema merging Aug 29, 2015

Add test.

8ecd3a6

asfgit closed this in 097a7e3 Aug 29, 2015

liancheng mentioned this pull request Sep 3, 2015

[SPARK-10301] [SPARK-10428] [SQL] [BRANCH-1.5] Fixes schema merging for nested structs #8583

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-10339] [SPARK-10334] [SPARK-10301] [SQL]Partitioned table scan can OOM driver and throw a better error message when users need to enable parquet schema merging #8515

[SPARK-10339] [SPARK-10334] [SPARK-10301] [SQL]Partitioned table scan can OOM driver and throw a better error message when users need to enable parquet schema merging #8515

Uh oh!

yhuai commented Aug 28, 2015

Uh oh!

yhuai Aug 28, 2015

Uh oh!

marmbrus commented Aug 28, 2015

Uh oh!

liancheng Aug 29, 2015

Uh oh!

SparkQA commented Aug 29, 2015

Uh oh!

SparkQA commented Aug 29, 2015

Uh oh!

liancheng commented Aug 29, 2015

Uh oh!

SparkQA commented Aug 29, 2015

Uh oh!

marmbrus commented Aug 29, 2015

Uh oh!

Uh oh!

[SPARK-10339] [SPARK-10334] [SPARK-10301] [SQL]Partitioned table scan can OOM driver and throw a better error message when users need to enable parquet schema merging #8515

[SPARK-10339] [SPARK-10334] [SPARK-10301] [SQL]Partitioned table scan can OOM driver and throw a better error message when users need to enable parquet schema merging #8515

Uh oh!

Conversation

yhuai commented Aug 28, 2015

Uh oh!

yhuai Aug 28, 2015

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Aug 28, 2015

Uh oh!

liancheng Aug 29, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 29, 2015

Uh oh!

SparkQA commented Aug 29, 2015

Uh oh!

liancheng commented Aug 29, 2015

Uh oh!

SparkQA commented Aug 29, 2015

Uh oh!

marmbrus commented Aug 29, 2015

Uh oh!

Uh oh!