[SPARK-8014] [SQL] Avoid premature metadata discovery when writing a HadoopFsRelation with a save mode other than Append #6583

liancheng · 2015-06-02T12:40:49Z

The current code references the schema of the DataFrame to be written before checking save mode. This triggers expensive metadata discovery prematurely. For save mode other than Append, this metadata discovery is useless since we either ignore the result (for Ignore and ErrorIfExists) or delete existing files (for Overwrite) later.

This PR fixes this issue by deferring metadata discovery after save mode checking.

liancheng · 2015-06-02T12:42:06Z

sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala

-        // the data.
-        val project =
-          Project(
-            r.schema.map(field => new UnresolvedAttribute(Seq(field.name))),


This r.schema is where metadata discovery is triggered. This PR fixes this issue by moving this projection into InsertIntoHadoopFsRelation.

SparkQA · 2015-06-02T14:41:34Z

Test build #33981 has finished for PR 6583 at commit 8fbd93f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…rtition schema are defined

yhuai · 2015-06-02T18:38:37Z

LGTM

SparkQA · 2015-06-02T20:29:59Z

Test build #34006 has finished for PR 6583 at commit 1aafabd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SortOrder(child: Expression, direction: SortDirection) extends Expression
- abstract class BinaryMathExpression(f: (Double, Double) => Double, name: String)

yhuai · 2015-06-02T20:31:46Z

I am merging it to master and branch 1.4.

…HadoopFsRelation with a save mode other than Append The current code references the schema of the DataFrame to be written before checking save mode. This triggers expensive metadata discovery prematurely. For save mode other than `Append`, this metadata discovery is useless since we either ignore the result (for `Ignore` and `ErrorIfExists`) or delete existing files (for `Overwrite`) later. This PR fixes this issue by deferring metadata discovery after save mode checking. Author: Cheng Lian <[email protected]> Closes #6583 from liancheng/spark-8014 and squashes the following commits: 1aafabd [Cheng Lian] Updates comments 088abaa [Cheng Lian] Avoids schema merging and partition discovery when data schema and partition schema are defined 8fbd93f [Cheng Lian] Fixes SPARK-8014 (cherry picked from commit 686a45f) Signed-off-by: Yin Huai <[email protected]>

…HadoopFsRelation with a save mode other than Append The current code references the schema of the DataFrame to be written before checking save mode. This triggers expensive metadata discovery prematurely. For save mode other than `Append`, this metadata discovery is useless since we either ignore the result (for `Ignore` and `ErrorIfExists`) or delete existing files (for `Overwrite`) later. This PR fixes this issue by deferring metadata discovery after save mode checking. Author: Cheng Lian <[email protected]> Closes apache#6583 from liancheng/spark-8014 and squashes the following commits: 1aafabd [Cheng Lian] Updates comments 088abaa [Cheng Lian] Avoids schema merging and partition discovery when data schema and partition schema are defined 8fbd93f [Cheng Lian] Fixes SPARK-8014

Fixes SPARK-8014

8fbd93f

liancheng reviewed Jun 2, 2015
View reviewed changes

liancheng added 2 commits June 3, 2015 02:26

Avoids schema merging and partition discovery when data schema and pa…

088abaa

…rtition schema are defined

Updates comments

1aafabd

asfgit closed this in 686a45f Jun 2, 2015

liancheng deleted the spark-8014 branch June 2, 2015 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-8014] [SQL] Avoid premature metadata discovery when writing a HadoopFsRelation with a save mode other than Append #6583

[SPARK-8014] [SQL] Avoid premature metadata discovery when writing a HadoopFsRelation with a save mode other than Append #6583

Uh oh!

liancheng commented Jun 2, 2015

Uh oh!

liancheng Jun 2, 2015

Uh oh!

SparkQA commented Jun 2, 2015

Uh oh!

yhuai commented Jun 2, 2015

Uh oh!

SparkQA commented Jun 2, 2015

Uh oh!

yhuai commented Jun 2, 2015

Uh oh!

Uh oh!

[SPARK-8014] [SQL] Avoid premature metadata discovery when writing a HadoopFsRelation with a save mode other than Append #6583

[SPARK-8014] [SQL] Avoid premature metadata discovery when writing a HadoopFsRelation with a save mode other than Append #6583

Uh oh!

Conversation

liancheng commented Jun 2, 2015

Uh oh!

liancheng Jun 2, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 2, 2015

Uh oh!

yhuai commented Jun 2, 2015

Uh oh!

SparkQA commented Jun 2, 2015

Uh oh!

yhuai commented Jun 2, 2015

Uh oh!

Uh oh!