Skip to content

[SPARK-8014] [SQL] Avoid premature metadata discovery when writing a HadoopFsRelation with a save mode other than Append #6583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

liancheng
Copy link
Contributor

The current code references the schema of the DataFrame to be written before checking save mode. This triggers expensive metadata discovery prematurely. For save mode other than Append, this metadata discovery is useless since we either ignore the result (for Ignore and ErrorIfExists) or delete existing files (for Overwrite) later.

This PR fixes this issue by deferring metadata discovery after save mode checking.

// the data.
val project =
Project(
r.schema.map(field => new UnresolvedAttribute(Seq(field.name))),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This r.schema is where metadata discovery is triggered. This PR fixes this issue by moving this projection into InsertIntoHadoopFsRelation.

@SparkQA
Copy link

SparkQA commented Jun 2, 2015

Test build #33981 has finished for PR 6583 at commit 8fbd93f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Jun 2, 2015

LGTM

@SparkQA
Copy link

SparkQA commented Jun 2, 2015

Test build #34006 has finished for PR 6583 at commit 1aafabd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class SortOrder(child: Expression, direction: SortDirection) extends Expression
    • abstract class BinaryMathExpression(f: (Double, Double) => Double, name: String)

@yhuai
Copy link
Contributor

yhuai commented Jun 2, 2015

I am merging it to master and branch 1.4.

asfgit pushed a commit that referenced this pull request Jun 2, 2015
…HadoopFsRelation with a save mode other than Append

The current code references the schema of the DataFrame to be written before checking save mode. This triggers expensive metadata discovery prematurely. For save mode other than `Append`, this metadata discovery is useless since we either ignore the result (for `Ignore` and `ErrorIfExists`) or delete existing files (for `Overwrite`) later.

This PR fixes this issue by deferring metadata discovery after save mode checking.

Author: Cheng Lian <[email protected]>

Closes #6583 from liancheng/spark-8014 and squashes the following commits:

1aafabd [Cheng Lian] Updates comments
088abaa [Cheng Lian] Avoids schema merging and partition discovery when data schema and partition schema are defined
8fbd93f [Cheng Lian] Fixes SPARK-8014

(cherry picked from commit 686a45f)
Signed-off-by: Yin Huai <[email protected]>
@asfgit asfgit closed this in 686a45f Jun 2, 2015
@liancheng liancheng deleted the spark-8014 branch June 2, 2015 23:48
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
…HadoopFsRelation with a save mode other than Append

The current code references the schema of the DataFrame to be written before checking save mode. This triggers expensive metadata discovery prematurely. For save mode other than `Append`, this metadata discovery is useless since we either ignore the result (for `Ignore` and `ErrorIfExists`) or delete existing files (for `Overwrite`) later.

This PR fixes this issue by deferring metadata discovery after save mode checking.

Author: Cheng Lian <[email protected]>

Closes apache#6583 from liancheng/spark-8014 and squashes the following commits:

1aafabd [Cheng Lian] Updates comments
088abaa [Cheng Lian] Avoids schema merging and partition discovery when data schema and partition schema are defined
8fbd93f [Cheng Lian] Fixes SPARK-8014
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…HadoopFsRelation with a save mode other than Append

The current code references the schema of the DataFrame to be written before checking save mode. This triggers expensive metadata discovery prematurely. For save mode other than `Append`, this metadata discovery is useless since we either ignore the result (for `Ignore` and `ErrorIfExists`) or delete existing files (for `Overwrite`) later.

This PR fixes this issue by deferring metadata discovery after save mode checking.

Author: Cheng Lian <[email protected]>

Closes apache#6583 from liancheng/spark-8014 and squashes the following commits:

1aafabd [Cheng Lian] Updates comments
088abaa [Cheng Lian] Avoids schema merging and partition discovery when data schema and partition schema are defined
8fbd93f [Cheng Lian] Fixes SPARK-8014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants