Skip to content

[SPARK-4553] [SPARK-5767] [SQL] Wires Parquet data source with the newly introduced write support for data source API #4563

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

liancheng
Copy link
Contributor

This PR migrates the Parquet data source to the new data source write support API. Now users can also overwriting and appending to existing tables. Notice that inserting into partitioned tables is not supported yet.

When Parquet data source is enabled, insertion to Hive Metastore Parquet tables is also fullfilled by the Parquet data source. This is done by the newly introduced HiveMetastoreCatalog.ParquetConversions rule, which is a "proper" implementation of the original hacky HiveStrategies.ParquetConversion. The latter is still preserved, and can be removed together with the old Parquet support in the future.

TODO:

  • Update outdated comments in newParquet.scala.

Review on Reviewable

@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27345 has started for PR 4563 at commit ae17ea8.

  • This patch merges cleanly.

@liancheng
Copy link
Contributor Author

cc @marmbrus @rxin @yhuai

@@ -89,7 +89,7 @@ class DefaultSource
val doSave = if (fs.exists(filesystemPath)) {
mode match {
case SaveMode.Append =>
sys.error(s"Append mode is not supported by ${this.getClass.getCanonicalName}")
true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling append.

@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27346 has started for PR 4563 at commit efcc8d2.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27345 has finished for PR 4563 at commit ae17ea8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ParquetRelation2(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27345/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27346 has finished for PR 4563 at commit efcc8d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ParquetRelation2(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27346/
Test PASSed.

@@ -106,12 +106,12 @@ class DefaultSource
ParquetRelation.createEmpty(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we are still using some utility functions like this one from the old Parquet support code. We can move them into the new data source in the future.

Rewires Parquet data source and the new data source write support

Temporary solution for moving Parquet conversion to analysis phase

Although it works, it's so ugly... I duplicated the whole Analyzer
in Hive Context. Have to fix this.

Cleaner solution for Metastore Parquet table conversion

Fixes compilation errors introduced during rebasing

Minor cleanups

Addresses @yhuai's comments
@liancheng
Copy link
Contributor Author

Squashed all commits to ease rebasing.

@SparkQA
Copy link

SparkQA commented Feb 15, 2015

Test build #27515 has started for PR 4563 at commit a83d290.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 15, 2015

Test build #27515 has finished for PR 4563 at commit a83d290.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ParquetRelation2(

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27515/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 15, 2015

Test build #27517 has started for PR 4563 at commit 2476e82.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 15, 2015

Test build #27517 has finished for PR 4563 at commit 2476e82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ParquetRelation2(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27517/
Test PASSed.

@liancheng
Copy link
Contributor Author

@yhuai @marmbrus All comments have been addressed, have to squash all commits to ease rebasing. Should be good to go.

@SparkQA
Copy link

SparkQA commented Feb 16, 2015

Test build #27544 has started for PR 4563 at commit fa98d27.

  • This patch merges cleanly.

@liancheng
Copy link
Contributor Author

Many thanks to @yhuai, who helped getting rid of a bug in those Parquet test suites which disable Parquet data source and fall back to the old implementation!

@liancheng
Copy link
Contributor Author

Removed the withSQLConf trick in test suites like ParquetQuerySuite. A simplified version of the problem can be shown as:

withSQLConf(SQLConf.PARQUET_USE_DATA_SOURCE_API -> "false") {
  test("some test") {
     ...
  }
}

The execution order of this snippet is:

  1. SQLConf.PARQUET_USE_DATA_SOURCE_API is set to false
  2. A ScalaTest test case "some test" is created
  3. SQLConf.PARQUET_USE_DATA_SOURCE_API is reverted (removed)
  4. ScalaTest starts executing test case "some test"
  5. "some test" executes with the default value of SQLConf.PARQUET_USE_DATA_SOURCE_API configuration, which is true

In the last commit, I removed the withSQLConf trick and fall back to beforeAll/afterAll. This introduced hundreds of lines of indentation changes in several test suites, thus made this PR twice as large.

PhysicalRDD(plan.output, sparkContext.emptyRDD[Row]) :: Nil
} else {
hiveContext
.parquetFile(partitionLocations.head, partitionLocations.tail: _*)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug fix. When no partition is selected, partitionLocation.head throws. In Spark 1.2, parquetFile accepts a single path argument. In this case, parquetFile throws an IllegalArgumentException since path is empty. This exception is then explicitly caught below.

@SparkQA
Copy link

SparkQA commented Feb 16, 2015

Test build #27544 has finished for PR 4563 at commit fa98d27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ParquetRelation2(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27544/
Test PASSed.

@liancheng
Copy link
Contributor Author

@yhuai @marmbrus Thanks for the review! I'm gonna merge this as it contains a bunch of fixes and enables full power of the new Parquet data source :)

asfgit pushed a commit that referenced this pull request Feb 16, 2015
…wly introduced write support for data source API

This PR migrates the Parquet data source to the new data source write support API.  Now users can also overwriting and appending to existing tables. Notice that inserting into partitioned tables is not supported yet.

When Parquet data source is enabled, insertion to Hive Metastore Parquet tables is also fullfilled by the Parquet data source. This is done by the newly introduced `HiveMetastoreCatalog.ParquetConversions` rule, which is a "proper" implementation of the original hacky `HiveStrategies.ParquetConversion`. The latter is still preserved, and can be removed together with the old Parquet support in the future.

TODO:

- [x] Update outdated comments in `newParquet.scala`.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4563)
<!-- Reviewable:end -->

Author: Cheng Lian <[email protected]>

Closes #4563 from liancheng/parquet-refining and squashes the following commits:

fa98d27 [Cheng Lian] Fixes test cases which should disable off Parquet data source
2476e82 [Cheng Lian] Fixes compilation error introduced during rebasing
a83d290 [Cheng Lian] Passes Hive Metastore partitioning information to ParquetRelation2

(cherry picked from commit 3ce58cf)
Signed-off-by: Cheng Lian <[email protected]>
@asfgit asfgit closed this in 3ce58cf Feb 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants