[SPARK-4553] [SPARK-5767] [SQL] Wires Parquet data source with the newly introduced write support for data source API #4563

liancheng · 2015-02-12T10:36:52Z

This PR migrates the Parquet data source to the new data source write support API. Now users can also overwriting and appending to existing tables. Notice that inserting into partitioned tables is not supported yet.

When Parquet data source is enabled, insertion to Hive Metastore Parquet tables is also fullfilled by the Parquet data source. This is done by the newly introduced HiveMetastoreCatalog.ParquetConversions rule, which is a "proper" implementation of the original hacky HiveStrategies.ParquetConversion. The latter is still preserved, and can be removed together with the old Parquet support in the future.

TODO:

Update outdated comments in newParquet.scala.

SparkQA · 2015-02-12T10:37:28Z

Test build #27345 has started for PR 4563 at commit ae17ea8.

This patch merges cleanly.

liancheng · 2015-02-12T10:37:32Z

cc @marmbrus @rxin @yhuai

liancheng · 2015-02-12T10:37:53Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

@@ -89,7 +89,7 @@ class DefaultSource
    val doSave = if (fs.exists(filesystemPath)) {
      mode match {
        case SaveMode.Append =>
-          sys.error(s"Append mode is not supported by ${this.getClass.getCanonicalName}")
+          true


Enabling append.

SparkQA · 2015-02-12T10:52:57Z

Test build #27346 has started for PR 4563 at commit efcc8d2.

This patch merges cleanly.

SparkQA · 2015-02-12T11:47:50Z

Test build #27345 has finished for PR 4563 at commit ae17ea8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ParquetRelation2(

AmplabJenkins · 2015-02-12T11:47:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27345/
Test PASSed.

SparkQA · 2015-02-12T12:30:54Z

Test build #27346 has finished for PR 4563 at commit efcc8d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ParquetRelation2(

AmplabJenkins · 2015-02-12T12:30:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27346/
Test PASSed.

liancheng · 2015-02-12T18:51:35Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

@@ -106,12 +106,12 @@ class DefaultSource
      ParquetRelation.createEmpty(


Currently we are still using some utility functions like this one from the old Parquet support code. We can move them into the new data source in the future.

@yhuai

Rewires Parquet data source and the new data source write support Temporary solution for moving Parquet conversion to analysis phase Although it works, it's so ugly... I duplicated the whole Analyzer in Hive Context. Have to fix this. Cleaner solution for Metastore Parquet table conversion Fixes compilation errors introduced during rebasing Minor cleanups Addresses @yhuai's comments

liancheng · 2015-02-15T09:40:37Z

Squashed all commits to ease rebasing.

SparkQA · 2015-02-15T09:42:28Z

Test build #27515 has started for PR 4563 at commit a83d290.

This patch merges cleanly.

SparkQA · 2015-02-15T09:46:01Z

Test build #27515 has finished for PR 4563 at commit a83d290.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ParquetRelation2(

AmplabJenkins · 2015-02-15T09:46:02Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27515/
Test FAILed.

SparkQA · 2015-02-15T10:02:25Z

Test build #27517 has started for PR 4563 at commit 2476e82.

This patch merges cleanly.

SparkQA · 2015-02-15T11:20:34Z

Test build #27517 has finished for PR 4563 at commit 2476e82.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ParquetRelation2(

AmplabJenkins · 2015-02-15T11:20:38Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27517/
Test PASSed.

liancheng · 2015-02-15T17:36:06Z

@yhuai @marmbrus All comments have been addressed, have to squash all commits to ease rebasing. Should be good to go.

SparkQA · 2015-02-16T07:47:33Z

Test build #27544 has started for PR 4563 at commit fa98d27.

This patch merges cleanly.

liancheng · 2015-02-16T07:50:45Z

Many thanks to @yhuai, who helped getting rid of a bug in those Parquet test suites which disable Parquet data source and fall back to the old implementation!

liancheng · 2015-02-16T07:57:40Z

Removed the withSQLConf trick in test suites like ParquetQuerySuite. A simplified version of the problem can be shown as:

withSQLConf(SQLConf.PARQUET_USE_DATA_SOURCE_API -> "false") {
  test("some test") {
     ...
  }
}

The execution order of this snippet is:

SQLConf.PARQUET_USE_DATA_SOURCE_API is set to false
A ScalaTest test case "some test" is created
SQLConf.PARQUET_USE_DATA_SOURCE_API is reverted (removed)
ScalaTest starts executing test case "some test"
"some test" executes with the default value of SQLConf.PARQUET_USE_DATA_SOURCE_API configuration, which is true

In the last commit, I removed the withSQLConf trick and fall back to beforeAll/afterAll. This introduced hundreds of lines of indentation changes in several test suites, thus made this PR twice as large.

liancheng · 2015-02-16T08:02:25Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+              PhysicalRDD(plan.output, sparkContext.emptyRDD[Row]) :: Nil
+            } else {
+              hiveContext
+                .parquetFile(partitionLocations.head, partitionLocations.tail: _*)


This is a bug fix. When no partition is selected, partitionLocation.head throws. In Spark 1.2, parquetFile accepts a single path argument. In this case, parquetFile throws an IllegalArgumentException since path is empty. This exception is then explicitly caught below.

SparkQA · 2015-02-16T09:00:29Z

Test build #27544 has finished for PR 4563 at commit fa98d27.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ParquetRelation2(

AmplabJenkins · 2015-02-16T09:00:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27544/
Test PASSed.

liancheng · 2015-02-16T09:38:10Z

@yhuai @marmbrus Thanks for the review! I'm gonna merge this as it contains a bunch of fixes and enables full power of the new Parquet data source :)

…wly introduced write support for data source API This PR migrates the Parquet data source to the new data source write support API. Now users can also overwriting and appending to existing tables. Notice that inserting into partitioned tables is not supported yet. When Parquet data source is enabled, insertion to Hive Metastore Parquet tables is also fullfilled by the Parquet data source. This is done by the newly introduced `HiveMetastoreCatalog.ParquetConversions` rule, which is a "proper" implementation of the original hacky `HiveStrategies.ParquetConversion`. The latter is still preserved, and can be removed together with the old Parquet support in the future. TODO: - [x] Update outdated comments in `newParquet.scala`.  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4563)  Author: Cheng Lian <[email protected]> Closes #4563 from liancheng/parquet-refining and squashes the following commits: fa98d27 [Cheng Lian] Fixes test cases which should disable off Parquet data source 2476e82 [Cheng Lian] Fixes compilation error introduced during rebasing a83d290 [Cheng Lian] Passes Hive Metastore partitioning information to ParquetRelation2 (cherry picked from commit 3ce58cf) Signed-off-by: Cheng Lian <[email protected]>

liancheng reviewed Feb 12, 2015
View reviewed changes

liancheng force-pushed the parquet-refining branch from efcc8d2 to a83d290 Compare February 15, 2015 09:39

Fixes compilation error introduced during rebasing

2476e82

Fixes test cases which should disable off Parquet data source

fa98d27

liancheng reviewed Feb 16, 2015
View reviewed changes

asfgit closed this in 3ce58cf Feb 16, 2015

liancheng mentioned this pull request Feb 17, 2015

[SPARK-5852] [SQL] Passdown the schema for Parquet File in HiveContext #4562

Closed

		@@ -106,12 +106,12 @@ class DefaultSource
		ParquetRelation.createEmpty(

[SPARK-4553] [SPARK-5767] [SQL] Wires Parquet data source with the newly introduced write support for data source API #4563

[SPARK-4553] [SPARK-5767] [SQL] Wires Parquet data source with the newly introduced write support for data source API #4563

Uh oh!

Conversation

liancheng commented Feb 12, 2015

Uh oh!

SparkQA commented Feb 12, 2015

Uh oh!

liancheng commented Feb 12, 2015

Uh oh!

liancheng Feb 12, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 12, 2015

Uh oh!

SparkQA commented Feb 12, 2015

Uh oh!

AmplabJenkins commented Feb 12, 2015

Uh oh!

SparkQA commented Feb 12, 2015

Uh oh!

AmplabJenkins commented Feb 12, 2015

Uh oh!

liancheng Feb 12, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Feb 15, 2015

Uh oh!

SparkQA commented Feb 15, 2015

Uh oh!

SparkQA commented Feb 15, 2015

Uh oh!

AmplabJenkins commented Feb 15, 2015

Uh oh!

SparkQA commented Feb 15, 2015

Uh oh!

SparkQA commented Feb 15, 2015

Uh oh!

AmplabJenkins commented Feb 15, 2015

Uh oh!

liancheng commented Feb 15, 2015

Uh oh!

SparkQA commented Feb 16, 2015

Uh oh!

liancheng commented Feb 16, 2015

Uh oh!

liancheng commented Feb 16, 2015

Uh oh!

liancheng Feb 16, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 16, 2015

Uh oh!

AmplabJenkins commented Feb 16, 2015

Uh oh!

liancheng commented Feb 16, 2015

Uh oh!

Uh oh!