-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-4553] [SPARK-5767] [SQL] Wires Parquet data source with the newly introduced write support for data source API #4563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #27345 has started for PR 4563 at commit
|
@@ -89,7 +89,7 @@ class DefaultSource | |||
val doSave = if (fs.exists(filesystemPath)) { | |||
mode match { | |||
case SaveMode.Append => | |||
sys.error(s"Append mode is not supported by ${this.getClass.getCanonicalName}") | |||
true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enabling append.
Test build #27346 has started for PR 4563 at commit
|
Test build #27345 has finished for PR 4563 at commit
|
Test PASSed. |
Test build #27346 has finished for PR 4563 at commit
|
Test PASSed. |
@@ -106,12 +106,12 @@ class DefaultSource | |||
ParquetRelation.createEmpty( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we are still using some utility functions like this one from the old Parquet support code. We can move them into the new data source in the future.
Rewires Parquet data source and the new data source write support Temporary solution for moving Parquet conversion to analysis phase Although it works, it's so ugly... I duplicated the whole Analyzer in Hive Context. Have to fix this. Cleaner solution for Metastore Parquet table conversion Fixes compilation errors introduced during rebasing Minor cleanups Addresses @yhuai's comments
efcc8d2
to
a83d290
Compare
Squashed all commits to ease rebasing. |
Test build #27515 has started for PR 4563 at commit
|
Test build #27515 has finished for PR 4563 at commit
|
Test FAILed. |
Test build #27517 has started for PR 4563 at commit
|
Test build #27517 has finished for PR 4563 at commit
|
Test PASSed. |
Test build #27544 has started for PR 4563 at commit
|
Many thanks to @yhuai, who helped getting rid of a bug in those Parquet test suites which disable Parquet data source and fall back to the old implementation! |
Removed the withSQLConf(SQLConf.PARQUET_USE_DATA_SOURCE_API -> "false") {
test("some test") {
...
}
} The execution order of this snippet is:
In the last commit, I removed the |
PhysicalRDD(plan.output, sparkContext.emptyRDD[Row]) :: Nil | ||
} else { | ||
hiveContext | ||
.parquetFile(partitionLocations.head, partitionLocations.tail: _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug fix. When no partition is selected, partitionLocation.head
throws. In Spark 1.2, parquetFile
accepts a single path
argument. In this case, parquetFile
throws an IllegalArgumentException
since path
is empty. This exception is then explicitly caught below.
Test build #27544 has finished for PR 4563 at commit
|
Test PASSed. |
…wly introduced write support for data source API This PR migrates the Parquet data source to the new data source write support API. Now users can also overwriting and appending to existing tables. Notice that inserting into partitioned tables is not supported yet. When Parquet data source is enabled, insertion to Hive Metastore Parquet tables is also fullfilled by the Parquet data source. This is done by the newly introduced `HiveMetastoreCatalog.ParquetConversions` rule, which is a "proper" implementation of the original hacky `HiveStrategies.ParquetConversion`. The latter is still preserved, and can be removed together with the old Parquet support in the future. TODO: - [x] Update outdated comments in `newParquet.scala`. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4563) <!-- Reviewable:end --> Author: Cheng Lian <[email protected]> Closes #4563 from liancheng/parquet-refining and squashes the following commits: fa98d27 [Cheng Lian] Fixes test cases which should disable off Parquet data source 2476e82 [Cheng Lian] Fixes compilation error introduced during rebasing a83d290 [Cheng Lian] Passes Hive Metastore partitioning information to ParquetRelation2 (cherry picked from commit 3ce58cf) Signed-off-by: Cheng Lian <[email protected]>
This PR migrates the Parquet data source to the new data source write support API. Now users can also overwriting and appending to existing tables. Notice that inserting into partitioned tables is not supported yet.
When Parquet data source is enabled, insertion to Hive Metastore Parquet tables is also fullfilled by the Parquet data source. This is done by the newly introduced
HiveMetastoreCatalog.ParquetConversions
rule, which is a "proper" implementation of the original hackyHiveStrategies.ParquetConversion
. The latter is still preserved, and can be removed together with the old Parquet support in the future.TODO:
newParquet.scala
.