-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-14070] [SQL] Use ORC data source for SQL queries on ORC tables #11891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ok to test |
Test build #53806 has finished for PR 11891 at commit
|
This is pretty cool -- the hive path is ridiculously slow. BTW I tried comparing Parquet vs ORC based on Spark master branch right now. I generated 100 million rows with one double column and one string column:
then read it back just the string column:
Parquet with gzip compression takes ~12 secs. Parquet with snappy compression takes ~7 secs. ORC takes ~24 secs. We can definitely optimize the current ORC implementation more too. |
@marmbrus : I looked at the build failures trying to figure out the cause : https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53806/ I need help in trying to just run the one of the tests that failed (preferably in Intellij). Do you know how to do that ? |
@rxin : Can you point me to specific features / changes in Parquet which are not in ORC ? I am happy to work on adding that to ORC. |
I've never gotten testing running to work in intelij unfortunately. You can use SBT as follows:
For the first failure, this looks like the cause:
Seems we can't build SQL for ORC queries (used in view canonicalization) anymore. @liancheng might have some suggestions. |
@@ -597,6 +619,107 @@ private[hive] class HiveMetastoreCatalog(val client: HiveClient, hive: HiveConte | |||
} | |||
} | |||
|
|||
private def convertToOrcRelation(metastoreRelation: MetastoreRelation): LogicalRelation = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much of this is duplicated with convertToParquetRelation
? It would be awesome if we could just make this ConvertToHadoopFSRelation
and handle all the cases (or just ORC/Parquet through a common path). I'd have to look closer, but I think its likely that the parquet specific things (like parquetOption with the merged schema) are actually not required anymore.
The fact that this code path is parquet specific predates a common interface for reading files in Spark SQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. It's probably OK to tolerant this duplication for this PR. I'd like to revisit this code path after finishing migrating ORC read path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done a refac to have a single method named convertToLogicalRelation
instead of separate ones for parquet and orc
Test build #53968 has finished for PR 11891 at commit
|
Test build #53969 has finished for PR 11891 at commit
|
@liancheng + @marmbrus : Thanks for your comments. I have made the suggested changes except the one related to test case which I am not sure how to do. Also, the PR failed to build on Jenkins but it builds fine on my end when I do:
I looked at the jenkins log and I guess that incremental compilation makes it not see the new class |
Test build #53983 has finished for PR 11891 at commit
|
@tejasapatil When testing a PR, Jenkins always tries to merge it with the most recent master (or any other branch against which the PR is opened) first. I think the reason of the build failure is that we just merged PR #11836, which probably doesn't conflict with yours from Git's point of view but actually causes compilation failure. Would you please try to rebase to the most recent master and retry? |
@tejasapatil Just saw your previous comment about debugging in IntelliJ. When I have to resort to the debugger, I usually run tests using SBT and debug with IntelliJ remote debugging feature. This is probably useful for you. Under SBT, you may do this:
|
@liancheng : I did a rebase and that fixed the problem. Also, thanks for the pointer for debugging the tests. |
ok to test |
retest this please |
add to whitelist |
Test build #54010 has finished for PR 11891 at commit
|
Ah, unfortunately we just reverted #11836 because it caused some regressions. This caused another compilation error for your PR... Sorry for the trouble... |
Test build #54071 has finished for PR 11891 at commit
|
Test build #54092 has finished for PR 11891 at commit
|
Test build #54097 has finished for PR 11891 at commit
|
- Merged the conversion methods for parquet and orc in one single method `convertToLogicalRelation` as suggested by @@marmbrus - @liancheng's comment about checking for bucket spec - 4 tests were failing because of the change. My change alters the plan for the queries. eg. Query from `date_serde.q` : ``` select * from date_serde_orc ``` Plan (BEFORE) ``` Project [c1#282,c2#283] +- MetastoreRelation default, date_serde_orc, None ``` Plan (AFTER) ``` Project [c1#287,c2#288] +- SubqueryAlias date_serde_orc +- Relation[c1#287,c2#288] HadoopFiles ``` Setting `CONVERT_METASTORE_ORC` to `false` by default to mitigate test failures. Other option was to make `SQLBuilder` work with `Relation` but that is out of scope of the current PR. In my opinion, it would be better to have this config turned on by default so that anyone trying out Spark out of the box gets better perf. w/o needing to tweak such configs. Open items: - @liancheng's review comment : Test case added does not verify if the new codepath is hit
… anyone downloading Spark and trying it out will get the best performance w/o needing to know about this feature and tweak this config. For the tests which had failed, made `HiveCompatibilitySuite` to set that config to `false` so that they still produce the plan as per the old way. Ran `build/sbt scalastyle catalyst/compile sql/compile hive/compile` Ran `build/sbt -Phive hive/test-only *HiveCompatibilitySuite`
ok to test |
Test build #54267 has finished for PR 11891 at commit
|
@liancheng : I have made all requested changes as per review and also rebased. Can you please take a look ? |
ping @liancheng |
Thanks, merging to master. |
What changes were proposed in this pull request?
This patch enables use of OrcRelation for SQL queries which read data from Hive tables. Changes in this patch:
OrcConversions
which would alter the plan to useOrcRelation
. In this diff, the conversion is done only for reads.spark.sql.hive.convertMetastoreOrc
to control the conversionBEFORE
AFTER
How was this patch tested?
Performance gains
Ran on a production table in Facebook (note that the data was in DWRF file format which is similar to ORC)
Best case : when there was no matching rows for the predicate in the query (everything is filtered out)
Average case: A subset of rows in the data match the query predicate