Skip to content

[SPARK-14070] [SQL] Use ORC data source for SQL queries on ORC tables #11891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

tejasapatil
Copy link
Contributor

What changes were proposed in this pull request?

This patch enables use of OrcRelation for SQL queries which read data from Hive tables. Changes in this patch:

  • Added a new rule OrcConversions which would alter the plan to use OrcRelation. In this diff, the conversion is done only for reads.
  • Added a new config spark.sql.hive.convertMetastoreOrc to control the conversion

BEFORE

scala>  hqlContext.sql("SELECT * FROM orc_table").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias(*, None)]
+- 'UnresolvedRelation `orc_table`, None

== Analyzed Logical Plan ==
key: string, value: string
Project [key#171,value#172]
+- MetastoreRelation default, orc_table, None

== Optimized Logical Plan ==
MetastoreRelation default, orc_table, None

== Physical Plan ==
HiveTableScan [key#171,value#172], MetastoreRelation default, orc_table, None

AFTER

scala> hqlContext.sql("SELECT * FROM orc_table").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias(*, None)]
+- 'UnresolvedRelation `orc_table`, None

== Analyzed Logical Plan ==
key: string, value: string
Project [key#76,value#77]
+- SubqueryAlias orc_table
   +- Relation[key#76,value#77] ORC part: struct<>, data: struct<key:string,value:string>

== Optimized Logical Plan ==
Relation[key#76,value#77] ORC part: struct<>, data: struct<key:string,value:string>

== Physical Plan ==
WholeStageCodegen
:  +- Scan ORC part: struct<>, data: struct<key:string,value:string>[key#76,value#77] InputPaths: file:/user/hive/warehouse/orc_table

How was this patch tested?

  • Added a new unit test. Ran existing unit tests
  • Ran with production like data

Performance gains

Ran on a production table in Facebook (note that the data was in DWRF file format which is similar to ORC)

Best case : when there was no matching rows for the predicate in the query (everything is filtered out)

                      CPU time          Wall time     Total wall time across all tasks
================================================================
Without the change   541_515 sec    25.0 mins    165.8 hours
With change              407 sec       1.5 mins     15 mins

Average case: A subset of rows in the data match the query predicate

                        CPU time        Wall time     Total wall time across all tasks
================================================================
Without the change   624_630 sec     31.0 mins    199.0 h
With change           14_769 sec      5.3 mins      7.7 h

@tejasapatil tejasapatil changed the title Use ORC data source for SQL queries on ORC tables [SPARK-14070] [SQL] Use ORC data source for SQL queries on ORC tables Mar 22, 2016
@marmbrus
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53806 has finished for PR 11891 at commit 0938c08.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Mar 22, 2016

This is pretty cool -- the hive path is ridiculously slow. BTW I tried comparing Parquet vs ORC based on Spark master branch right now.

I generated 100 million rows with one double column and one string column:

sqlContext.range(100 * 1000 * 1000)
  .select(rand().as("numeric"), rand().cast("string").as("string"))
  .write.parquet("testdata/random.parquet")

then read it back just the string column:

def measureParquet() {
  val start = System.nanoTime
  sqlContext.read.parquet("testdata/random.parquet").selectExpr("count(string)").show()
  val end = System.nanoTime
  print((end - start) / 1000 / 1000)
}
measureParquet()

Parquet with gzip compression takes ~12 secs. Parquet with snappy compression takes ~7 secs. ORC takes ~24 secs. We can definitely optimize the current ORC implementation more too.

@tejasapatil
Copy link
Contributor Author

@marmbrus : I looked at the build failures trying to figure out the cause : https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53806/

I need help in trying to just run the one of the tests that failed (preferably in Intellij). Do you know how to do that ?

@tejasapatil
Copy link
Contributor Author

@rxin : Can you point me to specific features / changes in Parquet which are not in ORC ? I am happy to work on adding that to ORC.

@marmbrus
Copy link
Contributor

I've never gotten testing running to work in intelij unfortunately. You can use SBT as follows:

spark/$ sbt/sbt -Phive
> hive/test-only *HiveCompatibilitySuite -- -z date_serde
....

For the first failure, this looks like the cause:

[info]   Caused by: java.lang.UnsupportedOperationException: unsupported plan Relation[c1#282,c2#283] HadoopFiles
[info]   
[info]          at org.apache.spark.sql.hive.SQLBuilder.org$apache$spark$sql$hive$SQLBuilder$$toSQL(SQLBuilder.scala:191)
[info]          at org.apache.spark.sql.hive.SQLBuilder.org$apache$spark$sql$hive$SQLBuilder$$toSQL(SQLBuilder.scala:149)
[info]          at org.apache.spark.sql.hive.SQLBuilder.projectToSQL(SQLBuilder.scala:208)
[info]          at org.apache.spark.sql.hive.SQLBuilder.org$apache$spark$sql$hive$SQLBuilder$$toSQL(SQLBuilder.scala:111)
[info]          at org.apache.spark.sql.hive.SQLBuilder.org$apache$spark$sql$hive$SQLBuilder$$toSQL(SQLBuilder.scala:149)
[info]          at org.apache.spark.sql.hive.SQLBuilder.projectToSQL(SQLBuilder.scala:208)
[info]          at org.apache.spark.sql.hive.SQLBuilder.org$apache$spark$sql$hive$SQLBuilder$$toSQL(SQLBuilder.scala:111)
[info]          at org.apache.spark.sql.hive.SQLBuilder.toSQL(SQLBuilder.scala:81)
[info]          at org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1$$anonfun$36.liftedTree1$1(HiveComparisonTest.scala:398)
[info]          ... 57 more
[info]   

Seems we can't build SQL for ORC queries (used in view canonicalization) anymore. @liancheng might have some suggestions.

@@ -597,6 +619,107 @@ private[hive] class HiveMetastoreCatalog(val client: HiveClient, hive: HiveConte
}
}

private def convertToOrcRelation(metastoreRelation: MetastoreRelation): LogicalRelation = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of this is duplicated with convertToParquetRelation? It would be awesome if we could just make this ConvertToHadoopFSRelation and handle all the cases (or just ORC/Parquet through a common path). I'd have to look closer, but I think its likely that the parquet specific things (like parquetOption with the merged schema) are actually not required anymore.

The fact that this code path is parquet specific predates a common interface for reading files in Spark SQL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. It's probably OK to tolerant this duplication for this PR. I'd like to revisit this code path after finishing migrating ORC read path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done a refac to have a single method named convertToLogicalRelation instead of separate ones for parquet and orc

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53968 has finished for PR 11891 at commit c71bb6c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53969 has finished for PR 11891 at commit c15a39f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

@liancheng + @marmbrus : Thanks for your comments. I have made the suggested changes except the one related to test case which I am not sure how to do.

Also, the PR failed to build on Jenkins but it builds fine on my end when I do:

build/sbt scalastyle catalyst/compile sql/compile hive/compile
build/sbt -Pyarn -Phadoop-2.3 -Phive package assembly/assembly

I looked at the jenkins log and I guess that incremental compilation makes it not see the new class OrcConversions added in HiveSessionCatalog. Is there a way to avoid that ?

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53983 has finished for PR 11891 at commit c4111fa.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

@tejasapatil When testing a PR, Jenkins always tries to merge it with the most recent master (or any other branch against which the PR is opened) first. I think the reason of the build failure is that we just merged PR #11836, which probably doesn't conflict with yours from Git's point of view but actually causes compilation failure. Would you please try to rebase to the most recent master and retry?

@liancheng
Copy link
Contributor

@tejasapatil Just saw your previous comment about debugging in IntelliJ. When I have to resort to the debugger, I usually run tests using SBT and debug with IntelliJ remote debugging feature. This is probably useful for you. Under SBT, you may do this:

> project hive // or switch to any other project the target test case belongs to
> set javaOptions in Test += <remote debugging options>
> test-only ...
...
> session clear // this clears the previously added options and disable remote debugging

@tejasapatil
Copy link
Contributor Author

@liancheng : I did a rebase and that fixed the problem. Also, thanks for the pointer for debugging the tests.

@tejasapatil
Copy link
Contributor Author

ok to test

@liancheng
Copy link
Contributor

retest this please

@liancheng
Copy link
Contributor

add to whitelist

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54010 has finished for PR 11891 at commit c4111fa.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

Ah, unfortunately we just reverted #11836 because it caused some regressions. This caused another compilation error for your PR... Sorry for the trouble...

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54071 has finished for PR 11891 at commit 0a03a78.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54092 has finished for PR 11891 at commit 4b459a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54097 has finished for PR 11891 at commit 4f0fe06.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

- Merged the conversion methods for parquet and orc in one single method `convertToLogicalRelation` as suggested by @@marmbrus
- @liancheng's comment about checking for bucket spec
- 4 tests were failing because of the change. My change alters the plan for the queries.

eg. Query from `date_serde.q` :

```
select * from date_serde_orc
```

Plan (BEFORE)
```
Project [c1#282,c2#283]
+- MetastoreRelation default, date_serde_orc, None
```

Plan (AFTER)
```
Project [c1#287,c2#288]
+- SubqueryAlias date_serde_orc
   +- Relation[c1#287,c2#288] HadoopFiles
```

Setting `CONVERT_METASTORE_ORC` to `false` by default to mitigate test failures. Other option was to make `SQLBuilder` work with `Relation` but that is out of scope of the current PR. In my opinion, it would be better to have this config turned on by default so that anyone trying out Spark out of the box gets better perf. w/o needing to tweak such configs.

Open items:
- @liancheng's review comment : Test case added does not verify if the new codepath is hit
… anyone downloading Spark and trying it out will get the best performance w/o needing to know about this feature and tweak this config.

For the tests which had failed, made `HiveCompatibilitySuite` to set that config to `false` so that they still produce the plan as per the old way.

Ran `build/sbt scalastyle catalyst/compile sql/compile hive/compile`
Ran `build/sbt -Phive hive/test-only *HiveCompatibilitySuite`
@tejasapatil
Copy link
Contributor Author

ok to test

@SparkQA
Copy link

SparkQA commented Mar 27, 2016

Test build #54267 has finished for PR 11891 at commit 3c25e7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

@liancheng : I have made all requested changes as per review and also rebased. Can you please take a look ?

@tejasapatil
Copy link
Contributor Author

ping @liancheng

@marmbrus
Copy link
Contributor

marmbrus commented Apr 1, 2016

Thanks, merging to master.

@asfgit asfgit closed this in 1e88615 Apr 1, 2016
@tejasapatil tejasapatil deleted the orc_ppd branch April 2, 2016 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants