[SPARK-14070] [SQL] Use ORC data source for SQL queries on ORC tables #11891

tejasapatil · 2016-03-22T17:01:52Z

What changes were proposed in this pull request?

This patch enables use of OrcRelation for SQL queries which read data from Hive tables. Changes in this patch:

Added a new rule OrcConversions which would alter the plan to use OrcRelation. In this diff, the conversion is done only for reads.
Added a new config spark.sql.hive.convertMetastoreOrc to control the conversion

BEFORE

scala>  hqlContext.sql("SELECT * FROM orc_table").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias(*, None)]
+- 'UnresolvedRelation `orc_table`, None

== Analyzed Logical Plan ==
key: string, value: string
Project [key#171,value#172]
+- MetastoreRelation default, orc_table, None

== Optimized Logical Plan ==
MetastoreRelation default, orc_table, None

== Physical Plan ==
HiveTableScan [key#171,value#172], MetastoreRelation default, orc_table, None

AFTER

scala> hqlContext.sql("SELECT * FROM orc_table").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias(*, None)]
+- 'UnresolvedRelation `orc_table`, None

== Analyzed Logical Plan ==
key: string, value: string
Project [key#76,value#77]
+- SubqueryAlias orc_table
   +- Relation[key#76,value#77] ORC part: struct<>, data: struct<key:string,value:string>

== Optimized Logical Plan ==
Relation[key#76,value#77] ORC part: struct<>, data: struct<key:string,value:string>

== Physical Plan ==
WholeStageCodegen
:  +- Scan ORC part: struct<>, data: struct<key:string,value:string>[key#76,value#77] InputPaths: file:/user/hive/warehouse/orc_table

How was this patch tested?

Added a new unit test. Ran existing unit tests
Ran with production like data

Performance gains

Ran on a production table in Facebook (note that the data was in DWRF file format which is similar to ORC)

Best case : when there was no matching rows for the predicate in the query (everything is filtered out)

                      CPU time          Wall time     Total wall time across all tasks
================================================================
Without the change   541_515 sec    25.0 mins    165.8 hours
With change              407 sec       1.5 mins     15 mins

Average case: A subset of rows in the data match the query predicate

                        CPU time        Wall time     Total wall time across all tasks
================================================================
Without the change   624_630 sec     31.0 mins    199.0 h
With change           14_769 sec      5.3 mins      7.7 h

marmbrus · 2016-03-22T18:58:22Z

ok to test

SparkQA · 2016-03-22T20:22:33Z

Test build #53806 has finished for PR 11891 at commit 0938c08.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-03-22T20:59:33Z

This is pretty cool -- the hive path is ridiculously slow. BTW I tried comparing Parquet vs ORC based on Spark master branch right now.

I generated 100 million rows with one double column and one string column:

sqlContext.range(100 * 1000 * 1000)
  .select(rand().as("numeric"), rand().cast("string").as("string"))
  .write.parquet("testdata/random.parquet")

then read it back just the string column:

def measureParquet() {
  val start = System.nanoTime
  sqlContext.read.parquet("testdata/random.parquet").selectExpr("count(string)").show()
  val end = System.nanoTime
  print((end - start) / 1000 / 1000)
}
measureParquet()

Parquet with gzip compression takes ~12 secs. Parquet with snappy compression takes ~7 secs. ORC takes ~24 secs. We can definitely optimize the current ORC implementation more too.

tejasapatil · 2016-03-22T23:26:31Z

@marmbrus : I looked at the build failures trying to figure out the cause : https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53806/

I need help in trying to just run the one of the tests that failed (preferably in Intellij). Do you know how to do that ?

tejasapatil · 2016-03-22T23:28:53Z

@rxin : Can you point me to specific features / changes in Parquet which are not in ORC ? I am happy to work on adding that to ORC.

marmbrus · 2016-03-22T23:45:29Z

I've never gotten testing running to work in intelij unfortunately. You can use SBT as follows:

spark/$ sbt/sbt -Phive
> hive/test-only *HiveCompatibilitySuite -- -z date_serde
....

For the first failure, this looks like the cause:

[info]   Caused by: java.lang.UnsupportedOperationException: unsupported plan Relation[c1#282,c2#283] HadoopFiles
[info]   
[info]          at org.apache.spark.sql.hive.SQLBuilder.org$apache$spark$sql$hive$SQLBuilder$$toSQL(SQLBuilder.scala:191)
[info]          at org.apache.spark.sql.hive.SQLBuilder.org$apache$spark$sql$hive$SQLBuilder$$toSQL(SQLBuilder.scala:149)
[info]          at org.apache.spark.sql.hive.SQLBuilder.projectToSQL(SQLBuilder.scala:208)
[info]          at org.apache.spark.sql.hive.SQLBuilder.org$apache$spark$sql$hive$SQLBuilder$$toSQL(SQLBuilder.scala:111)
[info]          at org.apache.spark.sql.hive.SQLBuilder.org$apache$spark$sql$hive$SQLBuilder$$toSQL(SQLBuilder.scala:149)
[info]          at org.apache.spark.sql.hive.SQLBuilder.projectToSQL(SQLBuilder.scala:208)
[info]          at org.apache.spark.sql.hive.SQLBuilder.org$apache$spark$sql$hive$SQLBuilder$$toSQL(SQLBuilder.scala:111)
[info]          at org.apache.spark.sql.hive.SQLBuilder.toSQL(SQLBuilder.scala:81)
[info]          at org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1$$anonfun$36.liftedTree1$1(HiveComparisonTest.scala:398)
[info]          ... 57 more
[info]

Seems we can't build SQL for ORC queries (used in view canonicalization) anymore. @liancheng might have some suggestions.

marmbrus · 2016-03-23T00:04:04Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

@@ -597,6 +619,107 @@ private[hive] class HiveMetastoreCatalog(val client: HiveClient, hive: HiveConte
    }
  }

+  private def convertToOrcRelation(metastoreRelation: MetastoreRelation): LogicalRelation = {


How much of this is duplicated with convertToParquetRelation? It would be awesome if we could just make this ConvertToHadoopFSRelation and handle all the cases (or just ORC/Parquet through a common path). I'd have to look closer, but I think its likely that the parquet specific things (like parquetOption with the merged schema) are actually not required anymore.

The fact that this code path is parquet specific predates a common interface for reading files in Spark SQL.

Agree. It's probably OK to tolerant this duplication for this PR. I'd like to revisit this code path after finishing migrating ORC read path.

I have done a refac to have a single method named convertToLogicalRelation instead of separate ones for parquet and orc

SparkQA · 2016-03-23T21:28:59Z

Test build #53968 has finished for PR 11891 at commit c71bb6c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-23T21:43:10Z

Test build #53969 has finished for PR 11891 at commit c15a39f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2016-03-23T22:05:33Z

@liancheng + @marmbrus : Thanks for your comments. I have made the suggested changes except the one related to test case which I am not sure how to do.

Also, the PR failed to build on Jenkins but it builds fine on my end when I do:

build/sbt scalastyle catalyst/compile sql/compile hive/compile
build/sbt -Pyarn -Phadoop-2.3 -Phive package assembly/assembly

I looked at the jenkins log and I guess that incremental compilation makes it not see the new class OrcConversions added in HiveSessionCatalog. Is there a way to avoid that ?

SparkQA · 2016-03-23T23:17:43Z

Test build #53983 has finished for PR 11891 at commit c4111fa.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-03-24T00:31:18Z

@tejasapatil When testing a PR, Jenkins always tries to merge it with the most recent master (or any other branch against which the PR is opened) first. I think the reason of the build failure is that we just merged PR #11836, which probably doesn't conflict with yours from Git's point of view but actually causes compilation failure. Would you please try to rebase to the most recent master and retry?

liancheng · 2016-03-24T00:50:08Z

@tejasapatil Just saw your previous comment about debugging in IntelliJ. When I have to resort to the debugger, I usually run tests using SBT and debug with IntelliJ remote debugging feature. This is probably useful for you. Under SBT, you may do this:

> project hive // or switch to any other project the target test case belongs to
> set javaOptions in Test += <remote debugging options>
> test-only ...
...
> session clear // this clears the previously added options and disable remote debugging

tejasapatil · 2016-03-24T01:34:00Z

@liancheng : I did a rebase and that fixed the problem. Also, thanks for the pointer for debugging the tests.

tejasapatil · 2016-03-24T02:40:33Z

ok to test

liancheng · 2016-03-24T06:18:15Z

retest this please

liancheng · 2016-03-24T06:18:29Z

add to whitelist

SparkQA · 2016-03-24T06:27:35Z

Test build #54010 has finished for PR 11891 at commit c4111fa.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-03-24T06:37:40Z

Ah, unfortunately we just reverted #11836 because it caused some regressions. This caused another compilation error for your PR... Sorry for the trouble...

SparkQA · 2016-03-24T20:43:16Z

Test build #54071 has finished for PR 11891 at commit 0a03a78.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-24T22:54:51Z

Test build #54092 has finished for PR 11891 at commit 4b459a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-24T23:20:05Z

Test build #54097 has finished for PR 11891 at commit 4f0fe06.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

@marmbrus

- Merged the conversion methods for parquet and orc in one single method `convertToLogicalRelation` as suggested by @@marmbrus - @liancheng's comment about checking for bucket spec - 4 tests were failing because of the change. My change alters the plan for the queries. eg. Query from `date_serde.q` : ``` select * from date_serde_orc ``` Plan (BEFORE) ``` Project [c1#282,c2#283] +- MetastoreRelation default, date_serde_orc, None ``` Plan (AFTER) ``` Project [c1#287,c2#288] +- SubqueryAlias date_serde_orc +- Relation[c1#287,c2#288] HadoopFiles ``` Setting `CONVERT_METASTORE_ORC` to `false` by default to mitigate test failures. Other option was to make `SQLBuilder` work with `Relation` but that is out of scope of the current PR. In my opinion, it would be better to have this config turned on by default so that anyone trying out Spark out of the box gets better perf. w/o needing to tweak such configs. Open items: - @liancheng's review comment : Test case added does not verify if the new codepath is hit

… anyone downloading Spark and trying it out will get the best performance w/o needing to know about this feature and tweak this config. For the tests which had failed, made `HiveCompatibilitySuite` to set that config to `false` so that they still produce the plan as per the old way. Ran `build/sbt scalastyle catalyst/compile sql/compile hive/compile` Ran `build/sbt -Phive hive/test-only *HiveCompatibilitySuite`

tejasapatil · 2016-03-26T22:53:56Z

ok to test

SparkQA · 2016-03-27T00:23:16Z

Test build #54267 has finished for PR 11891 at commit 3c25e7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2016-03-27T01:12:10Z

@liancheng : I have made all requested changes as per review and also rebased. Can you please take a look ?

tejasapatil · 2016-04-01T05:57:02Z

ping @liancheng

marmbrus · 2016-04-01T20:12:57Z

Thanks, merging to master.

tejasapatil changed the title ~~Use ORC data source for SQL queries on ORC tables~~ [SPARK-14070] [SQL] Use ORC data source for SQL queries on ORC tables Mar 22, 2016

marmbrus reviewed Mar 23, 2016
View reviewed changes

tejasapatil force-pushed the orc_ppd branch from c4111fa to 0a03a78 Compare March 24, 2016 19:19

tejasapatil force-pushed the orc_ppd branch from 0a03a78 to 4b459a0 Compare March 24, 2016 21:24

tejasapatil added 8 commits March 26, 2016 15:10

Use ORC data source for SQL queries on ORC tables

bd69beb

Fixed checkstyle

ca99355

dummy commit

0b1dd48

revert dummy commit

2b7a0a4

Modified the test case as per @liancheng's suggestion

f73b748

rebased

3c25e7e

tejasapatil force-pushed the orc_ppd branch from 4f0fe06 to 3c25e7e Compare March 26, 2016 22:52

asfgit closed this in 1e88615 Apr 1, 2016

tejasapatil deleted the orc_ppd branch April 2, 2016 02:43

[SPARK-14070] [SQL] Use ORC data source for SQL queries on ORC tables #11891

[SPARK-14070] [SQL] Use ORC data source for SQL queries on ORC tables #11891

Uh oh!

Conversation

tejasapatil commented Mar 22, 2016

What changes were proposed in this pull request?

How was this patch tested?

Performance gains

Uh oh!

marmbrus commented Mar 22, 2016

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

rxin commented Mar 22, 2016

Uh oh!

tejasapatil commented Mar 22, 2016

Uh oh!

tejasapatil commented Mar 22, 2016

Uh oh!

marmbrus commented Mar 22, 2016

Uh oh!

marmbrus Mar 23, 2016

Choose a reason for hiding this comment

Uh oh!

liancheng Mar 23, 2016

Choose a reason for hiding this comment

Uh oh!

tejasapatil Mar 23, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

tejasapatil commented Mar 23, 2016

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

liancheng commented Mar 24, 2016

Uh oh!

liancheng commented Mar 24, 2016

Uh oh!

tejasapatil commented Mar 24, 2016

Uh oh!

tejasapatil commented Mar 24, 2016

Uh oh!

liancheng commented Mar 24, 2016

Uh oh!

liancheng commented Mar 24, 2016

Uh oh!

SparkQA commented Mar 24, 2016

Uh oh!

liancheng commented Mar 24, 2016

Uh oh!

SparkQA commented Mar 24, 2016

Uh oh!

SparkQA commented Mar 24, 2016

Uh oh!

SparkQA commented Mar 24, 2016

Uh oh!

tejasapatil commented Mar 26, 2016

Uh oh!

SparkQA commented Mar 27, 2016

Uh oh!

tejasapatil commented Mar 27, 2016

Uh oh!

tejasapatil commented Apr 1, 2016

Uh oh!

marmbrus commented Apr 1, 2016

Uh oh!

Uh oh!