[SPARK-5938][SPARK-5443][SQL] Improve JsonRDD performance #5801

NathanHowell · 2015-04-30T07:24:11Z

This patch comprises of a few related pieces of work:

Schema inference is performed directly on the JSON token stream
String => Row conversion populate Spark SQL structures without intermediate types
Projection pushdown is implemented via CatalystScan for DataFrame queries
Support for the legacy parser by setting spark.sql.json.useJacksonStreamingAPI to false

Performance improvements depend on the schema and queries being executed, but it should be faster across the board. Below are benchmarks using the last.fm Million Song dataset:

Command                                            | Baseline | Patched
---------------------------------------------------|----------|--------
import sqlContext.implicits._                      |          |
val df = sqlContext.jsonFile("/tmp/lastfm.json")   |    70.0s |   14.6s
df.count()                                         |    28.8s |    6.2s
df.rdd.count()                                     |    35.3s |   21.5s
df.where($"artist" === "Robert Hood").collect()    |    28.3s |   16.9s

To prepare this dataset for benchmarking, follow these steps:

# Fetch the datasets from http://labrosa.ee.columbia.edu/millionsong/lastfm
wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_test.zip \
     http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_train.zip

# Decompress and combine, pipe through `jq -c` to ensure there is one record per line
unzip -p lastfm_test.zip lastfm_train.zip  | jq -c . > lastfm.json

rxin · 2015-04-30T07:49:36Z

Jenkins, ok to test.

rxin · 2015-04-30T07:49:53Z

I won't have time to look at this today, but this is pretty cool.

NathanHowell · 2015-04-30T08:03:03Z

Looks like it may also resolve SPARK-5443.

rxin · 2015-04-30T08:04:46Z

Can you put both JIRA tickets in the title? It will then automatically linked to both tickets.

NathanHowell · 2015-04-30T08:07:13Z

Done.

SparkQA · 2015-04-30T10:30:26Z

Test build #31399 has finished for PR 5801 at commit 1abf1d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KMeansModel (
- trait PMMLExportable
This patch adds the following new dependencies:
- jaxb-api-2.2.7.jar
- jaxb-core-2.2.7.jar
- jaxb-impl-2.2.7.jar
- pmml-agent-1.1.15.jar
- pmml-model-1.1.15.jar
- pmml-schema-1.1.15.jar
This patch removes the following dependencies:
- activation-1.1.jar
- jaxb-api-2.2.2.jar
- jaxb-impl-2.2.3-1.jar

NathanHowell · 2015-04-30T20:37:46Z

Benchmarked a small-ish real dataset... Runs are with 5 executors (for 5 input splits) with data in HDFS:

step	before	after
`val df = sqlContext.jsonRDD(...)` - schema inference	37.14s	18.16s
`df.count()`	125.8s	25.7s
`df.select("col1").count()`	96.9s	26.5s

Not sure why but the new code seems a bit slower when using projection pushdowns. It may be schema dependent or overhead from evaluating the projection expression.

SparkQA · 2015-04-30T22:25:06Z

Test build #31449 has finished for PR 5801 at commit 55c2f39.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

marmbrus · 2015-05-01T01:40:54Z

/cc @yhuai

SparkQA · 2015-05-01T08:14:38Z

Test build #31526 has finished for PR 5801 at commit 67c381a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

NathanHowell · 2015-05-01T08:55:53Z

I think it's in a decent state now, if this qualifies for the 1.4.0 merge window I'll make time to work through any remaining issues (if any).

SparkQA · 2015-05-01T10:40:05Z

Test build #31544 has finished for PR 5801 at commit bd2e929.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-05-01T21:16:23Z

@NathanHowell This is great! Is it possible to add a feature flag to choose what code path we use? By default, we use the this new code path. But, we still keep the option to use the old one in case there is any issue. Then, in 1.5, we can remove the old code path. What do you think?

NathanHowell · 2015-05-01T22:06:36Z

@yhuai Fine with me, I'm reworking the patch set now.

NathanHowell · 2015-05-01T23:01:04Z

@yhuai The updated patches do not test the old code. Do you have an opinion on the best way to address this? I can duplicate the entire JsonSuite or try to do something a bit better...

marmbrus · 2015-05-01T23:13:21Z

I'm okay with freezing the old code and not having tests. I just want a
quick fall back if a regression is found.
On May 1, 2015 4:01 PM, "Nathan Howell" [email protected] wrote:

@yhuai https://github.com/yhuai The updated patches do not test the old
code. Do you have an opinion on the best way to address this? I can
duplicate the entire JsonSuite or try to do something a bit better...

—
Reply to this email directly or view it on GitHub
#5801 (comment).

NathanHowell · 2015-05-01T23:25:41Z

@marmbrus sounds good, I'll leave it as is.

SparkQA · 2015-05-02T00:45:49Z

Test build #31626 has finished for PR 5801 at commit ab6ee87.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DataFrameStatFunctions(object):

SparkQA · 2015-05-02T01:08:24Z

Test build #31630 has finished for PR 5801 at commit 842846d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-05-04T04:02:43Z

sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD2.scala

+        parser.nextToken()
+        inferField(parser)
+
+      case VALUE_STRING if parser.getTextLength < 1 => NullType


Does it mean that we get an empty string? If so, can we keep the StringType? Otherwise, I feel we are destroying information.

Yes, an empty string gets inferred as a NullType. After inference is
complete any remaining NullType fields get converted back to a StringType.
The old code does this and has test coverage for it, but it does seem a bit
odd.
On May 3, 2015 9:03 PM, "Yin Huai" [email protected] wrote:

In sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD2.scala
#5801 (comment):

}

}

/**

* Infer the type of a json document from the parser's token stream

*/

private def inferField(parser: JsonParser): DataType = {

import com.fasterxml.jackson.core.JsonToken._

parser.getCurrentToken match {

case null | VALUE_NULL => NullType

case FIELD_NAME =>

parser.nextToken()

inferField(parser)

case VALUE_STRING if parser.getTextLength < 1 => NullType

Does it mean that we get an empty string? If so, can we keep the
StringType? Otherwise, I feel we are destroying information.

—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/5801/files#r29565168.

OK, I see. It is for those datasets that use a "" as a null. Since it is for inferring the type, it is same to use NullType because we can always get the StringType from other records. Actually, can we add a comment at there to explain its purpose?

rxin · 2015-05-04T23:48:42Z

You can try get some tweets and use that. That's what I usually demo on, but unfortunately I don't think it is legal to make collected tweets public.

There are some stuff here: https://www.opensciencedatacloud.org/publicdata/city-of-chicago-public-datasets/

NathanHowell · 2015-05-05T01:35:37Z

@rxin thanks, the datasets there are currently not mounted on their rsync endpoint so I found another (last.fm) dataset that is about 2G and timed a few queries.

rxin · 2015-05-05T01:42:15Z

Slightly off topic - @NathanHowell do you know if Jackson allows returning UTF-8 encoded strings directly? If it supports that, we can skip string decoding/encoding altogether, since Spark SQL internally now uses UTF-8 encoded bytes for strings.

NathanHowell · 2015-05-05T04:44:05Z

@rxin It supports writing a UTF8 encoding byte array, but there doesn't seem to be equivalent support for reads.. best that can be done is converting the current char[] buffer and offset/length directly to a byte[], avoiding an alloc/copy to String.

see: http://fasterxml.github.io/jackson-core/javadoc/2.3.0/com/fasterxml/jackson/core/base/ParserMinimalBase.html#getTextCharacters()
and http://fasterxml.github.io/jackson-core/javadoc/2.3.0/com/fasterxml/jackson/core/JsonGenerator.html#writeUTF8String(byte[], int, int)

NathanHowell · 2015-05-05T04:45:23Z

@yhuai Is there still time to get this in for 1.4.0?

marmbrus · 2015-05-05T04:56:29Z

Yeah, given that there is a flag I think we can still include this.

yhuai · 2015-05-05T05:11:30Z

@NathanHowell I will take a final check tomorrow. Can you also add the performance number of selecting all columns in the description? You can use df.rdd.count as the command to compare two versions.

yhuai · 2015-05-05T19:17:42Z

sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala

@@ -160,6 +162,9 @@ private[sql] class SQLConf extends Serializable {

  private[spark] def useSqlSerializer2: Boolean = getConf(USE_SQL_SERIALIZER2, "true").toBoolean

+  private[spark] def useJacksonStreamingAPI: Boolean =


Can you add comment to explain that it is a temporary flag and we will remove the old code path in 1.5?

yhuai · 2015-05-05T19:21:43Z

@NathanHowell I played with it. The issue I found is that insert does not work well because baseRDD is an input parameter of the JSON relation. For example, with the following code, we will have an exception.

sql(
      s"""
        |CREATE TEMPORARY TABLE jsonTable (a int, b string)
        |USING org.apache.spark.sql.json.DefaultSource
        |OPTIONS (
        |  path '/tmp/jsonTable'
        |)
      """.stripMargin)
val rdd = sparkContext.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""), 5)
jsonRDD(rdd).registerTempTable("jt")
sql(
      s"""
        |INSERT OVERWRITE TABLE jsonTable SELECT a, b FROM jt
      """.stripMargin)

val rdd = sparkContext.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""), 1)
jsonRDD(rdd1).registerTempTable("jt1")
sql(
      s"""
        |INSERT OVERWRITE TABLE jsonTable SELECT a, b FROM jt1
      """.stripMargin)

The exception is something like

org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 6.0 failed 1 times, most recent failure: Lost task 3.0 in stage 6.0 (TID 31, localhost): java.io.FileNotFoundException: File file:/tmp/testJson/part-00003 does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:137)
    at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763)
    at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:106)
    at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
    at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:235)
    at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
    at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

NathanHowell · 2015-05-05T20:08:57Z

@yhuai I'll be able to check on this a bit later today.

yhuai · 2015-05-05T21:09:49Z

Seems our test cases are not sufficient to catch the problem. Can you also add the following test cases.

In InsertSuite, let's change the val rdd defined in beforeAll to val rdd = sparkContext.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""), 5). Then, let's change the test of INSERT OVERWRITE a JSONRelation multiple times to

test("INSERT OVERWRITE a JSONRelation multiple times") {
  sql(
    s"""
      |INSERT OVERWRITE TABLE jsonTable SELECT a, b FROM jt
    """.stripMargin)
  checkAnswer(
    sql("SELECT a, b FROM jsonTable"),
    (1 to 10).map(i => Row(i, s"str$i"))
  )

  // Writing the table to less part files.
  val rdd1 = sparkContext.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""), 5)
  jsonRDD(rdd1).registerTempTable("jt1")
  sql(
    s"""
    |INSERT OVERWRITE TABLE jsonTable SELECT a, b FROM jt1
    """.stripMargin)
  checkAnswer(
    sql("SELECT a, b FROM jsonTable"),
    (1 to 10).map(i => Row(i, s"str$i"))
  )

  // Writing the table to more part files.
  val rdd2 = sparkContext.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""), 10)
  jsonRDD(rdd2).registerTempTable("jt2")
  sql(
    s"""
    |INSERT OVERWRITE TABLE jsonTable SELECT a, b FROM jt2
    """.stripMargin)
  checkAnswer(
    sql("SELECT a, b FROM jsonTable"),
    (1 to 10).map(i => Row(i, s"str$i"))
  )

  sql(
    s"""
      |INSERT OVERWRITE TABLE jsonTable SELECT a * 10, b FROM jt1
    """.stripMargin)
  checkAnswer(
    sql("SELECT a, b FROM jsonTable"),
    (1 to 10).map(i => Row(i * 10, s"str$i"))
  )

  dropTempTable("jt1")
  dropTempTable("jt2")
}

Also, add the following in the InsertSuite.

test("save directly to the path of a JSON table") {
  table("jt").selectExpr("a * 5 as a", "b").save(path.toString, "json", SaveMode.Overwrite)
  checkAnswer(
    sql("SELECT a, b FROM jsonTable"),
    (1 to 10).map(i => Row(i * 5, s"str$i"))
  )

  table("jt").save(path.toString, "json", SaveMode.Overwrite)
  checkAnswer(
    sql("SELECT a, b FROM jsonTable"),
    (1 to 10).map(i => Row(i, s"str$i"))
  )
}

SparkQA · 2015-05-06T16:31:46Z

Test build #31981 has finished for PR 5801 at commit e91d8c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

NathanHowell · 2015-05-06T17:04:23Z

@yhuai I've added the tests and fixed the failures, the change was minor... changed the type of baseRDD back to => RDD[String] and added some comments.

rxin · 2015-05-06T17:08:15Z

sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala

+private[sql] class JSONRelation(
+    // baseRDD needs to be created on scan and not when JSONRelation is
+    // constructed, so we need a function (call by name) instead of a value
+    baseRDD: => RDD[String],


can you update the comment to document why it needs to be created on scan?

How about we explicitly pass a closure (more reader friendly)?

NathanHowell · 2015-05-06T17:38:48Z

@rxin yep, I've updated the comment.

yhuai · 2015-05-06T17:44:16Z

sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala

+    // underlying inputs are modified. To be safe, a call-by-name
+    // value (a function) is used instead of a regular value to
+    // ensure the RDD is recreated on each and every operation.
+    baseRDD: => RDD[String],


@NathanHowell Do you think a closure at here will be better (as mentioned in https://github.com/databricks/scala-style-guide#call_by_name)?

Yes, this is corrected.

SparkQA · 2015-05-06T17:45:55Z

Test build #32000 has finished for PR 5801 at commit e1187eb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-06T19:59:53Z

Test build #32001 has finished for PR 5801 at commit 26fea31.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-05-07T05:56:17Z

Thank you! LGTM. I am merging it to master and branch 1.4.

This patch comprises of a few related pieces of work: * Schema inference is performed directly on the JSON token stream * `String => Row` conversion populate Spark SQL structures without intermediate types * Projection pushdown is implemented via CatalystScan for DataFrame queries * Support for the legacy parser by setting `spark.sql.json.useJacksonStreamingAPI` to `false` Performance improvements depend on the schema and queries being executed, but it should be faster across the board. Below are benchmarks using the last.fm Million Song dataset: ``` Command | Baseline | Patched ---------------------------------------------------|----------|-------- import sqlContext.implicits._ | | val df = sqlContext.jsonFile("/tmp/lastfm.json") | 70.0s | 14.6s df.count() | 28.8s | 6.2s df.rdd.count() | 35.3s | 21.5s df.where($"artist" === "Robert Hood").collect() | 28.3s | 16.9s ``` To prepare this dataset for benchmarking, follow these steps: ``` # Fetch the datasets from http://labrosa.ee.columbia.edu/millionsong/lastfm wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_test.zip \ http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_train.zip # Decompress and combine, pipe through `jq -c` to ensure there is one record per line unzip -p lastfm_test.zip lastfm_train.zip | jq -c . > lastfm.json ``` Author: Nathan Howell <[email protected]> Closes #5801 from NathanHowell/json-performance and squashes the following commits: 26fea31 [Nathan Howell] Recreate the baseRDD each for each scan operation a7ebeb2 [Nathan Howell] Increase coverage of inserts into a JSONRelation e06a1dd [Nathan Howell] Add comments to the `useJacksonStreamingAPI` config flag 6822712 [Nathan Howell] Split up JsonRDD2 into multiple objects fa8234f [Nathan Howell] Wrap long lines b31917b [Nathan Howell] Rename `useJsonRDD2` to `useJacksonStreamingAPI` 15c5d1b [Nathan Howell] JSONRelation's baseRDD need not be lazy f8add6e [Nathan Howell] Add comments on lack of support for precision and scale DecimalTypes fa0be47 [Nathan Howell] Remove unused default case in the field parser 80dba17 [Nathan Howell] Add comments regarding null handling and empty strings 842846d [Nathan Howell] Point the empty schema inference test at JsonRDD2 ab6ee87 [Nathan Howell] Add projection pushdown support to JsonRDD/JsonRDD2 f636c14 [Nathan Howell] Enable JsonRDD2 by default, add a flag to switch back to JsonRDD 0bbc445 [Nathan Howell] Improve JSON parsing and type inference performance 7ca70c1 [Nathan Howell] Eliminate arrow pattern, replace with pattern matches (cherry picked from commit 2d6612c) Signed-off-by: Yin Huai <[email protected]>

This patch comprises of a few related pieces of work: * Schema inference is performed directly on the JSON token stream * `String => Row` conversion populate Spark SQL structures without intermediate types * Projection pushdown is implemented via CatalystScan for DataFrame queries * Support for the legacy parser by setting `spark.sql.json.useJacksonStreamingAPI` to `false` Performance improvements depend on the schema and queries being executed, but it should be faster across the board. Below are benchmarks using the last.fm Million Song dataset: ``` Command | Baseline | Patched ---------------------------------------------------|----------|-------- import sqlContext.implicits._ | | val df = sqlContext.jsonFile("/tmp/lastfm.json") | 70.0s | 14.6s df.count() | 28.8s | 6.2s df.rdd.count() | 35.3s | 21.5s df.where($"artist" === "Robert Hood").collect() | 28.3s | 16.9s ``` To prepare this dataset for benchmarking, follow these steps: ``` # Fetch the datasets from http://labrosa.ee.columbia.edu/millionsong/lastfm wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_test.zip \ http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_train.zip # Decompress and combine, pipe through `jq -c` to ensure there is one record per line unzip -p lastfm_test.zip lastfm_train.zip | jq -c . > lastfm.json ``` Author: Nathan Howell <[email protected]> Closes apache#5801 from NathanHowell/json-performance and squashes the following commits: 26fea31 [Nathan Howell] Recreate the baseRDD each for each scan operation a7ebeb2 [Nathan Howell] Increase coverage of inserts into a JSONRelation e06a1dd [Nathan Howell] Add comments to the `useJacksonStreamingAPI` config flag 6822712 [Nathan Howell] Split up JsonRDD2 into multiple objects fa8234f [Nathan Howell] Wrap long lines b31917b [Nathan Howell] Rename `useJsonRDD2` to `useJacksonStreamingAPI` 15c5d1b [Nathan Howell] JSONRelation's baseRDD need not be lazy f8add6e [Nathan Howell] Add comments on lack of support for precision and scale DecimalTypes fa0be47 [Nathan Howell] Remove unused default case in the field parser 80dba17 [Nathan Howell] Add comments regarding null handling and empty strings 842846d [Nathan Howell] Point the empty schema inference test at JsonRDD2 ab6ee87 [Nathan Howell] Add projection pushdown support to JsonRDD/JsonRDD2 f636c14 [Nathan Howell] Enable JsonRDD2 by default, add a flag to switch back to JsonRDD 0bbc445 [Nathan Howell] Improve JSON parsing and type inference performance 7ca70c1 [Nathan Howell] Eliminate arrow pattern, replace with pattern matches

NathanHowell changed the title ~~[SPARK-5938][SQL] Improve JsonRDD performance~~ [SPARK-5938][SPARK-5443][SQL] Improve JsonRDD performance Apr 30, 2015

NathanHowell force-pushed the json-performance branch from 55c2f39 to 67c381a Compare May 1, 2015 06:27

Eliminate arrow pattern, replace with pattern matches

7ca70c1

Nathan Howell added 3 commits May 1, 2015 15:12

Improve JSON parsing and type inference performance

0bbc445

Enable JsonRDD2 by default, add a flag to switch back to JsonRDD

f636c14

Add projection pushdown support to JsonRDD/JsonRDD2

ab6ee87

NathanHowell force-pushed the json-performance branch from bd2e929 to ab6ee87 Compare May 1, 2015 22:58

Point the empty schema inference test at JsonRDD2

842846d

yhuai reviewed May 4, 2015
View reviewed changes

yhuai reviewed May 5, 2015
View reviewed changes

Nathan Howell added 2 commits May 6, 2015 07:27

Add comments to the useJacksonStreamingAPI config flag

e06a1dd

Increase coverage of inserts into a JSONRelation

a7ebeb2

rxin reviewed May 6, 2015
View reviewed changes

NathanHowell force-pushed the json-performance branch from e91d8c5 to e1187eb Compare May 6, 2015 17:38

yhuai reviewed May 6, 2015
View reviewed changes

NathanHowell force-pushed the json-performance branch from e1187eb to 26fea31 Compare May 6, 2015 17:46

Recreate the baseRDD each for each scan operation

26fea31

asfgit closed this in 2d6612c May 7, 2015

NathanHowell deleted the json-performance branch December 8, 2016 00:30

		@@ -160,6 +162,9 @@ private[sql] class SQLConf extends Serializable {

		private[spark] def useSqlSerializer2: Boolean = getConf(USE_SQL_SERIALIZER2, "true").toBoolean

		private[spark] def useJacksonStreamingAPI: Boolean =

[SPARK-5938][SPARK-5443][SQL] Improve JsonRDD performance #5801

[SPARK-5938][SPARK-5443][SQL] Improve JsonRDD performance #5801

Uh oh!

Conversation

NathanHowell commented Apr 30, 2015

Uh oh!

rxin commented Apr 30, 2015

Uh oh!

rxin commented Apr 30, 2015

Uh oh!

NathanHowell commented Apr 30, 2015

Uh oh!

rxin commented Apr 30, 2015

Uh oh!

NathanHowell commented Apr 30, 2015

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

NathanHowell commented Apr 30, 2015

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

marmbrus commented May 1, 2015

Uh oh!

SparkQA commented May 1, 2015

Uh oh!

NathanHowell commented May 1, 2015

Uh oh!

SparkQA commented May 1, 2015

Uh oh!

yhuai commented May 1, 2015

Uh oh!

NathanHowell commented May 1, 2015

Uh oh!

NathanHowell commented May 1, 2015

Uh oh!

marmbrus commented May 1, 2015

Uh oh!

NathanHowell commented May 1, 2015

Uh oh!

SparkQA commented May 2, 2015

Uh oh!

SparkQA commented May 2, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented May 4, 2015

Uh oh!

NathanHowell commented May 5, 2015

Uh oh!

rxin commented May 5, 2015

Uh oh!

NathanHowell commented May 5, 2015

Uh oh!

NathanHowell commented May 5, 2015

Uh oh!

marmbrus commented May 5, 2015

Uh oh!

yhuai commented May 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhuai commented May 5, 2015

Uh oh!

NathanHowell commented May 5, 2015

Uh oh!

yhuai commented May 5, 2015

Uh oh!

SparkQA commented May 6, 2015

Uh oh!

NathanHowell commented May 6, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!