-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-5938][SPARK-5443][SQL] Improve JsonRDD performance #5801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Jenkins, ok to test. |
I won't have time to look at this today, but this is pretty cool. |
Looks like it may also resolve SPARK-5443. |
Can you put both JIRA tickets in the title? It will then automatically linked to both tickets. |
Done. |
Test build #31399 has finished for PR 5801 at commit
|
Benchmarked a small-ish real dataset... Runs are with 5 executors (for 5 input splits) with data in HDFS:
Not sure why but the new code seems a bit slower when using projection pushdowns. It may be schema dependent or overhead from evaluating the projection expression. |
Test build #31449 has finished for PR 5801 at commit
|
/cc @yhuai |
55c2f39
to
67c381a
Compare
Test build #31526 has finished for PR 5801 at commit
|
I think it's in a decent state now, if this qualifies for the 1.4.0 merge window I'll make time to work through any remaining issues (if any). |
Test build #31544 has finished for PR 5801 at commit
|
@NathanHowell This is great! Is it possible to add a feature flag to choose what code path we use? By default, we use the this new code path. But, we still keep the option to use the old one in case there is any issue. Then, in 1.5, we can remove the old code path. What do you think? |
@yhuai Fine with me, I'm reworking the patch set now. |
bd2e929
to
ab6ee87
Compare
@yhuai The updated patches do not test the old code. Do you have an opinion on the best way to address this? I can duplicate the entire JsonSuite or try to do something a bit better... |
I'm okay with freezing the old code and not having tests. I just want a
|
@marmbrus sounds good, I'll leave it as is. |
Test build #31626 has finished for PR 5801 at commit
|
Test build #31630 has finished for PR 5801 at commit
|
parser.nextToken() | ||
inferField(parser) | ||
|
||
case VALUE_STRING if parser.getTextLength < 1 => NullType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that we get an empty string? If so, can we keep the StringType
? Otherwise, I feel we are destroying information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, an empty string gets inferred as a NullType. After inference is
complete any remaining NullType fields get converted back to a StringType.
The old code does this and has test coverage for it, but it does seem a bit
odd.
On May 3, 2015 9:03 PM, "Yin Huai" [email protected] wrote:
In sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD2.scala
#5801 (comment):
- }
- }
- /**
- * Infer the type of a json document from the parser's token stream
- */
- private def inferField(parser: JsonParser): DataType = {
- import com.fasterxml.jackson.core.JsonToken._
- parser.getCurrentToken match {
case null | VALUE_NULL => NullType
case FIELD_NAME =>
parser.nextToken()
inferField(parser)
case VALUE_STRING if parser.getTextLength < 1 => NullType
Does it mean that we get an empty string? If so, can we keep the
StringType? Otherwise, I feel we are destroying information.—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/5801/files#r29565168.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. It is for those datasets that use a ""
as a null. Since it is for inferring the type, it is same to use NullType
because we can always get the StringType
from other records. Actually, can we add a comment at there to explain its purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
You can try get some tweets and use that. That's what I usually demo on, but unfortunately I don't think it is legal to make collected tweets public. There are some stuff here: https://www.opensciencedatacloud.org/publicdata/city-of-chicago-public-datasets/ |
@rxin thanks, the datasets there are currently not mounted on their rsync endpoint so I found another (last.fm) dataset that is about 2G and timed a few queries. |
Slightly off topic - @NathanHowell do you know if Jackson allows returning UTF-8 encoded strings directly? If it supports that, we can skip string decoding/encoding altogether, since Spark SQL internally now uses UTF-8 encoded bytes for strings. |
@rxin It supports writing a UTF8 encoding byte array, but there doesn't seem to be equivalent support for reads.. best that can be done is converting the current see: http://fasterxml.github.io/jackson-core/javadoc/2.3.0/com/fasterxml/jackson/core/base/ParserMinimalBase.html#getTextCharacters() |
@yhuai Is there still time to get this in for 1.4.0? |
Yeah, given that there is a flag I think we can still include this. |
@NathanHowell I will take a final check tomorrow. Can you also add the performance number of selecting all columns in the description? You can use |
@@ -160,6 +162,9 @@ private[sql] class SQLConf extends Serializable { | |||
|
|||
private[spark] def useSqlSerializer2: Boolean = getConf(USE_SQL_SERIALIZER2, "true").toBoolean | |||
|
|||
private[spark] def useJacksonStreamingAPI: Boolean = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add comment to explain that it is a temporary flag and we will remove the old code path in 1.5?
@NathanHowell I played with it. The issue I found is that
The exception is something like
|
@yhuai I'll be able to check on this a bit later today. |
Seems our test cases are not sufficient to catch the problem. Can you also add the following test cases. In
Also, add the following in the
|
Test build #31981 has finished for PR 5801 at commit
|
@yhuai I've added the tests and fixed the failures, the change was minor... changed the type of baseRDD back to |
private[sql] class JSONRelation( | ||
// baseRDD needs to be created on scan and not when JSONRelation is | ||
// constructed, so we need a function (call by name) instead of a value | ||
baseRDD: => RDD[String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update the comment to document why it needs to be created on scan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we explicitly pass a closure (more reader friendly)?
e91d8c5
to
e1187eb
Compare
@rxin yep, I've updated the comment. |
// underlying inputs are modified. To be safe, a call-by-name | ||
// value (a function) is used instead of a regular value to | ||
// ensure the RDD is recreated on each and every operation. | ||
baseRDD: => RDD[String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NathanHowell Do you think a closure at here will be better (as mentioned in https://github.com/databricks/scala-style-guide#call_by_name)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is corrected.
Test build #32000 has finished for PR 5801 at commit
|
e1187eb
to
26fea31
Compare
Test build #32001 has finished for PR 5801 at commit
|
Thank you! LGTM. I am merging it to master and branch 1.4. |
This patch comprises of a few related pieces of work: * Schema inference is performed directly on the JSON token stream * `String => Row` conversion populate Spark SQL structures without intermediate types * Projection pushdown is implemented via CatalystScan for DataFrame queries * Support for the legacy parser by setting `spark.sql.json.useJacksonStreamingAPI` to `false` Performance improvements depend on the schema and queries being executed, but it should be faster across the board. Below are benchmarks using the last.fm Million Song dataset: ``` Command | Baseline | Patched ---------------------------------------------------|----------|-------- import sqlContext.implicits._ | | val df = sqlContext.jsonFile("/tmp/lastfm.json") | 70.0s | 14.6s df.count() | 28.8s | 6.2s df.rdd.count() | 35.3s | 21.5s df.where($"artist" === "Robert Hood").collect() | 28.3s | 16.9s ``` To prepare this dataset for benchmarking, follow these steps: ``` # Fetch the datasets from http://labrosa.ee.columbia.edu/millionsong/lastfm wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_test.zip \ http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_train.zip # Decompress and combine, pipe through `jq -c` to ensure there is one record per line unzip -p lastfm_test.zip lastfm_train.zip | jq -c . > lastfm.json ``` Author: Nathan Howell <[email protected]> Closes #5801 from NathanHowell/json-performance and squashes the following commits: 26fea31 [Nathan Howell] Recreate the baseRDD each for each scan operation a7ebeb2 [Nathan Howell] Increase coverage of inserts into a JSONRelation e06a1dd [Nathan Howell] Add comments to the `useJacksonStreamingAPI` config flag 6822712 [Nathan Howell] Split up JsonRDD2 into multiple objects fa8234f [Nathan Howell] Wrap long lines b31917b [Nathan Howell] Rename `useJsonRDD2` to `useJacksonStreamingAPI` 15c5d1b [Nathan Howell] JSONRelation's baseRDD need not be lazy f8add6e [Nathan Howell] Add comments on lack of support for precision and scale DecimalTypes fa0be47 [Nathan Howell] Remove unused default case in the field parser 80dba17 [Nathan Howell] Add comments regarding null handling and empty strings 842846d [Nathan Howell] Point the empty schema inference test at JsonRDD2 ab6ee87 [Nathan Howell] Add projection pushdown support to JsonRDD/JsonRDD2 f636c14 [Nathan Howell] Enable JsonRDD2 by default, add a flag to switch back to JsonRDD 0bbc445 [Nathan Howell] Improve JSON parsing and type inference performance 7ca70c1 [Nathan Howell] Eliminate arrow pattern, replace with pattern matches (cherry picked from commit 2d6612c) Signed-off-by: Yin Huai <[email protected]>
This patch comprises of a few related pieces of work: * Schema inference is performed directly on the JSON token stream * `String => Row` conversion populate Spark SQL structures without intermediate types * Projection pushdown is implemented via CatalystScan for DataFrame queries * Support for the legacy parser by setting `spark.sql.json.useJacksonStreamingAPI` to `false` Performance improvements depend on the schema and queries being executed, but it should be faster across the board. Below are benchmarks using the last.fm Million Song dataset: ``` Command | Baseline | Patched ---------------------------------------------------|----------|-------- import sqlContext.implicits._ | | val df = sqlContext.jsonFile("/tmp/lastfm.json") | 70.0s | 14.6s df.count() | 28.8s | 6.2s df.rdd.count() | 35.3s | 21.5s df.where($"artist" === "Robert Hood").collect() | 28.3s | 16.9s ``` To prepare this dataset for benchmarking, follow these steps: ``` # Fetch the datasets from http://labrosa.ee.columbia.edu/millionsong/lastfm wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_test.zip \ http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_train.zip # Decompress and combine, pipe through `jq -c` to ensure there is one record per line unzip -p lastfm_test.zip lastfm_train.zip | jq -c . > lastfm.json ``` Author: Nathan Howell <[email protected]> Closes apache#5801 from NathanHowell/json-performance and squashes the following commits: 26fea31 [Nathan Howell] Recreate the baseRDD each for each scan operation a7ebeb2 [Nathan Howell] Increase coverage of inserts into a JSONRelation e06a1dd [Nathan Howell] Add comments to the `useJacksonStreamingAPI` config flag 6822712 [Nathan Howell] Split up JsonRDD2 into multiple objects fa8234f [Nathan Howell] Wrap long lines b31917b [Nathan Howell] Rename `useJsonRDD2` to `useJacksonStreamingAPI` 15c5d1b [Nathan Howell] JSONRelation's baseRDD need not be lazy f8add6e [Nathan Howell] Add comments on lack of support for precision and scale DecimalTypes fa0be47 [Nathan Howell] Remove unused default case in the field parser 80dba17 [Nathan Howell] Add comments regarding null handling and empty strings 842846d [Nathan Howell] Point the empty schema inference test at JsonRDD2 ab6ee87 [Nathan Howell] Add projection pushdown support to JsonRDD/JsonRDD2 f636c14 [Nathan Howell] Enable JsonRDD2 by default, add a flag to switch back to JsonRDD 0bbc445 [Nathan Howell] Improve JSON parsing and type inference performance 7ca70c1 [Nathan Howell] Eliminate arrow pattern, replace with pattern matches
This patch comprises of a few related pieces of work: * Schema inference is performed directly on the JSON token stream * `String => Row` conversion populate Spark SQL structures without intermediate types * Projection pushdown is implemented via CatalystScan for DataFrame queries * Support for the legacy parser by setting `spark.sql.json.useJacksonStreamingAPI` to `false` Performance improvements depend on the schema and queries being executed, but it should be faster across the board. Below are benchmarks using the last.fm Million Song dataset: ``` Command | Baseline | Patched ---------------------------------------------------|----------|-------- import sqlContext.implicits._ | | val df = sqlContext.jsonFile("/tmp/lastfm.json") | 70.0s | 14.6s df.count() | 28.8s | 6.2s df.rdd.count() | 35.3s | 21.5s df.where($"artist" === "Robert Hood").collect() | 28.3s | 16.9s ``` To prepare this dataset for benchmarking, follow these steps: ``` # Fetch the datasets from http://labrosa.ee.columbia.edu/millionsong/lastfm wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_test.zip \ http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_train.zip # Decompress and combine, pipe through `jq -c` to ensure there is one record per line unzip -p lastfm_test.zip lastfm_train.zip | jq -c . > lastfm.json ``` Author: Nathan Howell <[email protected]> Closes apache#5801 from NathanHowell/json-performance and squashes the following commits: 26fea31 [Nathan Howell] Recreate the baseRDD each for each scan operation a7ebeb2 [Nathan Howell] Increase coverage of inserts into a JSONRelation e06a1dd [Nathan Howell] Add comments to the `useJacksonStreamingAPI` config flag 6822712 [Nathan Howell] Split up JsonRDD2 into multiple objects fa8234f [Nathan Howell] Wrap long lines b31917b [Nathan Howell] Rename `useJsonRDD2` to `useJacksonStreamingAPI` 15c5d1b [Nathan Howell] JSONRelation's baseRDD need not be lazy f8add6e [Nathan Howell] Add comments on lack of support for precision and scale DecimalTypes fa0be47 [Nathan Howell] Remove unused default case in the field parser 80dba17 [Nathan Howell] Add comments regarding null handling and empty strings 842846d [Nathan Howell] Point the empty schema inference test at JsonRDD2 ab6ee87 [Nathan Howell] Add projection pushdown support to JsonRDD/JsonRDD2 f636c14 [Nathan Howell] Enable JsonRDD2 by default, add a flag to switch back to JsonRDD 0bbc445 [Nathan Howell] Improve JSON parsing and type inference performance 7ca70c1 [Nathan Howell] Eliminate arrow pattern, replace with pattern matches
This patch comprises of a few related pieces of work: * Schema inference is performed directly on the JSON token stream * `String => Row` conversion populate Spark SQL structures without intermediate types * Projection pushdown is implemented via CatalystScan for DataFrame queries * Support for the legacy parser by setting `spark.sql.json.useJacksonStreamingAPI` to `false` Performance improvements depend on the schema and queries being executed, but it should be faster across the board. Below are benchmarks using the last.fm Million Song dataset: ``` Command | Baseline | Patched ---------------------------------------------------|----------|-------- import sqlContext.implicits._ | | val df = sqlContext.jsonFile("/tmp/lastfm.json") | 70.0s | 14.6s df.count() | 28.8s | 6.2s df.rdd.count() | 35.3s | 21.5s df.where($"artist" === "Robert Hood").collect() | 28.3s | 16.9s ``` To prepare this dataset for benchmarking, follow these steps: ``` # Fetch the datasets from http://labrosa.ee.columbia.edu/millionsong/lastfm wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_test.zip \ http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_train.zip # Decompress and combine, pipe through `jq -c` to ensure there is one record per line unzip -p lastfm_test.zip lastfm_train.zip | jq -c . > lastfm.json ``` Author: Nathan Howell <[email protected]> Closes apache#5801 from NathanHowell/json-performance and squashes the following commits: 26fea31 [Nathan Howell] Recreate the baseRDD each for each scan operation a7ebeb2 [Nathan Howell] Increase coverage of inserts into a JSONRelation e06a1dd [Nathan Howell] Add comments to the `useJacksonStreamingAPI` config flag 6822712 [Nathan Howell] Split up JsonRDD2 into multiple objects fa8234f [Nathan Howell] Wrap long lines b31917b [Nathan Howell] Rename `useJsonRDD2` to `useJacksonStreamingAPI` 15c5d1b [Nathan Howell] JSONRelation's baseRDD need not be lazy f8add6e [Nathan Howell] Add comments on lack of support for precision and scale DecimalTypes fa0be47 [Nathan Howell] Remove unused default case in the field parser 80dba17 [Nathan Howell] Add comments regarding null handling and empty strings 842846d [Nathan Howell] Point the empty schema inference test at JsonRDD2 ab6ee87 [Nathan Howell] Add projection pushdown support to JsonRDD/JsonRDD2 f636c14 [Nathan Howell] Enable JsonRDD2 by default, add a flag to switch back to JsonRDD 0bbc445 [Nathan Howell] Improve JSON parsing and type inference performance 7ca70c1 [Nathan Howell] Eliminate arrow pattern, replace with pattern matches
This patch comprises of a few related pieces of work:
String => Row
conversion populate Spark SQL structures without intermediate typesspark.sql.json.useJacksonStreamingAPI
tofalse
Performance improvements depend on the schema and queries being executed, but it should be faster across the board. Below are benchmarks using the last.fm Million Song dataset:
To prepare this dataset for benchmarking, follow these steps: