[WIP][SPARK-25129][SQL] Revert mapping com.databricks.spark.avro to org.apache.spark.sql.avro #22119

gengliangwang · 2018-08-16T06:36:03Z

What changes were proposed in this pull request?

In https://issues.apache.org/jira/browse/SPARK-24924, the data source provider com.databricks.spark.avro is mapped to the new package org.apache.spark.sql.avro .

Avro is external module and not loaded by default, we should not prevent users from using "com.databricks.spark.avro".

How was this patch tested?

Unit test

This reverts commit 58353d7.

gengliangwang · 2018-08-16T06:36:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+                  throw new AnalysisException(
+                    s"Failed to find data source: ${provider1.toLowerCase(Locale.ROOT)}. " +
+                    "AVRO is built-in data source since Spark 2.4. Please deploy the application " +
+                    "as per https://spark.apache.org/docs/latest/avro-data-source.html#deploying")


I am creating a documentation for AVRO data source. Let's merge this PR after the README is done.

gengliangwang · 2018-08-16T06:37:32Z

@tgravescs @dongjoon-hyun @HyukjinKwon @cloud-fan

HyukjinKwon · 2018-08-16T06:41:03Z

Sorry if I missed some comments somewhere but just for clarification, should we do it for CSV in 3.0.0? Inconsistency should also be taken into account. Actually configuration sounds making more sense to me and remove it in 3.0.0.

cloud-fan · 2018-08-16T06:46:36Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

@@ -503,7 +495,7 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
    // get the same values back.
    withTempPath { tempDir =>
      val name = "AvroTest"
-      val namespace = "org.apache.spark.avro"
+      val namespace = "com.databricks.spark.avro"


why change the name space in test?

Just revert the whole commit in #21878.
This line is from https://github.com/apache/spark/pull/21878/files#diff-9364b0610f92b3cc35a4bc43a80751bfL459

cloud-fan · 2018-08-16T06:47:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -637,6 +635,12 @@ object DataSource extends Logging {
                    "Hive built-in ORC data source must be used with Hive support enabled. " +
                    "Please use the native ORC data source by setting 'spark.sql.orc.impl' to " +
                    "'native'")
+                } else if (provider1.toLowerCase(Locale.ROOT) == "avro" ||


do we have the same check for kafka?

cloud-fan · 2018-08-16T06:50:07Z

If we all agree this databricks mapping is not reasonable, I think it's ok to have this inconsistency and remove the mapping for CSV in 3.0.

It's weird to make the same mistake just to make things consistent. (again, if we agree this is a mistake)

gatorsmile · 2018-08-16T06:51:43Z

For details, see the discussion in the JIRA https://issues.apache.org/jira/browse/SPARK-24924

gengliangwang · 2018-08-16T06:55:34Z

CSV is loaded by default, while AVRO is not. So having a backward compatibility mapping in CSV only still makes sense.
But to make it consistent , let's remove the mapping for CSV in 3.0.

SparkQA · 2018-08-16T07:01:22Z

Test build #94841 has finished for PR 22119 at commit 656790e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-16T07:06:31Z

This particular inconsistency could confuse users because CSV's one has existed for a long time. I think configuration makes this safer since both sides make sense I believe.

gengliangwang · 2018-08-16T08:14:38Z

I am not sure how useful the configuration for AVRO is.

For the hive table example @dongjoon-hyun mentioned in https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16570702&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16570702, it seems that the table should keep using the Databricks repo one, or manually set the table property. Otherwise some day the package org.apache.spark.sql.avro may change the previous behavior and lead to regression.

tgravescs · 2018-08-16T13:18:26Z

Sorry I'm a bit confused by what is going on here. It looks like you just reverted the change. I thought we were simply adding a config so its configurable as to whether its in the mapping table or not? This gives some backwards compatibility with the hive table provider, but allows them to turn it off to use their own version of the avro package.

I don't necessarily agree with just reverting this either. How do users then update the hive provider?

See https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16571908&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16571908 for my thoughts on our options.

gengliangwang · 2018-08-16T17:08:56Z

@tgravescs I saw your comments. Just feel that we can make it simpler by reverting it.
For hive tables that used Databricks spark-avro, the tables can still use the Databricks repo(since the built-in spark-avro is not loaded by default), or have a manually migration to the built-in one, which makes more sense.

gengliangwang · 2018-08-16T17:14:31Z

But it seems that creating a configuration makes everyone happy...
I will wait for another day to get more thoughts before more code changes.

tgravescs · 2018-08-16T17:30:11Z

How do users manually migrate and keep compatibility? That is the problem I have, I am all for reverting, if we have an easy way for users to migrate to the internal one.

Note that one of the problems is that if users change the table property for provider from databricks.avro to just spark avro, then all the older versions of Spark can't read that table. When you are dealing with multi-tenant environment that doesn't work. You have to do more phased approach until all the users get onto the newer versions of spark that support internal avro. This mapping would allow people to make the choice of using internal version or existing databricks avro and it works with all versions of Spark being used.

dongjoon-hyun · 2018-08-16T19:33:09Z

+1 for @tgravescs 's comments. In terms of usability, the mapping and configuration will be easier for the most customers.

For the following @gengliangwang 's comment, technically there is no available published Databricks avro artifacts for Spark 2.4 (master branch) as of today. I assume that @gengliangwang will release it on the same day along with Apache Spark 2.4, but it would be great if we don't have that kind of undesirable assumptions which is beyond the Apache community.

For hive tables that used Databricks spark-avro, the tables can still use the Databricks repo(since the built-in spark-avro is not loaded by default)

Additionally, 3rd party spark-avro will go to maintenance mode like spark-csv. And, Spark 3.0 may want to read the old spark-avro generated tables.

gengliangwang · 2018-08-17T06:44:54Z

@tgravescs @dongjoon-hyun Thanks for the explanation. We should add a configuration instead of reverting.

gengliangwang · 2018-08-17T10:48:27Z

Close this one and open #22133

…uilt-in module configurable In https://issues.apache.org/jira/browse/SPARK-24924, the data source provider com.databricks.spark.avro is mapped to the new package org.apache.spark.sql.avro . As per the discussion in the [Jira](https://issues.apache.org/jira/browse/SPARK-24924) and PR apache#22119, we should make the mapping configurable. This PR also improve the error message when data source of Avro/Kafka is not found. Unit test Closes apache#22133 from gengliangwang/configurable_avro_mapping. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Xiao Li <[email protected]> (cherry picked from commit ac0174e) RB=1526614 BUG=LIHADOOP-43392 R=fli,mshen,yezhou,edlu A=fli

gengliangwang added 2 commits August 16, 2018 12:57

Revert "[SPARK-24924][SQL] Add mapping for built-in Avro data source"

4584b2d

This reverts commit 58353d7.

improve error message

656790e

gengliangwang commented Aug 16, 2018

View reviewed changes

cloud-fan reviewed Aug 16, 2018

View reviewed changes

gengliangwang mentioned this pull request Aug 17, 2018

[SPARK-25129][SQL]Make the mapping of com.databricks.spark.avro to built-in module configurable #22133

Closed

gengliangwang closed this Aug 17, 2018

[WIP][SPARK-25129][SQL] Revert mapping com.databricks.spark.avro to org.apache.spark.sql.avro #22119

[WIP][SPARK-25129][SQL] Revert mapping com.databricks.spark.avro to org.apache.spark.sql.avro #22119

Uh oh!

Conversation

gengliangwang commented Aug 16, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gengliangwang Aug 16, 2018

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Aug 16, 2018

Uh oh!

HyukjinKwon commented Aug 16, 2018

Uh oh!

cloud-fan Aug 16, 2018

Choose a reason for hiding this comment

Uh oh!

gengliangwang Aug 16, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 16, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Aug 16, 2018

Uh oh!

gatorsmile commented Aug 16, 2018

Uh oh!

gengliangwang commented Aug 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 16, 2018

Uh oh!

HyukjinKwon commented Aug 16, 2018

Uh oh!

gengliangwang commented Aug 16, 2018

Uh oh!

tgravescs commented Aug 16, 2018

Uh oh!

gengliangwang commented Aug 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gengliangwang commented Aug 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgravescs commented Aug 16, 2018

Uh oh!

dongjoon-hyun commented Aug 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gengliangwang commented Aug 17, 2018

Uh oh!

gengliangwang commented Aug 17, 2018

Uh oh!

Uh oh!

gengliangwang commented Aug 16, 2018 •

edited

Loading

gengliangwang commented Aug 16, 2018 •

edited

Loading

gengliangwang commented Aug 16, 2018 •

edited

Loading

dongjoon-hyun commented Aug 16, 2018 •

edited

Loading