Skip to content

[WIP][SPARK-25129][SQL] Revert mapping com.databricks.spark.avro to org.apache.spark.sql.avro #22119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

gengliangwang
Copy link
Member

What changes were proposed in this pull request?

In https://issues.apache.org/jira/browse/SPARK-24924, the data source provider com.databricks.spark.avro is mapped to the new package org.apache.spark.sql.avro .

Avro is external module and not loaded by default, we should not prevent users from using "com.databricks.spark.avro".

How was this patch tested?

Unit test

throw new AnalysisException(
s"Failed to find data source: ${provider1.toLowerCase(Locale.ROOT)}. " +
"AVRO is built-in data source since Spark 2.4. Please deploy the application " +
"as per https://spark.apache.org/docs/latest/avro-data-source.html#deploying")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am creating a documentation for AVRO data source. Let's merge this PR after the README is done.

@gengliangwang
Copy link
Member Author

@HyukjinKwon
Copy link
Member

Sorry if I missed some comments somewhere but just for clarification, should we do it for CSV in 3.0.0? Inconsistency should also be taken into account. Actually configuration sounds making more sense to me and remove it in 3.0.0.

@@ -503,7 +495,7 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
// get the same values back.
withTempPath { tempDir =>
val name = "AvroTest"
val namespace = "org.apache.spark.avro"
val namespace = "com.databricks.spark.avro"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change the name space in test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -637,6 +635,12 @@ object DataSource extends Logging {
"Hive built-in ORC data source must be used with Hive support enabled. " +
"Please use the native ORC data source by setting 'spark.sql.orc.impl' to " +
"'native'")
} else if (provider1.toLowerCase(Locale.ROOT) == "avro" ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have the same check for kafka?

@cloud-fan
Copy link
Contributor

If we all agree this databricks mapping is not reasonable, I think it's ok to have this inconsistency and remove the mapping for CSV in 3.0.

It's weird to make the same mistake just to make things consistent. (again, if we agree this is a mistake)

@gatorsmile
Copy link
Member

For details, see the discussion in the JIRA https://issues.apache.org/jira/browse/SPARK-24924

@gengliangwang
Copy link
Member Author

gengliangwang commented Aug 16, 2018

CSV is loaded by default, while AVRO is not. So having a backward compatibility mapping in CSV only still makes sense.
But to make it consistent , let's remove the mapping for CSV in 3.0.

@SparkQA
Copy link

SparkQA commented Aug 16, 2018

Test build #94841 has finished for PR 22119 at commit 656790e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

This particular inconsistency could confuse users because CSV's one has existed for a long time. I think configuration makes this safer since both sides make sense I believe.

@gengliangwang
Copy link
Member Author

I am not sure how useful the configuration for AVRO is.

For the hive table example @dongjoon-hyun mentioned in https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16570702&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16570702, it seems that the table should keep using the Databricks repo one, or manually set the table property. Otherwise some day the package org.apache.spark.sql.avro may change the previous behavior and lead to regression.

@tgravescs
Copy link
Contributor

Sorry I'm a bit confused by what is going on here. It looks like you just reverted the change. I thought we were simply adding a config so its configurable as to whether its in the mapping table or not? This gives some backwards compatibility with the hive table provider, but allows them to turn it off to use their own version of the avro package.

I don't necessarily agree with just reverting this either. How do users then update the hive provider?

See https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16571908&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16571908 for my thoughts on our options.

@gengliangwang
Copy link
Member Author

gengliangwang commented Aug 16, 2018

@tgravescs I saw your comments. Just feel that we can make it simpler by reverting it.
For hive tables that used Databricks spark-avro, the tables can still use the Databricks repo(since the built-in spark-avro is not loaded by default), or have a manually migration to the built-in one, which makes more sense.

@gengliangwang
Copy link
Member Author

gengliangwang commented Aug 16, 2018

But it seems that creating a configuration makes everyone happy...
I will wait for another day to get more thoughts before more code changes.

@tgravescs
Copy link
Contributor

How do users manually migrate and keep compatibility? That is the problem I have, I am all for reverting, if we have an easy way for users to migrate to the internal one.

Note that one of the problems is that if users change the table property for provider from databricks.avro to just spark avro, then all the older versions of Spark can't read that table. When you are dealing with multi-tenant environment that doesn't work. You have to do more phased approach until all the users get onto the newer versions of spark that support internal avro. This mapping would allow people to make the choice of using internal version or existing databricks avro and it works with all versions of Spark being used.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 16, 2018

+1 for @tgravescs 's comments. In terms of usability, the mapping and configuration will be easier for the most customers.

For the following @gengliangwang 's comment, technically there is no available published Databricks avro artifacts for Spark 2.4 (master branch) as of today. I assume that @gengliangwang will release it on the same day along with Apache Spark 2.4, but it would be great if we don't have that kind of undesirable assumptions which is beyond the Apache community.

For hive tables that used Databricks spark-avro, the tables can still use the Databricks repo(since the built-in spark-avro is not loaded by default)

Additionally, 3rd party spark-avro will go to maintenance mode like spark-csv. And, Spark 3.0 may want to read the old spark-avro generated tables.

@gengliangwang
Copy link
Member Author

@tgravescs @dongjoon-hyun Thanks for the explanation. We should add a configuration instead of reverting.

@gengliangwang
Copy link
Member Author

Close this one and open #22133

otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…uilt-in module configurable

In https://issues.apache.org/jira/browse/SPARK-24924, the data source provider com.databricks.spark.avro is mapped to the new package org.apache.spark.sql.avro .

As per the discussion in the [Jira](https://issues.apache.org/jira/browse/SPARK-24924) and PR apache#22119, we should make the mapping configurable.

This PR also improve the error message when data source of Avro/Kafka is not found.

Unit test

Closes apache#22133 from gengliangwang/configurable_avro_mapping.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Xiao Li <[email protected]>
(cherry picked from commit ac0174e)

RB=1526614
BUG=LIHADOOP-43392
R=fli,mshen,yezhou,edlu
A=fli
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants