-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[WIP][SPARK-25129][SQL] Revert mapping com.databricks.spark.avro to org.apache.spark.sql.avro #22119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
throw new AnalysisException( | ||
s"Failed to find data source: ${provider1.toLowerCase(Locale.ROOT)}. " + | ||
"AVRO is built-in data source since Spark 2.4. Please deploy the application " + | ||
"as per https://spark.apache.org/docs/latest/avro-data-source.html#deploying") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am creating a documentation for AVRO data source. Let's merge this PR after the README is done.
Sorry if I missed some comments somewhere but just for clarification, should we do it for CSV in 3.0.0? Inconsistency should also be taken into account. Actually configuration sounds making more sense to me and remove it in 3.0.0. |
@@ -503,7 +495,7 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { | |||
// get the same values back. | |||
withTempPath { tempDir => | |||
val name = "AvroTest" | |||
val namespace = "org.apache.spark.avro" | |||
val namespace = "com.databricks.spark.avro" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change the name space in test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just revert the whole commit in #21878.
This line is from https://github.com/apache/spark/pull/21878/files#diff-9364b0610f92b3cc35a4bc43a80751bfL459
@@ -637,6 +635,12 @@ object DataSource extends Logging { | |||
"Hive built-in ORC data source must be used with Hive support enabled. " + | |||
"Please use the native ORC data source by setting 'spark.sql.orc.impl' to " + | |||
"'native'") | |||
} else if (provider1.toLowerCase(Locale.ROOT) == "avro" || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have the same check for kafka?
If we all agree this databricks mapping is not reasonable, I think it's ok to have this inconsistency and remove the mapping for CSV in 3.0. It's weird to make the same mistake just to make things consistent. (again, if we agree this is a mistake) |
For details, see the discussion in the JIRA https://issues.apache.org/jira/browse/SPARK-24924 |
CSV is loaded by default, while AVRO is not. So having a backward compatibility mapping in CSV only still makes sense. |
Test build #94841 has finished for PR 22119 at commit
|
This particular inconsistency could confuse users because CSV's one has existed for a long time. I think configuration makes this safer since both sides make sense I believe. |
I am not sure how useful the configuration for AVRO is. For the hive table example @dongjoon-hyun mentioned in https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16570702&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16570702, it seems that the table should keep using the Databricks repo one, or manually set the table property. Otherwise some day the package |
Sorry I'm a bit confused by what is going on here. It looks like you just reverted the change. I thought we were simply adding a config so its configurable as to whether its in the mapping table or not? This gives some backwards compatibility with the hive table provider, but allows them to turn it off to use their own version of the avro package. I don't necessarily agree with just reverting this either. How do users then update the hive provider? See https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16571908&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16571908 for my thoughts on our options. |
@tgravescs I saw your comments. Just feel that we can make it simpler by reverting it. |
But it seems that creating a configuration makes everyone happy... |
How do users manually migrate and keep compatibility? That is the problem I have, I am all for reverting, if we have an easy way for users to migrate to the internal one. Note that one of the problems is that if users change the table property for provider from databricks.avro to just spark avro, then all the older versions of Spark can't read that table. When you are dealing with multi-tenant environment that doesn't work. You have to do more phased approach until all the users get onto the newer versions of spark that support internal avro. This mapping would allow people to make the choice of using internal version or existing databricks avro and it works with all versions of Spark being used. |
+1 for @tgravescs 's comments. In terms of usability, the mapping and configuration will be easier for the most customers. For the following @gengliangwang 's comment, technically there is no available published Databricks avro artifacts for Spark 2.4 (master branch) as of today. I assume that @gengliangwang will release it on the same day along with Apache Spark 2.4, but it would be great if we don't have that kind of undesirable assumptions which is beyond the Apache community.
Additionally, 3rd party |
@tgravescs @dongjoon-hyun Thanks for the explanation. We should add a configuration instead of reverting. |
Close this one and open #22133 |
…uilt-in module configurable In https://issues.apache.org/jira/browse/SPARK-24924, the data source provider com.databricks.spark.avro is mapped to the new package org.apache.spark.sql.avro . As per the discussion in the [Jira](https://issues.apache.org/jira/browse/SPARK-24924) and PR apache#22119, we should make the mapping configurable. This PR also improve the error message when data source of Avro/Kafka is not found. Unit test Closes apache#22133 from gengliangwang/configurable_avro_mapping. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Xiao Li <[email protected]> (cherry picked from commit ac0174e) RB=1526614 BUG=LIHADOOP-43392 R=fli,mshen,yezhou,edlu A=fli
What changes were proposed in this pull request?
In https://issues.apache.org/jira/browse/SPARK-24924, the data source provider com.databricks.spark.avro is mapped to the new package org.apache.spark.sql.avro .
Avro is external module and not loaded by default, we should not prevent users from using "com.databricks.spark.avro".
How was this patch tested?
Unit test