Skip to content

[SPARK-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive #33679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

SaurabhChawla100
Copy link
Contributor

@SaurabhChawla100 SaurabhChawla100 commented Aug 8, 2021

What changes were proposed in this pull request?

Add the support in Spark for having group by map datatype column for the scenario that works in Hive.
In hive this scenario works fine

describe extended complex2;
OK
id                  string 
c1                  map<int, string>   
Detailed Table Information Table(tableName:complex2, dbName:default, owner:abc, createTime:1627994412, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:c1, type:map<int,string>, comment:null)], location:/user/hive/warehouse/complex2, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1

select * from complex2;
OK
1 {1:"u"}
2 {1:"u",2:"uo"}
1 {1:"u",2:"uo"}
Time taken: 0.363 seconds, Fetched: 3 row(s)

Working Scenario in Hive -: 

select id, c1, count(*) from complex2 group by id, c1;
OK
1 {1:"u"} 1
1 {1:"u",2:"uo"} 1
2 {1:"u",2:"uo"} 1
Time taken: 1.621 seconds, Fetched: 3 row(s)

Failed Scenario in Hive -: 
When map type is present in aggregated expression 
select id, max(c1), count(*) from complex2 group by id, c1; 
FAILED: UDFArgumentTypeException Cannot support comparison of map<> type or complex type containing map<>.

But in spark where the group by map column failed for this scenario where the map column is used in the select without any aggregation, The one that works in hive.

scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
org.apache.spark.sql.AnalysisException: expression spark_catalog.default.complex2.`c1` cannot be used as a grouping expression because its data type map<int,string> is not an orderable data type.;
Aggregate [id#1, c1#2], [id#1, c1#2, count(1) AS count(1)#3L]
+- SubqueryAlias spark_catalog.default.complex2
 +- HiveTableRelation [`default`.`complex2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#1, c1#2], Partition Cols: []]
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)

Why are the changes needed?

There is need to add the this scenario where grouping expression can have map type if aggregated expression does not have the that map type reference. This helps in migrating the user from hive to Spark.

After the code change

scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
+---+-----------------+--------+                                                
| id|               c1|count(1)|
+---+-----------------+--------+
|  1|         {1 -> u}|       1|
|  2|{1 -> u, 2 -> uo}|       1|
|  1|{1 -> u, 2 -> uo}|       1|
+---+-----------------+--------+

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added the unit test and also tested using spark-shell the scenario

@github-actions github-actions bot added the SQL label Aug 8, 2021
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@SaurabhChawla100 SaurabhChawla100 changed the title [Spark 36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive [Spark-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive Aug 8, 2021
@SaurabhChawla100 SaurabhChawla100 changed the title [Spark-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive [SPARK-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive Aug 8, 2021
@@ -97,13 +97,18 @@ object InterpretedOrdering {
object RowOrdering extends CodeGeneratorWithInterpretedFallback[Seq[SortOrder], BaseOrdering] {

/**
* Returns true iff the data type can be ordered (i.e. can be sorted).
* Returns true if the data type can be ordered (i.e. can be sorted).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iff is an abbreviation of if and only if

*/
def isOrderable(dataType: DataType): Boolean = dataType match {
def isOrderable(dataType: DataType,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fix #31967 first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon - Thanks for checking this PR. Yes we can wait for this PR #32552. The fix in this will work with group by, order by , partition by in window.

@c21
Copy link
Contributor

c21 commented Aug 9, 2021

I thought @maropu is still working on this? (#32552)

@SaurabhChawla100
Copy link
Contributor Author

I thought @maropu is still working on this? (#32552)

I was not aware, that there is already a jira for this map issue, Yes this PR (#32552) will fix the use case that I am trying in this PR.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Nov 18, 2021
@github-actions github-actions bot closed this Nov 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants