-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-4636] [SQL] Cluster By & Distribute By should follows the Hive behavior #3496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #23919 has started for PR 3496 at commit
|
Test build #23919 has finished for PR 3496 at commit
|
Test FAILed. |
Test build #23934 has started for PR 3496 at commit
|
Test build #23934 has finished for PR 3496 at commit
|
Test PASSed. |
This is a good catch, and it would be good to match Hive's semantics here. Perhaps the right thing to do though is change what logical operators are generated in the HiveQL parser, instead of changing the planner. It seems a little odd to me that |
Also it would be good to add a comment to explain the reason why we choose a suboptimal plan here is because we tend to respect Hive's semantics. Otherwise future developers may mistake this for a bug. |
#3386 has been merged :) |
062d82d
to
0fd4176
Compare
Test build #24932 has started for PR 3496 at commit
|
Test build #24932 has finished for PR 3496 at commit
|
Test PASSed. |
Thank you @marmbrus & @liancheng , I've updated the code, and removed the |
@marmbrus just in case you missed this. :) |
I do not see any total order guarantee is mentioned in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy. Here are what I find...
For a particular case, if total order is guaranteed is based on the partitioner. The default partitioner is I do not think we need to make the change. |
Oh, thank you @yhuai for the correction. I realized that the Hive |
Using the
RangePartitioning
(SortPartition
) instead ofHashPartitioning
(Repartition
) forCluster By
andDistribute By
, as they requires the no key(sort keys) overlap among the output partitions.http://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by