Skip to content

[SPARK-4636] [SQL] Cluster By & Distribute By should follows the Hive behavior #3496

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

chenghao-intel
Copy link
Contributor

Using the RangePartitioning(SortPartition) instead of HashPartitioning(Repartition) for Cluster By and Distribute By, as they requires the no key(sort keys) overlap among the output partitions.

http://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by

@SparkQA
Copy link

SparkQA commented Nov 27, 2014

Test build #23919 has started for PR 3496 at commit e57e715.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 27, 2014

Test build #23919 has finished for PR 3496 at commit e57e715.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23919/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 28, 2014

Test build #23934 has started for PR 3496 at commit 062d82d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 28, 2014

Test build #23934 has finished for PR 3496 at commit 062d82d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MatrixFactorizationModel(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23934/
Test PASSed.

@marmbrus
Copy link
Contributor

This is a good catch, and it would be good to match Hive's semantics here. Perhaps the right thing to do though is change what logical operators are generated in the HiveQL parser, instead of changing the planner. It seems a little odd to me that Repartition would not behave like a spark repartition.

@chenghao-intel
Copy link
Contributor Author

@marmbrus , you're right, I will update the code after #3386 be merged.

@liancheng
Copy link
Contributor

Also it would be good to add a comment to explain the reason why we choose a suboptimal plan here is because we tend to respect Hive's semantics. Otherwise future developers may mistake this for a bug.

@marmbrus
Copy link
Contributor

#3386 has been merged :)

@chenghao-intel chenghao-intel changed the title [SPARK-4636] [SQL] WIP: Cluster By & Distribute By doesn't follow the result of Hive [SPARK-4636] [SQL] Cluster By & Distribute By should follows the Hive behavior Dec 31, 2014
@SparkQA
Copy link

SparkQA commented Dec 31, 2014

Test build #24932 has started for PR 3496 at commit 0fd4176.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 31, 2014

Test build #24932 has finished for PR 3496 at commit 0fd4176.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class SortPartitions(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24932/
Test PASSed.

@chenghao-intel
Copy link
Contributor Author

Thank you @marmbrus & @liancheng , I've updated the code, and removed the WIP from the PR title.

@chenghao-intel
Copy link
Contributor Author

@marmbrus just in case you missed this. :)

@yhuai
Copy link
Contributor

yhuai commented Jan 8, 2015

I do not see any total order guarantee is mentioned in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy. Here are what I find...

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.

Cluster By is a short-cut for both Distribute By and Sort By.

Hive supports SORT BY which sorts the data per reducer.

For a particular case, if total order is guaranteed is based on the partitioner. The default partitioner is org.apache.hadoop.hive.ql.io.DefaultHivePartitioner (https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/DefaultHivePartitioner.java), which is a hash partitioner.

I do not think we need to make the change.

@chenghao-intel
Copy link
Contributor Author

Oh, thank you @yhuai for the correction. I realized that the Hive mapred.reduce.tasks setting will not work in local mode after some investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants