[SPARK-4636] [SQL] Cluster By & Distribute By should follows the Hive behavior #3496

chenghao-intel · 2014-11-27T08:48:05Z

Using the RangePartitioning(SortPartition) instead of HashPartitioning(Repartition) for Cluster By and Distribute By, as they requires the no key(sort keys) overlap among the output partitions.

http://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by

SparkQA · 2014-11-27T08:55:05Z

Test build #23919 has started for PR 3496 at commit e57e715.

This patch merges cleanly.

SparkQA · 2014-11-27T09:31:47Z

Test build #23919 has finished for PR 3496 at commit e57e715.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-27T09:31:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23919/
Test FAILed.

SparkQA · 2014-11-28T07:40:06Z

Test build #23934 has started for PR 3496 at commit 062d82d.

This patch merges cleanly.

SparkQA · 2014-11-28T08:42:27Z

Test build #23934 has finished for PR 3496 at commit 062d82d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class MatrixFactorizationModel(

AmplabJenkins · 2014-11-28T08:42:30Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23934/
Test PASSed.

marmbrus · 2014-12-17T20:23:08Z

This is a good catch, and it would be good to match Hive's semantics here. Perhaps the right thing to do though is change what logical operators are generated in the HiveQL parser, instead of changing the planner. It seems a little odd to me that Repartition would not behave like a spark repartition.

chenghao-intel · 2014-12-22T01:10:47Z

@marmbrus , you're right, I will update the code after #3386 be merged.

liancheng · 2014-12-22T06:32:25Z

Also it would be good to add a comment to explain the reason why we choose a suboptimal plan here is because we tend to respect Hive's semantics. Otherwise future developers may mistake this for a bug.

marmbrus · 2014-12-30T21:50:26Z

#3386 has been merged :)

SparkQA · 2014-12-31T02:57:39Z

Test build #24932 has started for PR 3496 at commit 0fd4176.

This patch merges cleanly.

SparkQA · 2014-12-31T04:09:38Z

Test build #24932 has finished for PR 3496 at commit 0fd4176.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SortPartitions(

AmplabJenkins · 2014-12-31T04:09:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24932/
Test PASSed.

chenghao-intel · 2014-12-31T04:12:38Z

Thank you @marmbrus & @liancheng , I've updated the code, and removed the WIP from the PR title.

chenghao-intel · 2015-01-07T07:25:09Z

@marmbrus just in case you missed this. :)

yhuai · 2015-01-08T19:48:42Z

I do not see any total order guarantee is mentioned in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy. Here are what I find...

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.

Cluster By is a short-cut for both Distribute By and Sort By.

Hive supports SORT BY which sorts the data per reducer.

For a particular case, if total order is guaranteed is based on the partitioner. The default partitioner is org.apache.hadoop.hive.ql.io.DefaultHivePartitioner (https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/DefaultHivePartitioner.java), which is a hash partitioner.

I do not think we need to make the change.

chenghao-intel · 2015-01-09T02:23:30Z

Oh, thank you @yhuai for the correction. I realized that the Hive mapred.reduce.tasks setting will not work in local mode after some investigation.

Cluster By & Distribute By follows the Hive behavior

0fd4176

chenghao-intel force-pushed the repartition branch from 062d82d to 0fd4176 Compare December 31, 2014 02:54

chenghao-intel changed the title ~~[SPARK-4636] [SQL] WIP: Cluster By & Distribute By doesn't follow the result of Hive~~ [SPARK-4636] [SQL] Cluster By & Distribute By should follows the Hive behavior Dec 31, 2014

chenghao-intel closed this Jan 9, 2015

[SPARK-4636] [SQL] Cluster By & Distribute By should follows the Hive behavior #3496

[SPARK-4636] [SQL] Cluster By & Distribute By should follows the Hive behavior #3496

Uh oh!

Conversation

chenghao-intel commented Nov 27, 2014

Uh oh!

SparkQA commented Nov 27, 2014

Uh oh!

SparkQA commented Nov 27, 2014

Uh oh!

AmplabJenkins commented Nov 27, 2014

Uh oh!

SparkQA commented Nov 28, 2014

Uh oh!

SparkQA commented Nov 28, 2014

Uh oh!

AmplabJenkins commented Nov 28, 2014

Uh oh!

marmbrus commented Dec 17, 2014

Uh oh!

chenghao-intel commented Dec 22, 2014

Uh oh!

liancheng commented Dec 22, 2014

Uh oh!

marmbrus commented Dec 30, 2014

Uh oh!

SparkQA commented Dec 31, 2014

Uh oh!

SparkQA commented Dec 31, 2014

Uh oh!

AmplabJenkins commented Dec 31, 2014

Uh oh!

chenghao-intel commented Dec 31, 2014

Uh oh!

chenghao-intel commented Jan 7, 2015

Uh oh!

yhuai commented Jan 8, 2015

Uh oh!

chenghao-intel commented Jan 9, 2015

Uh oh!

Uh oh!