[WIP][SPARK-4673][SQL] Optimizing limit using coalesce #3531

scwf · 2014-12-01T07:57:44Z

Optimizing limit using coalesce to avoid shuffle.

SparkQA · 2014-12-01T08:05:09Z

Test build #23978 has started for PR 3531 at commit 681243a.

This patch merges cleanly.

SparkQA · 2014-12-01T09:05:43Z

Test build #23978 has finished for PR 3531 at commit 681243a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-01T09:05:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23978/
Test PASSed.

chenghao-intel · 2014-12-01T09:07:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala

-        iter.take(limit).map(row => (false, row.copy()))
+    if (sortBasedShuffleOn) {
+      child.execute().map(_.copy).coalesce(1).mapPartitions { iter =>
+        iter.take(limit)


Can we move the map(_.copy) after take(limit)?

Hmm, I will try this. Actually i am not clear why we need copy here, @rxin added it to fix a bug. Hi @rxin, can you explain this?

You do need to copy() before any take or collect operation because SparkSQL will reuse row objects and these operations create arrays that will end up all containing the same object.

marmbrus · 2014-12-01T23:51:06Z

Will this actually always be faster? It seems like in some cases you are actually just eliminating a bunch of parallelism? Do you have some benchmarks?

scwf · 2014-12-02T01:01:13Z

hi @marmbrus, the old version also eliminate the parallelism to 1 by a shuffledRDD, the diff is this PR using coalesce to do the same thing but avoid shuffle(which will write and read file from disk).
In my test with the new version it can speed up 3X on a 95g dataset.

chenghao-intel · 2014-12-02T01:09:47Z

Is there an assumption, the LIMIT number is quite small?

scwf · 2014-12-02T01:33:26Z

I tested with limit number 5000. I am testing more for this. I do not think limit number has big affect.

rxin · 2014-12-02T02:26:48Z

@scwf I am not sure if this is a good idea in general. Think about a highly selective filter, e.g.

select * from every_body_in_the_world where company="Databricks" limit 5;

In this case, with your patch this query is going to run slowly on a single thread to scan all the data ...

scwf · 2014-12-02T02:31:15Z

Yes, i also realize this, it will not be always faster, since coalesce (1) will lead to run with a single thread.

rxin · 2014-12-02T02:35:53Z

I think it is too risky to do this this way right now. It seems to me the advantage of coalesce only shows up when you have a huge number of partitions without a highly selective filter. Maybe we can have two variants of Limit, and in the optimizer, we pick the coalesce one if there is no filter at all?

scwf · 2014-12-02T02:47:14Z

@rxin Yes, we can not change to coalesce here, I agree with you of coalesce's advantages situation, and i will try to do the optimization with coalesce for no filter. Thanks;)

rxin · 2014-12-02T02:51:24Z

BTW it doesn't have to be a new operator. Can also just add a flag to Limit.

rxin · 2014-12-02T02:52:10Z

Actually one more question before you make big changes: executeCollect should be called most of the time (if you run a sql query). In what cases did you run into this problem?

scwf · 2014-12-02T04:56:42Z

Hi @rxin, bin/spark-sql do not call the executeCollect, here i filed a PR for this #3547

rxin · 2014-12-02T05:52:17Z

Ah I see. We should absolutely fix that one. Once that is fixed, do you think we still need this? It seems very unlikely execute() will be called on this.

scwf · 2014-12-02T06:26:31Z

If limit is in a sub-queries, execute() will be called, right? But that is really rare:)

rxin · 2014-12-02T06:30:53Z

It seems to me that case would be rare enough that we probably don't need to care at this point. There are a lot of other low hanging fruits that we can optimize.

scwf · 2014-12-02T06:44:52Z

agree， to close this

发自我的 iPhone

在 2014年12月2日，14:31，Reynold Xin [email protected] 写道：

It seems to me that case would be rare enough that we probably don't need to care at this point. There are a lot of other low hanging fruits that we can optimize.

—
Reply to this email directly or view it on GitHub.

optimize limit using coalesce

681243a

chenghao-intel reviewed Dec 1, 2014
View reviewed changes

scwf changed the title ~~[SPARK-4673][SQL] Optimizing limit using coalesce~~ [WIP][SPARK-4673][SQL] Optimizing limit using coalesce Dec 2, 2014

scwf closed this Dec 2, 2014

scwf deleted the limit branch January 7, 2015 09:54

[WIP][SPARK-4673][SQL] Optimizing limit using coalesce #3531

[WIP][SPARK-4673][SQL] Optimizing limit using coalesce #3531

Uh oh!

Conversation

scwf commented Dec 1, 2014

Uh oh!

SparkQA commented Dec 1, 2014

Uh oh!

SparkQA commented Dec 1, 2014

Uh oh!

AmplabJenkins commented Dec 1, 2014

Uh oh!

chenghao-intel Dec 1, 2014

Choose a reason for hiding this comment

Uh oh!

scwf Dec 1, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus Dec 1, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Dec 1, 2014

Uh oh!

scwf commented Dec 2, 2014

Uh oh!

chenghao-intel commented Dec 2, 2014

Uh oh!

scwf commented Dec 2, 2014

Uh oh!

rxin commented Dec 2, 2014

Uh oh!

scwf commented Dec 2, 2014

Uh oh!

rxin commented Dec 2, 2014

Uh oh!

scwf commented Dec 2, 2014

Uh oh!

rxin commented Dec 2, 2014

Uh oh!

rxin commented Dec 2, 2014

Uh oh!

scwf commented Dec 2, 2014

Uh oh!

rxin commented Dec 2, 2014

Uh oh!

scwf commented Dec 2, 2014

Uh oh!

rxin commented Dec 2, 2014

Uh oh!

scwf commented Dec 2, 2014

Uh oh!

Uh oh!