Skip to content

[WIP][SPARK-4673][SQL] Optimizing limit using coalesce #3531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

scwf
Copy link
Contributor

@scwf scwf commented Dec 1, 2014

Optimizing limit using coalesce to avoid shuffle.

@SparkQA
Copy link

SparkQA commented Dec 1, 2014

Test build #23978 has started for PR 3531 at commit 681243a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 1, 2014

Test build #23978 has finished for PR 3531 at commit 681243a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23978/
Test PASSed.

iter.take(limit).map(row => (false, row.copy()))
if (sortBasedShuffleOn) {
child.execute().map(_.copy).coalesce(1).mapPartitions { iter =>
iter.take(limit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the map(_.copy) after take(limit)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I will try this. Actually i am not clear why we need copy here, @rxin added it to fix a bug. Hi @rxin, can you explain this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do need to copy() before any take or collect operation because SparkSQL will reuse row objects and these operations create arrays that will end up all containing the same object.

@marmbrus
Copy link
Contributor

marmbrus commented Dec 1, 2014

Will this actually always be faster? It seems like in some cases you are actually just eliminating a bunch of parallelism? Do you have some benchmarks?

@scwf
Copy link
Contributor Author

scwf commented Dec 2, 2014

hi @marmbrus, the old version also eliminate the parallelism to 1 by a shuffledRDD, the diff is this PR using coalesce to do the same thing but avoid shuffle(which will write and read file from disk).
In my test with the new version it can speed up 3X on a 95g dataset.

@chenghao-intel
Copy link
Contributor

Is there an assumption, the LIMIT number is quite small?

@scwf
Copy link
Contributor Author

scwf commented Dec 2, 2014

I tested with limit number 5000. I am testing more for this. I do not think limit number has big affect.

@rxin
Copy link
Contributor

rxin commented Dec 2, 2014

@scwf I am not sure if this is a good idea in general. Think about a highly selective filter, e.g.

select * from every_body_in_the_world where company="Databricks" limit 5;

In this case, with your patch this query is going to run slowly on a single thread to scan all the data ...

@scwf
Copy link
Contributor Author

scwf commented Dec 2, 2014

Yes, i also realize this, it will not be always faster, since coalesce (1) will lead to run with a single thread.

@rxin
Copy link
Contributor

rxin commented Dec 2, 2014

I think it is too risky to do this this way right now. It seems to me the advantage of coalesce only shows up when you have a huge number of partitions without a highly selective filter. Maybe we can have two variants of Limit, and in the optimizer, we pick the coalesce one if there is no filter at all?

@scwf
Copy link
Contributor Author

scwf commented Dec 2, 2014

@rxin Yes, we can not change to coalesce here, I agree with you of coalesce's advantages situation, and i will try to do the optimization with coalesce for no filter. Thanks;)

@scwf scwf changed the title [SPARK-4673][SQL] Optimizing limit using coalesce [WIP][SPARK-4673][SQL] Optimizing limit using coalesce Dec 2, 2014
@rxin
Copy link
Contributor

rxin commented Dec 2, 2014

BTW it doesn't have to be a new operator. Can also just add a flag to Limit.

@rxin
Copy link
Contributor

rxin commented Dec 2, 2014

Actually one more question before you make big changes: executeCollect should be called most of the time (if you run a sql query). In what cases did you run into this problem?

@scwf
Copy link
Contributor Author

scwf commented Dec 2, 2014

Hi @rxin, bin/spark-sql do not call the executeCollect, here i filed a PR for this #3547

@rxin
Copy link
Contributor

rxin commented Dec 2, 2014

Ah I see. We should absolutely fix that one. Once that is fixed, do you think we still need this? It seems very unlikely execute() will be called on this.

@scwf
Copy link
Contributor Author

scwf commented Dec 2, 2014

If limit is in a sub-queries, execute() will be called, right? But that is really rare:)

@rxin
Copy link
Contributor

rxin commented Dec 2, 2014

It seems to me that case would be rare enough that we probably don't need to care at this point. There are a lot of other low hanging fruits that we can optimize.

@scwf
Copy link
Contributor Author

scwf commented Dec 2, 2014

agree, to close this

发自我的 iPhone

在 2014年12月2日,14:31,Reynold Xin [email protected] 写道:

It seems to me that case would be rare enough that we probably don't need to care at this point. There are a lot of other low hanging fruits that we can optimize.


Reply to this email directly or view it on GitHub.

@scwf scwf closed this Dec 2, 2014
@scwf scwf deleted the limit branch January 7, 2015 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants