Skip to content

[SPARK-5969][PySpark] Fix descending pyspark.rdd.sortByKey. #4761

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

foxik
Copy link
Contributor

@foxik foxik commented Feb 25, 2015

The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved in rangePartitioner by reversing the found index.

The current implementation also work, but always uses only two partitions -- the first one and the last one (because the bisect_left return returns either "beginning" or "end" for a descending sequence).

The samples should always be sorted in ascending order, because
bisect.bisect_left is used on it. The reverse order of the result
is already achieved in rangePartitioner by reversing the found index.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@JoshRosen
Copy link
Contributor

Could you add a regression test for this issue? It looks like you have one in the JIRA ticket, so adding one hopefully should not be much work. Take a look at python/pyspark/tests.py for examples of this. Make sure to add a comment referencing the JIRA number.

@foxik
Copy link
Contributor Author

foxik commented Mar 5, 2015

I added the regression test. It also tests that sortByKey returns sorted sequence and tests also ascending sequence, which are not strictly necessary for SPARK-5969, but I added them anyway.

@JoshRosen
Copy link
Contributor

Jenkins, this is ok to test.

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28315 has started for PR 4761 at commit bc2647f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28315 has finished for PR 4761 at commit bc2647f.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28315/
Test FAILed.

@foxik foxik force-pushed the fix-descending-sort branch from bc2647f to 95896b5 Compare March 6, 2015 08:37
@foxik
Copy link
Contributor Author

foxik commented Mar 6, 2015

I have amended the regression test commit to pass lint-python.

@SparkQA
Copy link

SparkQA commented Mar 6, 2015

Test build #28334 has started for PR 4761 at commit 95896b5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 6, 2015

Test build #28334 has finished for PR 4761 at commit 95896b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28334/
Test PASSed.

@JoshRosen
Copy link
Contributor

@davies, does this look good to you? Sorry for letting this patch fall off my radar (slowly getting caught up on a backlog of reviews). If things look good, I can fix the merge conflict (which is probably just a conflict in tests) and get this committed.

@davies
Copy link
Contributor

davies commented Apr 8, 2015

LGTM

@JoshRosen
Copy link
Contributor

Alright, merging this into master (1.4.0) now. Thanks!

@asfgit asfgit closed this in 0375134 Apr 10, 2015
@JoshRosen
Copy link
Contributor

Should this be backported anywhere?

@davies
Copy link
Contributor

davies commented Apr 10, 2015

This is a bug since the beginning (0.8), could we back port it for all 1.0+ branches?

asfgit pushed a commit that referenced this pull request Apr 10, 2015
The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved in rangePartitioner by reversing the found index.

The current implementation also work, but always uses only two partitions -- the first one and the last one (because the bisect_left return returns either "beginning" or "end" for a descending sequence).

Author: Milan Straka <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Josh Rosen <[email protected]>

Closes #4761 from foxik/fix-descending-sort and squashes the following commits:

95896b5 [Milan Straka] Add regression test for SPARK-5969.
5757490 [Milan Straka] Fix descending pyspark.rdd.sortByKey.
asfgit pushed a commit that referenced this pull request Apr 10, 2015
The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved in rangePartitioner by reversing the found index.

The current implementation also work, but always uses only two partitions -- the first one and the last one (because the bisect_left return returns either "beginning" or "end" for a descending sequence).

Author: Milan Straka <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Josh Rosen <[email protected]>

Closes #4761 from foxik/fix-descending-sort and squashes the following commits:

95896b5 [Milan Straka] Add regression test for SPARK-5969.
5757490 [Milan Straka] Fix descending pyspark.rdd.sortByKey.
@JoshRosen
Copy link
Contributor

I've cherry-picked the fix into branch-1.2 (1.2.3) and branch-1.3 (1.3.2). I'm going to omit the the pre-1.2 backports for now because I hit a test merge conflict and also because I think it's unlikely that we're going to release another 1.1.x version.

@davies
Copy link
Contributor

davies commented Apr 10, 2015

Make sense, thank you!

markhamstra pushed a commit to markhamstra/spark that referenced this pull request Apr 15, 2015
The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved in rangePartitioner by reversing the found index.

The current implementation also work, but always uses only two partitions -- the first one and the last one (because the bisect_left return returns either "beginning" or "end" for a descending sequence).

Author: Milan Straka <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Josh Rosen <[email protected]>

Closes apache#4761 from foxik/fix-descending-sort and squashes the following commits:

95896b5 [Milan Straka] Add regression test for SPARK-5969.
5757490 [Milan Straka] Fix descending pyspark.rdd.sortByKey.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants