-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-5969][PySpark] Fix descending pyspark.rdd.sortByKey. #4761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved in rangePartitioner by reversing the found index.
Can one of the admins verify this patch? |
Could you add a regression test for this issue? It looks like you have one in the JIRA ticket, so adding one hopefully should not be much work. Take a look at |
I added the regression test. It also tests that sortByKey returns sorted sequence and tests also ascending sequence, which are not strictly necessary for SPARK-5969, but I added them anyway. |
Jenkins, this is ok to test. |
Test build #28315 has started for PR 4761 at commit
|
Test build #28315 has finished for PR 4761 at commit
|
Test FAILed. |
bc2647f
to
95896b5
Compare
I have amended the regression test commit to pass lint-python. |
Test build #28334 has started for PR 4761 at commit
|
Test build #28334 has finished for PR 4761 at commit
|
Test PASSed. |
@davies, does this look good to you? Sorry for letting this patch fall off my radar (slowly getting caught up on a backlog of reviews). If things look good, I can fix the merge conflict (which is probably just a conflict in tests) and get this committed. |
LGTM |
Alright, merging this into |
Should this be backported anywhere? |
This is a bug since the beginning (0.8), could we back port it for all 1.0+ branches? |
The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved in rangePartitioner by reversing the found index. The current implementation also work, but always uses only two partitions -- the first one and the last one (because the bisect_left return returns either "beginning" or "end" for a descending sequence). Author: Milan Straka <[email protected]> This patch had conflicts when merged, resolved by Committer: Josh Rosen <[email protected]> Closes #4761 from foxik/fix-descending-sort and squashes the following commits: 95896b5 [Milan Straka] Add regression test for SPARK-5969. 5757490 [Milan Straka] Fix descending pyspark.rdd.sortByKey.
The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved in rangePartitioner by reversing the found index. The current implementation also work, but always uses only two partitions -- the first one and the last one (because the bisect_left return returns either "beginning" or "end" for a descending sequence). Author: Milan Straka <[email protected]> This patch had conflicts when merged, resolved by Committer: Josh Rosen <[email protected]> Closes #4761 from foxik/fix-descending-sort and squashes the following commits: 95896b5 [Milan Straka] Add regression test for SPARK-5969. 5757490 [Milan Straka] Fix descending pyspark.rdd.sortByKey.
I've cherry-picked the fix into |
Make sense, thank you! |
The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved in rangePartitioner by reversing the found index. The current implementation also work, but always uses only two partitions -- the first one and the last one (because the bisect_left return returns either "beginning" or "end" for a descending sequence). Author: Milan Straka <[email protected]> This patch had conflicts when merged, resolved by Committer: Josh Rosen <[email protected]> Closes apache#4761 from foxik/fix-descending-sort and squashes the following commits: 95896b5 [Milan Straka] Add regression test for SPARK-5969. 5757490 [Milan Straka] Fix descending pyspark.rdd.sortByKey.
The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved in rangePartitioner by reversing the found index.
The current implementation also work, but always uses only two partitions -- the first one and the last one (because the bisect_left return returns either "beginning" or "end" for a descending sequence).