Skip to content

SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys #986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

mateiz
Copy link
Contributor

@mateiz mateiz commented Jun 5, 2014

The current implementation reads one key with the next hash code as it finishes reading the keys with the current hash code, which may cause it to miss some matches of the next key. This can cause operations like join to give the wrong result when reduce tasks spill to disk and there are hash collisions, as values won't be matched together. This PR fixes it by not reading in that next key, using a peeking iterator instead.

ExternalAppendOnlyMap, instead use a buffered iterator to only read
values with the current hash code.
@AmplabJenkins
Copy link

Merged build triggered.

@mridulm
Copy link
Contributor

mridulm commented Jun 5, 2014

Ah, nasty bug, great catch !

@mateiz
Copy link
Contributor Author

mateiz commented Jun 5, 2014

Yeah, it is. BTW I'm going to also try to add a test that catches this (our current hash collision ones don't), so please don't merge it yet.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build triggered.

@mateiz
Copy link
Contributor Author

mateiz commented Jun 6, 2014

Added test now.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15488/

@mateiz
Copy link
Contributor Author

mateiz commented Jun 6, 2014

@pwendell does it look okay to you?

@pwendell
Copy link
Contributor

pwendell commented Jun 6, 2014

Thanks Matei, LGTM. We should definitely get this in 0.9.2.

@asfgit asfgit closed this in b45c13e Jun 6, 2014
asfgit pushed a commit that referenced this pull request Jun 6, 2014
The current implementation reads one key with the next hash code as it finishes reading the keys with the current hash code, which may cause it to miss some matches of the next key. This can cause operations like join to give the wrong result when reduce tasks spill to disk and there are hash collisions, as values won't be matched together. This PR fixes it by not reading in that next key, using a peeking iterator instead.

Author: Matei Zaharia <[email protected]>

Closes #986 from mateiz/spark-2043 and squashes the following commits:

0959514 [Matei Zaharia] Added unit test for having many hash collisions
892debb [Matei Zaharia] SPARK-2043: don't read a key with the next hash code in ExternalAppendOnlyMap, instead use a buffered iterator to only read values with the current hash code.

(cherry picked from commit b45c13e)
Signed-off-by: Matei Zaharia <[email protected]>
asfgit pushed a commit that referenced this pull request Jun 6, 2014
The current implementation reads one key with the next hash code as it finishes reading the keys with the current hash code, which may cause it to miss some matches of the next key. This can cause operations like join to give the wrong result when reduce tasks spill to disk and there are hash collisions, as values won't be matched together. This PR fixes it by not reading in that next key, using a peeking iterator instead.

Author: Matei Zaharia <[email protected]>

Closes #986 from mateiz/spark-2043 and squashes the following commits:

0959514 [Matei Zaharia] Added unit test for having many hash collisions
892debb [Matei Zaharia] SPARK-2043: don't read a key with the next hash code in ExternalAppendOnlyMap, instead use a buffered iterator to only read values with the current hash code.

(cherry picked from commit b45c13e)
Signed-off-by: Matei Zaharia <[email protected]>
@mateiz
Copy link
Contributor Author

mateiz commented Jun 6, 2014

Alright, merged this.

pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
The current implementation reads one key with the next hash code as it finishes reading the keys with the current hash code, which may cause it to miss some matches of the next key. This can cause operations like join to give the wrong result when reduce tasks spill to disk and there are hash collisions, as values won't be matched together. This PR fixes it by not reading in that next key, using a peeking iterator instead.

Author: Matei Zaharia <[email protected]>

Closes apache#986 from mateiz/spark-2043 and squashes the following commits:

0959514 [Matei Zaharia] Added unit test for having many hash collisions
892debb [Matei Zaharia] SPARK-2043: don't read a key with the next hash code in ExternalAppendOnlyMap, instead use a buffered iterator to only read values with the current hash code.
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
The current implementation reads one key with the next hash code as it finishes reading the keys with the current hash code, which may cause it to miss some matches of the next key. This can cause operations like join to give the wrong result when reduce tasks spill to disk and there are hash collisions, as values won't be matched together. This PR fixes it by not reading in that next key, using a peeking iterator instead.

Author: Matei Zaharia <[email protected]>

Closes apache#986 from mateiz/spark-2043 and squashes the following commits:

0959514 [Matei Zaharia] Added unit test for having many hash collisions
892debb [Matei Zaharia] SPARK-2043: don't read a key with the next hash code in ExternalAppendOnlyMap, instead use a buffered iterator to only read values with the current hash code.
wangyum pushed a commit that referenced this pull request May 26, 2023
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
mapr-devops pushed a commit to mapr/spark that referenced this pull request May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants