-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-47547][CORE] Add BloomFilter
V2 and use it as default
#50933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
ishnagy
wants to merge
46
commits into
apache:master
from
ishnagy:SPARK-47547_bloomfilter_fpp_degradation
Closed
Changes from all commits
Commits
Show all changes
46 commits
Select commit
Hold shift + click to select a range
3c5a843
SPARK-47547 BloomFilter fpp degradation: addressing the int32 truncation
ishnagy 08cbfeb
SPARK-47547 BloomFilter fpp degradation: fixing test data repetition …
ishnagy e3cb08e
SPARK-47547 BloomFilter fpp degradation: scrambling the high 32bytes …
ishnagy c4e3f58
SPARK-47547 BloomFilter fpp degradation: random distribution fpp test
ishnagy 1a0b66f
SPARK-47547 BloomFilter fpp degradation: javadoc for test methods, ch…
ishnagy d912b66
SPARK-47547 BloomFilter fpp degradation: make seed serialization back…
ishnagy f589e2c
SPARK-47547 BloomFilter fpp degradation: counting discarded odd items…
ishnagy f597c76
SPARK-47547 BloomFilter fpp degradation: refactoring FPP counting log…
ishnagy 4ea633d
SPARK-47547 BloomFilter fpp degradation: checkstyle fix
ishnagy 6696106
SPARK-47547 BloomFilter fpp degradation: fix test bug
ishnagy b75e187
SPARK-47547 BloomFilter fpp degradation: parallelization friendly tes…
ishnagy 2d8a9f1
SPARK-47547 BloomFilter fpp degradation: parallelization friendly tes…
ishnagy 4a30794
SPARK-47547 BloomFilter fpp degradation: parallelization friendly tes…
ishnagy d9d6980
SPARK-47547 BloomFilter fpp degradation: addressing concerns around d…
ishnagy 39a46c9
SPARK-47547 BloomFilter fpp degradation: cut down test cases to decre…
ishnagy 7f235e7
Merge branch 'master' into SPARK-47547_bloomfilter_fpp_degradation
ishnagy 16be3a9
SPARK-47547 BloomFilter fpp degradation: revert creating a new SlowTe…
ishnagy e91b5ca
SPARK-47547 BloomFilter fpp degradation: disable progress logging by …
ishnagy 897c1d4
SPARK-47547 BloomFilter fpp degradation: adjust tolerance and fail on…
ishnagy 013bfe4
SPARK-47547 BloomFilter fpp degradation: make V1/V2 distinction in Bl…
ishnagy 6d44c1e
SPARK-47547 BloomFilter fpp degradation: scrambling test input withou…
ishnagy 925bf12
SPARK-47547 BloomFilter fpp degradation: parallelizing BloomFilter re…
ishnagy 6f28882
SPARK-47547 BloomFilter fpp degradation: add seed to equals/hashCode
ishnagy ed6caac
SPARK-47547 BloomFilter fpp degradation: checkstyle fix
ishnagy 7d4ef74
SPARK-47547 BloomFilter fpp degradation: remove dependency between lo…
ishnagy c52ead3
Merge branch 'master' into SPARK-47547_bloomfilter_fpp_degradation
ishnagy 0ab8276
SPARK-47547 BloomFilter fpp degradation: running /dev/scalafmt
ishnagy d2477bf
SPARK-47547 BloomFilter fpp degradation: javadoc comment for the V2 enum
ishnagy 413c4fe
SPARK-47547 BloomFilter fpp degradation: reindent with 2 spaces
ishnagy 4599fcb
SPARK-47547 BloomFilter fpp degradation: (recover empty line in Bloom…
ishnagy 1ee2e13
SPARK-47547 BloomFilter fpp degradation: JEP-361 style switches
ishnagy c501b2a
SPARK-47547 BloomFilter fpp degradation: removing Objects::equals
ishnagy 1f5cfb6
SPARK-47547 BloomFilter fpp degradation: add missing seed comparison …
ishnagy f60d55f
SPARK-47547 BloomFilter fpp degradation: checkstyle
ishnagy 0314963
SPARK-47547 BloomFilter fpp degradation: BloomFilterBase abstract par…
ishnagy f2df338
SPARK-47547 BloomFilter fpp degradation: pull up long and byte hashin…
ishnagy 4aaff83
SPARK-47547 BloomFilter fpp degradation: checkstyle
ishnagy e214bd7
SPARK-47547 BloomFilter fpp degradation: removing unnecessary line wr…
ishnagy 99f7343
SPARK-47547 BloomFilter fpp degradation: moving junit-pioneer version…
ishnagy 58e3066
SPARK-47547 BloomFilter fpp degradation: (empty line juggling)
ishnagy c06cb38
SPARK-47547 BloomFilter fpp degradation: pull up common hash scatteri…
ishnagy b99ef3a
SPARK-47547 BloomFilter fpp degradation: (empty line juggling)
ishnagy ce3ad76
SPARK-47547 BloomFilter fpp degradation: remove redundant default cas…
ishnagy 626e459
SPARK-47547 BloomFilter fpp degradation: properly capitalize InputStr…
ishnagy b0f5b45
SPARK-47547 BloomFilter fpp degradation: indenting method parameters …
ishnagy 6849dbe
SPARK-47547 BloomFilter fpp degradation: removing junit-pioneer from …
ishnagy File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
199 changes: 199 additions & 0 deletions
199
common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterBase.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,199 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.util.sketch; | ||
|
||
import java.util.Objects; | ||
|
||
abstract class BloomFilterBase extends BloomFilter { | ||
|
||
public static final int DEFAULT_SEED = 0; | ||
|
||
protected int seed; | ||
protected int numHashFunctions; | ||
protected BitArray bits; | ||
|
||
protected BloomFilterBase(int numHashFunctions, long numBits) { | ||
this(numHashFunctions, numBits, DEFAULT_SEED); | ||
} | ||
|
||
protected BloomFilterBase(int numHashFunctions, long numBits, int seed) { | ||
this(new BitArray(numBits), numHashFunctions, seed); | ||
} | ||
|
||
protected BloomFilterBase(BitArray bits, int numHashFunctions, int seed) { | ||
this.bits = bits; | ||
this.numHashFunctions = numHashFunctions; | ||
this.seed = seed; | ||
} | ||
|
||
protected BloomFilterBase() {} | ||
|
||
@Override | ||
public boolean equals(Object other) { | ||
if (other == this) { | ||
return true; | ||
} | ||
|
||
if (!(other instanceof BloomFilterBase that)) { | ||
return false; | ||
} | ||
|
||
return | ||
this.getClass() == that.getClass() | ||
&& this.numHashFunctions == that.numHashFunctions | ||
&& this.seed == that.seed | ||
// TODO: this.bits can be null temporarily, during deserialization, | ||
// should we worry about this? | ||
&& this.bits.equals(that.bits); | ||
} | ||
|
||
@Override | ||
public int hashCode() { | ||
return Objects.hash(numHashFunctions, seed, bits); | ||
} | ||
|
||
@Override | ||
public double expectedFpp() { | ||
return Math.pow((double) bits.cardinality() / bits.bitSize(), numHashFunctions); | ||
} | ||
|
||
@Override | ||
public long bitSize() { | ||
return bits.bitSize(); | ||
} | ||
|
||
@Override | ||
public boolean put(Object item) { | ||
if (item instanceof String str) { | ||
return putString(str); | ||
} else if (item instanceof byte[] bytes) { | ||
return putBinary(bytes); | ||
} else { | ||
return putLong(Utils.integralToLong(item)); | ||
} | ||
} | ||
|
||
protected HiLoHash hashLongToIntPair(long item, int seed) { | ||
// Here we first hash the input long element into 2 int hash values, h1 and h2, then produce n | ||
// hash values by `h1 + i * h2` with 1 <= i <= numHashFunctions. | ||
// Note that `CountMinSketch` use a different strategy, it hash the input long element with | ||
// every i to produce n hash values. | ||
// TODO: the strategy of `CountMinSketch` looks more advanced, should we follow it here? | ||
int h1 = Murmur3_x86_32.hashLong(item, seed); | ||
int h2 = Murmur3_x86_32.hashLong(item, h1); | ||
return new HiLoHash(h1, h2); | ||
} | ||
|
||
protected HiLoHash hashBytesToIntPair(byte[] item, int seed) { | ||
int h1 = Murmur3_x86_32.hashUnsafeBytes(item, Platform.BYTE_ARRAY_OFFSET, item.length, seed); | ||
int h2 = Murmur3_x86_32.hashUnsafeBytes(item, Platform.BYTE_ARRAY_OFFSET, item.length, h1); | ||
return new HiLoHash(h1, h2); | ||
} | ||
|
||
protected abstract boolean scatterHashAndSetAllBits(HiLoHash inputHash); | ||
|
||
protected abstract boolean scatterHashAndGetAllBits(HiLoHash inputHash); | ||
|
||
@Override | ||
public boolean putString(String item) { | ||
return putBinary(Utils.getBytesFromUTF8String(item)); | ||
} | ||
|
||
@Override | ||
public boolean putBinary(byte[] item) { | ||
HiLoHash hiLoHash = hashBytesToIntPair(item, seed); | ||
return scatterHashAndSetAllBits(hiLoHash); | ||
} | ||
|
||
@Override | ||
public boolean mightContainString(String item) { | ||
return mightContainBinary(Utils.getBytesFromUTF8String(item)); | ||
} | ||
|
||
@Override | ||
public boolean mightContainBinary(byte[] item) { | ||
HiLoHash hiLoHash = hashBytesToIntPair(item, seed); | ||
return scatterHashAndGetAllBits(hiLoHash); | ||
} | ||
|
||
public boolean putLong(long item) { | ||
HiLoHash hiLoHash = hashLongToIntPair(item, seed); | ||
return scatterHashAndSetAllBits(hiLoHash); | ||
} | ||
|
||
@Override | ||
public boolean mightContainLong(long item) { | ||
HiLoHash hiLoHash = hashLongToIntPair(item, seed); | ||
return scatterHashAndGetAllBits(hiLoHash); | ||
} | ||
|
||
@Override | ||
public boolean mightContain(Object item) { | ||
if (item instanceof String str) { | ||
return mightContainString(str); | ||
} else if (item instanceof byte[] bytes) { | ||
return mightContainBinary(bytes); | ||
} else { | ||
return mightContainLong(Utils.integralToLong(item)); | ||
} | ||
} | ||
|
||
@Override | ||
public boolean isCompatible(BloomFilter other) { | ||
if (other == null) { | ||
return false; | ||
} | ||
|
||
if (!(other instanceof BloomFilterBase that)) { | ||
return false; | ||
} | ||
|
||
return | ||
this.getClass() == that.getClass() | ||
&& this.bitSize() == that.bitSize() | ||
&& this.numHashFunctions == that.numHashFunctions | ||
&& this.seed == that.seed; | ||
} | ||
|
||
@Override | ||
public BloomFilter mergeInPlace(BloomFilter other) throws IncompatibleMergeException { | ||
BloomFilterBase otherImplInstance = checkCompatibilityForMerge(other); | ||
|
||
this.bits.putAll(otherImplInstance.bits); | ||
return this; | ||
} | ||
|
||
@Override | ||
public BloomFilter intersectInPlace(BloomFilter other) throws IncompatibleMergeException { | ||
BloomFilterBase otherImplInstance = checkCompatibilityForMerge(other); | ||
|
||
this.bits.and(otherImplInstance.bits); | ||
return this; | ||
} | ||
|
||
@Override | ||
public long cardinality() { | ||
return this.bits.cardinality(); | ||
} | ||
|
||
protected abstract BloomFilterBase checkCompatibilityForMerge(BloomFilter other) | ||
throws IncompatibleMergeException; | ||
|
||
public record HiLoHash(int hi, int lo) {} | ||
|
||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to add
V2
, we need to add a new comment block for V2 because the above comment is forV1
only.spark/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java
Line 48 in 46b6ccb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some comments in d2477bf