-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact sample size #1025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer comments addressed: - commons-math3 is now a test-only dependency. bumped up to v3.3 - comments added to explain what computeFraction is doing - fixed the unit for computeFraction to use BinomialDitro for without replacement sampling - stylistic fixes
…exact sample size
Build triggered. |
Build triggered. |
Build triggered. |
@@ -46,7 +50,8 @@ import org.apache.spark.Partitioner.defaultPartitioner | |||
import org.apache.spark.SparkContext._ | |||
import org.apache.spark.partial.{BoundedDouble, PartialResult} | |||
import org.apache.spark.serializer.Serializer | |||
import org.apache.spark.util.SerializableHyperLogLog | |||
import org.apache.spark.util.{Utils, SerializableHyperLogLog} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SerializableHyperLogLog no longer exists
Build started. |
Build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15579/ |
Build started. |
Build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15581/ |
Build started. |
Build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15582/ |
…to the aggregate function
Build triggered. |
Build started. |
Build finished. All automated tests passed. |
All automated tests passed. |
|
||
import org.apache.commons.math3.distribution.{PoissonDistribution, NormalDistribution} | ||
|
||
private[spark] object PoissonBounds { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though private, this object needs some documentation. I understand this is a little unfair ... :p
Well, the other "sample" functions are already approximate anyway. I kind of like this here because it conveys that it's more expensive. The other thing is that if we want the Exact one to be experimental, we can't just make it a parameter. |
fix poisson mean for waitlisting add unit tests for Java
@dorx I removed commons-math3 from dependencies, separated Other than those changes, the refactoring was completely based on your implementation. I did some renaming but I'm not taking any credit for the implementation. Some of the refactoring was unnecessary, which was partially because I did this around 5am and my head was not clear. I'm really sorry for the unnecessary code changes. Since this is apparently your contribution, do you mind re-opening the PR? Thanks! |
QA tests have started for PR 1025. This patch DID NOT merge cleanly! |
QA tests have started for PR 1025. This patch merges cleanly. |
QA results for PR 1025: |
QA tests have started for PR 1025. This patch merges cleanly. |
QA results for PR 1025: |
QA results for PR 1025: |
QA tests have started for PR 1025. This patch merges cleanly. |
QA results for PR 1025: |
*/ | ||
def getLowerBound(delta: Double, n: Long, fraction: Double): Double = { | ||
val gamma = - math.log(delta) / n * (2.0 / 3.0) | ||
math.max(minSamplingRate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be okay to return 0
here. The minSamplingRate
should be applied to getUpperBound
.
QA tests have started for PR 1025. This patch merges cleanly. |
QA results for PR 1025: |
LGTM. Merged into master. Thanks!! |
…exact sample size Implemented stratified sampling that guarantees exact sample size using ScaRSR with two passes over the RDD for sampling without replacement and three passes for sampling with replacement. Author: Doris Xin <[email protected]> Author: Xiangrui Meng <[email protected]> Closes apache#1025 from dorx/stratified and squashes the following commits: 245439e [Doris Xin] moved minSamplingRate to getUpperBound eaf5771 [Doris Xin] bug fixes. 17a381b [Doris Xin] fixed a merge issue and a failed unit ea7d27f [Doris Xin] merge master b223529 [Xiangrui Meng] use approx bounds for poisson fix poisson mean for waitlisting add unit tests for Java b3013a4 [Xiangrui Meng] move math3 back to test scope eecee5f [Doris Xin] Merge branch 'master' into stratified f4c21f3 [Doris Xin] Reviewer comments a10e68d [Doris Xin] style fix a2bf756 [Doris Xin] Merge branch 'master' into stratified 680b677 [Doris Xin] use mapPartitionWithIndex instead 9884a9f [Doris Xin] style fix bbfb8c9 [Doris Xin] Merge branch 'master' into stratified ee9d260 [Doris Xin] addressed reviewer comments 6b5b10b [Doris Xin] Merge branch 'master' into stratified 254e03c [Doris Xin] minor fixes and Java API. 4ad516b [Doris Xin] remove unused imports from PairRDDFunctions bd9dc6e [Doris Xin] unit bug and style violation fixed 1fe1cff [Doris Xin] Changed fractionByKey to a map to enable arg check 944a10c [Doris Xin] [SPARK-2145] Add lower bound on sampling rate 0214a76 [Doris Xin] cleanUp 90d94c0 [Doris Xin] merge master 9e74ab5 [Doris Xin] Separated out most of the logic in sampleByKey 7327611 [Doris Xin] merge master 50581fc [Doris Xin] added a TODO for logging in python 46f6c8c [Doris Xin] fixed the NPE caused by closures being cleaned before being passed into the aggregate function 7e1a481 [Doris Xin] changed the permission on SamplingUtil 1d413ce [Doris Xin] fixed checkstyle issues 9ee94ee [Doris Xin] [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact sample size e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample 7cab53a [Doris Xin] fixed import bug in rdd.py ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD 1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
…ommand (#1025) * check auth for subquerys in condition * add skip auth method for dataframe * add skip auth method for dataframe * add skip auth method for dataframe * add skip auth method for dataframe
…#1025) Co-authored-by: Egor Krivokon <>
…#1025) Co-authored-by: Egor Krivokon <>
Implemented stratified sampling that guarantees exact sample size using ScaRSR with two passes over the RDD for sampling without replacement and three passes for sampling with replacement.