Skip to content

[SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact sample size #1025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 33 commits into from

Conversation

dorx
Copy link
Contributor

@dorx dorx commented Jun 9, 2014

Implemented stratified sampling that guarantees exact sample size using ScaRSR with two passes over the RDD for sampling without replacement and three passes for sampling with replacement.

dorx added 5 commits May 29, 2014 15:22
Reviewer comments addressed:
- commons-math3 is now a test-only dependency. bumped up to v3.3
- comments added to explain what computeFraction is doing
- fixed the unit for computeFraction to use BinomialDitro for without
replacement sampling
- stylistic fixes
@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build triggered.

@@ -46,7 +50,8 @@ import org.apache.spark.Partitioner.defaultPartitioner
import org.apache.spark.SparkContext._
import org.apache.spark.partial.{BoundedDouble, PartialResult}
import org.apache.spark.serializer.Serializer
import org.apache.spark.util.SerializableHyperLogLog
import org.apache.spark.util.{Utils, SerializableHyperLogLog}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SerializableHyperLogLog no longer exists

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15579/

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15581/

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15582/

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15589/


import org.apache.commons.math3.distribution.{PoissonDistribution, NormalDistribution}

private[spark] object PoissonBounds {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though private, this object needs some documentation. I understand this is a little unfair ... :p

@mateiz
Copy link
Contributor

mateiz commented Jul 25, 2014

Well, the other "sample" functions are already approximate anyway. I kind of like this here because it conveys that it's more expensive. The other thing is that if we want the Exact one to be experimental, we can't just make it a parameter.

fix poisson mean for waitlisting
add unit tests for Java
@dorx dorx closed this Jul 26, 2014
@mengxr
Copy link
Contributor

mengxr commented Jul 27, 2014

@dorx I removed commons-math3 from dependencies, separated sampleByKey and sampleByKeyExact, and corrected the math in waitlisting in sampling with replacement.

Other than those changes, the refactoring was completely based on your implementation. I did some renaming but I'm not taking any credit for the implementation. Some of the refactoring was unnecessary, which was partially because I did this around 5am and my head was not clear. I'm really sorry for the unnecessary code changes.

Since this is apparently your contribution, do you mind re-opening the PR? Thanks!

@dorx dorx reopened this Jul 28, 2014
@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA tests have started for PR 1025. This patch DID NOT merge cleanly!
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17293/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA tests have started for PR 1025. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17296/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA results for PR 1025:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17296/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA tests have started for PR 1025. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17300/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA results for PR 1025:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17300/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA results for PR 1025:
- This patch FAILED unit tests.

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17293/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA tests have started for PR 1025. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17307/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA results for PR 1025:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17307/consoleFull

*/
def getLowerBound(delta: Double, n: Long, fraction: Double): Double = {
val gamma = - math.log(delta) / n * (2.0 / 3.0)
math.max(minSamplingRate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be okay to return 0 here. The minSamplingRate should be applied to getUpperBound.

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 1025. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17368/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA results for PR 1025:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17368/consoleFull

@mengxr
Copy link
Contributor

mengxr commented Jul 29, 2014

LGTM. Merged into master. Thanks!!

@asfgit asfgit closed this in dc96536 Jul 29, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…exact sample size

Implemented stratified sampling that guarantees exact sample size using ScaRSR with two passes over the RDD for sampling without replacement and three passes for sampling with replacement.

Author: Doris Xin <[email protected]>
Author: Xiangrui Meng <[email protected]>

Closes apache#1025 from dorx/stratified and squashes the following commits:

245439e [Doris Xin] moved minSamplingRate to getUpperBound
eaf5771 [Doris Xin] bug fixes.
17a381b [Doris Xin] fixed a merge issue and a failed unit
ea7d27f [Doris Xin] merge master
b223529 [Xiangrui Meng] use approx bounds for poisson fix poisson mean for waitlisting add unit tests for Java
b3013a4 [Xiangrui Meng] move math3 back to test scope
eecee5f [Doris Xin] Merge branch 'master' into stratified
f4c21f3 [Doris Xin] Reviewer comments
a10e68d [Doris Xin] style fix
a2bf756 [Doris Xin] Merge branch 'master' into stratified
680b677 [Doris Xin] use mapPartitionWithIndex instead
9884a9f [Doris Xin] style fix
bbfb8c9 [Doris Xin] Merge branch 'master' into stratified
ee9d260 [Doris Xin] addressed reviewer comments
6b5b10b [Doris Xin] Merge branch 'master' into stratified
254e03c [Doris Xin] minor fixes and Java API.
4ad516b [Doris Xin] remove unused imports from PairRDDFunctions
bd9dc6e [Doris Xin] unit bug and style violation fixed
1fe1cff [Doris Xin] Changed fractionByKey to a map to enable arg check
944a10c [Doris Xin] [SPARK-2145] Add lower bound on sampling rate
0214a76 [Doris Xin] cleanUp
90d94c0 [Doris Xin] merge master
9e74ab5 [Doris Xin] Separated out most of the logic in sampleByKey
7327611 [Doris Xin] merge master
50581fc [Doris Xin] added a TODO for logging in python
46f6c8c [Doris Xin] fixed the NPE caused by closures being cleaned before being passed into the aggregate function
7e1a481 [Doris Xin] changed the permission on SamplingUtil
1d413ce [Doris Xin] fixed checkstyle issues
9ee94ee [Doris Xin] [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact sample size
e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
7cab53a [Doris Xin] fixed import bug in rdd.py
ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
wangyum pushed a commit that referenced this pull request May 26, 2023
…ommand (#1025)

* check auth for subquerys in condition

* add skip auth method for dataframe

* add skip auth method for dataframe

* add skip auth method for dataframe

* add skip auth method for dataframe
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
mapr-devops pushed a commit to mapr/spark that referenced this pull request May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants