[SPARK-4327] [PySpark] Python API for RDD.randomSplit() #3193

davies · 2014-11-10T23:57:16Z

pyspark.RDD.randomSplit(self, weights, seed=None)
    Randomly splits this RDD with the provided weights.

    :param weights: weights for splits, will be normalized if they don't sum to 1
    :param seed: random seed
    :return: split RDDs in an list

    >>> rdd = sc.parallelize(range(10), 1)
    >>> rdd1, rdd2, rdd3 = rdd.randomSplit([0.4, 0.6, 1.0], 11)
    >>> rdd1.collect()
    [3, 6]
    >>> rdd2.collect()
    [0, 5, 7]
    >>> rdd3.collect()
    [1, 2, 4, 8, 9]

davies · 2014-11-10T23:58:51Z

cc @mengxr @JoshRosen @mateiz, since there is a public API in this PR.

SparkQA · 2014-11-11T00:05:25Z

Test build #23172 has started for PR 3193 at commit 41fce54.

This patch merges cleanly.

SparkQA · 2014-11-11T01:32:00Z

Test build #23172 has finished for PR 3193 at commit 41fce54.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-11T01:32:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23172/
Test PASSed.

JoshRosen · 2014-11-12T19:43:38Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

+
+  /**
+   * A helper to convert java.util.List[Double] into Array[Double]
+   * @param list


Do you mind removing these empty Scaladoc tags?

JoshRosen · 2014-11-12T19:52:56Z

@mengxr I think that you should review this, since I don't understand how randomSplit is is implemented in Scala or what sorts of properties / correctness guarantees it's supposed to exhibit.

davies · 2014-11-12T23:02:18Z

@mengxr Good catch! I had updated it.

SparkQA · 2014-11-12T23:05:15Z

Test build #23283 has started for PR 3193 at commit 0d9b256.

This patch merges cleanly.

SparkQA · 2014-11-13T00:29:59Z

Test build #23283 has finished for PR 3193 at commit 0d9b256.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-13T00:30:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23283/
Test PASSed.

davies · 2014-11-13T19:33:07Z

@JoshRosen @mengxr I had moved to implement it in Python, it avoid the problem of change the batchSize, please review it again.

SparkQA · 2014-11-13T19:35:12Z

Test build #23320 has started for PR 3193 at commit f866bcf.

This patch merges cleanly.

mengxr · 2014-11-13T20:03:33Z

python/pyspark/rddsampler.py

@@ -111,7 +112,7 @@ def func(self, split, iterator):
                    yield obj
        else:
            for obj in iterator:
-                if self.getUniformSample(split) <= self._fraction:
+                if self._lowbound <= self.getUniformSample(split) < self._fraction:


There is an issue with the name here. Maybe we should keep the name fraction and rename lowbound to acceptanceRangeStart. Then check acceptanceRangeStart + fraction <= 1.0 + eps in the constructor, and call RDDSampler(False, ub - lb, seed, lb).func in randomSplit.

mengxr · 2014-11-13T20:04:35Z

@davies Did you compare the performance?

davies · 2014-11-13T20:28:50Z

@mengxr I had run randomSplit([0.2, 0.3]) on an RDD of a million of int, the Scala version finished in 16.4 seconds, the python version finished in 20.5 seconds (25% slower).

SparkQA · 2014-11-13T20:30:14Z

Test build #23325 has started for PR 3193 at commit 4dfa2cd.

This patch merges cleanly.

SparkQA · 2014-11-13T20:48:38Z

Test build #23320 has finished for PR 3193 at commit f866bcf.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-13T20:48:41Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23320/
Test FAILed.

SparkQA · 2014-11-13T21:00:11Z

Test build #23326 has started for PR 3193 at commit f5fdf63.

This patch merges cleanly.

SparkQA · 2014-11-14T06:45:00Z

Test build #23351 has started for PR 3193 at commit 51649f5.

This patch merges cleanly.

davies · 2014-11-14T06:53:44Z

@mengxr I had simplified RDDSample by removing numpy, the reason has been updated in the description of this PR, please re-review it.

mengxr · 2014-11-14T07:25:16Z

@davies Did you only measure the rdd.sample(...).count()? Sampling 1 million took about 0.6s without replacement and 2.5s with replacement on my computer. I think we use the same macbook model or yours is better:)

Maybe part of the time in your case was spent on broadcasting the rdd. Could you try the following:

from pyspark.mllib.random import RandomRDDs
rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache()
rdd.count()
rdd.sample(True, 0.9).count()

SparkQA · 2014-11-14T07:50:17Z

Test build #23351 has finished for PR 3193 at commit 51649f5.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RDDRangeSampler(RDDSamplerBase):

AmplabJenkins · 2014-11-14T07:50:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23351/
Test FAILed.

davies · 2014-11-14T07:58:58Z

@mengxr I got same result with you (using your test code), I will update the results in description.

davies · 2014-11-14T08:03:44Z

@mengxr I had updated it, the numpy is even slower when withReplacement is False.

davies · 2014-11-14T08:39:25Z

remove the test for withReplacement=True, because random.expovariate() is consistant between 2.6 and 2.7.

SparkQA · 2014-11-14T08:45:11Z

Test build #23359 has started for PR 3193 at commit f583023.

This patch merges cleanly.

SparkQA · 2014-11-14T10:15:06Z

Test build #23359 has finished for PR 3193 at commit f583023.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RDDRangeSampler(RDDSamplerBase):

AmplabJenkins · 2014-11-14T10:15:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23359/
Test PASSed.

mengxr · 2014-11-14T18:02:12Z

@davies I'm a little concerned about removing numpy.random in this PR: 1) it is beyond the topic of this PR, 2) it brings performance regression. Since we are comparing python random vs. numpy random, we can easily tell the performance outside Spark. Numpy's random is about 2x faster than python's random on my machine. Besides speed, another issue is the quality of RNG, on which we need to spend more time on the specification.

davies · 2014-11-14T18:30:29Z

For 1), I could put the refactor in another JIRA/PR.

For the performance regression, I think it's a acceptable balance in performance and code manageability. There are lots of way to improve the performance of PySpark, such as numpy/Cython/numba/pypy/pandas, we should balance the dependence and complicity.

Actually, the current approach introduce problems, if numpy is available in driver, but not installed in slaves, it will failed. And someone try to fix this by #2313, but that PR may introduce another problems, the result will be sample() will be no-reproducible if some of the slaves have numpy but others do not, these complicate the problem a lot, but did not contribute huge performance gain.

mengxr · 2014-11-14T21:08:45Z

For the quality of RNG, both python and numpy use Mersenne Twister (http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.RandomState.html):

The Python stdlib module “random” also contains a Mersenne Twister pseudo-random number generator with a number of methods that are similar to the ones available in RandomState. RandomState, besides being NumPy-aware, has the advantage that it provides a much larger number of probability distributions to choose from.

I can tell numpy.random uses MT19937 from its source code, and perhaps Python implements the same RNG. So quality-wise, there should be no issues with always using Python's random.

But for the performance/code complexity trade-offs, maybe @JoshRosen should decide.

davies · 2014-11-15T05:07:10Z

@JoshRosen How to you think of this? The MLlib tests may be blocked by this.

JoshRosen · 2014-11-15T07:49:11Z

I don't really feel qualified to give an opinion here.

davies · 2014-11-15T20:06:30Z

@mengxr @JoshRosen I had reverted the changes about numpy, because it blocks this PR, let's think about it later.

SparkQA · 2014-11-15T20:09:49Z

Test build #23427 has started for PR 3193 at commit 78bf997.

This patch merges cleanly.

SparkQA · 2014-11-15T21:37:00Z

Test build #23427 has finished for PR 3193 at commit 78bf997.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RDDRangeSampler(RDDSamplerBase):

AmplabJenkins · 2014-11-15T21:37:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23427/
Test PASSed.

mengxr · 2014-11-19T00:38:58Z

LGTM. Merged into master and branch-1.2. Thanks! Let's create a new JIRA for the python random vs. numpy random discussion.

davies · 2014-11-19T00:58:06Z

I had created https://issues.apache.org/jira/browse/SPARK-4477 for it.

``` pyspark.RDD.randomSplit(self, weights, seed=None) Randomly splits this RDD with the provided weights. :param weights: weights for splits, will be normalized if they don't sum to 1 :param seed: random seed :return: split RDDs in an list >>> rdd = sc.parallelize(range(10), 1) >>> rdd1, rdd2, rdd3 = rdd.randomSplit([0.4, 0.6, 1.0], 11) >>> rdd1.collect() [3, 6] >>> rdd2.collect() [0, 5, 7] >>> rdd3.collect() [1, 2, 4, 8, 9] ``` Author: Davies Liu <[email protected]> Closes apache#3193 from davies/randomSplit and squashes the following commits: 78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain f5fdf63 [Davies Liu] fix bug with int in weights 4dfa2cd [Davies Liu] refactor f866bcf [Davies Liu] remove unneeded change c7a2007 [Davies Liu] switch to python implementation 95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit 0d9b256 [Davies Liu] refactor 1715ee3 [Davies Liu] address comments 41fce54 [Davies Liu] randomSplit()

davies · 2014-11-19T06:29:16Z

merged.

``` pyspark.RDD.randomSplit(self, weights, seed=None) Randomly splits this RDD with the provided weights. :param weights: weights for splits, will be normalized if they don't sum to 1 :param seed: random seed :return: split RDDs in an list >>> rdd = sc.parallelize(range(10), 1) >>> rdd1, rdd2, rdd3 = rdd.randomSplit([0.4, 0.6, 1.0], 11) >>> rdd1.collect() [3, 6] >>> rdd2.collect() [0, 5, 7] >>> rdd3.collect() [1, 2, 4, 8, 9] ``` Author: Davies Liu <[email protected]> Closes #3193 from davies/randomSplit and squashes the following commits: 78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain f5fdf63 [Davies Liu] fix bug with int in weights 4dfa2cd [Davies Liu] refactor f866bcf [Davies Liu] remove unneeded change c7a2007 [Davies Liu] switch to python implementation 95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit 0d9b256 [Davies Liu] refactor 1715ee3 [Davies Liu] address comments 41fce54 [Davies Liu] randomSplit() (cherry picked from commit 7f22fa8) Signed-off-by: Xiangrui Meng <[email protected]>

randomSplit()

41fce54

JoshRosen reviewed Nov 12, 2014
View reviewed changes

Davies Liu added 2 commits November 12, 2014 14:15

address comments

1715ee3

refactor

0d9b256

Davies Liu added 3 commits November 13, 2014 10:53

Merge branch 'master' of github.com:apache/spark into randomSplit

95a48ac

switch to python implementation

c7a2007

remove unneeded change

f866bcf

mengxr reviewed Nov 13, 2014
View reviewed changes

refactor

4dfa2cd

fix bug with int in weights

f5fdf63

davies force-pushed the randomSplit branch from f583023 to 78bf997 Compare November 15, 2014 20:04

davies closed this Nov 19, 2014

mengxr mentioned this pull request Nov 19, 2014

[SPARK-4477] [PySpark] remove numpy from RDDSampler #3351

Closed

[SPARK-4327] [PySpark] Python API for RDD.randomSplit() #3193

[SPARK-4327] [PySpark] Python API for RDD.randomSplit() #3193

Uh oh!

Conversation

davies commented Nov 10, 2014

Uh oh!

davies commented Nov 10, 2014

Uh oh!

SparkQA commented Nov 11, 2014

Uh oh!

SparkQA commented Nov 11, 2014

Uh oh!

AmplabJenkins commented Nov 11, 2014

Uh oh!

JoshRosen Nov 12, 2014

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Nov 12, 2014

Uh oh!

davies commented Nov 12, 2014

Uh oh!

SparkQA commented Nov 12, 2014

Uh oh!

SparkQA commented Nov 13, 2014

Uh oh!

AmplabJenkins commented Nov 13, 2014

Uh oh!

davies commented Nov 13, 2014

Uh oh!

SparkQA commented Nov 13, 2014

Uh oh!

mengxr Nov 13, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr commented Nov 13, 2014

Uh oh!

davies commented Nov 13, 2014

Uh oh!

SparkQA commented Nov 13, 2014

Uh oh!

SparkQA commented Nov 13, 2014

Uh oh!

AmplabJenkins commented Nov 13, 2014

Uh oh!

SparkQA commented Nov 13, 2014

Uh oh!

SparkQA commented Nov 14, 2014

Uh oh!

davies commented Nov 14, 2014

Uh oh!

mengxr commented Nov 14, 2014

Uh oh!

SparkQA commented Nov 14, 2014

Uh oh!

AmplabJenkins commented Nov 14, 2014

Uh oh!

davies commented Nov 14, 2014

Uh oh!

davies commented Nov 14, 2014

Uh oh!

davies commented Nov 14, 2014

Uh oh!

SparkQA commented Nov 14, 2014

Uh oh!

SparkQA commented Nov 14, 2014

Uh oh!

AmplabJenkins commented Nov 14, 2014

Uh oh!

mengxr commented Nov 14, 2014

Uh oh!

davies commented Nov 14, 2014

Uh oh!

mengxr commented Nov 14, 2014

Uh oh!

davies commented Nov 15, 2014

Uh oh!

JoshRosen commented Nov 15, 2014

Uh oh!

davies commented Nov 15, 2014

Uh oh!