Skip to content

SPARK-1438 RDD make seed optional in RDD methods sam... #462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

arun-rama
Copy link
Contributor

Its probably better to let the underlying language implementation take care of the default seed if none is specified by the user. This was easier to do with python as the default value for seed in random and numpy random is None.

In Scala/Java side it might meen propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention.

Conflict with overloaded method in sql.SchemaRDD
SchemaRDD defines an overloaded method
sample(fraction, withReplacement=false, seed=math.random)

So, SchemaRDD had tow sample methods with same parameters in different order. I believe the author intended to override the RDD.sample method and not overload it. So, changed it.

Also, scala does not allow more than overloaded method to have default params. So, this code had to be modified. Not sure if there is exiting application code that might break because of this. If we need to keep things backward compatible, 3 new method can be introduced (without default params) like this
sample(fraction)
sample(fraction, withReplacement)
sample(fraction, withReplacement, seed)

Added some tests for the scala RDD takeSample method. Was able to test the java side manually.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

/**
* Return a sampled subset of this RDD.
*/
def sample(withReplacement: Boolean, fraction: Double, seed: Long): JavaPairRDD[K, V] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's import to use Int instead of Long. Since current code is wrote against Int. If we change to Long, the old code using the sample api cannot be compiled because of type mismatch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long is going to be more standard; certainly Java uses long seeds in its APIs. It gives more bits of seed, which is good too. There may be some API changes but anyone calling with an Int seed should be able to call an API with a Long seed right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is more standard to use Long. But we just need to rewrite some code.
We should ask about @mateiz or @pwendell for advice. Maybe they chosen Int for some reason we don't know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have deprecated overloaded methods for backward compatibility if needed.

@advancedxy
Copy link
Contributor

About the seed selection, maybe we should keep all the implementations in sync.
That is to say: if we use system.nanotime in scala, we should use it in java and python; if we use math.random, we should use it in java and python(or port it to different languages).

withReplacement: Boolean = true,
seed: Int = (math.random * 1000).toInt) =
fraction: Double,
seed: Long) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to remove the default behavior here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scala does not allow multiple overloaded methods to have default params. So, if we makde seed default in RDD.sample, then this had to be modified. So, modified it in a standard way. Also, I believe the author intended to overrride RDD.sample and not overload. More details on the PR comment.

@pwendell
Copy link
Contributor

Could you make this pull request into the master branch instead of into branch-1.0? Thanks

@arun-rama
Copy link
Contributor Author

new PR against master instead of 1.0 #477

@arun-rama arun-rama closed this Apr 22, 2014
@arun-rama arun-rama deleted the branch-1.0 branch April 22, 2014 05:35
@arun-rama arun-rama restored the branch-1.0 branch April 22, 2014 05:38
asfgit pushed a commit that referenced this pull request Apr 25, 2014
copying form previous pull request #462

Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None.

In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention.

Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params.
sample(fraction, withReplacement=false, seed=math.random)
Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it.

If backward compatible is important, 3 new method can be introduced (without default params) like this
sample(fraction)
sample(fraction, withReplacement)
sample(fraction, withReplacement, seed)

Added some tests for the scala RDD takeSample method.

Author: Arun Ramakrishnan <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <[email protected]>

Closes #477 from smartnut007/master and squashes the following commits:

07bb06e [Arun Ramakrishnan] SPARK-1438 fixing more space formatting issues
b9ebfe2 [Arun Ramakrishnan] SPARK-1438 removing redundant import of random in python rddsampler
8d05b1a [Arun Ramakrishnan] SPARK-1438 RDD . Replace System.nanoTime with a Random generated number. python: use a separate instance of Random instead of seeding language api global Random instance.
69619c6 [Arun Ramakrishnan] SPARK-1438 fix spacing issue
0c247db [Arun Ramakrishnan] SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample

(cherry picked from commit 35e3d19)
Signed-off-by: Matei Zaharia <[email protected]>
asfgit pushed a commit that referenced this pull request Apr 25, 2014
copying form previous pull request #462

Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None.

In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention.

Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params.
sample(fraction, withReplacement=false, seed=math.random)
Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it.

If backward compatible is important, 3 new method can be introduced (without default params) like this
sample(fraction)
sample(fraction, withReplacement)
sample(fraction, withReplacement, seed)

Added some tests for the scala RDD takeSample method.

Author: Arun Ramakrishnan <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <[email protected]>

Closes #477 from smartnut007/master and squashes the following commits:

07bb06e [Arun Ramakrishnan] SPARK-1438 fixing more space formatting issues
b9ebfe2 [Arun Ramakrishnan] SPARK-1438 removing redundant import of random in python rddsampler
8d05b1a [Arun Ramakrishnan] SPARK-1438 RDD . Replace System.nanoTime with a Random generated number. python: use a separate instance of Random instead of seeding language api global Random instance.
69619c6 [Arun Ramakrishnan] SPARK-1438 fix spacing issue
0c247db [Arun Ramakrishnan] SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample
pwendell added a commit to pwendell/spark that referenced this pull request May 12, 2014
Remove Typesafe Config usage and conf files to fix nested property names

With Typesafe Config we had the subtle problem of no longer allowing
nested property names, which are used for a few of our properties:
http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html

This PR is for branch 0.9 but should be added into master too.
(cherry picked from commit 34e911c)

Signed-off-by: Patrick Wendell <[email protected]>
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
copying form previous pull request apache#462

Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None.

In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention.

Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params.
sample(fraction, withReplacement=false, seed=math.random)
Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it.

If backward compatible is important, 3 new method can be introduced (without default params) like this
sample(fraction)
sample(fraction, withReplacement)
sample(fraction, withReplacement, seed)

Added some tests for the scala RDD takeSample method.

Author: Arun Ramakrishnan <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <[email protected]>

Closes apache#477 from smartnut007/master and squashes the following commits:

07bb06e [Arun Ramakrishnan] SPARK-1438 fixing more space formatting issues
b9ebfe2 [Arun Ramakrishnan] SPARK-1438 removing redundant import of random in python rddsampler
8d05b1a [Arun Ramakrishnan] SPARK-1438 RDD . Replace System.nanoTime with a Random generated number. python: use a separate instance of Random instead of seeding language api global Random instance.
69619c6 [Arun Ramakrishnan] SPARK-1438 fix spacing issue
0c247db [Arun Ramakrishnan] SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample
andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Jan 8, 2015
Remove Typesafe Config usage and conf files to fix nested property names

With Typesafe Config we had the subtle problem of no longer allowing
nested property names, which are used for a few of our properties:
http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html

This PR is for branch 0.9 but should be added into master too.
erictu pushed a commit to erictu/interval-tree that referenced this pull request Sep 16, 2015
copying form previous pull request apache/spark#462

Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None.

In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention.

Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params.
sample(fraction, withReplacement=false, seed=math.random)
Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it.

If backward compatible is important, 3 new method can be introduced (without default params) like this
sample(fraction)
sample(fraction, withReplacement)
sample(fraction, withReplacement, seed)

Added some tests for the scala RDD takeSample method.

Author: Arun Ramakrishnan <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <[email protected]>

Closes #477 from smartnut007/master and squashes the following commits:

07bb06e [Arun Ramakrishnan] SPARK-1438 fixing more space formatting issues
b9ebfe2 [Arun Ramakrishnan] SPARK-1438 removing redundant import of random in python rddsampler
8d05b1a [Arun Ramakrishnan] SPARK-1438 RDD . Replace System.nanoTime with a Random generated number. python: use a separate instance of Random instead of seeding language api global Random instance.
69619c6 [Arun Ramakrishnan] SPARK-1438 fix spacing issue
0c247db [Arun Ramakrishnan] SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Nov 7, 2017
j-esse pushed a commit to j-esse/spark that referenced this pull request Jan 24, 2019
## Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain)

https://issues.apache.org/jira/browse/SPARK-26200

## What changes were proposed in this pull request?

Row type is handled differently depending on _needSerializeAnyField value. When _needSerializeAnyField, Row is handled as tuple which leads to column values being transposed (see upstream ticket for details).

## How was this patch tested?

Unit test.
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
…esignate

Remove enable_services of designate devstack conf
RolatZhang pushed a commit to RolatZhang/spark that referenced this pull request Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants