Skip to content

[SPARK-4197] [mllib] GradientBoosting API cleanup and examples in Scala, Java #3094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

jkbradley
Copy link
Member

Summary

  • Made it easier to construct default Strategy and BoostingStrategy and to set parameters using simple types.
  • Added Scala and Java examples for GradientBoostedTrees
  • small cleanups and fixes

Details

GradientBoosting bug fixes (“bug” = bad default options)

  • Force boostingStrategy.weakLearnerParams.algo = Regression
  • Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance
  • Only persist data if not yet persisted (since it causes an error if persisted twice)

BoostingStrategy

  • numEstimators: renamed to numIterations
  • removed subsamplingRate (duplicated by Strategy)
  • removed categoricalFeaturesInfo since it belongs with the weak learner params (since boosting can be oblivious to feature type)
  • Changed algo to var (not val) and added @BeanProperty, with overload taking String argument
  • Added assertValid() method
  • Updated defaultParams() method and eliminated defaultWeakLearnerParams() since that belongs in Strategy

Strategy (for DecisionTree)

  • Changed algo to var (not val) and added @BeanProperty, with overload taking String argument
  • Added setCategoricalFeaturesInfo method taking Java Map.
  • Cleaned up assertValid
  • Changed val’s to def’s since parameters can now be changed.

CC: @manishamde @mengxr @codedeft

* Made it easier to construct default Strategy and BoostingStrategy and to set parameters using simple types.
* Added Scala and Java examples for GradientBoostedTrees
* small cleanups and fixes

Details

GradientBoosting bug fixes (“bug” = bad default options)
* Force boostingStrategy.weakLearnerParams.algo = Regression
* Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance
* Only persist data if not yet persisted (since it causes an error if persisted twice)

BoostingStrategy
* numEstimators: renamed to numIterations
* removed subsamplingRate (duplicated by Strategy)
* removed categoricalFeaturesInfo since it belongs with the weak learner params (since boosting can be oblivious to feature type)
* Changed algo to var (not val) and added @BeanProperty, with overload taking String argument
* Added assertValid() method
* Updated defaultParams() method and eliminated defaultWeakLearnerParams() since that belongs in Strategy

Strategy (for DecisionTree)
* Changed algo to var (not val) and added @BeanProperty, with overload taking String argument
* Added setCategoricalFeaturesInfo method taking Java Map.
* Cleaned up assertValid
* Changed val’s to def’s since parameters can now be changed.
@jkbradley
Copy link
Member Author

By the way, I had made a JIRA for this, but the website seems down, so I can't look up the JIRA number. I'll tag it later.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22888 has started for PR 3094 at commit e9b8410.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22888 has finished for PR 3094 at commit e9b8410.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public final class JavaGradientBoostedTrees
    • case class Params(

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22888/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22889 has started for PR 3094 at commit 52013d5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22889 has finished for PR 3094 at commit 52013d5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public final class JavaGradientBoostedTrees
    • case class Params(
    • class RDDFunctions[T: ClassTag](self: RDD[T]) extends Serializable

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22889/
Test FAILed.

@manishamde
Copy link
Contributor

@jkbradley Thanks! I will take a look and get back.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22900 has started for PR 3094 at commit 7a27e22.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22900 has finished for PR 3094 at commit 7a27e22.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public final class JavaGradientBoostedTrees
    • case class Params(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22900/
Test PASSed.

@codedeft
Copy link

codedeft commented Nov 5, 2014

@jkbradley @manishamde Is there a story for TreeBoost improvement for Gradient Boosting? TreeBoosting basically improves the gradient estimation at each iteration by re-calculating tree node predictions to minimize the loss values.

This can be done through an additional aggregation step after training each tree. To be efficient, it'll probably have to be added after optimizations on GB.

@manishamde
Copy link
Contributor

@codedeft Not yet. I was planning to but forgot to do so. Feel free to create one or I can create it if you prefer.

You are correct. We need to add (possibly approximate) aggregations for distributed median, etc.

Another major task is to move to the new MLlib API once it's ready and add support for conversion/prediction using internal formats.

Did I miss out on any other optimizations?

@jkbradley jkbradley changed the title [mllib] GradientBoosting API cleanup and examples in Scala, Java [SPARK-4197] [mllib] GradientBoosting API cleanup and examples in Scala, Java Nov 5, 2014
@codedeft
Copy link

codedeft commented Nov 5, 2014

Sounds good. I'll create a story for this.

In addition to using internal formats for more efficiency, perhaps there are also some minor things such as separating label as a different RDD from features so that label can be updated while features stay the same.

Maybe all these are under the same MLLib API optimization umbrella.

@jkbradley
Copy link
Member Author

@manishamde Your comment pretty much covers it, I believe. I hope these optimizations/parameters can remain within the same algorithm, but we should discuss it if any merit new APIs or separate implementations.

@jkbradley
Copy link
Member Author

@codedeft The long-foretold new ML API will help with these things. A WIP PR with the Pipeline concept is out now, but a remake of the internal class hierarchy is still being designed. The class hierarchy in particular should make a lot of these things easier (e.g., separating labels and features).

You can see the Pipeline PR here: [https://github.com//pull/3099]

The class hierarchy JIRA is here [https://issues.apache.org/jira/browse/SPARK-3702]. The design doc is a bit out of date, and I've been working on a branch of the Pipeline PR. It will take a bit of time for me to merge the new Pipeline changes, but I'll ping you when I have a branch ready. Basically, I'm trying to look ahead and think of general use cases and algorithms (mainly for prediction) to figure out good abstractions, while minimizing burden on developers. Feedback would be awesome (though it might make sense to wait until I clean it up this week).

@@ -70,7 +70,7 @@ import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
*/
@Experimental
class Strategy (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note on binary compatibility: If we add new parameters later, it will change the constructor signature. Then we need to provide auxiliary constructors to maintain binary compatibility, which makes the code hard to maintain in the long run. We can hide constructors with parameters and only expose the one without any parameters. Then check the required parameters at runtime by adding def validate().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been a long-standing problem, with several iterations of adding new parameters. I agree with your suggested approach, but had stuck with the old one for consistency. I'm OK with changing the API. Should we now?

@codedeft
Copy link

codedeft commented Nov 5, 2014

@jkbradley @manishamde @mengxr
This is probably not the right place to communicate this. But FYI, I created a separate story for refining tree predictions for GB.

https://issues.apache.org/jira/browse/SPARK-4240

@manishamde
Copy link
Contributor

@codedeft Thanks for creating the JIRA and informing us.

asfgit pushed a commit that referenced this pull request Nov 5, 2014
…la, Java

### Summary

* Made it easier to construct default Strategy and BoostingStrategy and to set parameters using simple types.
* Added Scala and Java examples for GradientBoostedTrees
* small cleanups and fixes

### Details

GradientBoosting bug fixes (“bug” = bad default options)
* Force boostingStrategy.weakLearnerParams.algo = Regression
* Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance
* Only persist data if not yet persisted (since it causes an error if persisted twice)

BoostingStrategy
* numEstimators: renamed to numIterations
* removed subsamplingRate (duplicated by Strategy)
* removed categoricalFeaturesInfo since it belongs with the weak learner params (since boosting can be oblivious to feature type)
* Changed algo to var (not val) and added BeanProperty, with overload taking String argument
* Added assertValid() method
* Updated defaultParams() method and eliminated defaultWeakLearnerParams() since that belongs in Strategy

Strategy (for DecisionTree)
* Changed algo to var (not val) and added BeanProperty, with overload taking String argument
* Added setCategoricalFeaturesInfo method taking Java Map.
* Cleaned up assertValid
* Changed val’s to def’s since parameters can now be changed.

CC: manishamde mengxr codedeft

Author: Joseph K. Bradley <[email protected]>

Closes #3094 from jkbradley/gbt-api and squashes the following commits:

7a27e22 [Joseph K. Bradley] scalastyle fix
52013d5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into gbt-api
e9b8410 [Joseph K. Bradley] Summary of changes

(cherry picked from commit 5b3b6f6)
Signed-off-by: Xiangrui Meng <[email protected]>
@mengxr
Copy link
Contributor

mengxr commented Nov 5, 2014

LGTM. I've merged this into master and branch-1.2. Thanks! (We need to think about binary compatibility when we remove the @experimental tag.)

@asfgit asfgit closed this in 5b3b6f6 Nov 5, 2014
@jkbradley jkbradley deleted the gbt-api branch December 4, 2014 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants