[SPARK-4197] [mllib] GradientBoosting API cleanup and examples in Scala, Java #3094

jkbradley · 2014-11-04T19:16:21Z

Summary

Made it easier to construct default Strategy and BoostingStrategy and to set parameters using simple types.
Added Scala and Java examples for GradientBoostedTrees
small cleanups and fixes

Details

GradientBoosting bug fixes (“bug” = bad default options)

Force boostingStrategy.weakLearnerParams.algo = Regression
Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance
Only persist data if not yet persisted (since it causes an error if persisted twice)

BoostingStrategy

numEstimators: renamed to numIterations
removed subsamplingRate (duplicated by Strategy)
removed categoricalFeaturesInfo since it belongs with the weak learner params (since boosting can be oblivious to feature type)
Changed algo to var (not val) and added @BeanProperty, with overload taking String argument
Added assertValid() method
Updated defaultParams() method and eliminated defaultWeakLearnerParams() since that belongs in Strategy

Strategy (for DecisionTree)

Changed algo to var (not val) and added @BeanProperty, with overload taking String argument
Added setCategoricalFeaturesInfo method taking Java Map.
Cleaned up assertValid
Changed val’s to def’s since parameters can now be changed.

CC: @manishamde @mengxr @codedeft

@BeanProperty

* Made it easier to construct default Strategy and BoostingStrategy and to set parameters using simple types. * Added Scala and Java examples for GradientBoostedTrees * small cleanups and fixes Details GradientBoosting bug fixes (“bug” = bad default options) * Force boostingStrategy.weakLearnerParams.algo = Regression * Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance * Only persist data if not yet persisted (since it causes an error if persisted twice) BoostingStrategy * numEstimators: renamed to numIterations * removed subsamplingRate (duplicated by Strategy) * removed categoricalFeaturesInfo since it belongs with the weak learner params (since boosting can be oblivious to feature type) * Changed algo to var (not val) and added @BeanProperty, with overload taking String argument * Added assertValid() method * Updated defaultParams() method and eliminated defaultWeakLearnerParams() since that belongs in Strategy Strategy (for DecisionTree) * Changed algo to var (not val) and added @BeanProperty, with overload taking String argument * Added setCategoricalFeaturesInfo method taking Java Map. * Cleaned up assertValid * Changed val’s to def’s since parameters can now be changed.

jkbradley · 2014-11-04T19:16:52Z

By the way, I had made a JIRA for this, but the website seems down, so I can't look up the JIRA number. I'll tag it later.

SparkQA · 2014-11-04T19:20:03Z

Test build #22888 has started for PR 3094 at commit e9b8410.

This patch merges cleanly.

SparkQA · 2014-11-04T19:21:15Z

Test build #22888 has finished for PR 3094 at commit e9b8410.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public final class JavaGradientBoostedTrees
- case class Params(

AmplabJenkins · 2014-11-04T19:21:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22888/
Test FAILed.

SparkQA · 2014-11-04T19:29:59Z

Test build #22889 has started for PR 3094 at commit 52013d5.

This patch merges cleanly.

SparkQA · 2014-11-04T19:32:06Z

Test build #22889 has finished for PR 3094 at commit 52013d5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public final class JavaGradientBoostedTrees
- case class Params(
- class RDDFunctions[T: ClassTag](self: RDD[T]) extends Serializable

AmplabJenkins · 2014-11-04T19:32:07Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22889/
Test FAILed.

manishamde · 2014-11-04T21:52:27Z

@jkbradley Thanks! I will take a look and get back.

SparkQA · 2014-11-04T21:59:51Z

Test build #22900 has started for PR 3094 at commit 7a27e22.

This patch merges cleanly.

SparkQA · 2014-11-04T23:25:30Z

Test build #22900 has finished for PR 3094 at commit 7a27e22.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public final class JavaGradientBoostedTrees
- case class Params(

AmplabJenkins · 2014-11-04T23:25:34Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22900/
Test PASSed.

codedeft · 2014-11-05T03:16:18Z

@jkbradley @manishamde Is there a story for TreeBoost improvement for Gradient Boosting? TreeBoosting basically improves the gradient estimation at each iteration by re-calculating tree node predictions to minimize the loss values.

This can be done through an additional aggregation step after training each tree. To be efficient, it'll probably have to be added after optimizations on GB.

manishamde · 2014-11-05T04:01:49Z

@codedeft Not yet. I was planning to but forgot to do so. Feel free to create one or I can create it if you prefer.

You are correct. We need to add (possibly approximate) aggregations for distributed median, etc.

Another major task is to move to the new MLlib API once it's ready and add support for conversion/prediction using internal formats.

Did I miss out on any other optimizations?

codedeft · 2014-11-05T05:36:05Z

Sounds good. I'll create a story for this.

In addition to using internal formats for more efficiency, perhaps there are also some minor things such as separating label as a different RDD from features so that label can be updated while features stay the same.

Maybe all these are under the same MLLib API optimization umbrella.

jkbradley · 2014-11-05T05:55:23Z

@manishamde Your comment pretty much covers it, I believe. I hope these optimizations/parameters can remain within the same algorithm, but we should discuss it if any merit new APIs or separate implementations.

jkbradley · 2014-11-05T06:13:37Z

@codedeft The long-foretold new ML API will help with these things. A WIP PR with the Pipeline concept is out now, but a remake of the internal class hierarchy is still being designed. The class hierarchy in particular should make a lot of these things easier (e.g., separating labels and features).

You can see the Pipeline PR here: [https://github.com//pull/3099]

The class hierarchy JIRA is here [https://issues.apache.org/jira/browse/SPARK-3702]. The design doc is a bit out of date, and I've been working on a branch of the Pipeline PR. It will take a bit of time for me to merge the new Pipeline changes, but I'll ping you when I have a branch ready. Basically, I'm trying to look ahead and think of general use cases and algorithms (mainly for prediction) to figure out good abstractions, while minimizing burden on developers. Feedback would be awesome (though it might make sense to wait until I clean it up this week).

mengxr · 2014-11-05T06:57:17Z

mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala

@@ -70,7 +70,7 @@ import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
 */
 @Experimental
 class Strategy (


Note on binary compatibility: If we add new parameters later, it will change the constructor signature. Then we need to provide auxiliary constructors to maintain binary compatibility, which makes the code hard to maintain in the long run. We can hide constructors with parameters and only expose the one without any parameters. Then check the required parameters at runtime by adding def validate().

This has been a long-standing problem, with several iterations of adding new parameters. I agree with your suggested approach, but had stuck with the old one for consistency. I'm OK with changing the API. Should we now?

codedeft · 2014-11-05T07:33:57Z

@jkbradley @manishamde @mengxr
This is probably not the right place to communicate this. But FYI, I created a separate story for refining tree predictions for GB.

https://issues.apache.org/jira/browse/SPARK-4240

manishamde · 2014-11-05T17:43:36Z

@codedeft Thanks for creating the JIRA and informing us.

…la, Java ### Summary * Made it easier to construct default Strategy and BoostingStrategy and to set parameters using simple types. * Added Scala and Java examples for GradientBoostedTrees * small cleanups and fixes ### Details GradientBoosting bug fixes (“bug” = bad default options) * Force boostingStrategy.weakLearnerParams.algo = Regression * Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance * Only persist data if not yet persisted (since it causes an error if persisted twice) BoostingStrategy * numEstimators: renamed to numIterations * removed subsamplingRate (duplicated by Strategy) * removed categoricalFeaturesInfo since it belongs with the weak learner params (since boosting can be oblivious to feature type) * Changed algo to var (not val) and added BeanProperty, with overload taking String argument * Added assertValid() method * Updated defaultParams() method and eliminated defaultWeakLearnerParams() since that belongs in Strategy Strategy (for DecisionTree) * Changed algo to var (not val) and added BeanProperty, with overload taking String argument * Added setCategoricalFeaturesInfo method taking Java Map. * Cleaned up assertValid * Changed val’s to def’s since parameters can now be changed. CC: manishamde mengxr codedeft Author: Joseph K. Bradley <[email protected]> Closes #3094 from jkbradley/gbt-api and squashes the following commits: 7a27e22 [Joseph K. Bradley] scalastyle fix 52013d5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into gbt-api e9b8410 [Joseph K. Bradley] Summary of changes (cherry picked from commit 5b3b6f6) Signed-off-by: Xiangrui Meng <[email protected]>

mengxr · 2014-11-05T18:36:28Z

LGTM. I've merged this into master and branch-1.2. Thanks! (We need to think about binary compatibility when we remove the @experimental tag.)

Merge remote-tracking branch 'upstream/master' into gbt-api

52013d5

scalastyle fix

7a27e22

jkbradley changed the title ~~[mllib] GradientBoosting API cleanup and examples in Scala, Java~~ [SPARK-4197] [mllib] GradientBoosting API cleanup and examples in Scala, Java Nov 5, 2014

mengxr reviewed Nov 5, 2014
View reviewed changes

asfgit closed this in 5b3b6f6 Nov 5, 2014

jkbradley deleted the gbt-api branch December 4, 2014 20:28

[SPARK-4197] [mllib] GradientBoosting API cleanup and examples in Scala, Java #3094

[SPARK-4197] [mllib] GradientBoosting API cleanup and examples in Scala, Java #3094

Uh oh!

Conversation

jkbradley commented Nov 4, 2014

Summary

Details

Uh oh!

jkbradley commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

AmplabJenkins commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

AmplabJenkins commented Nov 4, 2014

Uh oh!

manishamde commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

AmplabJenkins commented Nov 4, 2014

Uh oh!

codedeft commented Nov 5, 2014

Uh oh!

manishamde commented Nov 5, 2014

Uh oh!

codedeft commented Nov 5, 2014

Uh oh!

jkbradley commented Nov 5, 2014

Uh oh!

jkbradley commented Nov 5, 2014

Uh oh!

mengxr Nov 5, 2014

Choose a reason for hiding this comment

Uh oh!

jkbradley Nov 5, 2014

Choose a reason for hiding this comment

Uh oh!

codedeft commented Nov 5, 2014

Uh oh!

manishamde commented Nov 5, 2014

Uh oh!

mengxr commented Nov 5, 2014

Uh oh!

Uh oh!