-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-4197] [mllib] GradientBoosting API cleanup and examples in Scala, Java #3094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Made it easier to construct default Strategy and BoostingStrategy and to set parameters using simple types. * Added Scala and Java examples for GradientBoostedTrees * small cleanups and fixes Details GradientBoosting bug fixes (“bug” = bad default options) * Force boostingStrategy.weakLearnerParams.algo = Regression * Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance * Only persist data if not yet persisted (since it causes an error if persisted twice) BoostingStrategy * numEstimators: renamed to numIterations * removed subsamplingRate (duplicated by Strategy) * removed categoricalFeaturesInfo since it belongs with the weak learner params (since boosting can be oblivious to feature type) * Changed algo to var (not val) and added @BeanProperty, with overload taking String argument * Added assertValid() method * Updated defaultParams() method and eliminated defaultWeakLearnerParams() since that belongs in Strategy Strategy (for DecisionTree) * Changed algo to var (not val) and added @BeanProperty, with overload taking String argument * Added setCategoricalFeaturesInfo method taking Java Map. * Cleaned up assertValid * Changed val’s to def’s since parameters can now be changed.
By the way, I had made a JIRA for this, but the website seems down, so I can't look up the JIRA number. I'll tag it later. |
Test build #22888 has started for PR 3094 at commit
|
Test build #22888 has finished for PR 3094 at commit
|
Test FAILed. |
Test build #22889 has started for PR 3094 at commit
|
Test build #22889 has finished for PR 3094 at commit
|
Test FAILed. |
@jkbradley Thanks! I will take a look and get back. |
Test build #22900 has started for PR 3094 at commit
|
Test build #22900 has finished for PR 3094 at commit
|
Test PASSed. |
@jkbradley @manishamde Is there a story for TreeBoost improvement for Gradient Boosting? TreeBoosting basically improves the gradient estimation at each iteration by re-calculating tree node predictions to minimize the loss values. This can be done through an additional aggregation step after training each tree. To be efficient, it'll probably have to be added after optimizations on GB. |
@codedeft Not yet. I was planning to but forgot to do so. Feel free to create one or I can create it if you prefer. You are correct. We need to add (possibly approximate) aggregations for distributed median, etc. Another major task is to move to the new MLlib API once it's ready and add support for conversion/prediction using internal formats. Did I miss out on any other optimizations? |
Sounds good. I'll create a story for this. In addition to using internal formats for more efficiency, perhaps there are also some minor things such as separating label as a different RDD from features so that label can be updated while features stay the same. Maybe all these are under the same MLLib API optimization umbrella. |
@manishamde Your comment pretty much covers it, I believe. I hope these optimizations/parameters can remain within the same algorithm, but we should discuss it if any merit new APIs or separate implementations. |
@codedeft The long-foretold new ML API will help with these things. A WIP PR with the Pipeline concept is out now, but a remake of the internal class hierarchy is still being designed. The class hierarchy in particular should make a lot of these things easier (e.g., separating labels and features). You can see the Pipeline PR here: [https://github.com//pull/3099] The class hierarchy JIRA is here [https://issues.apache.org/jira/browse/SPARK-3702]. The design doc is a bit out of date, and I've been working on a branch of the Pipeline PR. It will take a bit of time for me to merge the new Pipeline changes, but I'll ping you when I have a branch ready. Basically, I'm trying to look ahead and think of general use cases and algorithms (mainly for prediction) to figure out good abstractions, while minimizing burden on developers. Feedback would be awesome (though it might make sense to wait until I clean it up this week). |
@@ -70,7 +70,7 @@ import org.apache.spark.mllib.tree.configuration.QuantileStrategy._ | |||
*/ | |||
@Experimental | |||
class Strategy ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note on binary compatibility: If we add new parameters later, it will change the constructor signature. Then we need to provide auxiliary constructors to maintain binary compatibility, which makes the code hard to maintain in the long run. We can hide constructors with parameters and only expose the one without any parameters. Then check the required parameters at runtime by adding def validate()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been a long-standing problem, with several iterations of adding new parameters. I agree with your suggested approach, but had stuck with the old one for consistency. I'm OK with changing the API. Should we now?
@jkbradley @manishamde @mengxr |
@codedeft Thanks for creating the JIRA and informing us. |
…la, Java ### Summary * Made it easier to construct default Strategy and BoostingStrategy and to set parameters using simple types. * Added Scala and Java examples for GradientBoostedTrees * small cleanups and fixes ### Details GradientBoosting bug fixes (“bug” = bad default options) * Force boostingStrategy.weakLearnerParams.algo = Regression * Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance * Only persist data if not yet persisted (since it causes an error if persisted twice) BoostingStrategy * numEstimators: renamed to numIterations * removed subsamplingRate (duplicated by Strategy) * removed categoricalFeaturesInfo since it belongs with the weak learner params (since boosting can be oblivious to feature type) * Changed algo to var (not val) and added BeanProperty, with overload taking String argument * Added assertValid() method * Updated defaultParams() method and eliminated defaultWeakLearnerParams() since that belongs in Strategy Strategy (for DecisionTree) * Changed algo to var (not val) and added BeanProperty, with overload taking String argument * Added setCategoricalFeaturesInfo method taking Java Map. * Cleaned up assertValid * Changed val’s to def’s since parameters can now be changed. CC: manishamde mengxr codedeft Author: Joseph K. Bradley <[email protected]> Closes #3094 from jkbradley/gbt-api and squashes the following commits: 7a27e22 [Joseph K. Bradley] scalastyle fix 52013d5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into gbt-api e9b8410 [Joseph K. Bradley] Summary of changes (cherry picked from commit 5b3b6f6) Signed-off-by: Xiangrui Meng <[email protected]>
LGTM. I've merged this into master and branch-1.2. Thanks! (We need to think about binary compatibility when we remove the @experimental tag.) |
Summary
Details
GradientBoosting bug fixes (“bug” = bad default options)
BoostingStrategy
Strategy (for DecisionTree)
CC: @manishamde @mengxr @codedeft