-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix #3461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification). Added examples to programming guide for all ensembles.
Test build #23852 has started for PR 3461 at commit
|
Test build #23852 has finished for PR 3461 at commit
|
Test FAILed. |
…all bug in same example in the programming guide.
Test build #23858 has started for PR 3461 at commit
|
Test build #23858 has finished for PR 3461 at commit
|
Test FAILed. |
Test build #23861 has started for PR 3461 at commit
|
Note: I'm working on updating the decision tree programming guide further too (with more info about parameters). |
Test build #23861 has finished for PR 3461 at commit
|
Test PASSed. |
Test build #23879 has started for PR 3461 at commit
|
OK! I think everything's updated, though I'm sure people will have feedback. |
Test build #23879 has finished for PR 3461 at commit
|
Test PASSed. |
## Usage tips | ||
|
||
We include a few guidelines for using decision trees by discussing the various parameters. | ||
There are many parameters, put in order here with the most imporant first. New users should mainly consider the "Problem specification parameters" section below and the `maxDepth` parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrase? The parameters are listed here in descending order of importance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
Test build #23997 has started for PR 3461 at commit
|
Test build #23997 has finished for PR 3461 at commit
|
Test PASSed. |
@jkbradley The GBDT sections looks good to me but the subsection on Comparison with RFs could possibly be moved towards the end. It breaks the flow in my opinion. |
@jkbradley minor: Shall we merge RF and GBT into a single section called "tree ensembles (random forests and gradient-boosted trees" (on the same level as decision trees) ? Then we can move the comparison part to the bottom (or to the very beginning). |
@mengxr Sure, that seems like a good solution to the suggestion from @manishamde |
@mengxr @manishamde Just pushed an update. Let me know if there's anything else--thanks! |
Test build #24116 has started for PR 3461 at commit
|
Test build #24117 has started for PR 3461 at commit
|
Test build #24116 has finished for PR 3461 at commit
|
Test PASSed. |
@jkbradley LGTM! (I don't know a fix to include code examples in jekyll markdown) |
LGTM. Thanks a lot for the user doc and migration guide! Waiting for Jenkins ... |
@mengxr Don't merge yet, one more small update. |
@mengxr OK should be good to go if the update looks OK to you. |
Test build #24121 has started for PR 3461 at commit
|
Test build #24117 has finished for PR 3461 at commit
|
Test PASSed. |
It looks good. I've merge it into master and branch-1.2. Thanks!! |
…bles + DecisionTree API fix Major changes: * Added programming guide sections for tree ensembles * Added examples for tree ensembles * Updated DecisionTree programming guide with more info on parameters * **API change**: Standardized the tree parameter for the number of classes (for classification) Minor changes: * Updated decision tree documentation * Updated existing tree and tree ensemble examples * Use train/test split, and compute test error instead of training error. * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix) Note: I know this is a lot of lines, but most is covered by: * Programming guide sections for gradient boosting and random forests. (The changes are probably best viewed by generating the docs locally.) * New examples (which were copied from the programming guide) * The "numClasses" renaming I have run all examples and relevant unit tests. CC: mengxr manishamde codedeft Author: Joseph K. Bradley <[email protected]> Author: Joseph K. Bradley <[email protected]> Closes #3461 from jkbradley/ensemble-docs and squashes the following commits: 70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide 8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide 6fab846 [Joseph K. Bradley] small fixes based on review b9f8576 [Joseph K. Bradley] updated decision tree doc 375204c [Joseph K. Bradley] fixed python style 2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file. added header. Fixed small bug in same example in the programming guide. 706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small c76c823 [Joseph K. Bradley] added migration guide for mllib abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder 07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification). cdfdfbc [Joseph K. Bradley] added examples for GBT 6372a2b [Joseph K. Bradley] updated decision tree examples to use random split. tested all of them. ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide. still need to update their examples (cherry picked from commit 657a888) Signed-off-by: Xiangrui Meng <[email protected]>
Test build #24121 has finished for PR 3461 at commit
|
Test FAILed. |
Major changes:
Minor changes:
Note: I know this is a lot of lines, but most is covered by:
I have run all examples and relevant unit tests.
CC: @mengxr @manishamde @codedeft