Skip to content

[SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix #3461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

jkbradley
Copy link
Member

Major changes:

  • Added programming guide sections for tree ensembles
  • Added examples for tree ensembles
  • Updated DecisionTree programming guide with more info on parameters
  • API change: Standardized the tree parameter for the number of classes (for classification)

Minor changes:

  • Updated decision tree documentation
  • Updated existing tree and tree ensemble examples
    • Use train/test split, and compute test error instead of training error.
    • Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)

Note: I know this is a lot of lines, but most is covered by:

  • Programming guide sections for gradient boosting and random forests. (The changes are probably best viewed by generating the docs locally.)
  • New examples (which were copied from the programming guide)
  • The "numClasses" renaming

I have run all examples and relevant unit tests.

CC: @mengxr @manishamde @codedeft

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23852 has started for PR 3461 at commit 706d332.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23852 has finished for PR 3461 at commit 706d332.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public final class JavaRandomForestClassification
    • public final class JavaRandomForestRegression

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23852/
Test FAILed.

…all bug in same example in the programming guide.
@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23858 has started for PR 3461 at commit 2b60b6e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23858 has finished for PR 3461 at commit 2b60b6e.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public final class JavaRandomForestExample

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23858/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23861 has started for PR 3461 at commit 375204c.

  • This patch merges cleanly.

@jkbradley
Copy link
Member Author

Note: I'm working on updating the decision tree programming guide further too (with more info about parameters).

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23861 has finished for PR 3461 at commit 375204c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public final class JavaRandomForestExample

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23861/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23879 has started for PR 3461 at commit b9f8576.

  • This patch merges cleanly.

@jkbradley
Copy link
Member Author

OK! I think everything's updated, though I'm sure people will have feedback.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23879 has finished for PR 3461 at commit b9f8576.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public final class JavaRandomForestExample

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23879/
Test PASSed.

## Usage tips

We include a few guidelines for using decision trees by discussing the various parameters.
There are many parameters, put in order here with the most imporant first. New users should mainly consider the "Problem specification parameters" section below and the `maxDepth` parameter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrase? The parameters are listed here in descending order of importance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@SparkQA
Copy link

SparkQA commented Dec 1, 2014

Test build #23997 has started for PR 3461 at commit 6fab846.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 1, 2014

Test build #23997 has finished for PR 3461 at commit 6fab846.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public final class JavaRandomForestExample

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23997/
Test PASSed.

@manishamde
Copy link
Contributor

@jkbradley The GBDT sections looks good to me but the subsection on Comparison with RFs could possibly be moved towards the end. It breaks the flow in my opinion.

@mengxr
Copy link
Contributor

mengxr commented Dec 2, 2014

@jkbradley minor: Shall we merge RF and GBT into a single section called "tree ensembles (random forests and gradient-boosted trees" (on the same level as decision trees) ? Then we can move the comparison part to the bottom (or to the very beginning).

@jkbradley
Copy link
Member Author

@mengxr Sure, that seems like a good solution to the suggestion from @manishamde
Will do.

@jkbradley
Copy link
Member Author

@mengxr @manishamde Just pushed an update. Let me know if there's anything else--thanks!
Do you happen to know if there's a way to include the code examples (the .scala or .java or .py files themselves) in the Markdown (to avoid copying the code in manually)? I can't find a way (since it seems like jekyll expects any includes to be in a special includes folder).

@SparkQA
Copy link

SparkQA commented Dec 3, 2014

Test build #24116 has started for PR 3461 at commit 8e87f8f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 4, 2014

Test build #24117 has started for PR 3461 at commit d1de753.

  • This patch merges cleanly.

@jkbradley jkbradley changed the title [SPARK-4580] [SPARK-4610] [mllib] Documentation for tree ensembles + DecisionTree API fix [SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix Dec 4, 2014
@SparkQA
Copy link

SparkQA commented Dec 4, 2014

Test build #24116 has finished for PR 3461 at commit 8e87f8f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public final class JavaRandomForestExample

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24116/
Test PASSed.

@manishamde
Copy link
Contributor

@jkbradley LGTM!

(I don't know a fix to include code examples in jekyll markdown)

@mengxr
Copy link
Contributor

mengxr commented Dec 4, 2014

LGTM. Thanks a lot for the user doc and migration guide! Waiting for Jenkins ...

@jkbradley
Copy link
Member Author

@mengxr Don't merge yet, one more small update.

@jkbradley
Copy link
Member Author

@mengxr OK should be good to go if the update looks OK to you.

@SparkQA
Copy link

SparkQA commented Dec 4, 2014

Test build #24121 has started for PR 3461 at commit 70a75f3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 4, 2014

Test build #24117 has finished for PR 3461 at commit d1de753.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public final class JavaRandomForestExample

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24117/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Dec 4, 2014

It looks good. I've merge it into master and branch-1.2. Thanks!!

asfgit pushed a commit that referenced this pull request Dec 4, 2014
…bles + DecisionTree API fix

Major changes:
* Added programming guide sections for tree ensembles
* Added examples for tree ensembles
* Updated DecisionTree programming guide with more info on parameters
* **API change**: Standardized the tree parameter for the number of classes (for classification)

Minor changes:
* Updated decision tree documentation
* Updated existing tree and tree ensemble examples
 * Use train/test split, and compute test error instead of training error.
 * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)

Note: I know this is a lot of lines, but most is covered by:
* Programming guide sections for gradient boosting and random forests.  (The changes are probably best viewed by generating the docs locally.)
* New examples (which were copied from the programming guide)
* The "numClasses" renaming

I have run all examples and relevant unit tests.

CC: mengxr manishamde codedeft

Author: Joseph K. Bradley <[email protected]>
Author: Joseph K. Bradley <[email protected]>

Closes #3461 from jkbradley/ensemble-docs and squashes the following commits:

70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
6fab846 [Joseph K. Bradley] small fixes based on review
b9f8576 [Joseph K. Bradley] updated decision tree doc
375204c [Joseph K. Bradley] fixed python style
2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file.  added header.  Fixed small bug in same example in the programming guide.
706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
c76c823 [Joseph K. Bradley] added migration guide for mllib
abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
cdfdfbc [Joseph K. Bradley] added examples for GBT
6372a2b [Joseph K. Bradley] updated decision tree examples to use random split.  tested all of them.
ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide.  still need to update their examples

(cherry picked from commit 657a888)
Signed-off-by: Xiangrui Meng <[email protected]>
@asfgit asfgit closed this in 657a888 Dec 4, 2014
@SparkQA
Copy link

SparkQA commented Dec 4, 2014

Test build #24121 has finished for PR 3461 at commit 70a75f3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public final class JavaRandomForestExample

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24121/
Test FAILed.

@jkbradley jkbradley deleted the ensemble-docs branch December 4, 2014 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants