forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit 407ea9f
[SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unordered feature bug fix
DecisionTree improvements:
(1) TreePoint representation to avoid binning multiple times
(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features
(3) Timing for DecisionTree internals
Details:
(1) TreePoint representation to avoid binning multiple times
[https://issues.apache.org/jira/browse/SPARK-3022]
Added private[tree] TreePoint class for representing binned feature values.
The input RDD of LabeledPoint is converted to the TreePoint representation initially and then cached. This avoids the previous problem of re-computing bins multiple times.
(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features
[https://issues.apache.org/jira/browse/SPARK-3041]
isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories.
* exhibited for unordered features (multi-class classification with categorical features of low arity)
* Fix: Index bins correctly for unordered categorical features.
(3) Timing for DecisionTree internals
Added tree/impl/TimeTracker.scala class which is private[tree] for now, for timing key parts of DT code.
Prints timing info via logDebug.
CC: mengxr manishamde chouqin Very similar update, with one bug fix. Many apologies for the conflicting update, but I hope that a few more optimizations I have on the way (which depend on this update) will prove valuable to you: SPARK-3042 and SPARK-3043
Author: Joseph K. Bradley <[email protected]>
Closes apache#1950 from jkbradley/dt-opt1 and squashes the following commits:
5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
6b5651e [Joseph K. Bradley] Updates based on code review. 1 major change: persisting to memory + disk, not just memory.
2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
430d782 [Joseph K. Bradley] Added more debug info on binning error. Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up. Removed debugging println calls from DecisionTree. Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
(cherry picked from commit c703229)
Signed-off-by: Xiangrui Meng <[email protected]>1 parent 63376a0 commit 407ea9fCopy full SHA for 407ea9f
File tree
Expand file treeCollapse file tree
5 files changed
+449
-207
lines changedFilter options
- mllib/src
- main/scala/org/apache/spark/mllib/tree
- configuration
- impl
- test/scala/org/apache/spark/mllib/tree
Expand file treeCollapse file tree
5 files changed
+449
-207
lines changed
0 commit comments