-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-7555][docs] Add doc for elastic net in ml-guide and mllib-guide #6504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #33760 has finished for PR 6504 at commit
|
@@ -157,6 +174,49 @@ There are now several algorithms in the Pipelines API which are not in the lower | |||
* [Feature Extraction, Transformation, and Selection](ml-features.html) | |||
* [Ensembles](ml-ensembles.html) | |||
|
|||
## Linear Methods with Elastic Net Regularization | |||
|
|||
In MLlib, we implement popular linear methods such as logistic regression and linear least squares with L1 or L2 regularization. Refer to [the linear methods section](mllib-linear-methods.html) for details. In `spark.ml`, we add the [Elastic net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf), which is a hybrid of L1 and L2 regularization. Mathematically it is defined as a linear combination of the L1-norm and the L2-norm: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In
spark.ml
, we add the Elastic net, which is a hybrid of L1 and L2 regularization.
This makes it sound like it's only in spark.ml. Can this please instead say that we provide a Pipelines API? The main thing needed here is a code example, which should demonstrate how to do L1, L2, and a mix. (But I like the note about how it uses a different optimizer.)
@coderxiang Thanks for the doc! In addition to the link to the Elastic Net paper, could you please add a link to the Wikipedia page [http://en.wikipedia.org/wiki/Elastic_net_regularization]? I'll try generating the doc once the code examples are there. |
I'm OK with using the Algorithm guide section for now; we can create a separate page later on. |
@jkbradley thanks for the comments. I highlighted the Pipelines API in the revision and included the wiki page. |
Test build #33859 has finished for PR 6504 at commit
|
Agreed. We should create a separate page for this. For example code, in https://github.com/apache/spark/pull/6576/files , I have full running apps in scala, but this may not be suitable here since in the documentation, we probably only have to provide a code snippet to demonstrate the apis. @coderxiang can you add scala example based on my PR? Thanks. |
@dbtsai sounds good. Let me start with your Scala code. |
@coderxiang One more comment: Could you please mention the phrases "Lasso" and "Ridge Regression" in the spark.ml doc? E.g., say how to set elasticNetParam to get Lasso and Ridge Regression. That might help users who are not familiar with Elastic Net. |
@jkbradley @dbtsai I removed the pipeline and Also update the definition of elastic net using only one lambda. |
Test build #34029 has finished for PR 6504 at commit
|
jenkins test this please |
|
||
{% highlight scala %} | ||
|
||
import scala.collection.mutable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need these 2 imports?? I think you can remove these 2 + others below. Can you please see how many can be removed and still be able to run the example in the Spark shell?
Nice, it's much simpler now. The current PR looks good except for the Java/Python examples. Who is going to add them? If you need me to, then I can send an update to this PR. |
@jkbradley I was planning to reuse the code from the code example that @dbtsai is working on. But feel free to update this PR, especially for the Python code. |
@jkbradley: had an offline discussion with @dbtsai, I can work on the the Java and python code example. What do you think? |
What example do you want to work on? Example in the documentation or in the example code base? |
@dbtsai the example code in this user guide. Are you working on the example code base, I'm sure they share lots of code. |
Test build #34791 has finished for PR 6504 at commit
|
jenkins test this please |
Test build #34790 has finished for PR 6504 at commit
|
Test build #34796 has finished for PR 6504 at commit
|
@dbtsai @jkbradley finished the Java and Python sample code. |
Test build #34965 has finished for PR 6504 at commit
|
`\[ | ||
\alpha \|\wv\|_1 + (1-\alpha) \frac{1}{2}\|\wv\|_2^2, \alpha \in [0, 1]. | ||
\]` | ||
By setting $\alpha$ properly, it contains both L1 and L2 regularization as special cases. For example, if a [linear regression](/api/scala/index.html#org.apache.spark.ml.regression.LinearRegression) model is trained with the elastic net parameter $\alpha$ set to $1$, it is equivalent to a [Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model. On the other hand, if $\alpha$ is set to $0$, the trained model reduces to a [ridge regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model. We implement Pipelines API for both linear regression and logistic regression with elastic net regularization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's odd to link to the Scala API for linear regression but Wikipedia for other models. How about all of these links go to Wikipedia (following what the spark.mllib guide does)?
@coderxiang My apologies for the long delay! It looks good, save for the minor comments above. Since there are merge issues, would you mind going ahead and moving this doc to a new Markdown file, linked from the algorithm section? That new file can be for Linear Methods and be called ml-linear-methods.md Thank you! |
@jkbradley thanks for the comments! Just uploaded the change. |
Test build #37093 has finished for PR 6504 at commit
|
@jkbradley is this one ready to go? |
LGTM merging with master |
jkbradley I put the elastic net under the **Algorithm guide** section. Also add the formula of elastic net in mllib-linear `mllib-linear-methods#regularizers`. dbtsai I left the code tab for you to add example code. Do you think it is the right place? Author: Shuo Xiang <[email protected]> Closes #6504 from coderxiang/elasticnet and squashes the following commits: f6061ee [Shuo Xiang] typo 90a7c88 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet 0610a36 [Shuo Xiang] move out the elastic net to ml-linear-methods 8747190 [Shuo Xiang] merge master 706d3f7 [Shuo Xiang] add python code 9bc2b4c [Shuo Xiang] typo db32a60 [Shuo Xiang] java code sample aab3b3a [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet a0dae07 [Shuo Xiang] simplify code d8616fd [Shuo Xiang] Update the definition of elastic net. Add scala code; Mention Lasso and Ridge df5bd14 [Shuo Xiang] use wikipeida page in ml-linear-methods.md 78d9366 [Shuo Xiang] address comments 8ce37c2 [Shuo Xiang] Merge branch 'elasticnet' of github.com:coderxiang/spark into elasticnet 8f24848 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc 998d766 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc 89f10e4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc 9262a72 [Shuo Xiang] update 7e07d12 [Shuo Xiang] update b32f21a [Shuo Xiang] add doc for elastic net in sparkml 937eef1 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc 180b496 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' aa0717d [Shuo Xiang] Merge remote-tracking branch 'upstream/master' 5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master' 98804c9 [Shuo Xiang] fix bug in topBykey and update test (cherry picked from commit 303c120) Signed-off-by: Joseph K. Bradley <[email protected]>
@jkbradley I put the elastic net under the Algorithm guide section. Also add the formula of elastic net in mllib-linear
mllib-linear-methods#regularizers
.@dbtsai I left the code tab for you to add example code. Do you think it is the right place?