Skip to content

Commit 303c120

Browse files
coderxiangjkbradley
authored andcommitted
[SPARK-7555] [DOCS] Add doc for elastic net in ml-guide and mllib-guide
jkbradley I put the elastic net under the **Algorithm guide** section. Also add the formula of elastic net in mllib-linear `mllib-linear-methods#regularizers`. dbtsai I left the code tab for you to add example code. Do you think it is the right place? Author: Shuo Xiang <[email protected]> Closes #6504 from coderxiang/elasticnet and squashes the following commits: f6061ee [Shuo Xiang] typo 90a7c88 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet 0610a36 [Shuo Xiang] move out the elastic net to ml-linear-methods 8747190 [Shuo Xiang] merge master 706d3f7 [Shuo Xiang] add python code 9bc2b4c [Shuo Xiang] typo db32a60 [Shuo Xiang] java code sample aab3b3a [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet a0dae07 [Shuo Xiang] simplify code d8616fd [Shuo Xiang] Update the definition of elastic net. Add scala code; Mention Lasso and Ridge df5bd14 [Shuo Xiang] use wikipeida page in ml-linear-methods.md 78d9366 [Shuo Xiang] address comments 8ce37c2 [Shuo Xiang] Merge branch 'elasticnet' of github.com:coderxiang/spark into elasticnet 8f24848 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc 998d766 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc 89f10e4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc 9262a72 [Shuo Xiang] update 7e07d12 [Shuo Xiang] update b32f21a [Shuo Xiang] add doc for elastic net in sparkml 937eef1 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc 180b496 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' aa0717d [Shuo Xiang] Merge remote-tracking branch 'upstream/master' 5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master' 98804c9 [Shuo Xiang] fix bug in topBykey and update test
1 parent 9716a72 commit 303c120

File tree

3 files changed

+188
-25
lines changed

3 files changed

+188
-25
lines changed

docs/ml-guide.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,24 @@ layout: global
33
title: Spark ML Programming Guide
44
---
55

6+
`\[
7+
\newcommand{\R}{\mathbb{R}}
8+
\newcommand{\E}{\mathbb{E}}
9+
\newcommand{\x}{\mathbf{x}}
10+
\newcommand{\y}{\mathbf{y}}
11+
\newcommand{\wv}{\mathbf{w}}
12+
\newcommand{\av}{\mathbf{\alpha}}
13+
\newcommand{\bv}{\mathbf{b}}
14+
\newcommand{\N}{\mathbb{N}}
15+
\newcommand{\id}{\mathbf{I}}
16+
\newcommand{\ind}{\mathbf{1}}
17+
\newcommand{\0}{\mathbf{0}}
18+
\newcommand{\unit}{\mathbf{e}}
19+
\newcommand{\one}{\mathbf{1}}
20+
\newcommand{\zero}{\mathbf{0}}
21+
\]`
22+
23+
624
Spark 1.2 introduced a new package called `spark.ml`, which aims to provide a uniform set of
725
high-level APIs that help users create and tune practical machine learning pipelines.
826

@@ -154,6 +172,19 @@ Parameters belong to specific instances of `Estimator`s and `Transformer`s.
154172
For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, then we can build a `ParamMap` with both `maxIter` parameters specified: `ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
155173
This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`.
156174

175+
# Algorithm Guides
176+
177+
There are now several algorithms in the Pipelines API which are not in the lower-level MLlib API, so we link to documentation for them here. These algorithms are mostly feature transformers, which fit naturally into the `Transformer` abstraction in Pipelines, and ensembles, which fit naturally into the `Estimator` abstraction in the Pipelines.
178+
179+
**Pipelines API Algorithm Guides**
180+
181+
* [Feature Extraction, Transformation, and Selection](ml-features.html)
182+
* [Ensembles](ml-ensembles.html)
183+
184+
**Algorithms in `spark.ml`**
185+
186+
* [Linear methods with elastic net regularization](ml-linear-methods.html)
187+
157188
# Code Examples
158189

159190
This section gives code examples illustrating the functionality discussed above.

docs/ml-linear-methods.md

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
---
2+
layout: global
3+
title: Linear Methods - ML
4+
displayTitle: <a href="ml-guide.html">ML</a> - Linear Methods
5+
---
6+
7+
8+
`\[
9+
\newcommand{\R}{\mathbb{R}}
10+
\newcommand{\E}{\mathbb{E}}
11+
\newcommand{\x}{\mathbf{x}}
12+
\newcommand{\y}{\mathbf{y}}
13+
\newcommand{\wv}{\mathbf{w}}
14+
\newcommand{\av}{\mathbf{\alpha}}
15+
\newcommand{\bv}{\mathbf{b}}
16+
\newcommand{\N}{\mathbb{N}}
17+
\newcommand{\id}{\mathbf{I}}
18+
\newcommand{\ind}{\mathbf{1}}
19+
\newcommand{\0}{\mathbf{0}}
20+
\newcommand{\unit}{\mathbf{e}}
21+
\newcommand{\one}{\mathbf{1}}
22+
\newcommand{\zero}{\mathbf{0}}
23+
\]`
24+
25+
26+
In MLlib, we implement popular linear methods such as logistic regression and linear least squares with L1 or L2 regularization. Refer to [the linear methods in mllib](mllib-linear-methods.html) for details. In `spark.ml`, we also include Pipelines API for [Elastic net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid of L1 and L2 regularization proposed in [this paper](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf). Mathematically it is defined as a linear combination of the L1-norm and the L2-norm:
27+
`\[
28+
\alpha \|\wv\|_1 + (1-\alpha) \frac{1}{2}\|\wv\|_2^2, \alpha \in [0, 1].
29+
\]`
30+
By setting $\alpha$ properly, it contains both L1 and L2 regularization as special cases. For example, if a [linear regression](https://en.wikipedia.org/wiki/Linear_regression) model is trained with the elastic net parameter $\alpha$ set to $1$, it is equivalent to a [Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model. On the other hand, if $\alpha$ is set to $0$, the trained model reduces to a [ridge regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model. We implement Pipelines API for both linear regression and logistic regression with elastic net regularization.
31+
32+
**Examples**
33+
34+
<div class="codetabs">
35+
36+
<div data-lang="scala" markdown="1">
37+
38+
{% highlight scala %}
39+
40+
import org.apache.spark.ml.classification.LogisticRegression
41+
import org.apache.spark.mllib.util.MLUtils
42+
43+
// Load training data
44+
val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
45+
46+
val lr = new LogisticRegression()
47+
.setMaxIter(10)
48+
.setRegParam(0.3)
49+
.setElasticNetParam(0.8)
50+
51+
// Fit the model
52+
val lrModel = lr.fit(training)
53+
54+
// Print the weights and intercept for logistic regression
55+
println(s"Weights: ${lrModel.weights} Intercept: ${lrModel.intercept}")
56+
57+
{% endhighlight %}
58+
59+
</div>
60+
61+
<div data-lang="java" markdown="1">
62+
63+
{% highlight java %}
64+
65+
import org.apache.spark.ml.classification.LogisticRegression;
66+
import org.apache.spark.ml.classification.LogisticRegressionModel;
67+
import org.apache.spark.mllib.regression.LabeledPoint;
68+
import org.apache.spark.mllib.util.MLUtils;
69+
import org.apache.spark.SparkConf;
70+
import org.apache.spark.SparkContext;
71+
import org.apache.spark.sql.DataFrame;
72+
import org.apache.spark.sql.SQLContext;
73+
74+
public class LogisticRegressionWithElasticNetExample {
75+
public static void main(String[] args) {
76+
SparkConf conf = new SparkConf()
77+
.setAppName("Logistic Regression with Elastic Net Example");
78+
79+
SparkContext sc = new SparkContext(conf);
80+
SQLContext sql = new SQLContext(sc);
81+
String path = "sample_libsvm_data.txt";
82+
83+
// Load training data
84+
DataFrame training = sql.createDataFrame(MLUtils.loadLibSVMFile(sc, path).toJavaRDD(), LabeledPoint.class);
85+
86+
LogisticRegression lr = new LogisticRegression()
87+
.setMaxIter(10)
88+
.setRegParam(0.3)
89+
.setElasticNetParam(0.8)
90+
91+
// Fit the model
92+
LogisticRegressionModel lrModel = lr.fit(training);
93+
94+
// Print the weights and intercept for logistic regression
95+
System.out.println("Weights: " + lrModel.weights() + " Intercept: " + lrModel.intercept());
96+
}
97+
}
98+
{% endhighlight %}
99+
</div>
100+
101+
<div data-lang="python" markdown="1">
102+
103+
{% highlight python %}
104+
105+
from pyspark.ml.classification import LogisticRegression
106+
from pyspark.mllib.regression import LabeledPoint
107+
from pyspark.mllib.util import MLUtils
108+
109+
# Load training data
110+
training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
111+
112+
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
113+
114+
# Fit the model
115+
lrModel = lr.fit(training)
116+
117+
# Print the weights and intercept for logistic regression
118+
print("Weights: " + str(lrModel.weights))
119+
print("Intercept: " + str(lrModel.intercept))
120+
{% endhighlight %}
121+
122+
</div>
123+
124+
</div>
125+
126+
### Optimization
127+
128+
The optimization algorithm underlies the implementation is called [Orthant-Wise Limited-memory QuasiNewton](http://research-srv.microsoft.com/en-us/um/people/jfgao/paper/icml07scalable.pdf)
129+
(OWL-QN). It is an extension of L-BFGS that can effectively handle L1 regularization and elastic net.

docs/mllib-linear-methods.md

Lines changed: 28 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -10,26 +10,26 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Linear Methods
1010

1111
`\[
1212
\newcommand{\R}{\mathbb{R}}
13-
\newcommand{\E}{\mathbb{E}}
13+
\newcommand{\E}{\mathbb{E}}
1414
\newcommand{\x}{\mathbf{x}}
1515
\newcommand{\y}{\mathbf{y}}
1616
\newcommand{\wv}{\mathbf{w}}
1717
\newcommand{\av}{\mathbf{\alpha}}
1818
\newcommand{\bv}{\mathbf{b}}
1919
\newcommand{\N}{\mathbb{N}}
2020
\newcommand{\id}{\mathbf{I}}
21-
\newcommand{\ind}{\mathbf{1}}
22-
\newcommand{\0}{\mathbf{0}}
23-
\newcommand{\unit}{\mathbf{e}}
24-
\newcommand{\one}{\mathbf{1}}
21+
\newcommand{\ind}{\mathbf{1}}
22+
\newcommand{\0}{\mathbf{0}}
23+
\newcommand{\unit}{\mathbf{e}}
24+
\newcommand{\one}{\mathbf{1}}
2525
\newcommand{\zero}{\mathbf{0}}
2626
\]`
2727

2828
## Mathematical formulation
2929

3030
Many standard *machine learning* methods can be formulated as a convex optimization problem, i.e.
3131
the task of finding a minimizer of a convex function `$f$` that depends on a variable vector
32-
`$\wv$` (called `weights` in the code), which has `$d$` entries.
32+
`$\wv$` (called `weights` in the code), which has `$d$` entries.
3333
Formally, we can write this as the optimization problem `$\min_{\wv \in\R^d} \; f(\wv)$`, where
3434
the objective function is of the form
3535
`\begin{equation}
@@ -39,7 +39,7 @@ the objective function is of the form
3939
\ .
4040
\end{equation}`
4141
Here the vectors `$\x_i\in\R^d$` are the training data examples, for `$1\le i\le n$`, and
42-
`$y_i\in\R$` are their corresponding labels, which we want to predict.
42+
`$y_i\in\R$` are their corresponding labels, which we want to predict.
4343
We call the method *linear* if $L(\wv; \x, y)$ can be expressed as a function of $\wv^T x$ and $y$.
4444
Several of MLlib's classification and regression algorithms fall into this category,
4545
and are discussed here.
@@ -99,6 +99,9 @@ regularizers in MLlib:
9999
<tr>
100100
<td>L1</td><td>$\|\wv\|_1$</td><td>$\mathrm{sign}(\wv)$</td>
101101
</tr>
102+
<tr>
103+
<td>elastic net</td><td>$\alpha \|\wv\|_1 + (1-\alpha)\frac{1}{2}\|\wv\|_2^2$</td><td>$\alpha \mathrm{sign}(\wv) + (1-\alpha) \wv$</td>
104+
</tr>
102105
</tbody>
103106
</table>
104107

@@ -107,7 +110,7 @@ of `$\wv$`.
107110

108111
L2-regularized problems are generally easier to solve than L1-regularized due to smoothness.
109112
However, L1 regularization can help promote sparsity in weights leading to smaller and more interpretable models, the latter of which can be useful for feature selection.
110-
It is not recommended to train models without any regularization,
113+
[Elastic net](http://en.wikipedia.org/wiki/Elastic_net_regularization) is a combination of L1 and L2 regularization. It is not recommended to train models without any regularization,
111114
especially when the number of training examples is small.
112115

113116
### Optimization
@@ -531,16 +534,16 @@ sameModel = LogisticRegressionModel.load(sc, "myModelPath")
531534
### Linear least squares, Lasso, and ridge regression
532535

533536

534-
Linear least squares is the most common formulation for regression problems.
537+
Linear least squares is the most common formulation for regression problems.
535538
It is a linear method as described above in equation `$\eqref{eq:regPrimal}$`, with the loss
536539
function in the formulation given by the squared loss:
537540
`\[
538541
L(\wv;\x,y) := \frac{1}{2} (\wv^T \x - y)^2.
539542
\]`
540543

541544
Various related regression methods are derived by using different types of regularization:
542-
[*ordinary least squares*](http://en.wikipedia.org/wiki/Ordinary_least_squares) or
543-
[*linear least squares*](http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)) uses
545+
[*ordinary least squares*](http://en.wikipedia.org/wiki/Ordinary_least_squares) or
546+
[*linear least squares*](http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)) uses
544547
no regularization; [*ridge regression*](http://en.wikipedia.org/wiki/Ridge_regression) uses L2
545548
regularization; and [*Lasso*](http://en.wikipedia.org/wiki/Lasso_(statistics)) uses L1
546549
regularization. For all of these models, the average loss or training error, $\frac{1}{n} \sum_{i=1}^n (\wv^T x_i - y_i)^2$, is
@@ -552,7 +555,7 @@ known as the [mean squared error](http://en.wikipedia.org/wiki/Mean_squared_erro
552555

553556
<div data-lang="scala" markdown="1">
554557
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint.
555-
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
558+
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
556559
values. We compute the mean squared error at the end to evaluate
557560
[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
558561

@@ -614,7 +617,7 @@ public class LinearRegression {
614617
public static void main(String[] args) {
615618
SparkConf conf = new SparkConf().setAppName("Linear Regression Example");
616619
JavaSparkContext sc = new JavaSparkContext(conf);
617-
620+
618621
// Load and parse the data
619622
String path = "data/mllib/ridge-data/lpsa.data";
620623
JavaRDD<String> data = sc.textFile(path);
@@ -634,7 +637,7 @@ public class LinearRegression {
634637

635638
// Building the model
636639
int numIterations = 100;
637-
final LinearRegressionModel model =
640+
final LinearRegressionModel model =
638641
LinearRegressionWithSGD.train(JavaRDD.toRDD(parsedData), numIterations);
639642

640643
// Evaluate model on training examples and compute training error
@@ -665,7 +668,7 @@ public class LinearRegression {
665668

666669
<div data-lang="python" markdown="1">
667670
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint.
668-
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
671+
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
669672
values. We compute the mean squared error at the end to evaluate
670673
[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
671674

@@ -706,8 +709,8 @@ a dependency.
706709

707710
###Streaming linear regression
708711

709-
When data arrive in a streaming fashion, it is useful to fit regression models online,
710-
updating the parameters of the model as new data arrives. MLlib currently supports
712+
When data arrive in a streaming fashion, it is useful to fit regression models online,
713+
updating the parameters of the model as new data arrives. MLlib currently supports
711714
streaming linear regression using ordinary least squares. The fitting is similar
712715
to that performed offline, except fitting occurs on each batch of data, so that
713716
the model continually updates to reflect the data from the stream.
@@ -722,7 +725,7 @@ online to the first stream, and make predictions on the second stream.
722725

723726
<div data-lang="scala" markdown="1">
724727

725-
First, we import the necessary classes for parsing our input data and creating the model.
728+
First, we import the necessary classes for parsing our input data and creating the model.
726729

727730
{% highlight scala %}
728731

@@ -734,7 +737,7 @@ import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
734737

735738
Then we make input streams for training and testing data. We assume a StreamingContext `ssc`
736739
has already been created, see [Spark Streaming Programming Guide](streaming-programming-guide.html#initializing)
737-
for more info. For this example, we use labeled points in training and testing streams,
740+
for more info. For this example, we use labeled points in training and testing streams,
738741
but in practice you will likely want to use unlabeled vectors for test data.
739742

740743
{% highlight scala %}
@@ -754,7 +757,7 @@ val model = new StreamingLinearRegressionWithSGD()
754757

755758
{% endhighlight %}
756759

757-
Now we register the streams for training and testing and start the job.
760+
Now we register the streams for training and testing and start the job.
758761
Printing predictions alongside true labels lets us easily see the result.
759762

760763
{% highlight scala %}
@@ -764,14 +767,14 @@ model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
764767

765768
ssc.start()
766769
ssc.awaitTermination()
767-
770+
768771
{% endhighlight %}
769772

770773
We can now save text files with data to the training or testing folders.
771-
Each line should be a data point formatted as `(y,[x1,x2,x3])` where `y` is the label
772-
and `x1,x2,x3` are the features. Anytime a text file is placed in `/training/data/dir`
773-
the model will update. Anytime a text file is placed in `/testing/data/dir` you will see predictions.
774-
As you feed more data to the training directory, the predictions
774+
Each line should be a data point formatted as `(y,[x1,x2,x3])` where `y` is the label
775+
and `x1,x2,x3` are the features. Anytime a text file is placed in `/training/data/dir`
776+
the model will update. Anytime a text file is placed in `/testing/data/dir` you will see predictions.
777+
As you feed more data to the training directory, the predictions
775778
will get better!
776779

777780
</div>

0 commit comments

Comments
 (0)