You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ml-guide.md
+31Lines changed: 31 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,24 @@ layout: global
3
3
title: Spark ML Programming Guide
4
4
---
5
5
6
+
`\[
7
+
\newcommand{\R}{\mathbb{R}}
8
+
\newcommand{\E}{\mathbb{E}}
9
+
\newcommand{\x}{\mathbf{x}}
10
+
\newcommand{\y}{\mathbf{y}}
11
+
\newcommand{\wv}{\mathbf{w}}
12
+
\newcommand{\av}{\mathbf{\alpha}}
13
+
\newcommand{\bv}{\mathbf{b}}
14
+
\newcommand{\N}{\mathbb{N}}
15
+
\newcommand{\id}{\mathbf{I}}
16
+
\newcommand{\ind}{\mathbf{1}}
17
+
\newcommand{\0}{\mathbf{0}}
18
+
\newcommand{\unit}{\mathbf{e}}
19
+
\newcommand{\one}{\mathbf{1}}
20
+
\newcommand{\zero}{\mathbf{0}}
21
+
\]`
22
+
23
+
6
24
Spark 1.2 introduced a new package called `spark.ml`, which aims to provide a uniform set of
7
25
high-level APIs that help users create and tune practical machine learning pipelines.
8
26
@@ -154,6 +172,19 @@ Parameters belong to specific instances of `Estimator`s and `Transformer`s.
154
172
For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, then we can build a `ParamMap` with both `maxIter` parameters specified: `ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
155
173
This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`.
156
174
175
+
# Algorithm Guides
176
+
177
+
There are now several algorithms in the Pipelines API which are not in the lower-level MLlib API, so we link to documentation for them here. These algorithms are mostly feature transformers, which fit naturally into the `Transformer` abstraction in Pipelines, and ensembles, which fit naturally into the `Estimator` abstraction in the Pipelines.
178
+
179
+
**Pipelines API Algorithm Guides**
180
+
181
+
*[Feature Extraction, Transformation, and Selection](ml-features.html)
182
+
*[Ensembles](ml-ensembles.html)
183
+
184
+
**Algorithms in `spark.ml`**
185
+
186
+
*[Linear methods with elastic net regularization](ml-linear-methods.html)
187
+
157
188
# Code Examples
158
189
159
190
This section gives code examples illustrating the functionality discussed above.
displayTitle: <a href="ml-guide.html">ML</a> - Linear Methods
5
+
---
6
+
7
+
8
+
`\[
9
+
\newcommand{\R}{\mathbb{R}}
10
+
\newcommand{\E}{\mathbb{E}}
11
+
\newcommand{\x}{\mathbf{x}}
12
+
\newcommand{\y}{\mathbf{y}}
13
+
\newcommand{\wv}{\mathbf{w}}
14
+
\newcommand{\av}{\mathbf{\alpha}}
15
+
\newcommand{\bv}{\mathbf{b}}
16
+
\newcommand{\N}{\mathbb{N}}
17
+
\newcommand{\id}{\mathbf{I}}
18
+
\newcommand{\ind}{\mathbf{1}}
19
+
\newcommand{\0}{\mathbf{0}}
20
+
\newcommand{\unit}{\mathbf{e}}
21
+
\newcommand{\one}{\mathbf{1}}
22
+
\newcommand{\zero}{\mathbf{0}}
23
+
\]`
24
+
25
+
26
+
In MLlib, we implement popular linear methods such as logistic regression and linear least squares with L1 or L2 regularization. Refer to [the linear methods in mllib](mllib-linear-methods.html) for details. In `spark.ml`, we also include Pipelines API for [Elastic net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid of L1 and L2 regularization proposed in [this paper](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf). Mathematically it is defined as a linear combination of the L1-norm and the L2-norm:
By setting $\alpha$ properly, it contains both L1 and L2 regularization as special cases. For example, if a [linear regression](https://en.wikipedia.org/wiki/Linear_regression) model is trained with the elastic net parameter $\alpha$ set to $1$, it is equivalent to a [Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model. On the other hand, if $\alpha$ is set to $0$, the trained model reduces to a [ridge regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model. We implement Pipelines API for both linear regression and logistic regression with elastic net regularization.
from pyspark.ml.classification import LogisticRegression
106
+
from pyspark.mllib.regression import LabeledPoint
107
+
from pyspark.mllib.util import MLUtils
108
+
109
+
# Load training data
110
+
training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
111
+
112
+
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
113
+
114
+
# Fit the model
115
+
lrModel = lr.fit(training)
116
+
117
+
# Print the weights and intercept for logistic regression
118
+
print("Weights: " + str(lrModel.weights))
119
+
print("Intercept: " + str(lrModel.intercept))
120
+
{% endhighlight %}
121
+
122
+
</div>
123
+
124
+
</div>
125
+
126
+
### Optimization
127
+
128
+
The optimization algorithm underlies the implementation is called [Orthant-Wise Limited-memory QuasiNewton](http://research-srv.microsoft.com/en-us/um/people/jfgao/paper/icml07scalable.pdf)
129
+
(OWL-QN). It is an extension of L-BFGS that can effectively handle L1 regularization and elastic net.
L2-regularized problems are generally easier to solve than L1-regularized due to smoothness.
109
112
However, L1 regularization can help promote sparsity in weights leading to smaller and more interpretable models, the latter of which can be useful for feature selection.
110
-
It is not recommended to train models without any regularization,
113
+
[Elastic net](http://en.wikipedia.org/wiki/Elastic_net_regularization) is a combination of L1 and L2 regularization. It is not recommended to train models without any regularization,
111
114
especially when the number of training examples is small.
0 commit comments