Skip to content

Commit 98e8af7

Browse files
committed
Merge pull request #65 from markhamstra/csd-1.4
SKIPME merging Apache branch-1.4 bug fixes
2 parents 9e54b8a + 0715408 commit 98e8af7

File tree

8 files changed

+248
-34
lines changed

8 files changed

+248
-34
lines changed

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ package org.apache.spark
2020
import java.util.concurrent.TimeUnit
2121

2222
import scala.collection.mutable
23+
import scala.util.control.ControlThrowable
2324

2425
import com.codahale.metrics.{Gauge, MetricRegistry}
2526

@@ -204,7 +205,16 @@ private[spark] class ExecutorAllocationManager(
204205
listenerBus.addListener(listener)
205206

206207
val scheduleTask = new Runnable() {
207-
override def run(): Unit = Utils.logUncaughtExceptions(schedule())
208+
override def run(): Unit = {
209+
try {
210+
schedule()
211+
} catch {
212+
case ct: ControlThrowable =>
213+
throw ct
214+
case t: Throwable =>
215+
logWarning(s"Uncaught exception in thread ${Thread.currentThread().getName}", t)
216+
}
217+
}
208218
}
209219
executor.scheduleAtFixedRate(scheduleTask, 0, intervalMillis, TimeUnit.MILLISECONDS)
210220
}

core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -684,7 +684,9 @@ private[ui] class StagePage(parent: StagesTab) extends WebUIPage("stage") {
684684
val gettingResultTime = getGettingResultTime(info)
685685

686686
val maybeAccumulators = info.accumulables
687-
val accumulatorsReadable = maybeAccumulators.map{acc => s"${acc.name}: ${acc.update.get}"}
687+
val accumulatorsReadable = maybeAccumulators.map { acc =>
688+
StringEscapeUtils.escapeHtml4(s"${acc.name}: ${acc.update.get}")
689+
}
688690

689691
val maybeInput = metrics.flatMap(_.inputMetrics)
690692
val inputSortable = maybeInput.map(_.bytesRead.toString).getOrElse("")

docs/configuration.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -665,7 +665,7 @@ Apart from these, the following properties are also available, and may be useful
665665
<td>
666666
Initial size of Kryo's serialization buffer. Note that there will be one buffer
667667
<i>per core</i> on each worker. This buffer will grow up to
668-
<code>spark.kryoserializer.buffer.max.mb</code> if needed.
668+
<code>spark.kryoserializer.buffer.max</code> if needed.
669669
</td>
670670
</tr>
671671
<tr>

docs/ml-guide.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,24 @@ layout: global
33
title: Spark ML Programming Guide
44
---
55

6+
`\[
7+
\newcommand{\R}{\mathbb{R}}
8+
\newcommand{\E}{\mathbb{E}}
9+
\newcommand{\x}{\mathbf{x}}
10+
\newcommand{\y}{\mathbf{y}}
11+
\newcommand{\wv}{\mathbf{w}}
12+
\newcommand{\av}{\mathbf{\alpha}}
13+
\newcommand{\bv}{\mathbf{b}}
14+
\newcommand{\N}{\mathbb{N}}
15+
\newcommand{\id}{\mathbf{I}}
16+
\newcommand{\ind}{\mathbf{1}}
17+
\newcommand{\0}{\mathbf{0}}
18+
\newcommand{\unit}{\mathbf{e}}
19+
\newcommand{\one}{\mathbf{1}}
20+
\newcommand{\zero}{\mathbf{0}}
21+
\]`
22+
23+
624
Spark 1.2 introduced a new package called `spark.ml`, which aims to provide a uniform set of
725
high-level APIs that help users create and tune practical machine learning pipelines.
826

@@ -154,6 +172,19 @@ Parameters belong to specific instances of `Estimator`s and `Transformer`s.
154172
For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, then we can build a `ParamMap` with both `maxIter` parameters specified: `ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
155173
This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`.
156174

175+
# Algorithm Guides
176+
177+
There are now several algorithms in the Pipelines API which are not in the lower-level MLlib API, so we link to documentation for them here. These algorithms are mostly feature transformers, which fit naturally into the `Transformer` abstraction in Pipelines, and ensembles, which fit naturally into the `Estimator` abstraction in the Pipelines.
178+
179+
**Pipelines API Algorithm Guides**
180+
181+
* [Feature Extraction, Transformation, and Selection](ml-features.html)
182+
* [Ensembles](ml-ensembles.html)
183+
184+
**Algorithms in `spark.ml`**
185+
186+
* [Linear methods with elastic net regularization](ml-linear-methods.html)
187+
157188
# Code Examples
158189

159190
This section gives code examples illustrating the functionality discussed above.

docs/ml-linear-methods.md

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
---
2+
layout: global
3+
title: Linear Methods - ML
4+
displayTitle: <a href="ml-guide.html">ML</a> - Linear Methods
5+
---
6+
7+
8+
`\[
9+
\newcommand{\R}{\mathbb{R}}
10+
\newcommand{\E}{\mathbb{E}}
11+
\newcommand{\x}{\mathbf{x}}
12+
\newcommand{\y}{\mathbf{y}}
13+
\newcommand{\wv}{\mathbf{w}}
14+
\newcommand{\av}{\mathbf{\alpha}}
15+
\newcommand{\bv}{\mathbf{b}}
16+
\newcommand{\N}{\mathbb{N}}
17+
\newcommand{\id}{\mathbf{I}}
18+
\newcommand{\ind}{\mathbf{1}}
19+
\newcommand{\0}{\mathbf{0}}
20+
\newcommand{\unit}{\mathbf{e}}
21+
\newcommand{\one}{\mathbf{1}}
22+
\newcommand{\zero}{\mathbf{0}}
23+
\]`
24+
25+
26+
In MLlib, we implement popular linear methods such as logistic regression and linear least squares with L1 or L2 regularization. Refer to [the linear methods in mllib](mllib-linear-methods.html) for details. In `spark.ml`, we also include Pipelines API for [Elastic net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid of L1 and L2 regularization proposed in [this paper](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf). Mathematically it is defined as a linear combination of the L1-norm and the L2-norm:
27+
`\[
28+
\alpha \|\wv\|_1 + (1-\alpha) \frac{1}{2}\|\wv\|_2^2, \alpha \in [0, 1].
29+
\]`
30+
By setting $\alpha$ properly, it contains both L1 and L2 regularization as special cases. For example, if a [linear regression](https://en.wikipedia.org/wiki/Linear_regression) model is trained with the elastic net parameter $\alpha$ set to $1$, it is equivalent to a [Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model. On the other hand, if $\alpha$ is set to $0$, the trained model reduces to a [ridge regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model. We implement Pipelines API for both linear regression and logistic regression with elastic net regularization.
31+
32+
**Examples**
33+
34+
<div class="codetabs">
35+
36+
<div data-lang="scala" markdown="1">
37+
38+
{% highlight scala %}
39+
40+
import org.apache.spark.ml.classification.LogisticRegression
41+
import org.apache.spark.mllib.util.MLUtils
42+
43+
// Load training data
44+
val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
45+
46+
val lr = new LogisticRegression()
47+
.setMaxIter(10)
48+
.setRegParam(0.3)
49+
.setElasticNetParam(0.8)
50+
51+
// Fit the model
52+
val lrModel = lr.fit(training)
53+
54+
// Print the weights and intercept for logistic regression
55+
println(s"Weights: ${lrModel.weights} Intercept: ${lrModel.intercept}")
56+
57+
{% endhighlight %}
58+
59+
</div>
60+
61+
<div data-lang="java" markdown="1">
62+
63+
{% highlight java %}
64+
65+
import org.apache.spark.ml.classification.LogisticRegression;
66+
import org.apache.spark.ml.classification.LogisticRegressionModel;
67+
import org.apache.spark.mllib.regression.LabeledPoint;
68+
import org.apache.spark.mllib.util.MLUtils;
69+
import org.apache.spark.SparkConf;
70+
import org.apache.spark.SparkContext;
71+
import org.apache.spark.sql.DataFrame;
72+
import org.apache.spark.sql.SQLContext;
73+
74+
public class LogisticRegressionWithElasticNetExample {
75+
public static void main(String[] args) {
76+
SparkConf conf = new SparkConf()
77+
.setAppName("Logistic Regression with Elastic Net Example");
78+
79+
SparkContext sc = new SparkContext(conf);
80+
SQLContext sql = new SQLContext(sc);
81+
String path = "sample_libsvm_data.txt";
82+
83+
// Load training data
84+
DataFrame training = sql.createDataFrame(MLUtils.loadLibSVMFile(sc, path).toJavaRDD(), LabeledPoint.class);
85+
86+
LogisticRegression lr = new LogisticRegression()
87+
.setMaxIter(10)
88+
.setRegParam(0.3)
89+
.setElasticNetParam(0.8)
90+
91+
// Fit the model
92+
LogisticRegressionModel lrModel = lr.fit(training);
93+
94+
// Print the weights and intercept for logistic regression
95+
System.out.println("Weights: " + lrModel.weights() + " Intercept: " + lrModel.intercept());
96+
}
97+
}
98+
{% endhighlight %}
99+
</div>
100+
101+
<div data-lang="python" markdown="1">
102+
103+
{% highlight python %}
104+
105+
from pyspark.ml.classification import LogisticRegression
106+
from pyspark.mllib.regression import LabeledPoint
107+
from pyspark.mllib.util import MLUtils
108+
109+
# Load training data
110+
training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
111+
112+
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
113+
114+
# Fit the model
115+
lrModel = lr.fit(training)
116+
117+
# Print the weights and intercept for logistic regression
118+
print("Weights: " + str(lrModel.weights))
119+
print("Intercept: " + str(lrModel.intercept))
120+
{% endhighlight %}
121+
122+
</div>
123+
124+
</div>
125+
126+
### Optimization
127+
128+
The optimization algorithm underlies the implementation is called [Orthant-Wise Limited-memory QuasiNewton](http://research-srv.microsoft.com/en-us/um/people/jfgao/paper/icml07scalable.pdf)
129+
(OWL-QN). It is an extension of L-BFGS that can effectively handle L1 regularization and elastic net.

docs/mllib-linear-methods.md

Lines changed: 28 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -10,26 +10,26 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Linear Methods
1010

1111
`\[
1212
\newcommand{\R}{\mathbb{R}}
13-
\newcommand{\E}{\mathbb{E}}
13+
\newcommand{\E}{\mathbb{E}}
1414
\newcommand{\x}{\mathbf{x}}
1515
\newcommand{\y}{\mathbf{y}}
1616
\newcommand{\wv}{\mathbf{w}}
1717
\newcommand{\av}{\mathbf{\alpha}}
1818
\newcommand{\bv}{\mathbf{b}}
1919
\newcommand{\N}{\mathbb{N}}
2020
\newcommand{\id}{\mathbf{I}}
21-
\newcommand{\ind}{\mathbf{1}}
22-
\newcommand{\0}{\mathbf{0}}
23-
\newcommand{\unit}{\mathbf{e}}
24-
\newcommand{\one}{\mathbf{1}}
21+
\newcommand{\ind}{\mathbf{1}}
22+
\newcommand{\0}{\mathbf{0}}
23+
\newcommand{\unit}{\mathbf{e}}
24+
\newcommand{\one}{\mathbf{1}}
2525
\newcommand{\zero}{\mathbf{0}}
2626
\]`
2727

2828
## Mathematical formulation
2929

3030
Many standard *machine learning* methods can be formulated as a convex optimization problem, i.e.
3131
the task of finding a minimizer of a convex function `$f$` that depends on a variable vector
32-
`$\wv$` (called `weights` in the code), which has `$d$` entries.
32+
`$\wv$` (called `weights` in the code), which has `$d$` entries.
3333
Formally, we can write this as the optimization problem `$\min_{\wv \in\R^d} \; f(\wv)$`, where
3434
the objective function is of the form
3535
`\begin{equation}
@@ -39,7 +39,7 @@ the objective function is of the form
3939
\ .
4040
\end{equation}`
4141
Here the vectors `$\x_i\in\R^d$` are the training data examples, for `$1\le i\le n$`, and
42-
`$y_i\in\R$` are their corresponding labels, which we want to predict.
42+
`$y_i\in\R$` are their corresponding labels, which we want to predict.
4343
We call the method *linear* if $L(\wv; \x, y)$ can be expressed as a function of $\wv^T x$ and $y$.
4444
Several of MLlib's classification and regression algorithms fall into this category,
4545
and are discussed here.
@@ -99,6 +99,9 @@ regularizers in MLlib:
9999
<tr>
100100
<td>L1</td><td>$\|\wv\|_1$</td><td>$\mathrm{sign}(\wv)$</td>
101101
</tr>
102+
<tr>
103+
<td>elastic net</td><td>$\alpha \|\wv\|_1 + (1-\alpha)\frac{1}{2}\|\wv\|_2^2$</td><td>$\alpha \mathrm{sign}(\wv) + (1-\alpha) \wv$</td>
104+
</tr>
102105
</tbody>
103106
</table>
104107

@@ -107,7 +110,7 @@ of `$\wv$`.
107110

108111
L2-regularized problems are generally easier to solve than L1-regularized due to smoothness.
109112
However, L1 regularization can help promote sparsity in weights leading to smaller and more interpretable models, the latter of which can be useful for feature selection.
110-
It is not recommended to train models without any regularization,
113+
[Elastic net](http://en.wikipedia.org/wiki/Elastic_net_regularization) is a combination of L1 and L2 regularization. It is not recommended to train models without any regularization,
111114
especially when the number of training examples is small.
112115

113116
### Optimization
@@ -527,16 +530,16 @@ print("Training Error = " + str(trainErr))
527530
### Linear least squares, Lasso, and ridge regression
528531

529532

530-
Linear least squares is the most common formulation for regression problems.
533+
Linear least squares is the most common formulation for regression problems.
531534
It is a linear method as described above in equation `$\eqref{eq:regPrimal}$`, with the loss
532535
function in the formulation given by the squared loss:
533536
`\[
534537
L(\wv;\x,y) := \frac{1}{2} (\wv^T \x - y)^2.
535538
\]`
536539

537540
Various related regression methods are derived by using different types of regularization:
538-
[*ordinary least squares*](http://en.wikipedia.org/wiki/Ordinary_least_squares) or
539-
[*linear least squares*](http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)) uses
541+
[*ordinary least squares*](http://en.wikipedia.org/wiki/Ordinary_least_squares) or
542+
[*linear least squares*](http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)) uses
540543
no regularization; [*ridge regression*](http://en.wikipedia.org/wiki/Ridge_regression) uses L2
541544
regularization; and [*Lasso*](http://en.wikipedia.org/wiki/Lasso_(statistics)) uses L1
542545
regularization. For all of these models, the average loss or training error, $\frac{1}{n} \sum_{i=1}^n (\wv^T x_i - y_i)^2$, is
@@ -548,7 +551,7 @@ known as the [mean squared error](http://en.wikipedia.org/wiki/Mean_squared_erro
548551

549552
<div data-lang="scala" markdown="1">
550553
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint.
551-
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
554+
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
552555
values. We compute the mean squared error at the end to evaluate
553556
[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
554557

@@ -610,7 +613,7 @@ public class LinearRegression {
610613
public static void main(String[] args) {
611614
SparkConf conf = new SparkConf().setAppName("Linear Regression Example");
612615
JavaSparkContext sc = new JavaSparkContext(conf);
613-
616+
614617
// Load and parse the data
615618
String path = "data/mllib/ridge-data/lpsa.data";
616619
JavaRDD<String> data = sc.textFile(path);
@@ -630,7 +633,7 @@ public class LinearRegression {
630633

631634
// Building the model
632635
int numIterations = 100;
633-
final LinearRegressionModel model =
636+
final LinearRegressionModel model =
634637
LinearRegressionWithSGD.train(JavaRDD.toRDD(parsedData), numIterations);
635638

636639
// Evaluate model on training examples and compute training error
@@ -661,7 +664,7 @@ public class LinearRegression {
661664

662665
<div data-lang="python" markdown="1">
663666
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint.
664-
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
667+
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
665668
values. We compute the mean squared error at the end to evaluate
666669
[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
667670

@@ -698,8 +701,8 @@ a dependency.
698701

699702
###Streaming linear regression
700703

701-
When data arrive in a streaming fashion, it is useful to fit regression models online,
702-
updating the parameters of the model as new data arrives. MLlib currently supports
704+
When data arrive in a streaming fashion, it is useful to fit regression models online,
705+
updating the parameters of the model as new data arrives. MLlib currently supports
703706
streaming linear regression using ordinary least squares. The fitting is similar
704707
to that performed offline, except fitting occurs on each batch of data, so that
705708
the model continually updates to reflect the data from the stream.
@@ -714,7 +717,7 @@ online to the first stream, and make predictions on the second stream.
714717

715718
<div data-lang="scala" markdown="1">
716719

717-
First, we import the necessary classes for parsing our input data and creating the model.
720+
First, we import the necessary classes for parsing our input data and creating the model.
718721

719722
{% highlight scala %}
720723

@@ -726,7 +729,7 @@ import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
726729

727730
Then we make input streams for training and testing data. We assume a StreamingContext `ssc`
728731
has already been created, see [Spark Streaming Programming Guide](streaming-programming-guide.html#initializing)
729-
for more info. For this example, we use labeled points in training and testing streams,
732+
for more info. For this example, we use labeled points in training and testing streams,
730733
but in practice you will likely want to use unlabeled vectors for test data.
731734

732735
{% highlight scala %}
@@ -746,7 +749,7 @@ val model = new StreamingLinearRegressionWithSGD()
746749

747750
{% endhighlight %}
748751

749-
Now we register the streams for training and testing and start the job.
752+
Now we register the streams for training and testing and start the job.
750753
Printing predictions alongside true labels lets us easily see the result.
751754

752755
{% highlight scala %}
@@ -756,14 +759,14 @@ model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
756759

757760
ssc.start()
758761
ssc.awaitTermination()
759-
762+
760763
{% endhighlight %}
761764

762765
We can now save text files with data to the training or testing folders.
763-
Each line should be a data point formatted as `(y,[x1,x2,x3])` where `y` is the label
764-
and `x1,x2,x3` are the features. Anytime a text file is placed in `/training/data/dir`
765-
the model will update. Anytime a text file is placed in `/testing/data/dir` you will see predictions.
766-
As you feed more data to the training directory, the predictions
766+
Each line should be a data point formatted as `(y,[x1,x2,x3])` where `y` is the label
767+
and `x1,x2,x3` are the features. Anytime a text file is placed in `/training/data/dir`
768+
the model will update. Anytime a text file is placed in `/testing/data/dir` you will see predictions.
769+
As you feed more data to the training directory, the predictions
767770
will get better!
768771

769772
</div>

0 commit comments

Comments
 (0)