Skip to content

[SPARK-4711] [mllib] [docs] Programming guide advice on choosing optimizer #3569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions docs/mllib-linear-methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,20 +110,24 @@ However, L1 regularization can help promote sparsity in weights leading to small
It is not recommended to train models without any regularization,
especially when the number of training examples is small.

### Optimization

Under the hood, linear methods use convex optimization methods to optimize the objective functions. MLlib uses two methods, SGD and L-BFGS, described in the [optimization section](mllib-optimization.html). Currently, most algorithm APIs support Stochastic Gradient Descent (SGD), and a few support L-BFGS. Refer to [this optimization section](mllib-optimization.html#Choosing-an-Optimization-Method) for guidelines on choosing between optimization methods.

## Binary classification

[Binary classification](http://en.wikipedia.org/wiki/Binary_classification)
aims to divide items into two categories: positive and negative. MLlib
supports two linear methods for binary classification: linear support vector
machines (SVMs) and logistic regression. For both methods, MLlib supports
supports two linear methods for binary classification: linear Support Vector
Machines (SVMs) and logistic regression. For both methods, MLlib supports
L1 and L2 regularized variants. The training data set is represented by an RDD
of [LabeledPoint](mllib-data-types.html) in MLlib. Note that, in the
mathematical formulation in this guide, a training label $y$ is denoted as
either $+1$ (positive) or $-1$ (negative), which is convenient for the
formulation. *However*, the negative label is represented by $0$ in MLlib
instead of $-1$, to be consistent with multiclass labeling.

### Linear support vector machines (SVMs)
### Linear Support Vector Machines (SVMs)

The [linear SVM](http://en.wikipedia.org/wiki/Support_vector_machine#Linear_SVM)
is a standard method for large-scale classification tasks. It is a linear method as described above in equation `$\eqref{eq:regPrimal}$`, with the loss function in the formulation given by the hinge loss:
Expand Down
17 changes: 11 additions & 6 deletions docs/mllib-optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,12 @@ vertical scalability issue (the number of training features) when computing the
explicitly in Newton's method. As a result, L-BFGS often achieves rapider convergence compared with
other first-order optimization.

### Choosing an Optimization Method

[Linear methods](mllib-linear-methods.html) use optimization internally, and some linear methods in MLlib support both SGD and L-BFGS.
Different optimization methods can have different convergence guarantees depending on the properties of the objective function, and we cannot cover the literature here.
In general, when L-BFGS is available, we recommend using it instead of SGD since L-BFGS tends to converge faster (in fewer iterations).

## Implementation in MLlib

### Gradient descent and stochastic gradient descent
Expand Down Expand Up @@ -168,10 +174,7 @@ descent. All updaters in MLlib use a step size at the t-th step equal to
* `regParam` is the regularization parameter when using L1 or L2 regularization.
* `miniBatchFraction` is the fraction of the total data that is sampled in
each iteration, to compute the gradient direction.

Available algorithms for gradient descent:

* [GradientDescent](api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent)
* Sampling still requires a pass over the entire RDD, so decreasing `miniBatchFraction` may not speed up optimization much. Users will see the greatest speedup when the gradient is expensive to compute, for only the chosen samples are used for computing the gradient.

### L-BFGS
L-BFGS is currently only a low-level optimization primitive in `MLlib`. If you want to use L-BFGS in various
Expand Down Expand Up @@ -359,13 +362,15 @@ public class LBFGSExample {
{% endhighlight %}
</div>
</div>
#### Developer's note

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this caused a .md generation problem in the old docs.

## Developer's notes

Since the Hessian is constructed approximately from previous gradient evaluations,
the objective function can not be changed during the optimization process.
As a result, Stochastic L-BFGS will not work naively by just using miniBatch;
therefore, we don't provide this until we have better understanding.

* `Updater` is a class originally designed for gradient decent which computes
`Updater` is a class originally designed for gradient decent which computes
the actual gradient descent step. However, we're able to take the gradient and
loss of objective function of regularization for L-BFGS by ignoring the part of logic
only for gradient decent such as adaptive step size stuff. We will refactorize
Expand Down