1
1
---
2
2
layout : global
3
- title : <a href="mllib-guide.html">MLlib</a> - Basics
3
+ title : Basics - MLlib
4
+ displayTitle : <a href="mllib-guide.html">MLlib</a> - Basics
4
5
---
5
6
6
7
* Table of contents
@@ -26,11 +27,11 @@ of the vector.
26
27
<div data-lang =" scala " markdown =" 1 " >
27
28
28
29
The base class of local vectors is
29
- [ ` Vector ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.Vector ) , and we provide two
30
- implementations: [ ` DenseVector ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.DenseVector ) and
31
- [ ` SparseVector ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.SparseVector ) . We recommend
30
+ [ ` Vector ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.Vector ) , and we provide two
31
+ implementations: [ ` DenseVector ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.DenseVector ) and
32
+ [ ` SparseVector ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.SparseVector ) . We recommend
32
33
using the factory methods implemented in
33
- [ ` Vectors ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.Vector ) to create local vectors.
34
+ [ ` Vectors ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.Vector ) to create local vectors.
34
35
35
36
{% highlight scala %}
36
37
import org.apache.spark.mllib.linalg.{Vector, Vectors}
@@ -53,11 +54,11 @@ Scala imports `scala.collection.immutable.Vector` by default, so you have to imp
53
54
<div data-lang =" java " markdown =" 1 " >
54
55
55
56
The base class of local vectors is
56
- [ ` Vector ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. Vector ) , and we provide two
57
- implementations: [ ` DenseVector ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. DenseVector ) and
58
- [ ` SparseVector ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. SparseVector ) . We recommend
57
+ [ ` Vector ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ Vector.html ) , and we provide two
58
+ implementations: [ ` DenseVector ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ DenseVector.html ) and
59
+ [ ` SparseVector ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ SparseVector.html ) . We recommend
59
60
using the factory methods implemented in
60
- [ ` Vectors ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. Vector ) to create local vectors.
61
+ [ ` Vectors ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ Vector.html ) to create local vectors.
61
62
62
63
{% highlight java %}
63
64
import org.apache.spark.mllib.linalg.Vector;
@@ -78,13 +79,13 @@ MLlib recognizes the following types as dense vectors:
78
79
79
80
and the following as sparse vectors:
80
81
81
- * MLlib's [ ` SparseVector ` ] ( api/pyspark /pyspark.mllib.linalg.SparseVector-class.html ) .
82
+ * MLlib's [ ` SparseVector ` ] ( api/python /pyspark.mllib.linalg.SparseVector-class.html ) .
82
83
* SciPy's
83
84
[ ` csc_matrix ` ] ( http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix )
84
85
with a single column
85
86
86
87
We recommend using NumPy arrays over lists for efficiency, and using the factory methods implemented
87
- in [ ` Vectors ` ] ( api/pyspark /pyspark.mllib.linalg.Vectors-class.html ) to create sparse vectors.
88
+ in [ ` Vectors ` ] ( api/python /pyspark.mllib.linalg.Vectors-class.html ) to create sparse vectors.
88
89
89
90
{% highlight python %}
90
91
import numpy as np
@@ -117,7 +118,7 @@ For multiclass classification, labels should be class indices staring from zero:
117
118
<div data-lang =" scala " markdown =" 1 " >
118
119
119
120
A labeled point is represented by the case class
120
- [ ` LabeledPoint ` ] ( api/mllib /index.html#org.apache.spark.mllib.regression.LabeledPoint ) .
121
+ [ ` LabeledPoint ` ] ( api/scala /index.html#org.apache.spark.mllib.regression.LabeledPoint ) .
121
122
122
123
{% highlight scala %}
123
124
import org.apache.spark.mllib.linalg.Vectors
@@ -134,7 +135,7 @@ val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
134
135
<div data-lang =" java " markdown =" 1 " >
135
136
136
137
A labeled point is represented by
137
- [ ` LabeledPoint ` ] ( api/mllib/index.html# org. apache. spark. mllib. regression. LabeledPoint ) .
138
+ [ ` LabeledPoint ` ] ( api/java/ org/ apache/ spark/ mllib/ regression/ LabeledPoint.html ) .
138
139
139
140
{% highlight java %}
140
141
import org.apache.spark.mllib.linalg.Vectors;
@@ -151,7 +152,7 @@ LabeledPoint neg = new LabeledPoint(1.0, Vectors.sparse(3, new int[] {0, 2}, new
151
152
<div data-lang =" python " markdown =" 1 " >
152
153
153
154
A labeled point is represented by
154
- [ ` LabeledPoint ` ] ( api/pyspark /pyspark.mllib.regression.LabeledPoint-class.html ) .
155
+ [ ` LabeledPoint ` ] ( api/python /pyspark.mllib.regression.LabeledPoint-class.html ) .
155
156
156
157
{% highlight python %}
157
158
from pyspark.mllib.linalg import SparseVector
@@ -184,28 +185,40 @@ After loading, the feature indices are converted to zero-based.
184
185
<div class =" codetabs " >
185
186
<div data-lang =" scala " markdown =" 1 " >
186
187
187
- [ ` MLUtils.loadLibSVMFile ` ] ( api/mllib /index.html#org.apache.spark.mllib.util.MLUtils$ ) reads training
188
+ [ ` MLUtils.loadLibSVMFile ` ] ( api/scala /index.html#org.apache.spark.mllib.util.MLUtils$ ) reads training
188
189
examples stored in LIBSVM format.
189
190
190
191
{% highlight scala %}
191
192
import org.apache.spark.mllib.regression.LabeledPoint
192
193
import org.apache.spark.mllib.util.MLUtils
193
194
import org.apache.spark.rdd.RDD
194
195
195
- val training : RDD[ LabeledPoint] = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
196
+ val examples : RDD[ LabeledPoint] = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
196
197
{% endhighlight %}
197
198
</div >
198
199
199
200
<div data-lang =" java " markdown =" 1 " >
200
- [ ` MLUtils.loadLibSVMFile ` ] ( api/mllib/index.html# org. apache. spark. mllib. util. MLUtils$ ) reads training
201
+ [ ` MLUtils.loadLibSVMFile ` ] ( api/java/ org/ apache/ spark/ mllib/ util/ MLUtils.html ) reads training
201
202
examples stored in LIBSVM format.
202
203
203
204
{% highlight java %}
204
205
import org.apache.spark.mllib.regression.LabeledPoint;
205
206
import org.apache.spark.mllib.util.MLUtils;
206
- import org.apache.spark.rdd.RDDimport;
207
+ import org.apache.spark.api.java.JavaRDD;
208
+
209
+ JavaRDD<LabeledPoint > examples =
210
+ MLUtils.loadLibSVMFile(jsc.sc(), "mllib/data/sample_libsvm_data.txt").toJavaRDD();
211
+ {% endhighlight %}
212
+ </div >
213
+
214
+ <div data-lang =" python " markdown =" 1 " >
215
+ [ ` MLUtils.loadLibSVMFile ` ] ( api/python/pyspark.mllib.util.MLUtils-class.html ) reads training
216
+ examples stored in LIBSVM format.
207
217
208
- RDD<LabeledPoint > training = MLUtils.loadLibSVMFile(jsc, "mllib/data/sample_libsvm_data.txt");
218
+ {% highlight python %}
219
+ from pyspark.mllib.util import MLUtils
220
+
221
+ examples = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
209
222
{% endhighlight %}
210
223
</div >
211
224
</div >
@@ -227,10 +240,10 @@ We are going to add sparse matrix in the next release.
227
240
<div data-lang =" scala " markdown =" 1 " >
228
241
229
242
The base class of local matrices is
230
- [ ` Matrix ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.Matrix ) , and we provide one
231
- implementation: [ ` DenseMatrix ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.DenseMatrix ) .
243
+ [ ` Matrix ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.Matrix ) , and we provide one
244
+ implementation: [ ` DenseMatrix ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.DenseMatrix ) .
232
245
Sparse matrix will be added in the next release. We recommend using the factory methods implemented
233
- in [ ` Matrices ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.Matrices ) to create local
246
+ in [ ` Matrices ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.Matrices ) to create local
234
247
matrices.
235
248
236
249
{% highlight scala %}
@@ -244,10 +257,10 @@ val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
244
257
<div data-lang =" java " markdown =" 1 " >
245
258
246
259
The base class of local matrices is
247
- [ ` Matrix ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. Matrix ) , and we provide one
248
- implementation: [ ` DenseMatrix ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. DenseMatrix ) .
260
+ [ ` Matrix ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ Matrix.html ) , and we provide one
261
+ implementation: [ ` DenseMatrix ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ DenseMatrix.html ) .
249
262
Sparse matrix will be added in the next release. We recommend using the factory methods implemented
250
- in [ ` Matrices ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. Matrices ) to create local
263
+ in [ ` Matrices ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ Matrices.html ) to create local
251
264
matrices.
252
265
253
266
{% highlight java %}
@@ -269,6 +282,15 @@ and distributed matrices. Converting a distributed matrix to a different format
269
282
global shuffle, which is quite expensive. We implemented three types of distributed matrices in
270
283
this release and will add more types in the future.
271
284
285
+ The basic type is called ` RowMatrix ` . A ` RowMatrix ` is a row-oriented distributed
286
+ matrix without meaningful row indices, e.g., a collection of feature vectors.
287
+ It is backed by an RDD of its rows, where each row is a local vector.
288
+ We assume that the number of columns is not huge for a ` RowMatrix ` .
289
+ An ` IndexedRowMatrix ` is similar to a ` RowMatrix ` but with row indices,
290
+ which can be used for identifying rows and joins.
291
+ A ` CoordinateMatrix ` is a distributed matrix stored in [ coordinate list (COO)] ( https://en.wikipedia.org/wiki/Sparse_matrix ) format,
292
+ backed by an RDD of its entries.
293
+
272
294
*** Note***
273
295
274
296
The underlying RDDs of a distributed matrix must be deterministic, because we cache the matrix size.
@@ -284,7 +306,7 @@ limited by the integer range but it should be much smaller in practice.
284
306
<div class =" codetabs " >
285
307
<div data-lang =" scala " markdown =" 1 " >
286
308
287
- A [ ` RowMatrix ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix ) can be
309
+ A [ ` RowMatrix ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix ) can be
288
310
created from an ` RDD[Vector] ` instance. Then we can compute its column summary statistics.
289
311
290
312
{% highlight scala %}
@@ -303,7 +325,7 @@ val n = mat.numCols()
303
325
304
326
<div data-lang =" java " markdown =" 1 " >
305
327
306
- A [ ` RowMatrix ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. distributed. RowMatrix ) can be
328
+ A [ ` RowMatrix ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ distributed/ RowMatrix.html ) can be
307
329
created from a ` JavaRDD<Vector> ` instance. Then we can compute its column summary statistics.
308
330
309
331
{% highlight java %}
@@ -333,8 +355,8 @@ which could be faster if the rows are sparse.
333
355
<div class =" codetabs " >
334
356
<div data-lang =" scala " markdown =" 1 " >
335
357
336
- ` RowMatrix#computeColumnSummaryStatistics ` returns an instance of
337
- [ ` MultivariateStatisticalSummary ` ] ( api/mllib /index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary ) ,
358
+ [ ` RowMatrix#computeColumnSummaryStatistics ` ] ( api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix ) returns an instance of
359
+ [ ` MultivariateStatisticalSummary ` ] ( api/scala /index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary ) ,
338
360
which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
339
361
total count.
340
362
@@ -355,6 +377,31 @@ println(summary.numNonzeros) // number of nonzeros in each column
355
377
val cov: Matrix = mat.computeCovariance()
356
378
{% endhighlight %}
357
379
</div >
380
+
381
+ <div data-lang =" java " markdown =" 1 " >
382
+
383
+ [ ` RowMatrix#computeColumnSummaryStatistics ` ] ( api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics() ) returns an instance of
384
+ [ ` MultivariateStatisticalSummary ` ] ( api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html ) ,
385
+ which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
386
+ total count.
387
+
388
+ {% highlight java %}
389
+ import org.apache.spark.mllib.linalg.Matrix;
390
+ import org.apache.spark.mllib.linalg.distributed.RowMatrix;
391
+ import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
392
+
393
+ RowMatrix mat = ... // a RowMatrix
394
+
395
+ // Compute column summary statistics.
396
+ MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
397
+ System.out.println(summary.mean()); // a dense vector containing the mean value for each column
398
+ System.out.println(summary.variance()); // column-wise variance
399
+ System.out.println(summary.numNonzeros()); // number of nonzeros in each column
400
+
401
+ // Compute the covariance matrix.
402
+ Matrix cov = mat.computeCovariance();
403
+ {% endhighlight %}
404
+ </div >
358
405
</div >
359
406
360
407
### IndexedRowMatrix
@@ -366,9 +413,9 @@ an RDD of indexed rows, which each row is represented by its index (long-typed)
366
413
<div data-lang =" scala " markdown =" 1 " >
367
414
368
415
An
369
- [ ` IndexedRowMatrix ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix )
416
+ [ ` IndexedRowMatrix ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix )
370
417
can be created from an ` RDD[IndexedRow] ` instance, where
371
- [ ` IndexedRow ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow ) is a
418
+ [ ` IndexedRow ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow ) is a
372
419
wrapper over ` (Long, Vector) ` . An ` IndexedRowMatrix ` can be converted to a ` RowMatrix ` by dropping
373
420
its row indices.
374
421
@@ -391,9 +438,9 @@ val rowMat: RowMatrix = mat.toRowMatrix()
391
438
<div data-lang =" java " markdown =" 1 " >
392
439
393
440
An
394
- [ ` IndexedRowMatrix ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. distributed. IndexedRowMatrix )
441
+ [ ` IndexedRowMatrix ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ distributed/ IndexedRowMatrix.html )
395
442
can be created from an ` JavaRDD<IndexedRow> ` instance, where
396
- [ ` IndexedRow ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. distributed. IndexedRow ) is a
443
+ [ ` IndexedRow ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ distributed/ IndexedRow.html ) is a
397
444
wrapper over ` (long, Vector) ` . An ` IndexedRowMatrix ` can be converted to a ` RowMatrix ` by dropping
398
445
its row indices.
399
446
@@ -427,9 +474,9 @@ dimensions of the matrix are huge and the matrix is very sparse.
427
474
<div data-lang =" scala " markdown =" 1 " >
428
475
429
476
A
430
- [ ` CoordinateMatrix ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix )
477
+ [ ` CoordinateMatrix ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix )
431
478
can be created from an ` RDD[MatrixEntry] ` instance, where
432
- [ ` MatrixEntry ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry ) is a
479
+ [ ` MatrixEntry ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry ) is a
433
480
wrapper over ` (Long, Long, Double) ` . A ` CoordinateMatrix ` can be converted to a ` IndexedRowMatrix `
434
481
with sparse rows by calling ` toIndexedRowMatrix ` . In this release, we do not provide other
435
482
computation for ` CoordinateMatrix ` .
@@ -453,21 +500,21 @@ val indexedRowMatrix = mat.toIndexedRowMatrix()
453
500
<div data-lang =" java " markdown =" 1 " >
454
501
455
502
A
456
- [ ` CoordinateMatrix ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. distributed. CoordinateMatrix )
503
+ [ ` CoordinateMatrix ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ distributed/ CoordinateMatrix.html )
457
504
can be created from a ` JavaRDD<MatrixEntry> ` instance, where
458
- [ ` MatrixEntry ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. distributed. MatrixEntry ) is a
505
+ [ ` MatrixEntry ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ distributed/ MatrixEntry.html ) is a
459
506
wrapper over ` (long, long, double) ` . A ` CoordinateMatrix ` can be converted to a ` IndexedRowMatrix `
460
507
with sparse rows by calling ` toIndexedRowMatrix ` .
461
508
462
- {% highlight scala %}
509
+ {% highlight java %}
463
510
import org.apache.spark.api.java.JavaRDD;
464
511
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
465
512
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
466
513
import org.apache.spark.mllib.linalg.distributed.MatrixEntry;
467
514
468
515
JavaRDD<MatrixEntry > entries = ... // a JavaRDD of matrix entries
469
516
// Create a CoordinateMatrix from a JavaRDD<MatrixEntry >.
470
- CoordinateMatrix mat = new CoordinateMatrix(entries);
517
+ CoordinateMatrix mat = new CoordinateMatrix(entries.rdd() );
471
518
472
519
// Get its size.
473
520
long m = mat.numRows();
0 commit comments