You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/mllib-clustering.md
+82-13Lines changed: 82 additions & 13 deletions
Original file line number
Diff line number
Diff line change
@@ -270,23 +270,92 @@ for i in range(2):
270
270
271
271
## Power iteration clustering (PIC)
272
272
273
-
Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm:
273
+
Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering vertices of a
274
+
graph given pairwise similarties as edge properties,
275
+
described in [Lin and Cohen, Power Iteration Clustering](http://www.icml2010.org/papers/387.pdf).
276
+
It computes a pseudo-eigenvector of the normalized affinity matrix of the graph via
277
+
[power iteration](http://en.wikipedia.org/wiki/Power_iteration) and uses it to cluster vertices.
278
+
MLlib includes an implementation of PIC using GraphX as its backend.
279
+
It takes an `RDD` of `(srcId, dstId, similarity)` tuples and outputs a model with the clustering assignments.
280
+
The similarities must be nonnegative.
281
+
PIC assumes that the similarity measure is symmetric.
282
+
A pair `(srcId, dstId)` regardless of the ordering should appear at most once in the input data.
283
+
If a pair is missing from input, their similarity is treated as zero.
284
+
MLlib's PIC implementation takes the following (hyper-)parameters:
285
+
286
+
*`k`: number of clusters
287
+
*`maxIterations`: maximum number of power iterations
288
+
*`initializationMode`: initialization model. This can be either "random", which is the default,
289
+
to use a random vector as vertex properties, or "degree" to use normalized sum similarities.
274
290
275
-
* accepts a [Graph](api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
276
-
* calculates the principal eigenvalue and eigenvector
277
-
* Clusters each of the input points according to their principal eigenvector component value
291
+
**Examples**
292
+
293
+
In the following, we show code snippets to demonstrate how to use PIC in MLlib.
0 commit comments