Skip to content

Commit 2dd921f

Browse files
committed
update PIC user guide and add a Java example
1 parent a8eb92d commit 2dd921f

File tree

3 files changed

+132
-13
lines changed

3 files changed

+132
-13
lines changed

docs/mllib-clustering.md

Lines changed: 82 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -270,23 +270,92 @@ for i in range(2):
270270

271271
## Power iteration clustering (PIC)
272272

273-
Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm:
273+
Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering vertices of a
274+
graph given pairwise similarties as edge properties,
275+
described in [Lin and Cohen, Power Iteration Clustering](http://www.icml2010.org/papers/387.pdf).
276+
It computes a pseudo-eigenvector of the normalized affinity matrix of the graph via
277+
[power iteration](http://en.wikipedia.org/wiki/Power_iteration) and uses it to cluster vertices.
278+
MLlib includes an implementation of PIC using GraphX as its backend.
279+
It takes an `RDD` of `(srcId, dstId, similarity)` tuples and outputs a model with the clustering assignments.
280+
The similarities must be nonnegative.
281+
PIC assumes that the similarity measure is symmetric.
282+
A pair `(srcId, dstId)` regardless of the ordering should appear at most once in the input data.
283+
If a pair is missing from input, their similarity is treated as zero.
284+
MLlib's PIC implementation takes the following (hyper-)parameters:
285+
286+
* `k`: number of clusters
287+
* `maxIterations`: maximum number of power iterations
288+
* `initializationMode`: initialization model. This can be either "random", which is the default,
289+
to use a random vector as vertex properties, or "degree" to use normalized sum similarities.
274290

275-
* accepts a [Graph](api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
276-
* calculates the principal eigenvalue and eigenvector
277-
* Clusters each of the input points according to their principal eigenvector component value
291+
**Examples**
292+
293+
In the following, we show code snippets to demonstrate how to use PIC in MLlib.
294+
295+
<div class="codetabs">
296+
<div data-lang="scala" markdown="1">
297+
298+
[`PowerIterationClustering`](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClustering)
299+
implements the PIC algorithm.
300+
It takes an `RDD` of `(srcId: Long, dstId: Long, similarity: Double)` tuples representing the
301+
affinity matrix.
302+
Calling `PowerIterationClustering.run` returns a
303+
[`PowerIterationClusteringModel`](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClusteringModel),
304+
which contains the computed clustering assignments.
278305

279-
Details of this algorithm are found within [Power Iteration Clustering, Lin and Cohen]{www.icml2010.org/papers/387.pdf}
306+
{% highlight scala %}
307+
import org.apache.spark.mllib.clustering.PowerIterationClustering
308+
import org.apache.spark.mllib.linalg.Vectors
280309

281-
Example outputs for a dataset inspired by the paper - but with five clusters instead of three- have he following output from our implementation:
310+
val similarities: RDD[(Long, Long, Double)] = ...
311+
312+
val pic = new PowerIteartionClustering()
313+
.setK(3)
314+
.setMaxIterations(20)
315+
val model = pic.run(similarities)
316+
317+
model.assignments.foreach { case (vertexId, clusterId) =>
318+
println(s"$vertexId -> $clusterId")
319+
}
320+
{% endhighlight %}
321+
322+
A full example that produces the experiment described in the PIC paper can be found under
323+
[`examples/`](https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala).
324+
325+
</div>
282326

283-
<p style="text-align: center;">
284-
<img src="img/PIClusteringFiveCirclesInputsAndOutputs.png"
285-
title="The Property Graph"
286-
alt="The Property Graph"
287-
width="50%" />
288-
<!-- Images are downsized intentionally to improve quality on retina displays -->
289-
</p>
327+
<div data-lang="java" markdown="1">
328+
329+
[`PowerIterationClustering`](api/java/org/apache/spark/mllib/clustering/PowerIterationClustering.html)
330+
implements the PIC algorithm.
331+
It takes an `JavaRDD` of `(srcId: Long, dstId: Long, similarity: Double)` tuples representing the
332+
affinity matrix.
333+
Calling `PowerIterationClustering.run` returns a
334+
[`PowerIterationClusteringModel`](api/java/org/apache/spark/mllib/clustering/PowerIterationClusteringModel.html)
335+
which contains the computed clustering assignments.
336+
337+
{% highlight java %}
338+
import scala.Tuple2;
339+
import scala.Tuple3;
340+
341+
import org.apache.spark.api.java.JavaRDD;
342+
import org.apache.spark.mllib.clustering.PowerIterationClustering;
343+
import org.apache.spark.mllib.clustering.PowerIterationClusteringModel;
344+
345+
JavaRDD<Tuple3<Long, Long, Double>> similarities = ...
346+
347+
PowerIterationClustering pic = new PowerIterationClustering()
348+
.setK(2)
349+
.setMaxIterations(10);
350+
PowerIterationClusteringModel model = pic.run(similarities);
351+
352+
for (Tuple2<Object, Object> assignment: model.assignments().toJavaRDD().collect()) {
353+
System.out.println(assignment._1() + " -> " + assignment._2());
354+
}
355+
{% endhighlight %}
356+
</div>
357+
358+
</div>
290359

291360
## Latent Dirichlet allocation (LDA)
292361

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
package org.apache.spark.examples.mllib;
2+
3+
import scala.Tuple2;
4+
import scala.Tuple3;
5+
6+
import com.google.common.collect.Lists;
7+
8+
import org.apache.spark.SparkConf;
9+
import org.apache.spark.api.java.JavaRDD;
10+
import org.apache.spark.api.java.JavaSparkContext;
11+
import org.apache.spark.mllib.clustering.PowerIterationClustering;
12+
import org.apache.spark.mllib.clustering.PowerIterationClusteringModel;
13+
14+
/**
15+
* Java example for graph clustering using power iteration clustering (PIC).
16+
*/
17+
public class JavaPowerIterationClusteringExample {
18+
public static void main(String[] args) {
19+
SparkConf sparkConf = new SparkConf().setAppName("JavaPowerIterationClusteringExample");
20+
JavaSparkContext sc = new JavaSparkContext(sparkConf);
21+
22+
@SuppressWarnings("unchecked")
23+
JavaRDD<Tuple3<Long, Long, Double>> similarities = sc.parallelize(Lists.newArrayList(
24+
new Tuple3<Long, Long, Double>(0L, 1L, 0.9),
25+
new Tuple3<Long, Long, Double>(1L, 2L, 0.9),
26+
new Tuple3<Long, Long, Double>(2L, 3L, 0.9),
27+
new Tuple3<Long, Long, Double>(3L, 4L, 0.1),
28+
new Tuple3<Long, Long, Double>(4L, 5L, 0.9)));
29+
30+
PowerIterationClustering pic = new PowerIterationClustering()
31+
.setK(2)
32+
.setMaxIterations(10);
33+
PowerIterationClusteringModel model = pic.run(similarities);
34+
35+
for (Tuple2<Object, Object> assignment: model.assignments().toJavaRDD().collect()) {
36+
System.out.println(assignment._1() + " -> " + assignment._2());
37+
}
38+
39+
sc.stop();
40+
}
41+
}

mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717

1818
package org.apache.spark.mllib.clustering
1919

20+
import org.apache.spark.api.java.JavaRDD
2021
import org.apache.spark.{Logging, SparkException}
2122
import org.apache.spark.annotation.Experimental
2223
import org.apache.spark.graphx._
@@ -115,6 +116,14 @@ class PowerIterationClustering private[clustering] (
115116
pic(w0)
116117
}
117118

119+
/**
120+
* A Java-friendly version of [[PowerIterationClustering.run]].
121+
*/
122+
def run(similarities: JavaRDD[(java.lang.Long, java.lang.Long, java.lang.Double)])
123+
: PowerIterationClusteringModel = {
124+
run(similarities.rdd.asInstanceOf[RDD[(Long, Long, Double)]])
125+
}
126+
118127
/**
119128
* Runs the PIC algorithm.
120129
*

0 commit comments

Comments
 (0)