Skip to content

Commit 85e9d09

Browse files
committed
[SPARK-5519][MLLIB] add user guide with example code for fp-growth
The API is still not very Java-friendly because `Array[Item]` in `freqItemsets` is recognized as `Object` in Java. We might want to define a case class to wrap the return pair to make it Java friendly. Author: Xiangrui Meng <[email protected]> Closes #4661 from mengxr/SPARK-5519 and squashes the following commits: 58ccc25 [Xiangrui Meng] add user guide with example code for fp-growth
1 parent 5aecdcf commit 85e9d09

File tree

4 files changed

+216
-0
lines changed

4 files changed

+216
-0
lines changed

docs/mllib-frequent-pattern-mining.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
---
2+
layout: global
3+
title: Frequent Pattern Mining - MLlib
4+
displayTitle: <a href="mllib-guide.html">MLlib</a> - Frequent Pattern Mining
5+
---
6+
7+
Mining frequent items, itemsets, subsequences, or other substructures is usually among the
8+
first steps to analyze a large-scale dataset, which has been an active research topic in
9+
data mining for years.
10+
We refer users to Wikipedia's [association rule learning](http://en.wikipedia.org/wiki/Association_rule_learning)
11+
for more information.
12+
MLlib provides a parallel implementation of FP-growth,
13+
a popular algorithm to mining frequent itemsets.
14+
15+
## FP-growth
16+
17+
The FP-growth algorithm is described in the paper
18+
[Han et al., Mining frequent patterns without candidate generation](http://dx.doi.org/10.1145/335191.335372),
19+
where "FP" stands for frequent pattern.
20+
Given a dataset of transactions, the first step of FP-growth is to calculate item frequencies and identify frequent items.
21+
Different from [Apriori-like](http://en.wikipedia.org/wiki/Apriori_algorithm) algorithms designed for the same purpose,
22+
the second step of FP-growth uses a suffix tree (FP-tree) structure to encode transactions without generating candidate sets
23+
explicitly, which are usually expensive to generate.
24+
After the second step, the frequent itemsets can be extracted from the FP-tree.
25+
In MLlib, we implemented a parallel version of FP-growth called PFP,
26+
as described in [Li et al., PFP: Parallel FP-growth for query recommendation](http://dx.doi.org/10.1145/1454008.1454027).
27+
PFP distributes the work of growing FP-trees based on the suffices of transactions,
28+
and hence more scalable than a single-machine implementation.
29+
We refer users to the papers for more details.
30+
31+
MLlib's FP-growth implementation takes the following (hyper-)parameters:
32+
33+
* `minSupport`: the minimum support for an itemset to be identified as frequent.
34+
For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.
35+
* `numPartitions`: the number of partitions used to distribute the work.
36+
37+
**Examples**
38+
39+
<div class="codetabs">
40+
<div data-lang="scala" markdown="1">
41+
42+
[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
43+
FP-growth algorithm.
44+
It take a `JavaRDD` of transactions, where each transaction is an `Iterable` of items of a generic type.
45+
Calling `FPGrowth.run` with transactions returns an
46+
[`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html)
47+
that stores the frequent itemsets with their frequencies.
48+
49+
{% highlight scala %}
50+
import org.apache.spark.rdd.RDD
51+
import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel}
52+
53+
val transactions: RDD[Array[String]] = ...
54+
55+
val fpg = new FPGrowth()
56+
.setMinSupport(0.2)
57+
.setNumPartitions(10)
58+
val model = fpg.run(transactions)
59+
60+
model.freqItemsets.collect().foreach { case (itemset, freq) =>
61+
println(itemset.mkString("[", ",", "]") + ", " + freq)
62+
}
63+
{% endhighlight %}
64+
65+
</div>
66+
67+
<div data-lang="java" markdown="1">
68+
69+
[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
70+
FP-growth algorithm.
71+
It take an `RDD` of transactions, where each transaction is an `Array` of items of a generic type.
72+
Calling `FPGrowth.run` with transactions returns an
73+
[`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html)
74+
that stores the frequent itemsets with their frequencies.
75+
76+
{% highlight java %}
77+
import java.util.Arrays;
78+
import java.util.List;
79+
80+
import scala.Tuple2;
81+
82+
import org.apache.spark.api.java.JavaRDD;
83+
import org.apache.spark.mllib.fpm.FPGrowth;
84+
import org.apache.spark.mllib.fpm.FPGrowthModel;
85+
86+
JavaRDD<List<String>> transactions = ...
87+
88+
FPGrowth fpg = new FPGrowth()
89+
.setMinSupport(0.2)
90+
.setNumPartitions(10);
91+
92+
FPGrowthModel<String> model = fpg.run(transactions);
93+
94+
for (Tuple2<Object, Long> s: model.javaFreqItemsets().collect()) {
95+
System.out.println("(" + Arrays.toString((Object[]) s._1()) + "): " + s._2());
96+
}
97+
{% endhighlight %}
98+
99+
</div>
100+
</div>

docs/mllib-guide.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@ filtering, dimensionality reduction, as well as underlying optimization primitiv
3434
* singular value decomposition (SVD)
3535
* principal component analysis (PCA)
3636
* [Feature extraction and transformation](mllib-feature-extraction.html)
37+
* [Frequent pattern mining](mllib-frequent-pattern-mining.html)
38+
* FP-growth
3739
* [Optimization (developer)](mllib-optimization.html)
3840
* stochastic gradient descent
3941
* limited-memory BFGS (L-BFGS)
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package org.apache.spark.examples.mllib;
19+
20+
import java.util.ArrayList;
21+
import java.util.Arrays;
22+
23+
import scala.Tuple2;
24+
25+
import com.google.common.collect.Lists;
26+
27+
import org.apache.spark.SparkConf;
28+
import org.apache.spark.api.java.JavaRDD;
29+
import org.apache.spark.api.java.JavaSparkContext;
30+
import org.apache.spark.mllib.fpm.FPGrowth;
31+
import org.apache.spark.mllib.fpm.FPGrowthModel;
32+
33+
/**
34+
* Java example for mining frequent itemsets using FP-growth.
35+
*/
36+
public class JavaFPGrowthExample {
37+
38+
public static void main(String[] args) {
39+
SparkConf sparkConf = new SparkConf().setAppName("JavaFPGrowthExample");
40+
JavaSparkContext sc = new JavaSparkContext(sparkConf);
41+
42+
43+
// TODO: Read a user-specified input file.
44+
@SuppressWarnings("unchecked")
45+
JavaRDD<ArrayList<String>> transactions = sc.parallelize(Lists.newArrayList(
46+
Lists.newArrayList("r z h k p".split(" ")),
47+
Lists.newArrayList("z y x w v u t s".split(" ")),
48+
Lists.newArrayList("s x o n r".split(" ")),
49+
Lists.newArrayList("x z y m t s q e".split(" ")),
50+
Lists.newArrayList("z".split(" ")),
51+
Lists.newArrayList("x z y r q t p".split(" "))), 2);
52+
53+
FPGrowth fpg = new FPGrowth()
54+
.setMinSupport(0.3);
55+
FPGrowthModel<String> model = fpg.run(transactions);
56+
57+
for (Tuple2<Object, Long> s: model.javaFreqItemsets().collect()) {
58+
System.out.println(Arrays.toString((Object[]) s._1()) + ", " + s._2());
59+
}
60+
61+
sc.stop();
62+
}
63+
}
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package org.apache.spark.examples.mllib
19+
20+
import org.apache.spark.mllib.fpm.FPGrowth
21+
import org.apache.spark.{SparkContext, SparkConf}
22+
23+
/**
24+
* Example for mining frequent itemsets using FP-growth.
25+
*/
26+
object FPGrowthExample {
27+
28+
def main(args: Array[String]) {
29+
val conf = new SparkConf().setAppName("FPGrowthExample")
30+
val sc = new SparkContext(conf)
31+
32+
// TODO: Read a user-specified input file.
33+
val transactions = sc.parallelize(Seq(
34+
"r z h k p",
35+
"z y x w v u t s",
36+
"s x o n r",
37+
"x z y m t s q e",
38+
"z",
39+
"x z y r q t p").map(_.split(" ")), numSlices = 2)
40+
41+
val fpg = new FPGrowth()
42+
.setMinSupport(0.3)
43+
val model = fpg.run(transactions)
44+
45+
model.freqItemsets.collect().foreach { case (itemset, freq) =>
46+
println(itemset.mkString("[", ",", "]") + ", " + freq)
47+
}
48+
49+
sc.stop()
50+
}
51+
}

0 commit comments

Comments
 (0)